[RFC 0/3] lib/fastmem: fast small-object allocator

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/3] lib/fastmem: fast small-object allocator
@ 2026-05-25 10:36 Mattias Rönnblom
  2026-05-25 10:36 ` [RFC 1/3] doc: add fastmem programming guide Mattias Rönnblom
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 10:36 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Mattias Rönnblom

This RFC introduces fastmem, a general-purpose small-object allocator
for DPDK. It is intended to replace per-type mempools with a single
allocator that handles arbitrary sizes, grows on demand, and matches
mempool-level performance on the hot path.

Motivation
----------

DPDK applications commonly maintain many mempools — one per object
type (connections, sessions, timers, work items). Each must be sized
up front, wastes memory when over-provisioned, and cannot serve
objects of a different size. Fastmem eliminates this by accepting
arbitrary sizes at runtime, backed by a slab allocator that
repurposes memory across size classes as demand shifts.

Design
------

Three-layer architecture:

1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
   reserved lazily (or pre-reserved for deterministic latency).

2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
   The alignment enables O(1) slab lookup from any object pointer
   via bitmask — no radix tree or index structure. Slabs move
   freely between 18 power-of-2 size classes (8 B to 1 MiB).

3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
   path). Cache misses trigger bulk transfers to/from the shared
   bin under a spinlock.

Key properties:

- Zero per-object metadata in the production build.
- NUMA-aware, with per-socket bins and free-slab pools.
- DMA-usable memory with O(1) virt-to-IOVA translation.
- Bulk alloc/free with all-or-nothing semantics.
- Backing memory never returned during lifetime (slabs recycled).
- Non-EAL threads supported (bypass cache, take bin lock).

API surface
-----------

  rte_fastmem_init / deinit
  rte_fastmem_reserve
  rte_fastmem_set_limit / get_limit
  rte_fastmem_alloc / alloc_socket
  rte_fastmem_alloc_bulk / alloc_bulk_socket
  rte_fastmem_free / free_bulk
  rte_fastmem_virt2iova
  rte_fastmem_cache_flush
  rte_fastmem_max_size / classes
  rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
  rte_fastmem_stats_reset

All APIs are marked __rte_experimental.

Performance
-----------

The single-object hot path is roughly 2-3x the cost of mempool
and an order of magnitude faster than rte_malloc. Under
multi-lcore contention, fastmem scales similarly to mempool,
while rte_malloc collapses.

Limitations
-----------

- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
- Power-of-2 classes only; worst-case internal fragmentation ~50%.
- Backing memory not reclaimable short of deinit.

Future work
-----------

- Lcore-affine allocations (false-sharing-free by construction).
- Mempool ops driver for transparent drop-in use.
- Pre-resolved allocator handle binding size class and socket,
  eliminating per-call class lookup and enabling an inline
  cache-hit fast path.
- Debug mode (cookies, double-free detection, poison-on-free).
- Telemetry integration.
- EAL integration, allowing EAL-internal subsystems to use
  fastmem for their small-object allocations.

Mattias Rönnblom (3):
  doc: add fastmem programming guide
  lib: add fastmem library
  app/test: add fastmem test suite

 app/test/meson.build                  |    3 +
 app/test/test_fastmem.c               | 1682 +++++++++++++++++++++++++
 app/test/test_fastmem_perf.c          |  997 +++++++++++++++
 app/test/test_fastmem_profile.c       |  157 +++
 doc/api/doxy-api-index.md             |    1 +
 doc/api/doxy-api.conf.in              |    1 +
 doc/guides/prog_guide/fastmem_lib.rst |  301 +++++
 doc/guides/prog_guide/index.rst       |    1 +
 lib/fastmem/meson.build               |    6 +
 lib/fastmem/rte_fastmem.c             | 1486 ++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h             |  644 ++++++++++
 lib/meson.build                       |    1 +
 12 files changed, 5280 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

-- 
2.43.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 1/3] doc: add fastmem programming guide
  2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
@ 2026-05-25 10:36 ` Mattias Rönnblom
  2026-05-25 10:36 ` [RFC 2/3] lib: add fastmem library Mattias Rönnblom
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 10:36 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Mattias Rönnblom

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Add a programming guide for the fastmem library covering usage,
API overview, design, and implementation details.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 doc/guides/prog_guide/fastmem_lib.rst | 301 ++++++++++++++++++++++++++
 doc/guides/prog_guide/index.rst       |   1 +
 2 files changed, 302 insertions(+)
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst

diff --git a/doc/guides/prog_guide/fastmem_lib.rst b/doc/guides/prog_guide/fastmem_lib.rst
new file mode 100644
index 0000000000..142408c3c2
--- /dev/null
+++ b/doc/guides/prog_guide/fastmem_lib.rst
@@ -0,0 +1,301 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2026 Ericsson AB
+
+Fastmem Library
+===============
+
+The fastmem library is a fast, general-purpose small-object
+allocator for DPDK applications. It lets an application replace
+its many per-type mempools — each sized for a single object type
+— with a single allocator that handles arbitrary object sizes,
+grows on demand, and offers mempool-level performance for the
+common allocation and free paths.
+
+Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+supports bulk operations, and uses per-lcore caches to reduce
+shared-state contention. Unlike mempool, it does not require the
+caller to declare object sizes or counts up front.
+
+
+When to use fastmem
+-------------------
+
+Use fastmem when:
+
+* Small objects (up to 1 MiB) are allocated and freed on the
+  data path with low, predictable latency requirements.
+
+* Many object types of varying sizes exist and maintaining a
+  separate mempool for each is impractical.
+
+* DMA-usable memory with efficient virtual-to-IOVA translation
+  is needed.
+
+Do not use fastmem for allocations larger than 1 MiB. Use
+``rte_malloc()`` instead.
+
+
+Initialization and teardown
+----------------------------
+
+.. code-block:: c
+
+   /* At startup, after rte_eal_init(). */
+   rte_fastmem_init();
+
+   /* Optional: pre-reserve backing memory to avoid latency
+    * spikes from on-demand memzone reservation. */
+   rte_fastmem_reserve(64 * 1024 * 1024, SOCKET_ID_ANY);
+
+   /* ... application runs ... */
+
+   /* At shutdown, after all allocations have been freed. */
+   rte_fastmem_deinit();
+
+Neither ``rte_fastmem_init()`` nor ``rte_fastmem_deinit()`` is
+thread-safe; call them from the main lcore during startup and
+shutdown.
+
+
+Allocation and free
+-------------------
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(128, 0, 0);
+   /* Use obj... */
+   rte_fastmem_free(obj);
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's NUMA
+socket. Use ``rte_fastmem_alloc_socket()`` to target a specific
+socket or to enable cross-socket fallback with ``SOCKET_ID_ANY``.
+
+Alignment
+~~~~~~~~~
+
+When ``align`` is 0, the returned pointer is aligned to at least
+``RTE_CACHE_LINE_SIZE``. A non-zero ``align`` must be a power of
+two. Specifying an alignment smaller than ``RTE_CACHE_LINE_SIZE``
+is permitted but the returned object may then share a cache line
+with an adjacent allocation, risking false sharing.
+
+Zeroing
+~~~~~~~
+
+Pass ``RTE_FASTMEM_F_ZERO`` to receive zero-initialized memory:
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(256, 0, RTE_FASTMEM_F_ZERO);
+
+
+Bulk allocation and free
+-------------------------
+
+.. code-block:: c
+
+   void *ptrs[32];
+
+   if (rte_fastmem_alloc_bulk(ptrs, 32, 64, 0, 0) < 0)
+       /* handle error */;
+
+   /* Use objects... */
+
+   rte_fastmem_free_bulk(ptrs, 32);
+
+Bulk allocation has all-or-nothing semantics: either all
+requested objects are returned, or none are (and ``rte_errno``
+is set to ``ENOMEM``).
+
+Bulk free is most efficient when all objects belong to the same
+size class; in that case the objects are pushed into the
+per-lcore cache in a single operation.
+
+
+IOVA translation
+----------------
+
+Memory returned by fastmem is DMA-usable. To obtain the IOVA
+for use in device descriptors:
+
+.. code-block:: c
+
+   rte_iova_t iova = rte_fastmem_virt2iova(obj);
+
+The translation is O(1). The returned IOVA is valid for the
+lifetime of the allocation.
+
+
+NUMA awareness
+--------------
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's socket.
+``rte_fastmem_alloc_socket()`` accepts an explicit socket ID or
+``SOCKET_ID_ANY``:
+
+* Explicit socket: allocate only from that socket; fail with
+  ``ENOMEM`` if exhausted.
+
+* ``SOCKET_ID_ANY``: try the caller's local socket first, then
+  fall back to other sockets.
+
+
+Per-lcore caches
+----------------
+
+Each EAL thread has a private cache per size class. The common
+allocation and free paths operate entirely within this cache,
+avoiding locks. Cache misses (empty on alloc, full on free)
+trigger a bulk transfer to/from the shared bin under a lock.
+
+Non-EAL threads bypass the cache and take the bin lock on every
+operation.
+
+``rte_fastmem_cache_flush()`` drains the calling lcore's caches
+back to the shared bins. This is useful after bursty phases to
+release idle cached memory.
+
+
+Threading
+---------
+
+All allocation and free functions are thread-safe and may be
+called from any thread. An allocation made on one thread may be
+freed on any other.
+
+Fastmem uses internal spinlocks. A thread preempted while
+holding one delays other threads contending for the same lock
+(correctness is not affected, only latency).
+
+
+Pre-reserving memory
+--------------------
+
+By default, fastmem reserves backing memory lazily on first
+allocation. ``rte_fastmem_reserve(size, socket_id)`` forces
+reservation up front, ensuring subsequent allocations do not
+incur memzone-reservation latency:
+
+.. code-block:: c
+
+   /* Reserve 128 MiB on socket 0. */
+   rte_fastmem_reserve(128 * 1024 * 1024, 0);
+
+Once reserved, backing memory is never returned to the system
+during the allocator's lifetime.
+
+Memory limits
+~~~~~~~~~~~~~
+
+``rte_fastmem_set_limit(socket_id, max_bytes)`` caps how much
+backing memory may be reserved on a given socket. Once the limit is
+reached, allocations that would require new backing memory fail with
+``ENOMEM``. The default is ``SIZE_MAX`` (unlimited).
+``rte_fastmem_get_limit()`` returns the current limit for a socket.
+
+.. code-block:: c
+
+   /* Allow at most 256 MiB on socket 0. */
+   rte_fastmem_set_limit(0, 256 * 1024 * 1024);
+
+   /* Block all growth on socket 1. */
+   rte_fastmem_set_limit(1, 0);
+
+Pass ``SOCKET_ID_ANY`` to apply the same limit to all sockets.
+
+
+Size classes
+------------
+
+Fastmem uses power-of-two size classes from 8 bytes to 1 MiB
+(18 classes). A request for N bytes is served from the smallest
+class >= N. The maximum supported size is queryable via
+``rte_fastmem_max_size()``.
+
+With power-of-two classes, worst-case internal fragmentation is
+just under 50% (e.g., a 33-byte request occupies a 64-byte
+slot). Assuming a uniform distribution of request sizes, the
+average waste is 25%. In practice, DPDK workloads tend to
+cluster at or near powers of two, so typical waste is lower.
+
+Requests exceeding the maximum are rejected with ``E2BIG``.
+
+
+Implementation
+--------------
+
+Fastmem organizes memory in three layers: backing memzones, slabs,
+and per-lcore caches.
+
+Backing memory and slabs
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Backing memory is obtained from EAL as 128 MiB IOVA-contiguous
+memzones, each aligned to 2 MiB. A memzone is partitioned into
+64 fixed-size, 2 MiB **slabs**. Slabs are the unit of memory
+that moves between size classes: a free slab can be assigned to
+any bin on demand, and an empty slab (all objects freed) returns
+to the free-slab pool for reuse by another size class.
+
+The 2 MiB slab alignment is the key structural property. Given
+any object pointer, the allocator recovers the owning slab by
+masking off the low 21 bits — no radix tree, hash table, or
+memzone lookup is needed. This makes the free path fast: a
+single pointer-mask load reaches the slab header, which
+identifies the size class and bin.
+
+Each slab reserves 64 bytes at offset 0 for its header. The
+remaining space is divided into fixed-size slots equal to the
+size class. Allocated objects carry no per-object metadata; the
+full slot is available to the caller.
+
+Three-level allocation hierarchy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. **Per-lcore cache** — a bounded LIFO stack of free object
+   pointers, one per (lcore, size class, socket). Allocation
+   pops; free pushes. No lock is needed because only the owning
+   lcore accesses its cache.
+
+2. **Bin** — one per (size class, socket). Owns the partial and
+   full slab lists. A spinlock serializes bulk transfers between
+   the bin and per-lcore caches. Most traffic is absorbed by the
+   caches, so bin-lock contention is low.
+
+3. **Free-slab pool** — one per socket. A spinlock protects slab
+   acquisition and release. These events are rare relative to
+   object-level operations (a single small-object slab serves
+   thousands of allocations).
+
+On a cache miss (empty on alloc, full on free), the cache
+exchanges objects with the bin in bulk, targeting half-full to
+maximize headroom in both directions.
+
+Cache sizing
+~~~~~~~~~~~~
+
+Cache capacity varies by size class to bound per-lcore memory
+footprint:
+
+* Classes 8 B through 4 KiB: capacity 64.
+* Larger classes: capacity halves per class (32, 16, 8, 4),
+  flooring at 4.
+
+Even the largest classes remain cached. The capacity curve
+ensures that small, frequent allocations get the highest cache
+hit rate, while large allocations still avoid the bin lock on
+most operations.
+
+
+Statistics
+----------
+
+Fastmem maintains always-on, per-lcore counters that track
+allocation and free activity. Statistics are queryable at four
+levels of granularity: global summary, per size class, per lcore,
+and per lcore per class.
+
+``rte_fastmem_classes()`` returns the number of size classes and
+optionally fills an array with their sizes.
+
+See ``rte_fastmem.h`` for the full statistics API.
\ No newline at end of file
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index e6f24945b0..c85196c85e 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -28,6 +28,7 @@ Memory Management
     mempool_lib
     mbuf_lib
     multi_proc_support
+    fastmem_lib
 
 
 CPU Management
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC 2/3] lib: add fastmem library
  2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-25 10:36 ` [RFC 1/3] doc: add fastmem programming guide Mattias Rönnblom
@ 2026-05-25 10:36 ` Mattias Rönnblom
  2026-05-27 14:22   ` Stephen Hemminger
  2026-05-25 10:36 ` [RFC 3/3] app/test: add fastmem test suite Mattias Rönnblom
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 10:36 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Mattias Rönnblom

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Introduce fastmem, a fast general-purpose small-object allocator
for DPDK applications. It allows an application to replace its
many per-type mempools with a single allocator that handles
arbitrary sizes, grows on demand, and offers mempool-level
performance on the hot path.

Applications that manage many object types (connections, sessions,
work items, timers) currently maintain a separate mempool for each,
requiring upfront sizing and wasting memory on over-provisioned
pools. Fastmem removes both constraints.

Key properties:

 * Huge-page-backed, NUMA-aware, DMA-usable.
 * Per-lcore caches for lock-free alloc/free on EAL threads.
 * Bulk alloc and free APIs.
 * Power-of-two size classes from 8 B to 1 MiB.
 * Backing memory grows lazily; rte_fastmem_reserve() allows
   upfront reservation to avoid latency spikes.
 * Always-on per-lcore and per-class statistics.

Bounded to small objects; requests above rte_fastmem_max_size()
are rejected. Replacing rte_malloc is currently not a goal.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 doc/api/doxy-api-index.md |    1 +
 doc/api/doxy-api.conf.in  |    1 +
 lib/fastmem/meson.build   |    6 +
 lib/fastmem/rte_fastmem.c | 1486 +++++++++++++++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h |  644 ++++++++++++++++
 lib/meson.build           |    1 +
 6 files changed, 2139 insertions(+)
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 9296042119..7ebf1201ce 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -70,6 +70,7 @@ The public API headers are grouped by topics:
   [memzone](@ref rte_memzone.h),
   [mempool](@ref rte_mempool.h),
   [malloc](@ref rte_malloc.h),
+  [fastmem](@ref rte_fastmem.h),
   [memcpy](@ref rte_memcpy.h)
 
 - **timers**:
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index bedd944681..4355e9fb2d 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -43,6 +43,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/efd \
                           @TOPDIR@/lib/ethdev \
                           @TOPDIR@/lib/eventdev \
+                          @TOPDIR@/lib/fastmem \
                           @TOPDIR@/lib/fib \
                           @TOPDIR@/lib/gpudev \
                           @TOPDIR@/lib/graph \
diff --git a/lib/fastmem/meson.build b/lib/fastmem/meson.build
new file mode 100644
index 0000000000..6c7834608f
--- /dev/null
+++ b/lib/fastmem/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Ericsson AB
+
+sources = files('rte_fastmem.c')
+headers = files('rte_fastmem.h')
+deps += ['eal']
diff --git a/lib/fastmem/rte_fastmem.c b/lib/fastmem/rte_fastmem.c
new file mode 100644
index 0000000000..f605c538fc
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.c
@@ -0,0 +1,1486 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <eal_export.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_spinlock.h>
+
+#include <rte_fastmem.h>
+
+RTE_LOG_REGISTER_DEFAULT(fastmem_logtype, NOTICE);
+
+#define RTE_LOGTYPE_FASTMEM fastmem_logtype
+
+#define FASTMEM_LOG(level, ...) \
+	RTE_LOG_LINE(level, FASTMEM, "" __VA_ARGS__)
+
+#define FASTMEM_MEMZONE_SIZE_LOG2 27                            /* 128 MiB */
+#define FASTMEM_MEMZONE_SIZE ((size_t)1 << FASTMEM_MEMZONE_SIZE_LOG2)
+
+#define FASTMEM_SLAB_SIZE_LOG2 21                               /*   2 MiB */
+#define FASTMEM_SLAB_SIZE ((size_t)1 << FASTMEM_SLAB_SIZE_LOG2)
+#define FASTMEM_SLAB_MASK (FASTMEM_SLAB_SIZE - 1)
+
+#define FASTMEM_SLABS_PER_MEMZONE (FASTMEM_MEMZONE_SIZE / FASTMEM_SLAB_SIZE)
+
+#define FASTMEM_MAX_MEMZONES_PER_SOCKET 64
+
+#define FASTMEM_MIN_CLASS_LOG2 3                                /*   8 B */
+#define FASTMEM_MAX_CLASS_LOG2 20                               /*   1 MiB */
+#define FASTMEM_N_CLASSES (FASTMEM_MAX_CLASS_LOG2 - FASTMEM_MIN_CLASS_LOG2 + 1)
+
+#define FASTMEM_MIN_SIZE ((size_t)1 << FASTMEM_MIN_CLASS_LOG2)
+#define FASTMEM_MAX_ALLOC_SIZE ((size_t)1 << FASTMEM_MAX_CLASS_LOG2)
+
+#define FASTMEM_SLAB_HEADER_SIZE RTE_CACHE_LINE_SIZE
+
+#define FASTMEM_CACHE_BASE_CAPACITY 64
+#define FASTMEM_CACHE_FLOOR_CAPACITY 4
+#define FASTMEM_CACHE_BASE_CLASS_LOG2 12                        /* 4 KiB */
+
+struct fastmem_bin;
+
+/*
+ * Slab header at offset 0 of each 2 MiB slab. Either free (linked
+ * via next_free) or assigned to a bin (linked via list).
+ */
+struct fastmem_slab {
+	struct fastmem_bin *bin;
+	void *free_head;
+	uint32_t free_count;
+	uint32_t n_slots;
+	struct fastmem_slab *next_free;
+	TAILQ_ENTRY(fastmem_slab) list;
+	rte_iova_t iova_base;
+} __rte_aligned(FASTMEM_SLAB_HEADER_SIZE);
+
+TAILQ_HEAD(fastmem_slab_list, fastmem_slab);
+
+struct fastmem_bin {
+	rte_spinlock_t lock;
+	uint32_t slot_size;
+	uint32_t slots_per_slab;
+	uint32_t class_idx;
+	struct fastmem_slab_list partial;
+	struct fastmem_slab_list full;
+	int socket_id;
+	uint64_t slab_acquires;
+	uint64_t slab_releases;
+	uint32_t slabs_partial;
+	uint32_t slabs_full;
+};
+
+/* Per-(lcore, class, socket) bounded LIFO of free object pointers. */
+struct fastmem_cache {
+	uint32_t count;
+	uint32_t capacity;
+	uint32_t target;
+	uint64_t alloc_cache_hits;
+	uint64_t alloc_cache_misses;
+	uint64_t alloc_nomem;
+	uint64_t free_cache_hits;
+	uint64_t free_cache_misses;
+	void *objs[];
+} __rte_cache_aligned;
+
+struct fastmem_socket_state {
+	rte_spinlock_t lock;
+	struct fastmem_slab *free_head;
+	size_t reserved_bytes;
+	size_t memory_limit;
+	unsigned int n_memzones;
+	unsigned int memzone_seq;
+	const struct rte_memzone *memzones[FASTMEM_MAX_MEMZONES_PER_SOCKET];
+	struct fastmem_bin bins[FASTMEM_N_CLASSES];
+	struct fastmem_cache *caches[RTE_MAX_LCORE][FASTMEM_N_CLASSES];
+};
+
+struct fastmem {
+	struct fastmem_socket_state sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct fastmem *fastmem;
+static const struct rte_memzone *fastmem_mz;
+
+static inline unsigned int
+size_to_class(size_t size, size_t align)
+{
+	size_t effective;
+	unsigned int log2;
+
+	effective = size < FASTMEM_MIN_SIZE ? FASTMEM_MIN_SIZE : size;
+	if (align > effective)
+		effective = align;
+
+	log2 = 64u - rte_clz64(effective - 1);
+
+	if (log2 < FASTMEM_MIN_CLASS_LOG2)
+		log2 = FASTMEM_MIN_CLASS_LOG2;
+	if (log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+static inline size_t
+class_size(unsigned int class_idx)
+{
+	return (size_t)1 << (class_idx + FASTMEM_MIN_CLASS_LOG2);
+}
+
+static_assert(sizeof(struct fastmem_slab) == FASTMEM_SLAB_HEADER_SIZE,
+	"fastmem slab header must fit in exactly one cache line");
+static_assert(sizeof(struct fastmem_slab) <= FASTMEM_SLAB_SIZE,
+	"slab header larger than a slab makes no sense");
+
+static __rte_always_inline struct fastmem_slab *
+slab_of(void *obj)
+{
+	return (struct fastmem_slab *)
+		((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+}
+
+static inline size_t
+slab_slot0_offset(size_t class_size)
+{
+	return class_size < FASTMEM_SLAB_HEADER_SIZE ?
+		FASTMEM_SLAB_HEADER_SIZE : class_size;
+}
+
+static inline uint32_t
+slab_slot_count(size_t class_size)
+{
+	size_t offset = slab_slot0_offset(class_size);
+
+	return (uint32_t)((FASTMEM_SLAB_SIZE - offset) / class_size);
+}
+
+/* Must be called with bin->lock held. */
+static void
+slab_init(struct fastmem_bin *bin, struct fastmem_slab *slab)
+{
+	size_t slot_size = bin->slot_size;
+	size_t offset = slab_slot0_offset(slot_size);
+	uint32_t n = bin->slots_per_slab;
+	void *prev = NULL;
+	uint32_t i;
+
+	slab->bin = bin;
+	slab->n_slots = n;
+	slab->free_count = n;
+
+	/* Build in reverse so pops yield sequential addresses. */
+	for (i = 0; i < n; i++) {
+		void *slot = RTE_PTR_ADD(slab, offset + i * slot_size);
+		*(void **)slot = prev;
+		prev = slot;
+	}
+	slab->free_head = prev;
+}
+
+static int
+grow_socket(struct fastmem_socket_state *socket, int socket_id)
+{
+	char name[RTE_MEMZONE_NAMESIZE];
+	const struct rte_memzone *mz;
+	unsigned int i;
+
+	if (socket->reserved_bytes + FASTMEM_MEMZONE_SIZE > socket->memory_limit) {
+		FASTMEM_LOG(ERR,
+			"reserve would exceed memory_limit (%zu) on socket %d",
+			socket->memory_limit, socket_id);
+		return -ENOMEM;
+	}
+
+	if (socket->n_memzones == FASTMEM_MAX_MEMZONES_PER_SOCKET) {
+		FASTMEM_LOG(ERR,
+			"reached per-socket memzone cap (%u) on socket %d",
+			FASTMEM_MAX_MEMZONES_PER_SOCKET, socket_id);
+		return -ENOMEM;
+	}
+
+	snprintf(name, sizeof(name), "fastmem_%d_%u", socket_id,
+			socket->memzone_seq++);
+
+	mz = rte_memzone_reserve_aligned(name, FASTMEM_MEMZONE_SIZE,
+			socket_id, RTE_MEMZONE_IOVA_CONTIG,
+			FASTMEM_SLAB_SIZE);
+	if (mz == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to reserve %zu-byte memzone '%s' on socket %d: %s",
+			(size_t)FASTMEM_MEMZONE_SIZE, name, socket_id,
+			rte_strerror(rte_errno));
+		return -ENOMEM;
+	}
+
+	socket->memzones[socket->n_memzones++] = mz;
+	socket->reserved_bytes += FASTMEM_MEMZONE_SIZE;
+
+	for (i = 0; i < FASTMEM_SLABS_PER_MEMZONE; i++) {
+		struct fastmem_slab *slab = RTE_PTR_ADD(mz->addr,
+				i * FASTMEM_SLAB_SIZE);
+
+		slab->iova_base = mz->iova + i * FASTMEM_SLAB_SIZE;
+		slab->next_free = socket->free_head;
+		socket->free_head = slab;
+	}
+
+	FASTMEM_LOG(DEBUG,
+		"reserved memzone '%s' (%zu bytes) on socket %d; %zu slabs added",
+		name, (size_t)FASTMEM_MEMZONE_SIZE, socket_id,
+		(size_t)FASTMEM_SLABS_PER_MEMZONE);
+
+	return 0;
+}
+
+static struct fastmem_slab *
+slab_acquire(struct fastmem_socket_state *socket, int socket_id)
+{
+	struct fastmem_slab *slab;
+
+	rte_spinlock_lock(&socket->lock);
+
+	if (socket->free_head == NULL) {
+		int rc = grow_socket(socket, socket_id);
+
+		if (rc < 0) {
+			rte_spinlock_unlock(&socket->lock);
+			return NULL;
+		}
+	}
+
+	slab = socket->free_head;
+	socket->free_head = slab->next_free;
+	slab->next_free = NULL;
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return slab;
+}
+
+static void
+slab_release(struct fastmem_socket_state *socket,
+		struct fastmem_slab *slab)
+{
+	rte_spinlock_lock(&socket->lock);
+
+	slab->next_free = socket->free_head;
+	socket->free_head = slab;
+
+	rte_spinlock_unlock(&socket->lock);
+}
+
+static void
+bin_init(struct fastmem_bin *bin, unsigned int class_idx, int socket_id)
+{
+	size_t slot_size = class_size(class_idx);
+
+	rte_spinlock_init(&bin->lock);
+	bin->slot_size = (uint32_t)slot_size;
+	bin->slots_per_slab = slab_slot_count(slot_size);
+	bin->class_idx = class_idx;
+	TAILQ_INIT(&bin->partial);
+	TAILQ_INIT(&bin->full);
+	bin->socket_id = socket_id;
+	bin->slab_acquires = 0;
+	bin->slab_releases = 0;
+	bin->slabs_partial = 0;
+	bin->slabs_full = 0;
+}
+
+static void
+bin_release(struct fastmem_bin *bin, struct fastmem_socket_state *socket)
+{
+	struct fastmem_slab *slab;
+
+	while ((slab = TAILQ_FIRST(&bin->partial)) != NULL) {
+		TAILQ_REMOVE(&bin->partial, slab, list);
+		slab_release(socket, slab);
+	}
+	while ((slab = TAILQ_FIRST(&bin->full)) != NULL) {
+		TAILQ_REMOVE(&bin->full, slab, list);
+		slab_release(socket, slab);
+	}
+}
+
+static unsigned int
+bin_pop_locked(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	unsigned int got = 0;
+
+	while (got < n) {
+		struct fastmem_slab *slab = TAILQ_FIRST(&bin->partial);
+		void *obj;
+
+		if (slab == NULL)
+			break;
+
+		obj = slab->free_head;
+		slab->free_head = *(void **)obj;
+		slab->free_count--;
+		objs[got++] = obj;
+
+		if (slab->free_count == 0) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			TAILQ_INSERT_HEAD(&bin->full, slab, list);
+			bin->slabs_partial--;
+			bin->slabs_full++;
+		}
+	}
+
+	return got;
+}
+
+/*
+ * Fully-drained slabs are accumulated in @p to_release for the
+ * caller to return after dropping the lock.
+ */
+static unsigned int
+bin_push_locked(struct fastmem_bin *bin, void **objs, unsigned int n,
+		struct fastmem_slab **to_release)
+{
+	unsigned int n_release = 0;
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		void *obj = objs[i];
+		struct fastmem_slab *slab = (struct fastmem_slab *)
+			((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+		bool was_full = slab->free_count == 0;
+
+		*(void **)obj = slab->free_head;
+		slab->free_head = obj;
+		slab->free_count++;
+
+		if (was_full) {
+			TAILQ_REMOVE(&bin->full, slab, list);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_full--;
+			bin->slabs_partial++;
+		}
+
+		if (slab->free_count == slab->n_slots) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			bin->slabs_partial--;
+			bin->slab_releases++;
+			to_release[n_release++] = slab;
+		}
+	}
+
+	return n_release;
+}
+
+/*
+ * The lock may be dropped and re-acquired internally.
+ */
+static int
+bin_ensure_partial_locked(struct fastmem_bin *bin,
+		struct fastmem_socket_state *socket)
+{
+	struct fastmem_slab *slab;
+
+	if (TAILQ_FIRST(&bin->partial) != NULL)
+		return 0;
+
+	rte_spinlock_unlock(&bin->lock);
+
+	slab = slab_acquire(socket, bin->socket_id);
+	if (slab == NULL) {
+		rte_spinlock_lock(&bin->lock);
+		return -ENOMEM;
+	}
+
+	rte_spinlock_lock(&bin->lock);
+
+	/* Another thread may have added a partial slab meanwhile. */
+	if (TAILQ_FIRST(&bin->partial) != NULL) {
+		rte_spinlock_unlock(&bin->lock);
+		slab_release(socket, slab);
+		rte_spinlock_lock(&bin->lock);
+		return 0;
+	}
+
+	slab_init(bin, slab);
+	TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+	bin->slabs_partial++;
+	bin->slab_acquires++;
+
+	return 0;
+}
+
+static void *
+bin_alloc_one(struct fastmem_bin *bin)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	void *obj;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (bin_pop_locked(bin, &obj, 1) == 0) {
+		int rc = bin_ensure_partial_locked(bin, socket);
+
+		if (rc < 0) {
+			rte_spinlock_unlock(&bin->lock);
+			rte_errno = -rc;
+			return NULL;
+		}
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return obj;
+}
+
+static unsigned int
+bin_alloc_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	unsigned int got = 0;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (got < n) {
+		got += bin_pop_locked(bin, objs + got, n - got);
+		if (got == n)
+			break;
+
+		if (bin_ensure_partial_locked(bin, socket) < 0)
+			break;
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return got;
+}
+
+static void
+bin_free_one(struct fastmem_bin *bin, void *obj)
+{
+	unsigned int n_release;
+	struct fastmem_slab *slab_to_release = NULL;
+	struct fastmem_socket_state *socket;
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, &obj, 1, &slab_to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	if (n_release > 0) {
+		socket = &fastmem->sockets[bin->socket_id];
+		slab_release(socket, slab_to_release);
+	}
+}
+
+static void
+bin_free_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	struct fastmem_slab *to_release[FASTMEM_CACHE_BASE_CAPACITY];
+	unsigned int n_release;
+	unsigned int i;
+
+	RTE_VERIFY(n <= RTE_DIM(to_release));
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, objs, n, to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	for (i = 0; i < n_release; i++)
+		slab_release(socket, to_release[i]);
+}
+
+static inline unsigned int
+cache_capacity(unsigned int class_idx)
+{
+	unsigned int class_log2 = class_idx + FASTMEM_MIN_CLASS_LOG2;
+	unsigned int shift;
+	unsigned int cap;
+
+	if (class_log2 <= FASTMEM_CACHE_BASE_CLASS_LOG2)
+		return FASTMEM_CACHE_BASE_CAPACITY;
+
+	shift = class_log2 - FASTMEM_CACHE_BASE_CLASS_LOG2;
+	cap = FASTMEM_CACHE_BASE_CAPACITY >> shift;
+
+	return cap < FASTMEM_CACHE_FLOOR_CAPACITY ?
+		FASTMEM_CACHE_FLOOR_CAPACITY : cap;
+}
+
+static inline struct fastmem_cache **
+cache_slot(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	if (lcore_id >= RTE_MAX_LCORE)
+		return NULL;
+	return &socket->caches[lcore_id][class_idx];
+}
+
+static struct fastmem_cache *
+cache_create(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache **slot = cache_slot(socket, class_idx, lcore_id);
+	struct fastmem_cache *cache;
+	unsigned int capacity;
+	size_t cache_size;
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	capacity = cache_capacity(class_idx);
+	cache_size = sizeof(*cache) + capacity * sizeof(void *);
+
+	/*
+	 * Allocate the cache struct from fastmem on the calling
+	 * lcore's socket (NUMA-local to the writer). Bypasses the
+	 * cache layer to avoid recursion.
+	 */
+	{
+		unsigned int cache_class =
+			size_to_class(cache_size, RTE_CACHE_LINE_SIZE);
+		unsigned int own_socket = rte_socket_id();
+		struct fastmem_socket_state *alloc_socket;
+
+		if (cache_class >= FASTMEM_N_CLASSES) {
+			FASTMEM_LOG(ERR,
+				"cache size %zu exceeds max size class",
+				cache_size);
+			return NULL;
+		}
+
+		if (own_socket >= RTE_MAX_NUMA_NODES)
+			own_socket = (unsigned int)socket->bins[0].socket_id;
+
+		alloc_socket = &fastmem->sockets[own_socket];
+
+		cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
+		if (cache == NULL) {
+			FASTMEM_LOG(ERR,
+				"failed to allocate cache for class %u on socket %u",
+				class_idx, own_socket);
+			return NULL;
+		}
+	}
+
+	cache->count = 0;
+	cache->capacity = capacity;
+	cache->target = capacity / 2;
+	cache->alloc_cache_hits = 0;
+	cache->alloc_cache_misses = 0;
+	cache->alloc_nomem = 0;
+	cache->free_cache_hits = 0;
+	cache->free_cache_misses = 0;
+
+	*slot = cache;
+
+	return cache;
+}
+
+static __rte_always_inline struct fastmem_cache *
+cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	struct fastmem_cache **slot = cache_slot(socket, class_idx, lcore_id);
+	struct fastmem_cache *cache;
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	return cache_create(socket, class_idx, lcore_id);
+}
+
+static __rte_always_inline void *
+cache_pop(struct fastmem_cache *cache, struct fastmem_bin *bin)
+{
+	if (cache->count > 0) {
+		cache->alloc_cache_hits++;
+		return cache->objs[--cache->count];
+	}
+
+	cache->count = bin_alloc_bulk(bin, cache->objs, cache->target);
+	if (cache->count == 0)
+		return NULL;
+
+	cache->alloc_cache_misses++;
+	return cache->objs[--cache->count];
+}
+
+static __rte_always_inline void
+cache_push(struct fastmem_cache *cache, struct fastmem_bin *bin, void *obj)
+{
+	unsigned int drain;
+
+	if (cache->count < cache->capacity) {
+		cache->free_cache_hits++;
+		cache->objs[cache->count++] = obj;
+		return;
+	}
+
+	cache->free_cache_misses++;
+
+	/*
+	 * Drain the oldest (bottom) half to the bin, keeping the
+	 * newest (top) half for temporal reuse.
+	 */
+	drain = cache->count - cache->target;
+	bin_free_bulk(bin, cache->objs, drain);
+	memmove(cache->objs, cache->objs + drain,
+		cache->target * sizeof(cache->objs[0]));
+	cache->count = cache->target;
+
+	cache->objs[cache->count++] = obj;
+}
+
+static void
+socket_release_caches(struct fastmem_socket_state *socket)
+{
+	unsigned int lcore;
+	unsigned int c;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache = socket->caches[lcore][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore][c] = NULL;
+		}
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_init, 25.07)
+int
+rte_fastmem_init(void)
+{
+	unsigned int s, c;
+
+	if (fastmem != NULL)
+		return -EBUSY;
+
+	fastmem_mz = rte_memzone_reserve_aligned("fastmem_state",
+			sizeof(*fastmem), SOCKET_ID_ANY, 0,
+			RTE_CACHE_LINE_SIZE);
+	if (fastmem_mz == NULL)
+		return -ENOMEM;
+
+	fastmem = fastmem_mz->addr;
+	memset(fastmem, 0, sizeof(*fastmem));
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		rte_spinlock_init(&socket->lock);
+		socket->memory_limit = SIZE_MAX;
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++)
+			bin_init(&socket->bins[c], c, (int)s);
+	}
+
+	return 0;
+}
+
+static void
+release_socket(struct fastmem_socket_state *socket)
+{
+	unsigned int c;
+	unsigned int i;
+
+	socket_release_caches(socket);
+
+	for (c = 0; c < FASTMEM_N_CLASSES; c++)
+		bin_release(&socket->bins[c], socket);
+
+	for (i = 0; i < socket->n_memzones; i++)
+		rte_memzone_free(socket->memzones[i]);
+
+	socket->free_head = NULL;
+	socket->reserved_bytes = 0;
+	socket->n_memzones = 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_deinit, 25.07)
+void
+rte_fastmem_deinit(void)
+{
+	unsigned int i;
+
+	if (fastmem == NULL)
+		return;
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket(&fastmem->sockets[i]);
+
+	rte_memzone_free(fastmem_mz);
+	fastmem_mz = NULL;
+	fastmem = NULL;
+}
+
+/* Same resolution order as rte_malloc's malloc_get_numa_socket(). */
+static __rte_always_inline unsigned int
+local_socket_id(void)
+{
+	unsigned int sid = rte_socket_id();
+
+	if (likely(sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = rte_lcore_to_socket_id(rte_get_main_lcore());
+	if (likely(sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	return (unsigned int)rte_socket_id_by_idx(0);
+}
+
+static int
+reserve_on_socket(int sid, size_t size)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[sid];
+	int rc = 0;
+
+	rte_spinlock_lock(&socket->lock);
+
+	while (socket->reserved_bytes < size) {
+		rc = grow_socket(socket, sid);
+		if (rc < 0)
+			break;
+	}
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return rc;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_reserve, 25.07)
+int
+rte_fastmem_reserve(size_t size, int socket_id)
+{
+	unsigned int i;
+	int rc;
+
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id != SOCKET_ID_ANY) {
+		if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+			return -EINVAL;
+		return reserve_on_socket(socket_id, size);
+	}
+
+	rc = reserve_on_socket(local_socket_id(), size);
+	if (rc == 0)
+		return 0;
+
+	for (i = 0; i < rte_socket_count(); i++) {
+		int sid = rte_socket_id_by_idx(i);
+
+		if (sid < 0 || (unsigned int)sid == local_socket_id())
+			continue;
+
+		rc = reserve_on_socket(sid, size);
+		if (rc == 0)
+			return 0;
+	}
+
+	return rc;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_set_limit, 25.07)
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes)
+{
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id == SOCKET_ID_ANY) {
+		for (unsigned int i = 0; i < RTE_MAX_NUMA_NODES; i++)
+			fastmem->sockets[i].memory_limit = max_bytes;
+		return 0;
+	}
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	fastmem->sockets[socket_id].memory_limit = max_bytes;
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_get_limit, 25.07)
+size_t
+rte_fastmem_get_limit(int socket_id)
+{
+	if (fastmem == NULL || socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return 0;
+
+	return fastmem->sockets[socket_id].memory_limit;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_max_size, 25.07)
+size_t
+rte_fastmem_max_size(void)
+{
+	return FASTMEM_MAX_ALLOC_SIZE;
+}
+
+static __rte_always_inline void *
+alloc_from_socket(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache *cache;
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		return cache_pop(cache, &socket->bins[class_idx]);
+	return bin_alloc_one(&socket->bins[class_idx]);
+}
+
+static __rte_always_inline void
+do_free(void *ptr)
+{
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+
+	slab = slab_of(ptr);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+	if (likely(cache != NULL))
+		cache_push(cache, bin, ptr);
+	else
+		bin_free_one(bin, ptr);
+}
+
+static __rte_always_inline int
+do_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags, unsigned int lcore_id,
+		int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int got = 0;
+
+	RTE_ASSERT(fastmem != NULL);
+
+	if (align == 0)
+		align = RTE_CACHE_LINE_SIZE;
+	else if (unlikely((align & (align - 1)) != 0)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return -E2BIG;
+	}
+
+	socket = &fastmem->sockets[socket_id];
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		/* Drain from cache. */
+		unsigned int avail = RTE_MIN(cache->count, n);
+
+		cache->count -= avail;
+		memcpy(ptrs, &cache->objs[cache->count],
+			avail * sizeof(void *));
+		got = avail;
+		cache->alloc_cache_hits += avail;
+
+		if (got < n) {
+			unsigned int need = n - got;
+			unsigned int want = RTE_MAX(need, cache->target);
+			unsigned int filled;
+
+			if (want <= cache->capacity) {
+				/* Refill into cache, give caller their share. */
+				filled = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					cache->objs, want);
+				if (filled > 0) {
+					cache->alloc_cache_misses += RTE_MIN(filled, need);
+				}
+				if (filled >= need) {
+					memcpy(ptrs + got,
+						cache->objs + filled - need,
+						need * sizeof(void *));
+					cache->count = filled - need;
+					got = n;
+				} else {
+					memcpy(ptrs + got, cache->objs,
+						filled * sizeof(void *));
+					got += filled;
+					cache->count = 0;
+				}
+			} else {
+				/* n exceeds cache capacity; pull directly. */
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, need);
+				if (direct > 0)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	} else {
+		got = bin_alloc_bulk(&socket->bins[class_idx], ptrs, n);
+	}
+
+	if (unlikely(got < n) && fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count() && got < n; i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			socket = &fastmem->sockets[sid];
+			cache = cache_get(socket, class_idx, lcore_id);
+			if (likely(cache != NULL)) {
+				unsigned int avail =
+					RTE_MIN(cache->count, n - got);
+				cache->count -= avail;
+				memcpy(ptrs + got,
+					&cache->objs[cache->count],
+					avail * sizeof(void *));
+				cache->alloc_cache_hits += avail;
+				got += avail;
+			}
+			if (got < n) {
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, n - got);
+				if (direct > 0 && cache != NULL)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	}
+
+	if (unlikely(got < n)) {
+		/* All-or-nothing: return what we got. */
+		struct fastmem_cache **slot;
+		unsigned int i;
+
+		for (i = 0; i < got; i++)
+			do_free(ptrs[i]);
+
+		slot = cache_slot(
+			&fastmem->sockets[socket_id], class_idx,
+			lcore_id);
+		if (slot != NULL && *slot != NULL)
+			(*slot)->alloc_nomem++;
+		rte_errno = ENOMEM;
+		return -ENOMEM;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO) {
+		size_t cs = class_size(class_idx);
+		unsigned int i;
+
+		for (i = 0; i < n; i++)
+			memset(ptrs[i], 0, cs);
+	}
+
+	return 0;
+}
+
+static __rte_always_inline void *
+do_alloc(size_t size, size_t align, unsigned int flags,
+		unsigned int lcore_id, int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_cache **slot;
+	void *obj;
+
+	RTE_ASSERT(fastmem != NULL);
+
+	if (align == 0)
+		align = RTE_CACHE_LINE_SIZE;
+	else if (unlikely((align & (align - 1)) != 0)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	obj = alloc_from_socket(&fastmem->sockets[socket_id],
+			class_idx, lcore_id);
+
+	if (likely(obj != NULL))
+		goto out;
+
+	if (fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count(); i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			obj = alloc_from_socket(&fastmem->sockets[sid],
+					class_idx, lcore_id);
+			if (obj != NULL)
+				goto out;
+		}
+	}
+
+	slot = cache_slot(
+		&fastmem->sockets[socket_id], class_idx, lcore_id);
+	if (slot != NULL && *slot != NULL)
+		(*slot)->alloc_nomem++;
+	rte_errno = ENOMEM;
+	return NULL;
+
+out:
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc, 25.07)
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+{
+	return do_alloc(size, align, flags, rte_lcore_id(),
+			local_socket_id(), false);
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_socket, 25.07)
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc(size, align, flags, rte_lcore_id(),
+				local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	return do_alloc(size, align, flags, rte_lcore_id(), socket_id, false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free, 25.07)
+void
+rte_fastmem_free(void *ptr)
+{
+	if (unlikely(ptr == NULL))
+		return;
+
+	do_free(ptr);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk, 25.07)
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags)
+{
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), local_socket_id(), false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk_socket, 25.07)
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc_bulk(ptrs, n, size, align, flags,
+				rte_lcore_id(), local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), socket_id, false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free_bulk, 25.07)
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int lcore_id;
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int space;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	lcore_id = rte_lcore_id();
+
+	/* Fast path: check if first object gives us the bin. */
+	slab = slab_of(ptrs[0]);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+
+	if (unlikely(cache == NULL)) {
+		for (i = 0; i < n; i++)
+			do_free(ptrs[i]);
+		return;
+	}
+
+	/*
+	 * Try to push all objects into the cache in one memcpy.
+	 * If any object belongs to a different bin, fall back to
+	 * per-object free for the remainder.
+	 */
+	space = cache->capacity - cache->count;
+	if (likely(n <= space)) {
+		/* Verify all same bin (common case). */
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+		cache->free_cache_hits += n;
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+	/* Would overflow cache — drain first, then push. */
+	if (n <= cache->capacity) {
+		unsigned int drain;
+
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+
+		cache->free_cache_misses += n;
+		drain = cache->count - cache->target + n;
+		if (drain > cache->count)
+			drain = cache->count;
+		if (drain > 0) {
+			bin_free_bulk(bin, cache->objs, drain);
+			cache->count -= drain;
+			memmove(cache->objs, cache->objs + drain,
+				cache->count * sizeof(cache->objs[0]));
+		}
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+slow:
+	for (i = 0; i < n; i++)
+		do_free(ptrs[i]);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_virt2iova, 25.07)
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr)
+{
+	struct fastmem_slab *slab;
+
+	RTE_ASSERT(fastmem != NULL);
+
+	slab = slab_of((void *)(uintptr_t)ptr);
+
+	return slab->iova_base + ((uintptr_t)ptr - (uintptr_t)slab);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_cache_flush, 25.07)
+void
+rte_fastmem_cache_flush(void)
+{
+	unsigned int lcore_id;
+	unsigned int s, c;
+
+	if (fastmem == NULL)
+		return;
+
+	lcore_id = rte_lcore_id();
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore_id][c] = NULL;
+		}
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats, 25.07)
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_stats){0};
+	stats->n_classes = FASTMEM_N_CLASSES;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		stats->bytes_backing += socket->reserved_bytes;
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			uint64_t class_allocs = 0, class_frees = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				class_allocs += cache->alloc_cache_hits +
+					cache->alloc_cache_misses;
+				class_frees += cache->free_cache_hits +
+					cache->free_cache_misses;
+				stats->alloc_nomem += cache->alloc_nomem;
+			}
+			stats->alloc_total += class_allocs;
+			stats->free_total += class_frees;
+			if (class_allocs > class_frees)
+				stats->bytes_in_use += class_size(c) *
+					(class_allocs - class_frees);
+		}
+	}
+
+	return 0;
+}
+
+static inline unsigned int
+exact_class_idx(size_t sz)
+{
+	unsigned int log2;
+
+	if (sz < FASTMEM_MIN_SIZE || sz > FASTMEM_MAX_ALLOC_SIZE)
+		return FASTMEM_N_CLASSES;
+	if ((sz & (sz - 1)) != 0)
+		return FASTMEM_N_CLASSES;
+
+	log2 = (unsigned int)rte_ctz64(sz);
+	if (log2 < FASTMEM_MIN_CLASS_LOG2 || log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_class, 25.07)
+int
+rte_fastmem_stats_class(size_t class_size_arg,
+		struct rte_fastmem_class_stats *stats)
+{
+	unsigned int c;
+	uint64_t allocs, frees;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+		struct fastmem_bin *bin = &socket->bins[c];
+
+		for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+			struct fastmem_cache *cache = socket->caches[l][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+
+		stats->slab_acquires += bin->slab_acquires;
+		stats->slab_releases += bin->slab_releases;
+		stats->slabs_partial += bin->slabs_partial;
+		stats->slabs_full += bin->slabs_full;
+	}
+
+	allocs = stats->alloc_cache_hits + stats->alloc_cache_misses;
+	frees = stats->free_cache_hits + stats->free_cache_misses;
+	if (allocs > frees)
+		stats->in_use = allocs - frees;
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore, 25.07)
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_stats){0};
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore_class, 25.07)
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size_arg,
+		struct rte_fastmem_lcore_class_stats *stats)
+{
+	unsigned int c;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_cache *cache =
+			fastmem->sockets[s].caches[lcore_id][c];
+		if (cache == NULL)
+			continue;
+		stats->alloc_cache_hits += cache->alloc_cache_hits;
+		stats->alloc_cache_misses += cache->alloc_cache_misses;
+		stats->alloc_nomem += cache->alloc_nomem;
+		stats->free_cache_hits += cache->free_cache_hits;
+		stats->free_cache_misses += cache->free_cache_misses;
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_reset, 25.07)
+void
+rte_fastmem_stats_reset(void)
+{
+	if (fastmem == NULL)
+		return;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_bin *bin = &socket->bins[c];
+
+			bin->slab_acquires = 0;
+			bin->slab_releases = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				cache->alloc_cache_hits = 0;
+				cache->alloc_cache_misses = 0;
+				cache->alloc_nomem = 0;
+				cache->free_cache_hits = 0;
+				cache->free_cache_misses = 0;
+			}
+		}
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_classes, 25.07)
+unsigned int
+rte_fastmem_classes(size_t *sizes)
+{
+	if (sizes != NULL)
+		for (unsigned int i = 0; i < FASTMEM_N_CLASSES; i++)
+			sizes[i] = class_size(i);
+	return FASTMEM_N_CLASSES;
+}
diff --git a/lib/fastmem/rte_fastmem.h b/lib/fastmem/rte_fastmem.h
new file mode 100644
index 0000000000..cd1abf9844
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.h
@@ -0,0 +1,644 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#ifndef _RTE_FASTMEM_H_
+#define _RTE_FASTMEM_H_
+
+/**
+ * @file
+ *
+ * RTE Fastmem
+ *
+ * @warning
+ * @b EXPERIMENTAL:
+ * All functions in this file may be changed or removed without prior notice.
+ *
+ * The fastmem library is a fast, general-purpose small-object
+ * allocator for DPDK applications. It is intended to allow an
+ * application to replace its many per-type mempools — each sized
+ * for a single object type (a connection, a session, a work item,
+ * a timer, etc.) — with a single allocator that handles arbitrary
+ * object sizes, grows on demand, and offers mempool-level
+ * performance for the common allocation and free paths.
+ *
+ * Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+ * supports bulk operations, and uses per-lcore caches to reduce
+ * shared-state contention. Unlike mempool, it does not require the
+ * caller to declare object sizes or counts up front.
+ *
+ * There is a single, global fastmem instance per process. The
+ * instance is brought up with rte_fastmem_init() and torn down with
+ * rte_fastmem_deinit(). Allocations are made with
+ * rte_fastmem_alloc() and freed with rte_fastmem_free().
+ *
+ * The allocator is bounded to small-object allocations. Requests
+ * larger than rte_fastmem_max_size() are rejected; callers with
+ * such needs should use rte_malloc() directly.
+ *
+ * Backing memory is reserved from DPDK memzones. Once reserved,
+ * backing memory is not returned to the system during the
+ * allocator's lifetime. Callers that need predictable latency may
+ * pre-reserve backing memory up front using rte_fastmem_reserve(),
+ * avoiding memzone-reservation overhead during steady-state
+ * operation.
+ *
+ * Alignment argument, @c align:
+ *   If non-zero, @c align specifies an exact minimum alignment and
+ *   must be a power of 2. If zero, the default alignment is
+ *   @c RTE_CACHE_LINE_SIZE, so that objects obtained from distinct
+ *   calls cannot false-share a cache line.
+ *
+ * Threads and per-lcore caches:
+ *   Allocate and free calls from EAL threads are served through a
+ *   per-lcore cache, which makes the common path lock-free.
+ *   Unregistered non-EAL threads do not use a cache; their
+ *   allocate and free calls go directly to shared state, take an
+ *   internal lock, and cost more per call.
+ *
+ * Non-preemptible caller:
+ *   Callers should not be preemptible while inside a fastmem call.
+ *   Fastmem uses internal spinlocks; if a caller is preempted
+ *   while holding one, any other thread that subsequently needs
+ *   the same lock stalls until the preempted caller resumes.
+ */
+
+#include <stddef.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Flag for rte_fastmem_alloc() and its variants: initialize the
+ * returned memory to zero before returning it to the caller.
+ */
+#define RTE_FASTMEM_F_ZERO RTE_BIT32(0)
+
+/**
+ * Initialize the fastmem allocator.
+ *
+ * Sets up the library's internal state. Must be called before any
+ * allocation call. Typically called once per process, after
+ * rte_eal_init() and before the application's worker threads begin
+ * making allocations.
+ *
+ * Initialization does not pre-reserve any backing memory; memzones
+ * are reserved lazily as allocations require. An application that
+ * wants to avoid memzone-reservation latency on the allocation
+ * path should follow rte_fastmem_init() with one or more calls to
+ * rte_fastmem_reserve().
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EBUSY: The allocator is already initialized.
+ *  - -ENOMEM: Unable to allocate internal state.
+ */
+__rte_experimental
+int
+rte_fastmem_init(void);
+
+/**
+ * Tear down the fastmem allocator.
+ *
+ * Releases the library's internal state and frees all backing
+ * memzones. After this call, no fastmem allocations or frees may
+ * be made until rte_fastmem_init() is called again.
+ *
+ * The caller is responsible for ensuring that no fastmem-allocated
+ * objects remain in use. Outstanding allocations at deinit time
+ * result in undefined behavior.
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ */
+__rte_experimental
+void
+rte_fastmem_deinit(void);
+
+/**
+ * Pre-reserve backing memory.
+ *
+ * Ensures that at least @p size bytes of memzone-backed memory are
+ * available to the allocator on @p socket_id, reserving additional
+ * memzones from EAL as needed to reach that total. Subsequent
+ * allocations served from the pre-reserved memory do not incur
+ * memzone-reservation cost.
+ *
+ * The reservation is cumulative: repeated calls to
+ * rte_fastmem_reserve() with the same @p socket_id grow the
+ * reservation monotonically. Reserved memory is never returned to
+ * the system during the allocator's lifetime.
+ *
+ * A typical use is to call rte_fastmem_reserve() once at
+ * application startup, with a size chosen to cover the expected
+ * steady-state working set. Allocations and frees during
+ * steady-state operation then avoid memzone reservations entirely.
+ *
+ * @param size
+ *  The minimum amount of backing memory, in bytes, to make
+ *  available on @p socket_id. The allocator may reserve more than
+ *  the requested amount due to internal rounding (e.g., to memzone
+ *  or block granularity).
+ *
+ * @param socket_id
+ *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
+ *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the reservation.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
+ *  - -EINVAL: Invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_reserve(size_t size, int socket_id);
+
+/**
+ * Set the maximum backing memory that may be reserved on a socket.
+ *
+ * Once the limit is reached, allocations that would require new
+ * backing memory on the constrained socket fail with ENOMEM.
+ * Already-reserved memory is not released.
+ *
+ * Setting a limit below the current reserved amount is allowed and
+ * prevents further growth.
+ *
+ * @param socket_id
+ *  The NUMA socket to constrain, or SOCKET_ID_ANY to apply the
+ *  limit to all sockets.
+ * @param max_bytes
+ *  Maximum backing memory in bytes, or SIZE_MAX for unlimited (the default).
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Fastmem not initialized, or invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes);
+
+/**
+ * Get the maximum backing memory limit for a socket.
+ *
+ * @param socket_id
+ *  The NUMA socket to query.
+ * @return
+ *  The limit in bytes, or SIZE_MAX if unlimited.
+ */
+__rte_experimental
+size_t
+rte_fastmem_get_limit(int socket_id);
+
+/**
+ * Retrieve the largest allocation size the allocator supports.
+ *
+ * Requests larger than this size are rejected by the allocation
+ * functions. The returned value is a property of the allocator
+ * implementation and does not change across the lifetime of the
+ * process.
+ *
+ * @return
+ *  The largest supported allocation size, in bytes.
+ */
+__rte_experimental
+size_t
+rte_fastmem_max_size(void);
+
+/**
+ * Allocate an object from the fastmem allocator.
+ *
+ * Allocates at least @p size bytes, aligned to at least @p align
+ * bytes. The returned memory is backed by huge pages and is
+ * DMA-usable; its IOVA can be obtained via rte_fastmem_virt2iova().
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_socket() to target a
+ * specific socket.
+ *
+ * The allocated memory must be freed with rte_fastmem_free(). An
+ * allocation may be freed from any lcore, not only the lcore that
+ * made the allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags. Use
+ *  RTE_FASTMEM_F_ZERO to obtain zero-initialized memory.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align (not a power of two).
+ *    - ENOMEM: Allocation could not be served from existing
+ *      backing memory and no additional memzone could be reserved.
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+/**
+ * Allocate an object on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc(), but targets the specified NUMA socket
+ * rather than the socket of the calling lcore. Use this variant
+ * when the lifetime or access pattern of the allocation is not
+ * tied to the calling lcore's socket.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set (see rte_fastmem_alloc()).
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+/**
+ * Free an object previously allocated by the fastmem allocator.
+ *
+ * @p ptr must have been returned by a prior call to any fastmem
+ * allocation function, or be NULL. If @p ptr is NULL, no operation
+ * is performed.
+ *
+ * Free may be called from any lcore, regardless of which lcore
+ * made the original allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an object previously allocated by fastmem, or NULL.
+ */
+__rte_experimental
+void
+rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate multiple objects in bulk.
+ *
+ * Allocates @p n objects, each of size at least @p size and aligned
+ * to at least @p align bytes, and stores the resulting pointers
+ * into @p ptrs. All @p n objects have the same size and alignment.
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_bulk_socket() to target a
+ * specific socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_alloc().
+ *
+ * On failure no objects are allocated and @p ptrs is left
+ * untouched.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ *  - -EINVAL: Invalid @p align.
+ *  - -ENOMEM: Not enough objects could be allocated to fill the
+ *    request.
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags);
+
+/**
+ * Allocate multiple objects in bulk on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc_bulk(), but targets the specified NUMA
+ * socket rather than the socket of the calling lcore.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - Negative errno on failure (see rte_fastmem_alloc_bulk()).
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id);
+
+/**
+ * Free multiple objects in bulk.
+ *
+ * Frees the @p n objects pointed to by @p ptrs. Each pointer in
+ * the array must have been returned by a prior fastmem allocation
+ * call and must not have been freed. The objects need not have
+ * the same size, alignment, or socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_free().
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n);
+
+/**
+ * Obtain the IOVA for a fastmem-allocated pointer.
+ *
+ * Translates a virtual address returned by a fastmem allocation
+ * function into the corresponding IOVA, suitable for use in device
+ * DMA descriptors.
+ *
+ * The returned IOVA is valid for the lifetime of the allocation.
+ *
+ * @p ptr must have been returned by a prior fastmem allocation
+ * function. Passing any other pointer results in undefined
+ * behavior.
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation
+ *  function.
+ *
+ * @return
+ *  The IOVA corresponding to @p ptr.
+ */
+__rte_experimental
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr);
+
+/**
+ * Flush the calling lcore's per-lcore caches.
+ *
+ * Drains every cached object from the calling lcore's
+ * per-(size class, NUMA socket) caches back to their shared
+ * bins, and releases the cache state itself. A subsequent
+ * allocation or free on this lcore lazily recreates any caches
+ * it needs.
+ *
+ * This is useful in applications that have finished a bursty
+ * phase and want to release memory that would otherwise sit idle
+ * in caches. It is also useful in tests that want to observe
+ * bin-level state without per-lcore caching hiding activity.
+ *
+ * The call has no effect when invoked from a non-EAL thread.
+ *
+ * This function is not thread-safe with respect to concurrent
+ * allocations or frees on the calling lcore; call it only when
+ * the calling lcore is not making other fastmem calls.
+ */
+__rte_experimental
+void
+rte_fastmem_cache_flush(void);
+
+/**
+ * Global summary statistics.
+ */
+struct rte_fastmem_stats {
+	uint64_t bytes_backing;  /**< Bytes of backing memory (memzones) reserved from EAL. */
+	uint64_t bytes_in_use;   /**< Approximate bytes in live objects. */
+	uint64_t alloc_total;    /**< Total successful alloc operations (hits + misses). */
+	uint64_t free_total;     /**< Total free operations (hits + misses). */
+	uint64_t alloc_nomem;    /**< Alloc attempts that failed with ENOMEM. */
+	unsigned int n_classes;  /**< Number of size classes. */
+};
+
+/**
+ * Per-size-class statistics (aggregated across all lcores).
+ *
+ * Allocation and free counters count individual objects, not
+ * operations. A bulk allocation of 32 objects that hits the cache
+ * increments alloc_cache_hits by 32.
+ */
+struct rte_fastmem_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t in_use;               /**< Objects currently live (allocs - frees). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from a per-lcore cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by a per-lcore cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+	uint64_t slab_acquires;        /**< Slabs pulled from the free pool. */
+	uint64_t slab_releases;        /**< Slabs returned to the free pool. */
+	uint32_t slabs_partial;        /**< Current partial slab count. */
+	uint32_t slabs_full;           /**< Current full slab count. */
+};
+
+/**
+ * Per-lcore statistics (aggregated across all classes).
+ */
+struct rte_fastmem_lcore_stats {
+	uint64_t alloc_cache_hits;     /**< Allocs served from this lcore's caches. */
+	uint64_t alloc_cache_misses;   /**< Allocs that missed this lcore's caches. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by this lcore's caches. */
+	uint64_t free_cache_misses;    /**< Frees that bypassed this lcore's caches. */
+};
+
+/**
+ * Per-lcore, per-class statistics (no aggregation).
+ */
+struct rte_fastmem_lcore_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+};
+
+/**
+ * Get the number of size classes and optionally their sizes.
+ *
+ * @param[out] sizes
+ *   If non-NULL, filled with the size (in bytes) of each class.
+ *   The caller must provide space for at least the returned number
+ *   of entries.
+ *
+ * @return
+ *   The number of size classes.
+ */
+__rte_experimental
+unsigned int
+rte_fastmem_classes(size_t *sizes);
+
+/**
+ * Retrieve global summary statistics.
+ *
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL or fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats);
+
+/**
+ * Retrieve statistics for a single size class.
+ *
+ * @param class_size
+ *   Exact size of the class to query (must match one of the values
+ *   returned by rte_fastmem_classes()).
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p class_size does not match any size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_class(size_t class_size,
+		struct rte_fastmem_class_stats *stats);
+
+/**
+ * Retrieve per-lcore statistics (aggregated across all classes).
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p lcore_id is invalid.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats);
+
+/**
+ * Retrieve per-lcore, per-class statistics.
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param class_size
+ *   Exact size of the class to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized,
+ *    @p lcore_id is invalid, or @p class_size does not match any
+ *    size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size,
+		struct rte_fastmem_lcore_class_stats *stats);
+
+/**
+ * Reset all statistics counters to zero.
+ *
+ * Zeroes per-lcore cache counters and per-bin counters. Does not
+ * affect the allocator's operational state.
+ */
+__rte_experimental
+void
+rte_fastmem_stats_reset(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_FASTMEM_H_ */
diff --git a/lib/meson.build b/lib/meson.build
index 8f5cfd28a5..98ec28a901 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -40,6 +40,7 @@ libraries = [
         'efd',
         'eventdev',
         'dispatcher', # dispatcher depends on eventdev
+        'fastmem',
         'gpudev',
         'gro',
         'gso',
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC 2/3] lib: add fastmem library
  2026-05-25 10:36 ` [RFC 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-27 14:22   ` Stephen Hemminger
  2026-05-27 17:25     ` Mattias Rönnblom
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2026-05-27 14:22 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Mon, 25 May 2026 12:36:41 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> +
> +static __rte_always_inline struct fastmem_cache *
> +cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
> +		unsigned int lcore_id)

Do not use always_inline. With current compilers using always inline
makes the optimizer generate worse code. The only exceptions would
be where inline is required to make assembly work or you have good benchmark
data that proves that always_inline generates > 1% performance gain.

To much of DPDK use __rte_always_inline as "cargo cult" it is faster setting.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 2/3] lib: add fastmem library
  2026-05-27 14:22   ` Stephen Hemminger
@ 2026-05-27 17:25     ` Mattias Rönnblom
  0 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 17:25 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/27/26 16:22, Stephen Hemminger wrote:
> On Mon, 25 May 2026 12:36:41 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> +
>> +static __rte_always_inline struct fastmem_cache *
>> +cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
>> +		unsigned int lcore_id)
> 
> Do not use always_inline. With current compilers using always inline
> makes the optimizer generate worse code. The only exceptions would
> be where inline is required to make assembly work or you have good benchmark
> data that proves that always_inline generates > 1% performance gain.
> 
> To much of DPDK use __rte_always_inline as "cargo cult" it is faster setting.

__rte_always_inline is still useful, but it is rare. For example, it may 
be required in certain situations to force constant propagation to 
actually occur.

I'm removing both inline and always_inline. Doesn't make a difference, 
so noise.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC 3/3] app/test: add fastmem test suite
  2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-25 10:36 ` [RFC 1/3] doc: add fastmem programming guide Mattias Rönnblom
  2026-05-25 10:36 ` [RFC 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-25 10:36 ` Mattias Rönnblom
  2026-05-26  8:57   ` [RFC v2 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-25 14:30 ` [RFC " Stephen Hemminger
  2026-05-25 18:36 ` Stephen Hemminger
  4 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 10:36 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Mattias Rönnblom

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Add functional, performance, and profiling test suites for the
fastmem library.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build            |    3 +
 app/test/test_fastmem.c         | 1682 +++++++++++++++++++++++++++++++
 app/test/test_fastmem_perf.c    |  997 ++++++++++++++++++
 app/test/test_fastmem_profile.c |  157 +++
 4 files changed, 2839 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d458f9c07..d11c63be6f 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -82,6 +82,9 @@ source_file_deps = {
     'test_event_vector_adapter.c': ['eventdev', 'bus_vdev'],
     'test_eventdev.c': ['eventdev', 'bus_vdev'],
     'test_external_mem.c': [],
+    'test_fastmem.c': ['fastmem'],
+    'test_fastmem_perf.c': ['fastmem', 'mempool'],
+    'test_fastmem_profile.c': ['fastmem'],
     'test_fbarray.c': [],
     'test_fib.c': ['net', 'fib'],
     'test_fib6.c': ['rib', 'fib'],
diff --git a/app/test/test_fastmem.c b/app/test/test_fastmem.c
new file mode 100644
index 0000000000..c79ea95481
--- /dev/null
+++ b/app/test/test_fastmem.c
@@ -0,0 +1,1682 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <inttypes.h>
+#include <stdalign.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_thread.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define FASTMEM_MEMZONE_SIZE (128U << 20)
+
+/*
+ * Count memzones whose names begin with the fastmem prefix.
+ * Used to verify that rte_fastmem_reserve() really did reserve
+ * backing memzones.
+ */
+static int fastmem_memzone_count;
+
+static void
+count_fastmem_memzones_walk(const struct rte_memzone *mz, void *arg)
+{
+	RTE_SET_USED(arg);
+
+	if (strncmp(mz->name, "fastmem_", strlen("fastmem_")) == 0)
+		fastmem_memzone_count++;
+}
+
+static unsigned int
+count_fastmem_memzones(void)
+{
+	fastmem_memzone_count = 0;
+	rte_memzone_walk(count_fastmem_memzones_walk, NULL);
+	return fastmem_memzone_count;
+}
+
+static int
+test_init_deinit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	/* A subsequent init/deinit cycle must succeed. */
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "second rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_init_is_not_idempotent(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, -EBUSY,
+		"expected -EBUSY on re-init, got %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_deinit_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_max_size(void)
+{
+	size_t max;
+
+	max = rte_fastmem_max_size();
+	TEST_ASSERT(max >= (1U << 20),
+		"max_size=%zu below required 1 MiB minimum", max);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_small(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * A small reserve request (1 byte) must result in exactly
+	 * one memzone reservation: the internal rounding is to
+	 * memzone granularity.
+	 */
+	rc = rte_fastmem_reserve(1, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve() failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	rte_fastmem_deinit();
+
+	/* After deinit the memzones must be released. */
+	TEST_ASSERT_EQUAL(count_fastmem_memzones(), 0,
+		"%u fastmem memzones leaked after deinit",
+		count_fastmem_memzones());
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_multiple_memzones(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	size_t reserve_size;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * Request just over one memzone's worth; this must force
+	 * a second memzone to be reserved.
+	 */
+	reserve_size = FASTMEM_MEMZONE_SIZE + 1;
+	rc = rte_fastmem_reserve(reserve_size, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve(%zu) failed: %d",
+		reserve_size, rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 2,
+		"expected 2 new memzones for %zu-byte reserve, got %u",
+		reserve_size, after - before);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_cumulative(void)
+{
+	int socket_id;
+	unsigned int after_first, after_second;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	after_first = count_fastmem_memzones();
+
+	/*
+	 * A second call requesting the same amount that's already
+	 * reserved must not trigger any new memzone reservation.
+	 */
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "second reserve failed: %d", rc);
+
+	after_second = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after_first, after_second,
+		"reserve of already-reserved amount added memzones (%u -> %u)",
+		after_first, after_second);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_invalid_socket(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, RTE_MAX_NUMA_NODES);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for out-of-range socket, got %d", rc);
+
+	rc = rte_fastmem_reserve(1, -2);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for negative socket, got %d", rc);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_without_init(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0,
+		"expected failure without init, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_any_socket(void)
+{
+	unsigned int before, after;
+	int rc;
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * SOCKET_ID_ANY should succeed on any system with at least
+	 * one configured socket. The allocator picks the caller's
+	 * socket first and falls back to other sockets if needed.
+	 */
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0,
+		"rte_fastmem_reserve(SOCKET_ID_ANY) failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 2 tests: allocation and free.
+ */
+
+static int
+test_alloc_too_big(void)
+{
+	void *p;
+	int rc;
+
+	rte_errno = 0;
+	p = rte_fastmem_alloc(rte_fastmem_max_size() + 1, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc above max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_invalid_align(void)
+{
+	void *p;
+	int rc;
+
+	rte_errno = 0;
+	p = rte_fastmem_alloc(16, 3, 0); /* 3 is not a power of 2 */
+	TEST_ASSERT_NULL(p, "alloc with align=3 returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL,
+		"expected rte_errno=EINVAL, got %d", rte_errno);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_small(void)
+{
+	void *p;
+	int rc;
+
+	p = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8) failed: rte_errno=%d", rte_errno);
+
+	/* Writing into the object must not crash. */
+	memset(p, 0xa5, 8);
+
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_various_sizes(void)
+{
+	static const size_t sizes[] = {
+		1, 8, 16, 17, 63, 64, 128, 1024, 4096,
+		64 * 1024, 256 * 1024, 1024 * 1024,
+	};
+	void *ptrs[RTE_DIM(sizes)];
+	unsigned int i;
+	int rc;
+
+	for (i = 0; i < RTE_DIM(sizes); i++) {
+		ptrs[i] = rte_fastmem_alloc(sizes[i], 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc(%zu) failed: rte_errno=%d",
+			sizes[i], rte_errno);
+		memset(ptrs[i], 0x5a, sizes[i]);
+	}
+
+	for (i = 0; i < RTE_DIM(sizes); i++)
+		rte_fastmem_free(ptrs[i]);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_alignment(void)
+{
+	static const size_t aligns[] = {
+		8, 16, 64, 256, 4096, 65536,
+	};
+	unsigned int i;
+	int rc;
+
+	for (i = 0; i < RTE_DIM(aligns); i++) {
+		void *p = rte_fastmem_alloc(1, aligns[i], 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=%zu) failed: rte_errno=%d",
+			aligns[i], rte_errno);
+		TEST_ASSERT((uintptr_t)p % aligns[i] == 0,
+			"pointer %p not aligned on %zu",
+			p, aligns[i]);
+		rte_fastmem_free(p);
+	}
+
+	/* Default (align=0) gives at least RTE_CACHE_LINE_SIZE. */
+	{
+		void *p = rte_fastmem_alloc(1, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=0) failed: rte_errno=%d", rte_errno);
+		TEST_ASSERT((uintptr_t)p % RTE_CACHE_LINE_SIZE == 0,
+			"default-align pointer %p not cache-line aligned",
+			p);
+		rte_fastmem_free(p);
+	}
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_zero_flag(void)
+{
+	uint8_t *p;
+	unsigned int i;
+	int rc;
+	bool all_zero = true;
+
+	/*
+	 * Dirty a slab first by allocating without F_ZERO, writing
+	 * a non-zero pattern, and freeing. A subsequent F_ZERO
+	 * allocation on the same slab must return zeroed memory.
+	 */
+	p = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "priming alloc failed");
+	memset(p, 0xff, 128);
+	rte_fastmem_free(p);
+
+	p = rte_fastmem_alloc(128, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_NOT_NULL(p, "F_ZERO alloc failed");
+	for (i = 0; i < 128; i++) {
+		if (p[i] != 0) {
+			all_zero = false;
+			break;
+		}
+	}
+	TEST_ASSERT(all_zero, "F_ZERO returned non-zero byte at offset %u", i);
+
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_reuse(void)
+{
+	void *first, *second;
+	int rc;
+
+	first = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(first, "first alloc failed");
+	rte_fastmem_free(first);
+
+	second = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(second, "second alloc failed");
+
+	/*
+	 * The slab's free list is LIFO, so the most recently freed
+	 * object is at the head of the list. A subsequent alloc in
+	 * the same class returns it.
+	 */
+	TEST_ASSERT_EQUAL(first, second,
+		"free + alloc did not reuse: first=%p second=%p",
+		first, second);
+
+	rte_fastmem_free(second);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_many_in_class(void)
+{
+	/*
+	 * Allocate more objects in one class than fit in a single
+	 * slab, forcing the bin to pull a second block. This
+	 * exercises the partial->full transition and the cross-slab
+	 * allocation path.
+	 */
+	enum { CLASS_SIZE = 8, COUNT = 300000 };
+	void **ptrs;
+	unsigned int i;
+	int rc;
+
+	ptrs = calloc(COUNT, sizeof(*ptrs));
+	TEST_ASSERT_NOT_NULL(ptrs, "calloc for test ptrs failed");
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(CLASS_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d",
+			i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	free(ptrs);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket(void)
+{
+	void *p;
+	int socket_id;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing(void)
+{
+	void *small, *large;
+	int rc;
+
+	/*
+	 * Allocate and free a small object, forcing a block to be
+	 * assigned to the small class and then returned to the
+	 * free-block pool. A subsequent allocation in a different
+	 * class must be able to reuse that block.
+	 */
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+	rte_fastmem_free(small);
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large, "large alloc failed");
+	rte_fastmem_free(large);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing_no_growth(void)
+{
+	struct rte_fastmem_stats stats;
+	void *small, *large;
+	uint64_t after_small;
+	int rc;
+
+	/*
+	 * Stronger version of test_alloc_block_repurposing: assert
+	 * that the cross-class allocation does not grow the
+	 * backing memory (bytes_backing stays flat). Because the
+	 * free-block pool is shared across size classes — not
+	 * partitioned per class — the block freed from the small
+	 * class must serve the large allocation without triggering
+	 * a new memzone reservation.
+	 */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, (uint64_t)0,
+		"unexpected pre-alloc bytes_backing: %" PRIu64,
+		stats.bytes_backing);
+
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT(stats.bytes_backing > 0,
+		"bytes_backing did not grow on first alloc");
+	after_small = stats.bytes_backing;
+
+	rte_fastmem_free(small);
+	rte_fastmem_cache_flush();
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large,
+		"large alloc failed: rte_errno=%d", rte_errno);
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, after_small,
+		"cross-class alloc grew backing memory from %" PRIu64
+		" to %" PRIu64,
+		after_small, stats.bytes_backing);
+
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_null(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_free(NULL);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_content_integrity(void)
+{
+	/*
+	 * Allocate a batch of objects, fill each with a distinct
+	 * byte pattern, then verify none of the patterns overlap.
+	 * This catches header overwrites (slab header corrupted by
+	 * object access) and slot-overlap bugs (two pointers pointing
+	 * at overlapping slots).
+	 */
+	enum { N = 256, SIZE = 128 };
+	uint8_t *ptrs[N];
+	unsigned int i, j;
+	int rc;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)i, SIZE);
+	}
+
+	for (i = 0; i < N; i++)
+		for (j = 0; j < SIZE; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], (uint8_t)i,
+				"corruption at ptrs[%u][%u]: got 0x%x, want 0x%x",
+				i, j, ptrs[i][j], (uint8_t)i);
+
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_too_big(void)
+{
+	void *p;
+	int rc;
+
+	/*
+	 * A small size with an alignment larger than the maximum
+	 * size class cannot be served. The class selected must be
+	 * large enough for the alignment, but no such class exists.
+	 */
+	rte_errno = 0;
+	p = rte_fastmem_alloc(1, rte_fastmem_max_size() * 2, 0);
+	TEST_ASSERT_NULL(p,
+		"alloc with align>max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_one(void)
+{
+	void *p;
+	int rc;
+
+	/* align=1 is a valid power of 2 and must be accepted. */
+	p = rte_fastmem_alloc(8, 1, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8, 1) failed: rte_errno=%d",
+		rte_errno);
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket_numa_placement(void)
+{
+	void *p;
+	int socket_id;
+	struct rte_memseg *ms;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	/*
+	 * Walk the memory to find the memseg for this pointer and
+	 * verify its socket. Skip the check if lookup fails (e.g.,
+	 * --no-huge mode may not populate memsegs for fastmem's
+	 * allocations in a way that rte_mem_virt2memseg can find).
+	 */
+	ms = rte_mem_virt2memseg(p, NULL);
+	if (ms != NULL) {
+		TEST_ASSERT_EQUAL(ms->socket_id, socket_id,
+			"alloc on socket %d landed on socket %d",
+			socket_id, ms->socket_id);
+	}
+
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 3 tests: per-lcore caches.
+ */
+
+static int
+test_cache_flush(void)
+{
+	void *p;
+	int rc;
+
+	/*
+	 * Alloc and free one object, leaving it in the cache. Then
+	 * flush and verify that a subsequent alloc may or may not
+	 * return the same pointer (not asserting same/different —
+	 * just checking that flush does not crash and a follow-up
+	 * alloc still works).
+	 */
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "first alloc failed");
+	rte_fastmem_free(p);
+
+	rte_fastmem_cache_flush();
+
+	/* Flush again — must be idempotent. */
+	rte_fastmem_cache_flush();
+
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "post-flush alloc failed");
+	rte_fastmem_free(p);
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_cache_flush();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_exceeds_capacity(void)
+{
+	/*
+	 * Free more objects at a single size class than the cache
+	 * capacity (64 for classes <= 4 KiB). This forces the
+	 * cache-drain slow path and verifies no corruption.
+	 */
+	enum { COUNT = 200, SIZE = 64 };
+	void *ptrs[COUNT];
+	unsigned int i;
+	int rc;
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	/* Re-alloc the same count should still work. */
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"re-alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+
+	return TEST_SUCCESS;
+}
+
+struct non_eal_args {
+	int ok;
+	char pad[64];
+};
+
+static uint32_t
+non_eal_thread_main(void *arg)
+{
+	struct non_eal_args *args = arg;
+	uint8_t *p;
+
+	p = rte_fastmem_alloc(128, 0, 0);
+	if (p == NULL)
+		return 1;
+
+	memset(p, 0x7e, 128);
+
+	rte_fastmem_free(p);
+
+	args->ok = 1;
+	return 0;
+}
+
+static int
+test_non_eal_thread(void)
+{
+	rte_thread_t thread_id;
+	struct non_eal_args args = { 0 };
+	int rc;
+
+	rc = rte_thread_create(&thread_id, NULL, non_eal_thread_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(args.ok, 1,
+		"non-EAL thread did not complete alloc/free successfully");
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_returns_memory(void)
+{
+	/*
+	 * When an entire slab's worth of objects is freed, the
+	 * slab's block is returned to the free-block pool and can
+	 * be reassigned to another size class. Verify the cache
+	 * does not permanently hold objects that prevent this.
+	 *
+	 * Allocate enough objects in one class to force multiple
+	 * slabs, free them all, then flush the cache. After the
+	 * flush, all cached objects are drained to their bins and
+	 * empty slabs are returned to the block pool.
+	 */
+	enum { N = 200, SIZE = 64 };
+	void *ptrs[N];
+	unsigned int i;
+	int rc;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rte_fastmem_cache_flush();
+
+	/*
+	 * An allocation in a completely different class should
+	 * succeed now, having access to any blocks freed by the
+	 * flush.
+	 */
+	{
+		void *other = rte_fastmem_alloc(65536, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(other,
+			"post-flush cross-class alloc failed");
+		rte_fastmem_free(other);
+	}
+
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_basic(void)
+{
+	enum { N = 32 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	/* Verify all pointers are non-NULL and distinct. */
+	for (unsigned int i = 0; i < N; i++) {
+		TEST_ASSERT_NOT_NULL(ptrs[i], "ptrs[%u] is NULL", i);
+		for (unsigned int j = 0; j < i; j++)
+			TEST_ASSERT(ptrs[i] != ptrs[j],
+				"ptrs[%u] == ptrs[%u]", i, j);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_zero_flag(void)
+{
+	enum { N = 8, SIZE = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, SIZE, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	for (unsigned int i = 0; i < N; i++) {
+		uint8_t *p = ptrs[i];
+
+		for (unsigned int b = 0; b < SIZE; b++)
+			TEST_ASSERT_EQUAL(p[b], 0,
+				"ptrs[%u][%u] != 0", i, b);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_exceeds_cache(void)
+{
+	/* Allocate more than cache capacity (64) in one bulk call. */
+	enum { N = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk(%u) failed: %d", N, rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_socket(void)
+{
+	enum { N = 16 };
+	void *ptrs[N];
+	int socket_id;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no sockets");
+
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* SOCKET_ID_ANY */
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket(ANY) failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_bulk(void)
+{
+	enum { N = 64 };
+	void *ptrs[N];
+	int rc;
+
+	/* Allocate individually, free in bulk. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* Verify memory is reusable. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "re-alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_classes(void)
+{
+	size_t sizes[32];
+	unsigned int n;
+
+	n = rte_fastmem_classes(NULL);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+
+	n = rte_fastmem_classes(sizes);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+	TEST_ASSERT_EQUAL(sizes[0], (size_t)8, "class 0 != 8");
+	TEST_ASSERT_EQUAL(sizes[n - 1], (size_t)(1 << 20),
+		"last class != 1 MiB");
+
+	for (unsigned int i = 0; i < n; i++) {
+		TEST_ASSERT(sizes[i] != 0 && (sizes[i] & (sizes[i] - 1)) == 0,
+			"class %u size %zu not power of 2", i, sizes[i]);
+		if (i > 0)
+			TEST_ASSERT(sizes[i] > sizes[i - 1],
+				"classes not ascending at %u", i);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_class(void)
+{
+	enum { N = 10 };
+	struct rte_fastmem_class_stats cs;
+	void *ptrs[N];
+	int rc;
+
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.class_size, (size_t)64, "wrong class_size");
+	TEST_ASSERT(cs.alloc_cache_hits + cs.alloc_cache_misses == N,
+		"alloc count != N: hits=%" PRIu64 " misses=%" PRIu64,
+		cs.alloc_cache_hits, cs.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)N, "in_use != N");
+
+	for (unsigned int i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class after free failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)0, "in_use != 0 after free");
+
+	/* Invalid class size. */
+	rc = rte_fastmem_stats_class(13, &cs);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad size");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore(void)
+{
+	struct rte_fastmem_lcore_stats ls;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+	TEST_ASSERT(ls.alloc_cache_hits + ls.alloc_cache_misses > 0,
+		"no alloc activity on this lcore");
+
+	rte_fastmem_free(ptr);
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore after free failed: %d", rc);
+	TEST_ASSERT(ls.free_cache_hits + ls.free_cache_misses > 0,
+		"no free activity on this lcore");
+
+	/* Invalid lcore. */
+	rc = rte_fastmem_stats_lcore(RTE_MAX_LCORE, &ls);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad lcore");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore_class(void)
+{
+	struct rte_fastmem_lcore_class_stats lcs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore_class(rte_lcore_id(), 256, &lcs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(lcs.class_size, (size_t)256, "wrong class_size");
+	TEST_ASSERT(lcs.alloc_cache_hits + lcs.alloc_cache_misses > 0,
+		"no alloc activity");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_reset(void)
+{
+	struct rte_fastmem_stats gs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+	rte_fastmem_free(ptr);
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_fastmem_stats(&gs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(gs.alloc_total, (uint64_t)0,
+		"alloc_total not zero after reset");
+	TEST_ASSERT_EQUAL(gs.free_total, (uint64_t)0,
+		"free_total not zero after reset");
+
+	return TEST_SUCCESS;
+}
+
+
+#define MIXED_LONG_LIVED_COUNT 25
+#define MIXED_SHORT_LIVED_ITERS 1000
+#define MIXED_MIN_LCORES 3
+
+static const size_t mixed_long_sizes[] = { 64, 256, 4096 };
+static const size_t mixed_short_sizes[] = { 8, 16, 32, 64, 128, 256, 512, 1024 };
+
+struct mixed_worker_args {
+	uint32_t seed;
+	int result;
+};
+
+static uint32_t
+xorshift32(uint32_t *state)
+{
+	uint32_t x = *state;
+
+	x ^= x << 13;
+	x ^= x >> 17;
+	x ^= x << 5;
+	*state = x;
+	return x;
+}
+
+static int
+mixed_worker(void *arg)
+{
+	struct mixed_worker_args *args = arg;
+	uint32_t seed = args->seed;
+	void *long_lived[MIXED_LONG_LIVED_COUNT];
+	size_t long_sizes[MIXED_LONG_LIVED_COUNT];
+	unsigned int i;
+
+	/* Allocate long-lived objects of mixed sizes. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		long_sizes[i] = mixed_long_sizes[i % RTE_DIM(mixed_long_sizes)];
+		long_lived[i] = rte_fastmem_alloc(long_sizes[i], 0, 0);
+		if (long_lived[i] == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(long_lived[i], (int)(i + 1), long_sizes[i]);
+	}
+
+	/* Rapidly cycle short-lived objects. */
+	for (i = 0; i < MIXED_SHORT_LIVED_ITERS; i++) {
+		size_t sz = mixed_short_sizes[xorshift32(&seed) %
+					      RTE_DIM(mixed_short_sizes)];
+		uint8_t pattern = (uint8_t)(i & 0xff);
+		uint8_t *p;
+
+		p = rte_fastmem_alloc(sz, 0, 0);
+		if (p == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(p, pattern, sz);
+
+		/* Verify before freeing. */
+		for (size_t j = 0; j < sz; j++) {
+			if (p[j] != pattern) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(p);
+	}
+
+	/* Verify long-lived objects are still intact. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		uint8_t *bytes = long_lived[i];
+		uint8_t expected = (uint8_t)(i + 1);
+
+		for (size_t j = 0; j < long_sizes[i]; j++) {
+			if (bytes[j] != expected) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(long_lived[i]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_mixed_lifetimes_multi_lcore(void)
+{
+	struct mixed_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int count = 0;
+	struct rte_fastmem_stats stats;
+	int rc;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		count++;
+
+	if (count < MIXED_MIN_LCORES) {
+		printf("Not enough worker lcores (%u < %u), skipping\n",
+		       count, MIXED_MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	/* Launch workers with distinct seeds. */
+	uint32_t seed = 0xdeadbeef;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].seed = seed;
+		args[lcore_id].result = TEST_FAILED;
+		seed += 0x12345678;
+		rte_eal_remote_launch(mixed_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	/* Check all workers succeeded. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	/* Verify no memory leak. */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero after test: %" PRIu64,
+		stats.bytes_in_use);
+
+
+	return TEST_SUCCESS;
+}
+
+
+/*
+ * Memory limit tests.
+ *
+ * FASTMEM_MEMZONE_SIZE is 128 MiB. We use a limit of 128 MiB
+ * (one memzone) for most tests, and large objects (256 KiB) to
+ * exhaust slabs quickly.
+ */
+
+#define LIMIT_ONE_MZ ((size_t)128 << 20)
+#define LIMIT_OBJ_SIZE ((size_t)256 * 1024)
+
+static int
+test_memory_limit_basic(void)
+{
+	int rc;
+
+	rc = rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+	TEST_ASSERT_EQUAL(rc, 0, "set_memory_limit failed: %d", rc);
+
+	const size_t got = rte_fastmem_get_limit(0);
+	TEST_ASSERT_EQUAL(got, LIMIT_ONE_MZ,
+		"get_memory_limit mismatch: %zu", got);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ + 1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "second reserve should have failed");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_exhaustion(void)
+{
+	const unsigned int max_ptrs = 1024;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+
+	TEST_ASSERT(count > 0, "should have allocated at least one");
+	TEST_ASSERT(count < max_ptrs, "should have hit the limit");
+	TEST_ASSERT_EQUAL(rte_errno, ENOMEM, "expected ENOMEM, got %d", rte_errno);
+
+	rte_fastmem_free(ptrs[count - 1]);
+	void *p = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc after free should succeed");
+	rte_fastmem_free(p);
+
+	for (unsigned int i = 0; i < count - 1; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_zero_blocks_growth(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "reserve with limit=0 should fail");
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc with limit=0 should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_below_current(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve failed: %d", rc);
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 1);
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc from existing backing should work");
+	rte_fastmem_free(p);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ * 2, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "growth beyond limit should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_socket_id_any(void)
+{
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 42);
+
+	for (unsigned int i = 0; i < rte_socket_count(); i++) {
+		const int sid = rte_socket_id_by_idx(i);
+		const size_t lim = rte_fastmem_get_limit(sid);
+
+		TEST_ASSERT_EQUAL(lim, (size_t)42,
+			"socket %d limit mismatch: %zu", sid, lim);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_unlimited(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+	rte_fastmem_set_limit(SOCKET_ID_ANY, SIZE_MAX);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve after reset failed: %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_integrity_under_oom(void)
+{
+	const unsigned int n = 128;
+	const size_t obj_size = 1024;
+	uint8_t *ptrs[n];
+	const unsigned int extra_max = 1024;
+	void *extra[extra_max];
+	unsigned int n_extra = 0;
+	unsigned int i;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = rte_fastmem_alloc(obj_size, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)(i & 0xff), obj_size);
+	}
+
+	/* Exhaust remaining backing with large objects. */
+	for (n_extra = 0; n_extra < extra_max; n_extra++) {
+		extra[n_extra] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (extra[n_extra] == NULL)
+			break;
+	}
+
+	/* Verify original objects are intact. */
+	for (i = 0; i < n; i++) {
+		const uint8_t expected = (uint8_t)(i & 0xff);
+		for (unsigned int j = 0; j < obj_size; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], expected,
+				"corruption at [%u][%u]", i, j);
+	}
+
+	for (i = 0; i < n; i++)
+		rte_fastmem_free(ptrs[i]);
+	for (i = 0; i < n_extra; i++)
+		rte_fastmem_free(extra[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_bulk_alloc_oom(void)
+{
+	const unsigned int bulk_n = 64;
+	const unsigned int drain_max = 512;
+	void *ptrs[bulk_n];
+	void *drain[drain_max];
+	unsigned int drained = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (drained = 0; drained < drain_max; drained++) {
+		drain[drained] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (drain[drained] == NULL)
+			break;
+	}
+
+	/* Free a few — enough for some but not bulk_n objects. */
+	const unsigned int freed = RTE_MIN(drained, 4u);
+	for (unsigned int i = 0; i < freed; i++)
+		rte_fastmem_free(drain[--drained]);
+
+	rc = rte_fastmem_alloc_bulk(ptrs, bulk_n, LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT(rc < 0, "bulk alloc should fail");
+
+	for (unsigned int i = 0; i < drained; i++)
+		rte_fastmem_free(drain[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_recovery_after_free(void)
+{
+	const unsigned int max_ptrs = 512;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+	TEST_ASSERT(count > 0 && count < max_ptrs,
+		"expected partial fill, got %u", count);
+
+	const unsigned int half = count / 2;
+	for (unsigned int i = 0; i < half; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	for (unsigned int i = 0; i < half; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "recovery alloc[%u] failed", i);
+	}
+
+	for (unsigned int i = 0; i < count; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct limit_worker_args {
+	unsigned int alloc_count;
+	int result;
+};
+
+static int
+limit_worker(void *arg)
+{
+	struct limit_worker_args *args = arg;
+	const unsigned int max_ptrs = 128;
+	void *ptrs[max_ptrs];
+	unsigned int i;
+
+	args->alloc_count = 0;
+
+	for (i = 0; i < max_ptrs; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[i] == NULL)
+			break;
+		memset(ptrs[i], 0xab, LIMIT_OBJ_SIZE);
+		args->alloc_count++;
+	}
+
+	for (unsigned int j = 0; j < args->alloc_count; j++) {
+		uint8_t *bytes = ptrs[j];
+		for (size_t k = 0; k < LIMIT_OBJ_SIZE; k++) {
+			if (bytes[k] != 0xab) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(ptrs[j]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_memory_limit_multi_lcore_oom(void)
+{
+	struct limit_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int worker_count = 0;
+	int rc;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		worker_count++;
+
+	if (worker_count < 2) {
+		printf("Not enough workers (%u < 2), skipping\n", worker_count);
+		return TEST_SKIPPED;
+	}
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].result = TEST_FAILED;
+		rte_eal_remote_launch(limit_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	struct rte_fastmem_stats stats;
+	rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero: %" PRIu64, stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+static int
+fastmem_setup(void)
+{
+	return rte_fastmem_init();
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_deinit();
+}
+
+static struct unit_test_suite fastmem_lifecycle_testsuite = {
+	.suite_name = "fastmem lifecycle tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_init_deinit),
+		TEST_CASE(test_init_is_not_idempotent),
+		TEST_CASE(test_deinit_without_init),
+		TEST_CASE(test_max_size),
+		TEST_CASE(test_reserve_without_init),
+		TEST_CASE(test_cache_flush_without_init),
+		TEST_CASE(test_classes),
+		TEST_CASES_END()
+	}
+};
+
+static struct unit_test_suite fastmem_functional_testsuite = {
+	.suite_name = "fastmem functional tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_multiple_memzones),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_cumulative),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_invalid_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_any_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_invalid_align),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_various_sizes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_alignment),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_reuse),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_many_in_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing_no_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_null),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_content_integrity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_one),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket_numa_placement),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_exceeds_capacity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_non_eal_thread),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush_returns_memory),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_exceeds_cache),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_bulk),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_reset),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_mixed_lifetimes_multi_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_exhaustion),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_zero_blocks_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_below_current),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_socket_id_any),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_unlimited),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_integrity_under_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_bulk_alloc_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_recovery_after_free),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_multi_lcore_oom),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_fastmem(void)
+{
+	int rc;
+
+	rc = unit_test_suite_runner(&fastmem_lifecycle_testsuite);
+	if (rc != 0)
+		return rc;
+
+	return unit_test_suite_runner(&fastmem_functional_testsuite);
+}
+
+REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_OK, ASAN_OK, test_fastmem);
diff --git a/app/test/test_fastmem_perf.c b/app/test/test_fastmem_perf.c
new file mode 100644
index 0000000000..9200847847
--- /dev/null
+++ b/app/test/test_fastmem_perf.c
@@ -0,0 +1,997 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_mempool.h>
+#include <rte_stdatomic.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define TEST_LOG(...) printf(__VA_ARGS__)
+
+static const size_t SIZES[] = { 8, 64, 256, 1024, 4096 };
+#define N_SIZES RTE_DIM(SIZES)
+
+/* Number of ops for warmup and measurement. */
+#define WARMUP_OPS 20000u
+#define MEASURE_OPS 2000000u
+
+/* Buffer for scenarios that allocate N then free N. */
+#define BATCH_N 256
+
+/*
+ * Allocator vtable: a thin adapter exposing alloc / free /
+ * per-allocator setup/teardown. Each scenario calls these
+ * indirectly so the same timing loop serves all allocators.
+ */
+struct allocator {
+	const char *name;
+	int (*setup)(size_t size, unsigned int n_max);
+	void (*teardown)(void);
+	void *(*alloc)(void);
+	void (*free_obj)(void *ptr);
+	int (*alloc_bulk)(void **ptrs, unsigned int n);
+	void (*free_bulk)(void **ptrs, unsigned int n);
+};
+
+/* Fastmem adapter -------------------------------------------------- */
+
+static size_t fastmem_size;
+
+static int
+fastmem_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	fastmem_size = size;
+	return 0;
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_alloc(void)
+{
+	return rte_fastmem_alloc(fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free(void *ptr)
+{
+	rte_fastmem_free(ptr);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static struct rte_mempool *mempool_pool;
+
+static int
+mempool_setup(size_t size, unsigned int n_max)
+{
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int cache_size;
+
+	/*
+	 * Pool size must accommodate the full batch burst plus
+	 * per-lcore cache capacity. Use mempool's default cache
+	 * size so we're measuring its standard hot path.
+	 */
+	cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
+
+	snprintf(name, sizeof(name), "fmperf_mp_%zu", size);
+	mempool_pool = rte_mempool_create(name, n_max + cache_size * 2,
+			size, cache_size, 0, NULL, NULL, NULL, NULL,
+			SOCKET_ID_ANY, 0);
+	if (mempool_pool == NULL) {
+		TEST_LOG("mempool_create(%zu) failed\n", size);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+mempool_teardown(void)
+{
+	rte_mempool_free(mempool_pool);
+	mempool_pool = NULL;
+}
+
+static void * __rte_noinline
+mempool_alloc_one(void)
+{
+	void *obj = NULL;
+
+	if (rte_mempool_get(mempool_pool, &obj) < 0)
+		return NULL;
+	return obj;
+}
+
+static void __rte_noinline
+mempool_free_one(void *ptr)
+{
+	rte_mempool_put(mempool_pool, ptr);
+}
+
+/* rte_malloc adapter ----------------------------------------------- */
+
+static size_t malloc_size;
+
+static int
+malloc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	malloc_size = size;
+	return 0;
+}
+
+static void
+malloc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+malloc_alloc(void)
+{
+	return rte_malloc(NULL, malloc_size, 0);
+}
+
+static void __rte_noinline
+malloc_free(void *ptr)
+{
+	rte_free(ptr);
+}
+
+/* libc (glibc) malloc adapter -------------------------------------- */
+
+static size_t libc_size;
+
+static int
+libc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	/*
+	 * Round up to cache-line alignment to match the other
+	 * allocators' default alignment guarantees and keep the
+	 * comparison honest. aligned_alloc() requires size to be
+	 * a multiple of the alignment.
+	 */
+	libc_size = RTE_ALIGN_CEIL(size, RTE_CACHE_LINE_SIZE);
+	return 0;
+}
+
+static void
+libc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+libc_alloc(void)
+{
+	return aligned_alloc(RTE_CACHE_LINE_SIZE, libc_size);
+}
+
+static void __rte_noinline
+libc_free(void *ptr)
+{
+	free(ptr);
+}
+
+/* Bulk adapters ---------------------------------------------------- */
+
+static int __rte_noinline
+fastmem_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_alloc_bulk(ptrs, n, fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_free_bulk(ptrs, n);
+}
+
+static int __rte_noinline
+mempool_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_mempool_get_bulk(mempool_pool, ptrs, n);
+}
+
+static void __rte_noinline
+mempool_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_mempool_put_bulk(mempool_pool, ptrs, n);
+}
+
+static int __rte_noinline
+generic_alloc_bulk(void **ptrs, unsigned int n, void *(*alloc_fn)(void))
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = alloc_fn();
+		if (ptrs[i] == NULL)
+			return -1;
+	}
+	return 0;
+}
+
+static int __rte_noinline
+malloc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, malloc_alloc);
+}
+
+static void __rte_noinline
+malloc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		malloc_free(ptrs[i]);
+}
+
+static int __rte_noinline
+libc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, libc_alloc);
+}
+
+static void __rte_noinline
+libc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		libc_free(ptrs[i]);
+}
+
+/* Adapter table ---------------------------------------------------- */
+
+static const struct allocator allocators[] = {
+	{ "fastmem",    fastmem_setup, fastmem_teardown, fastmem_alloc,     fastmem_free,     fastmem_alloc_bulk, fastmem_free_bulk },
+	{ "mempool",    mempool_setup, mempool_teardown, mempool_alloc_one, mempool_free_one, mempool_alloc_bulk, mempool_free_bulk },
+	{ "rte_malloc", malloc_setup,  malloc_teardown,  malloc_alloc,      malloc_free,      malloc_alloc_bulk,  malloc_free_bulk },
+	{ "libc",       libc_setup,    libc_teardown,    libc_alloc,        libc_free,        libc_alloc_bulk,    libc_free_bulk },
+};
+#define N_ALLOCATORS RTE_DIM(allocators)
+
+/*
+ * Scenario 1: tight alloc+free loop. A single object is cycled
+ * repeatedly. The LIFO path keeps the same pointer hot, giving
+ * a best-case measurement.
+ */
+static double
+run_tight(const struct allocator *alloc, size_t size)
+{
+	void *p;
+	uint64_t tsc;
+	unsigned int i;
+
+	if (alloc->setup(size, 1) < 0)
+		return -1.0;
+
+	/* Warmup. */
+	for (i = 0; i < WARMUP_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < MEASURE_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+	tsc = rte_rdtsc_precise() - tsc;
+
+	alloc->teardown();
+
+	return (double)tsc / MEASURE_OPS;
+err:
+	alloc->teardown();
+	return -1.0;
+}
+
+/*
+ * Scenario 2: allocate N, free N (FIFO free order). Exercises
+ * cache refill and drain paths when N exceeds cache capacity.
+ */
+static void
+run_batch(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	/* Pick iteration count so total ops ~= MEASURE_OPS. */
+	iters = MEASURE_OPS / BATCH_N;
+
+	/* Warmup. */
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 3: allocate N, free N in reverse order.
+ */
+static void
+run_batch_reverse(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	iters = MEASURE_OPS / BATCH_N;
+
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 4: multi-lcore alloc/work/free with a dummy-work
+ * baseline. Each worker runs a tight alloc → touch → free loop
+ * on its own lcore. A second run with the same dummy work but
+ * no allocator traffic establishes a baseline; the per-op
+ * allocator cost is reported as (alloc_run - baseline_run).
+ *
+ * Fixed size class and a fixed amount of dummy work per op —
+ * this scenario sweeps lcore count rather than size.
+ */
+#define MULTI_SIZE 256u
+#define MULTI_WORK_BYTES 64u
+#define MULTI_WORK_PASSES 8u   /* RMW passes over the work region. */
+#define MULTI_OPS 200000u
+#define MULTI_WARMUP 2000u
+#define MAX_MULTI_LCORES 32u
+
+/*
+ * Per-worker volatile sink. Each worker writes to its own
+ * slot, preventing dead-code elimination of touch_buffer() and
+ * avoiding cross-lcore cache-line sharing on the hot path.
+ * Padded to cache-line stride to prevent false sharing between
+ * neighboring workers' slots.
+ */
+struct worker_sink {
+	volatile uint64_t value;
+} __rte_cache_aligned;
+
+static struct worker_sink worker_sinks[RTE_MAX_LCORE];
+
+/*
+ * Out-of-line dummy workload: run MULTI_WORK_PASSES
+ * read-modify-write passes over the first 'bytes' of the
+ * buffer. Each pass reads what the previous pass wrote, so the
+ * compiler cannot unroll or parallelize across passes — the
+ * work scales linearly with MULTI_WORK_PASSES. Returns an
+ * accumulator so the caller can feed it into a volatile sink;
+ * without that, the compiler could elide the whole function.
+ *
+ * __rte_noinline so it looks identical to the compiler in both
+ * the baseline (pre-allocated scratch buffer) and alloc-path
+ * runs, making the cycle-delta subtraction valid.
+ *
+ * The purpose of this being tunably expensive is to keep
+ * worker-per-iteration cost high relative to the allocator's
+ * critical section, so that even serialized allocators like
+ * rte_malloc spend most of their time outside the lock and the
+ * measured per-op allocator cost reflects its own work rather
+ * than its contention queue.
+ */
+static uint64_t __rte_noinline
+touch_buffer(void *buf, size_t bytes)
+{
+	uint64_t *p = buf;
+	size_t n = bytes / sizeof(uint64_t);
+	uint64_t acc = 0;
+	unsigned int pass;
+	size_t i;
+
+	/* Prime the buffer with a known pattern. */
+	for (i = 0; i < n; i++)
+		p[i] = i * 0x9E3779B97F4A7C15ULL;
+
+	/*
+	 * Dependent RMW passes: each pass reads p[i] written by
+	 * the previous pass, mixes the pass index in, and writes
+	 * back. The XOR into acc keeps the chain live.
+	 */
+	for (pass = 0; pass < MULTI_WORK_PASSES; pass++) {
+		for (i = 0; i < n; i++) {
+			uint64_t v = p[i];
+
+			v = v * 0xC2B2AE3D27D4EB4FULL + pass;
+			v ^= v >> 33;
+			p[i] = v;
+			acc ^= v;
+		}
+	}
+
+	return acc;
+}
+
+struct worker_args {
+	const struct allocator *alloc;
+	void *scratch;            /* baseline only; NULL => alloc path */
+	unsigned int iters;
+	unsigned int warmup;
+	unsigned int bulk_n;      /* 0 = single-object, >0 = bulk */
+	RTE_ATOMIC(bool) start_flag; /* barrier at worker entry */
+	uint64_t cycles;          /* out */
+	unsigned int ops;         /* out */
+	int err;                  /* out */
+};
+
+static int
+worker_run(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	/* Wait for start flag (spin-barrier set by main). */
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				return -1;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+
+	/* Measured loop. */
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				break;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i;
+
+	/* Publish accumulator to defeat dead-code elimination. */
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+static int
+worker_run_bulk(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	void *ptrs[BATCH_N];
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i, j;
+	unsigned int bulk_n = wa->bulk_n;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			return -1;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			break;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i * bulk_n;
+
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+/*
+ * Launch workers on the first 'n_workers' worker lcores, run
+ * either the baseline (scratch != NULL) or the alloc path
+ * (scratch == NULL), and return the mean per-op cycle cost
+ * averaged across participating workers.
+ *
+ * On any worker error, returns -1.0.
+ */
+static double
+run_multi_workers(const struct allocator *alloc, unsigned int n_workers,
+		void *const *scratches, unsigned int bulk_n)
+{
+	struct worker_args wargs[RTE_MAX_LCORE];
+	unsigned int worker_lcores[MAX_MULTI_LCORES];
+	unsigned int n = 0;
+	unsigned int lcore_id;
+	unsigned int i;
+	lcore_function_t *fn = bulk_n > 0 ? worker_run_bulk : worker_run;
+
+	/* Collect the first n_workers worker lcores. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (n >= n_workers)
+			break;
+		worker_lcores[n++] = lcore_id;
+	}
+	if (n < n_workers)
+		return -1.0;
+
+	/* Prepare per-worker args. */
+	for (i = 0; i < n_workers; i++) {
+		struct worker_args *wa = &wargs[worker_lcores[i]];
+
+		wa->alloc = alloc;
+		wa->scratch = scratches != NULL ? scratches[i] : NULL;
+		wa->iters = MULTI_OPS;
+		wa->warmup = MULTI_WARMUP;
+		wa->bulk_n = bulk_n;
+		rte_atomic_store_explicit(&wa->start_flag, false,
+				rte_memory_order_relaxed);
+	}
+
+	/* Launch workers. They spin on start_flag until released. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_remote_launch(fn, &wargs[worker_lcores[i]],
+				worker_lcores[i]);
+
+	/* Release all workers roughly simultaneously. */
+	for (i = 0; i < n_workers; i++)
+		rte_atomic_store_explicit(
+			&wargs[worker_lcores[i]].start_flag, true,
+			rte_memory_order_release);
+
+	/* Wait for completion. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_wait_lcore(worker_lcores[i]);
+
+	/* Aggregate: mean cycles per op across workers. */
+	{
+		double sum_cycles_per_op = 0.0;
+		unsigned int n_ok = 0;
+
+		for (i = 0; i < n_workers; i++) {
+			struct worker_args *wa = &wargs[worker_lcores[i]];
+
+			if (wa->err != 0 || wa->ops == 0)
+				return -1.0;
+			sum_cycles_per_op +=
+				(double)wa->cycles / (double)wa->ops;
+			n_ok++;
+		}
+		return sum_cycles_per_op / n_ok;
+	}
+}
+
+/*
+ * One sub-run of Scenario 4: given an allocator and a worker
+ * count, return (baseline, alloc_path) mean cycles per op.
+ */
+static void
+run_multi_lcore(const struct allocator *alloc, unsigned int n_workers,
+		unsigned int bulk_n, double *baseline, double *alloc_path)
+{
+	void *scratches[MAX_MULTI_LCORES] = {0};
+	unsigned int n_alloced = 0;
+	unsigned int i;
+
+	*baseline = -1.0;
+	*alloc_path = -1.0;
+
+	if (alloc->setup(MULTI_SIZE, n_workers * 64) < 0)
+		return;
+
+	/* Baseline: pre-allocate one scratch per worker. */
+	for (i = 0; i < n_workers; i++) {
+		scratches[i] = alloc->alloc();
+		if (scratches[i] == NULL)
+			goto err;
+		n_alloced++;
+	}
+
+	*baseline = run_multi_workers(alloc, n_workers, scratches, 0);
+
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	n_alloced = 0;
+
+	/* Alloc path: workers alloc+free each iter. */
+	*alloc_path = run_multi_workers(alloc, n_workers, NULL, bulk_n);
+
+	alloc->teardown();
+	return;
+err:
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	alloc->teardown();
+}
+
+/* Reporting -------------------------------------------------------- */
+
+static void
+print_header(const char *title)
+{
+	size_t i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < N_SIZES; i++)
+		TEST_LOG(" %10zu B", SIZES[i]);
+	TEST_LOG("\n");
+}
+
+static void
+print_row(const char *name, const double *values)
+{
+	size_t i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < N_SIZES; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %12s", "--");
+		else
+			TEST_LOG(" %12.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_header(const char *title, const unsigned int *lcore_counts,
+		unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < n_counts; i++)
+		TEST_LOG(" %8u lcore%c", lcore_counts[i],
+				lcore_counts[i] == 1 ? ' ' : 's');
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_row(const char *name, const double *values, unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < n_counts; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %14s", "--");
+		else
+			TEST_LOG(" %14.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+/* Driver ----------------------------------------------------------- */
+
+static int
+test_fastmem_perf(void)
+{
+	size_t i;
+	size_t a;
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	TEST_LOG("\nfastmem performance — single-lcore, fixed-size\n");
+	TEST_LOG("All numbers are TSC cycles.\n");
+
+	/* Scenario 1: tight alloc+free. */
+	print_header("Scenario 1: Single-object hot path — cycles per (alloc + free)");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			vals[i] = run_tight(&allocators[a], SIZES[i]);
+		print_row(allocators[a].name, vals);
+	}
+
+	/* Scenario 2: batched, FIFO free. */
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 3: batched, reverse free. */
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 4: multi-lcore alloc/work/free with baseline. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		/* Sweep lcore counts: 1, 2, 4, 8, ... up to max_workers. */
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		/* Ensure max_workers is the final column if not power of two. */
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 4 (Multi-lcore contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 4 parameters: size=%u B\n",
+				MULTI_SIZE);
+
+			for (a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a], lcore_counts[c],
+							0, &base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 4: Multi-lcore contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	/* Scenario 5: multi-lcore bulk alloc/work/free. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+		unsigned int bulk_n = 8;
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 5 (Multi-lcore bulk contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 5 parameters: size=%u B, "
+				"bulk=%u\n",
+				MULTI_SIZE, bulk_n);
+
+			for (size_t a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a],
+							lcore_counts[c], bulk_n,
+							&base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 5: Multi-lcore bulk contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (size_t a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	TEST_LOG("\n");
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_perf_autotest, test_fastmem_perf);
diff --git a/app/test/test_fastmem_profile.c b/app/test/test_fastmem_profile.c
new file mode 100644
index 0000000000..9a5dc94018
--- /dev/null
+++ b/app/test/test_fastmem_profile.c
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+/*
+ * A minimal fastmem workload intended for use with perf record /
+ * perf report. Runs a tight alloc/free loop for a fixed duration
+ * so that sampling profilers can attribute cycles to individual
+ * functions and instructions within the fastmem hot path.
+ *
+ * Usage:
+ *   perf record -g -- dpdk-test --no-huge --no-pci -m 8192 \
+ *       -l 0 <<< fastmem_profile_autotest
+ *   perf report
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+/* Duration of each sub-test in TSC cycles (~3 seconds at 3 GHz). */
+#define PROFILE_DURATION_CYCLES (3ULL * rte_get_tsc_hz())
+
+/* Allocation size for the profiling workload. */
+#define PROFILE_SIZE 256u
+
+/*
+ * Sub-test 1: tight alloc+free, exercises only the per-lcore
+ * cache (no bin interaction after warmup).
+ */
+static int
+profile_cache_hit(void)
+{
+	uint64_t deadline;
+	uint64_t ops = 0;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		void *p = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+
+		if (p == NULL)
+			return -1;
+		rte_fastmem_free(p);
+		ops++;
+	}
+
+	printf("  cache_hit: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+/*
+ * Sub-test 2: alloc N then free N, where N exceeds the cache
+ * capacity. This forces repeated cache refills and drains,
+ * exercising the bin lock and slab free-list traversal.
+ */
+#define PROFILE_BATCH 256u
+
+static int
+profile_cache_miss(void)
+{
+	void *ptrs[PROFILE_BATCH];
+	uint64_t deadline;
+	uint64_t ops = 0;
+	unsigned int i;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		for (i = 0; i < PROFILE_BATCH; i++) {
+			ptrs[i] = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+			if (ptrs[i] == NULL)
+				return -1;
+		}
+		for (i = 0; i < PROFILE_BATCH; i++)
+			rte_fastmem_free(ptrs[i]);
+		ops += PROFILE_BATCH;
+	}
+
+	printf("  cache_miss: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_hit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-hit workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_hit() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_miss(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-miss workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_miss() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_profile_cache_hit_autotest,
+		test_fastmem_profile_cache_hit);
+REGISTER_PERF_TEST(fastmem_profile_cache_miss_autotest,
+		test_fastmem_profile_cache_miss);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v2 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 10:36 ` [RFC 3/3] app/test: add fastmem test suite Mattias Rönnblom
@ 2026-05-26  8:57   ` Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 1/3] doc: add fastmem programming guide Mattias Rönnblom
                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-26  8:57 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Mattias Rönnblom

This RFC introduces fastmem, a general-purpose small-object allocator
for DPDK. It is intended to replace per-type mempools with a single
allocator that handles arbitrary sizes, grows on demand, and matches
mempool-level performance on the hot path.

Motivation
----------

DPDK applications commonly maintain many mempools — one per object
type (connections, sessions, timers, work items). Each must be sized
up front, wastes memory when over-provisioned, and cannot serve
objects of a different size. Fastmem eliminates this by accepting
arbitrary sizes at runtime, backed by a slab allocator that
repurposes memory across size classes as demand shifts.

Design
------

Three-layer architecture:

1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
   reserved lazily (or pre-reserved for deterministic latency).

2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
   The alignment enables O(1) slab lookup from any object pointer
   via bitmask — no radix tree or index structure. Slabs move
   freely between 18 power-of-2 size classes (8 B to 1 MiB).

3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
   path). Cache misses trigger bulk transfers to/from the shared
   bin under a spinlock.

Key properties:

- Zero per-object metadata in the production build.
- NUMA-aware, with per-socket bins and free-slab pools.
- DMA-usable memory with O(1) virt-to-IOVA translation.
- Bulk alloc/free with all-or-nothing semantics.
- Backing memory never returned during lifetime (slabs recycled).
- Non-EAL threads supported (bypass cache, take bin lock).
- Secondary process support (lazy attach, no per-lcore caches).

API surface
-----------

  rte_fastmem_init / deinit
  rte_fastmem_reserve
  rte_fastmem_set_limit / get_limit
  rte_fastmem_alloc / alloc_socket
  rte_fastmem_alloc_bulk / alloc_bulk_socket
  rte_fastmem_free / free_bulk
  rte_fastmem_hlookup / halloc / halloc_bulk / hfree / hfree_bulk
  rte_fastmem_virt2iova
  rte_fastmem_cache_flush
  rte_fastmem_max_size / classes
  rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
  rte_fastmem_stats_reset

All APIs are marked __rte_experimental.

Performance
-----------

The single-object hot path is roughly 2–3× the cost of mempool
and an order of magnitude faster than rte_malloc. Under
multi-lcore contention, fastmem scales similarly to mempool,
while rte_malloc collapses.

Limitations
-----------

- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
- Power-of-2 classes only; worst-case internal fragmentation ~50%.
- Backing memory not reclaimable short of deinit.

Future work
-----------

- Lcore-affine allocations (false-sharing-free by construction).
- Mempool ops driver for transparent drop-in use.
- Debug mode (cookies, double-free detection, poison-on-free).
- Telemetry integration.
- EAL integration, allowing EAL-internal subsystems to use
  fastmem for their small-object allocations.

Mattias Rönnblom (3):
  doc: add fastmem programming guide
  lib: add fastmem library
  app/test: add fastmem test suite

 app/test/meson.build                  |    3 +
 app/test/test_fastmem.c               | 1673 ++++++++++++++++++++++++
 app/test/test_fastmem_perf.c          | 1040 +++++++++++++++
 app/test/test_fastmem_profile.c       |  157 +++
 doc/api/doxy-api-index.md             |    1 +
 doc/api/doxy-api.conf.in              |    1 +
 doc/guides/prog_guide/fastmem_lib.rst |  314 +++++
 doc/guides/prog_guide/index.rst       |    1 +
 lib/fastmem/meson.build               |    6 +
 lib/fastmem/rte_fastmem.c             | 1694 +++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h             |  774 +++++++++++
 lib/meson.build                       |    1 +
 12 files changed, 5665 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

-- 
2.43.0

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v2 1/3] doc: add fastmem programming guide
  2026-05-26  8:57   ` [RFC v2 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
@ 2026-05-26  8:57     ` Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 2/3] lib: add fastmem library Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 3/3] app/test: add fastmem test suite Mattias Rönnblom
  2 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-26  8:57 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Mattias Rönnblom

Add a programming guide for the fastmem library covering usage,
API overview, design, and implementation details.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/guides/prog_guide/fastmem_lib.rst | 314 ++++++++++++++++++++++++++
 doc/guides/prog_guide/index.rst       |   1 +
 2 files changed, 315 insertions(+)
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst

diff --git a/doc/guides/prog_guide/fastmem_lib.rst b/doc/guides/prog_guide/fastmem_lib.rst
new file mode 100644
index 0000000000..564d34b78f
--- /dev/null
+++ b/doc/guides/prog_guide/fastmem_lib.rst
@@ -0,0 +1,314 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2026 Ericsson AB
+
+Fastmem Library
+===============
+
+The fastmem library is a fast, general-purpose small-object
+allocator for DPDK applications. It lets an application replace
+its many per-type mempools — each sized for a single object type
+— with a single allocator that handles arbitrary object sizes,
+grows on demand, and offers mempool-level performance for the
+common allocation and free paths.
+
+Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+supports bulk operations, and uses per-lcore caches to reduce
+shared-state contention. Unlike mempool, it does not require the
+caller to declare object sizes or counts up front.
+
+
+When to use fastmem
+-------------------
+
+Use fastmem when:
+
+* Small objects (up to 1 MiB) are allocated and freed on the
+  data path with low, predictable latency requirements.
+
+* Many object types of varying sizes exist and maintaining a
+  separate mempool for each is impractical.
+
+* DMA-usable memory with efficient virtual-to-IOVA translation
+  is needed.
+
+Do not use fastmem for allocations larger than 1 MiB. Use
+``rte_malloc()`` instead.
+
+
+Initialization and teardown
+----------------------------
+
+.. code-block:: c
+
+   /* At startup, after rte_eal_init(). */
+   rte_fastmem_init();
+
+   /* Optional: pre-reserve backing memory to avoid latency
+    * spikes from on-demand memzone reservation. */
+   rte_fastmem_reserve(64 * 1024 * 1024, SOCKET_ID_ANY);
+
+   /* ... application runs ... */
+
+   /* At shutdown, after all allocations have been freed. */
+   rte_fastmem_deinit();
+
+Neither ``rte_fastmem_init()`` nor ``rte_fastmem_deinit()`` is
+thread-safe; call them from the main lcore during startup and
+shutdown.
+
+
+Allocation and free
+-------------------
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(128, 0, 0);
+   /* Use obj... */
+   rte_fastmem_free(obj);
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's NUMA
+socket. Use ``rte_fastmem_alloc_socket()`` to target a specific
+socket or to enable cross-socket fallback with ``SOCKET_ID_ANY``.
+
+Alignment
+~~~~~~~~~
+
+When ``align`` is 0, the returned pointer is aligned to at least
+``RTE_CACHE_LINE_SIZE``. A non-zero ``align`` must be a power of
+two. Specifying an alignment smaller than ``RTE_CACHE_LINE_SIZE``
+is permitted but the returned object may then share a cache line
+with an adjacent allocation, risking false sharing.
+
+Zeroing
+~~~~~~~
+
+Pass ``RTE_FASTMEM_F_ZERO`` to receive zero-initialized memory:
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(256, 0, RTE_FASTMEM_F_ZERO);
+
+
+Bulk allocation and free
+-------------------------
+
+.. code-block:: c
+
+   void *ptrs[32];
+
+   if (rte_fastmem_alloc_bulk(ptrs, 32, 64, 0, 0) < 0)
+       /* handle error */;
+
+   /* Use objects... */
+
+   rte_fastmem_free_bulk(ptrs, 32);
+
+Bulk allocation has all-or-nothing semantics: either all
+requested objects are returned, or none are (and ``rte_errno``
+is set to ``ENOMEM``).
+
+Bulk free is most efficient when all objects belong to the same
+size class; in that case the objects are pushed into the
+per-lcore cache in a single operation.
+
+
+IOVA translation
+----------------
+
+Memory returned by fastmem is DMA-usable. To obtain the IOVA
+for use in device descriptors:
+
+.. code-block:: c
+
+   rte_iova_t iova = rte_fastmem_virt2iova(obj);
+
+The translation is O(1). The returned IOVA is valid for the
+lifetime of the allocation.
+
+
+NUMA awareness
+--------------
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's socket.
+``rte_fastmem_alloc_socket()`` accepts an explicit socket ID or
+``SOCKET_ID_ANY``:
+
+* Explicit socket: allocate only from that socket; fail with
+  ``ENOMEM`` if exhausted.
+
+* ``SOCKET_ID_ANY``: try the caller's local socket first, then
+  fall back to other sockets.
+
+
+Per-lcore caches
+----------------
+
+Each EAL thread has a private cache per size class. The common
+allocation and free paths operate entirely within this cache,
+avoiding locks. Cache misses (empty on alloc, full on free)
+trigger a bulk transfer to/from the shared bin under a lock.
+
+Non-EAL threads bypass the cache and take the bin lock on every
+operation.
+
+``rte_fastmem_cache_flush()`` drains the calling lcore's caches
+back to the shared bins. This is useful after bursty phases to
+release idle cached memory.
+
+
+Threading
+---------
+
+All allocation and free functions are thread-safe and may be
+called from any thread. An allocation made on one thread may be
+freed on any other.
+
+Fastmem uses internal spinlocks. A thread preempted while
+holding one delays other threads contending for the same lock
+(correctness is not affected, only latency).
+
+
+Pre-reserving memory
+--------------------
+
+By default, fastmem reserves backing memory lazily on first
+allocation. ``rte_fastmem_reserve(size, socket_id)`` forces
+reservation up front, ensuring subsequent allocations do not
+incur memzone-reservation latency:
+
+.. code-block:: c
+
+   /* Reserve 128 MiB on socket 0. */
+   rte_fastmem_reserve(128 * 1024 * 1024, 0);
+
+Once reserved, backing memory is never returned to the system
+during the allocator's lifetime.
+
+Memory limits
+~~~~~~~~~~~~~
+
+``rte_fastmem_set_limit(socket_id, max_bytes)`` caps how much
+backing memory may be reserved on a given socket. Once the limit is
+reached, allocations that would require new backing memory fail with
+``ENOMEM``. The default is ``SIZE_MAX`` (unlimited).
+``rte_fastmem_get_limit()`` returns the current limit for a socket.
+
+.. code-block:: c
+
+   /* Allow at most 256 MiB on socket 0. */
+   rte_fastmem_set_limit(0, 256 * 1024 * 1024);
+
+   /* Block all growth on socket 1. */
+   rte_fastmem_set_limit(1, 0);
+
+Pass ``SOCKET_ID_ANY`` to apply the same limit to all sockets.
+
+
+Size classes
+------------
+
+Fastmem uses power-of-two size classes from 8 bytes to 1 MiB
+(18 classes). A request for N bytes is served from the smallest
+class >= N. The maximum supported size is queryable via
+``rte_fastmem_max_size()``.
+
+With power-of-two classes, worst-case internal fragmentation is
+just under 50% (e.g., a 33-byte request occupies a 64-byte
+slot). Assuming a uniform distribution of request sizes, the
+average waste is 25%. In practice, DPDK workloads tend to
+cluster at or near powers of two, so typical waste is lower.
+
+Requests exceeding the maximum are rejected with ``E2BIG``.
+
+
+Implementation
+--------------
+
+Fastmem organizes memory in three layers: backing memzones, slabs,
+and per-lcore caches.
+
+Backing memory and slabs
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Backing memory is obtained from EAL as 128 MiB IOVA-contiguous
+memzones, each aligned to 2 MiB. A memzone is partitioned into
+64 fixed-size, 2 MiB **slabs**. Slabs are the unit of memory
+that moves between size classes: a free slab can be assigned to
+any bin on demand, and an empty slab (all objects freed) returns
+to the free-slab pool for reuse by another size class.
+
+The 2 MiB slab alignment is the key structural property. Given
+any object pointer, the allocator recovers the owning slab by
+masking off the low 21 bits — no radix tree, hash table, or
+memzone lookup is needed. This makes the free path fast: a
+single pointer-mask load reaches the slab header, which
+identifies the size class and bin.
+
+Each slab reserves 64 bytes at offset 0 for its header. The
+remaining space is divided into fixed-size slots equal to the
+size class. Allocated objects carry no per-object metadata; the
+full slot is available to the caller.
+
+Three-level allocation hierarchy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. **Per-lcore cache** — a bounded LIFO stack of free object
+   pointers, one per (lcore, size class, socket). Allocation
+   pops; free pushes. No lock is needed because only the owning
+   lcore accesses its cache.
+
+2. **Bin** — one per (size class, socket). Owns the partial and
+   full slab lists. A spinlock serializes bulk transfers between
+   the bin and per-lcore caches. Most traffic is absorbed by the
+   caches, so bin-lock contention is low.
+
+3. **Free-slab pool** — one per socket. A spinlock protects slab
+   acquisition and release. These events are rare relative to
+   object-level operations (a single small-object slab serves
+   thousands of allocations).
+
+On a cache miss (empty on alloc, full on free), the cache
+exchanges objects with the bin in bulk, targeting half-full to
+maximize headroom in both directions.
+
+Cache sizing
+~~~~~~~~~~~~
+
+Cache capacity varies by size class to bound per-lcore memory
+footprint:
+
+* Classes 8 B through 4 KiB: capacity 64.
+* Larger classes: capacity halves per class (32, 16, 8, 4),
+  flooring at 4.
+
+Even the largest classes remain cached. The capacity curve
+ensures that small, frequent allocations get the highest cache
+hit rate, while large allocations still avoid the bin lock on
+most operations.
+
+
+Statistics
+----------
+
+Fastmem maintains always-on, per-lcore counters that track
+allocation and free activity. Statistics are queryable at four
+levels of granularity: global summary, per size class, per lcore,
+and per lcore per class.
+
+``rte_fastmem_classes()`` returns the number of size classes and
+optionally fills an array with their sizes.
+
+See ``rte_fastmem.h`` for the full statistics API.
+
+
+Secondary Processes
+-------------------
+
+Fastmem works transparently in DPDK secondary processes. The shared
+state is discovered automatically on first allocation.
+
+Secondary processes do not use per-lcore caches; every allocation and
+free acquires the bin spinlock directly. This is acceptable for
+control-plane secondaries with low allocation rates. The primary
+process should pre-reserve sufficient backing memory with
+``rte_fastmem_reserve()`` since secondaries cannot grow the pool.
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index e6f24945b0..c85196c85e 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -28,6 +28,7 @@ Memory Management
     mempool_lib
     mbuf_lib
     multi_proc_support
+    fastmem_lib
 
 
 CPU Management
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v2 2/3] lib: add fastmem library
  2026-05-26  8:57   ` [RFC v2 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 1/3] doc: add fastmem programming guide Mattias Rönnblom
@ 2026-05-26  8:57     ` Mattias Rönnblom
  2026-05-26 13:23       ` Stephen Hemminger
  2026-05-26  8:57     ` [RFC v2 3/3] app/test: add fastmem test suite Mattias Rönnblom
  2 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-26  8:57 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Mattias Rönnblom

Introduce fastmem, a fast general-purpose small-object allocator
for DPDK applications. It allows an application to replace its
many per-type mempools with a single allocator that handles
arbitrary sizes, grows on demand, and offers mempool-level
performance on the hot path.

Applications that manage many object types (connections, sessions,
work items, timers) currently maintain a separate mempool for each,
requiring upfront sizing and wasting memory on over-provisioned
pools. Fastmem removes both constraints.

Key properties:

 * Huge-page-backed, NUMA-aware, DMA-usable.
 * Per-lcore caches for lock-free alloc/free on EAL threads.
 * Bulk alloc and free APIs.
 * Power-of-two size classes from 8 B to 1 MiB.
 * Backing memory grows lazily; rte_fastmem_reserve() allows
   upfront reservation to avoid latency spikes.
 * Always-on per-lcore and per-class statistics.

Bounded to small objects; requests above rte_fastmem_max_size()
are rejected. Replacing rte_malloc is currently not a goal.

--

RFC v2:
 * Fix use-after-free in rte_fastmem_deinit() when caches were
   allocated cross-socket. Restructured teardown into three phases.
 * Add defensive bounds check to local_socket_id() final fallback.
 * Add secondary process support. Shared state is discovered lazily
   on first allocation; secondaries operate without per-lcore caches.
 * Add handle-based allocation API (rte_fastmem_hlookup,
   rte_fastmem_halloc, rte_fastmem_halloc_bulk).
 * Add test_alloc_cross_socket_deinit exercising cross-socket
   teardown path.
 * Fix clang -Wthread-safety-analysis warnings.
 * Move fastmem to alphabetical position in lib/meson.build.
 * Remove trailing double blank lines in test_fastmem.c.
 * Split programming guide into separate commit.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/api/doxy-api-index.md |    1 +
 doc/api/doxy-api.conf.in  |    1 +
 lib/fastmem/meson.build   |    6 +
 lib/fastmem/rte_fastmem.c | 1694 +++++++++++++++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h |  774 +++++++++++++++++
 lib/meson.build           |    1 +
 6 files changed, 2477 insertions(+)
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 9296042119..7ebf1201ce 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -70,6 +70,7 @@ The public API headers are grouped by topics:
   [memzone](@ref rte_memzone.h),
   [mempool](@ref rte_mempool.h),
   [malloc](@ref rte_malloc.h),
+  [fastmem](@ref rte_fastmem.h),
   [memcpy](@ref rte_memcpy.h)
 
 - **timers**:
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index bedd944681..4355e9fb2d 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -43,6 +43,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/efd \
                           @TOPDIR@/lib/ethdev \
                           @TOPDIR@/lib/eventdev \
+                          @TOPDIR@/lib/fastmem \
                           @TOPDIR@/lib/fib \
                           @TOPDIR@/lib/gpudev \
                           @TOPDIR@/lib/graph \
diff --git a/lib/fastmem/meson.build b/lib/fastmem/meson.build
new file mode 100644
index 0000000000..6c7834608f
--- /dev/null
+++ b/lib/fastmem/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Ericsson AB
+
+sources = files('rte_fastmem.c')
+headers = files('rte_fastmem.h')
+deps += ['eal']
diff --git a/lib/fastmem/rte_fastmem.c b/lib/fastmem/rte_fastmem.c
new file mode 100644
index 0000000000..84d97ac36f
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.c
@@ -0,0 +1,1694 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/queue.h>
+
+#include <eal_export.h>
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_eal.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_spinlock.h>
+
+#include <rte_fastmem.h>
+
+RTE_LOG_REGISTER_DEFAULT(fastmem_logtype, NOTICE);
+
+#define RTE_LOGTYPE_FASTMEM fastmem_logtype
+
+#define FASTMEM_LOG(level, ...) \
+	RTE_LOG_LINE(level, FASTMEM, "" __VA_ARGS__)
+
+#define FASTMEM_MEMZONE_SIZE_LOG2 27                            /* 128 MiB */
+#define FASTMEM_MEMZONE_SIZE ((size_t)1 << FASTMEM_MEMZONE_SIZE_LOG2)
+
+#define FASTMEM_SLAB_SIZE_LOG2 21                               /*   2 MiB */
+#define FASTMEM_SLAB_SIZE ((size_t)1 << FASTMEM_SLAB_SIZE_LOG2)
+#define FASTMEM_SLAB_MASK (FASTMEM_SLAB_SIZE - 1)
+
+#define FASTMEM_SLABS_PER_MEMZONE (FASTMEM_MEMZONE_SIZE / FASTMEM_SLAB_SIZE)
+
+#define FASTMEM_MAX_MEMZONES_PER_SOCKET 64
+
+#define FASTMEM_MIN_CLASS_LOG2 3                                /*   8 B */
+#define FASTMEM_MAX_CLASS_LOG2 20                               /*   1 MiB */
+#define FASTMEM_N_CLASSES (FASTMEM_MAX_CLASS_LOG2 - FASTMEM_MIN_CLASS_LOG2 + 1)
+
+#define FASTMEM_MIN_SIZE ((size_t)1 << FASTMEM_MIN_CLASS_LOG2)
+#define FASTMEM_MAX_ALLOC_SIZE ((size_t)1 << FASTMEM_MAX_CLASS_LOG2)
+
+#define FASTMEM_SLAB_HEADER_SIZE RTE_CACHE_LINE_SIZE
+
+#define FASTMEM_CACHE_BASE_CAPACITY 64
+#define FASTMEM_CACHE_FLOOR_CAPACITY 4
+#define FASTMEM_CACHE_BASE_CLASS_LOG2 12                        /* 4 KiB */
+
+struct fastmem_bin;
+
+/*
+ * Slab header at offset 0 of each 2 MiB slab. Either free (linked
+ * via next_free) or assigned to a bin (linked via list).
+ */
+struct fastmem_slab {
+	struct fastmem_bin *bin;
+	void *free_head;
+	uint32_t free_count;
+	uint32_t n_slots;
+	struct fastmem_slab *next_free;
+	TAILQ_ENTRY(fastmem_slab) list;
+	rte_iova_t iova_base;
+} __rte_aligned(FASTMEM_SLAB_HEADER_SIZE);
+
+TAILQ_HEAD(fastmem_slab_list, fastmem_slab);
+
+struct fastmem_bin {
+	rte_spinlock_t lock;
+	uint32_t slot_size;
+	uint32_t slots_per_slab;
+	uint32_t class_idx;
+	struct fastmem_slab_list partial;
+	struct fastmem_slab_list full;
+	int socket_id;
+	uint64_t slab_acquires;
+	uint64_t slab_releases;
+	uint32_t slabs_partial;
+	uint32_t slabs_full;
+};
+
+/* Per-(lcore, class, socket) bounded LIFO of free object pointers. */
+struct fastmem_cache {
+	uint32_t count;
+	uint32_t capacity;
+	uint32_t target;
+	uint64_t alloc_cache_hits;
+	uint64_t alloc_cache_misses;
+	uint64_t alloc_nomem;
+	uint64_t free_cache_hits;
+	uint64_t free_cache_misses;
+	void *objs[];
+} __rte_cache_aligned;
+
+struct fastmem_socket_state {
+	rte_spinlock_t lock;
+	struct fastmem_slab *free_head;
+	size_t reserved_bytes;
+	size_t memory_limit;
+	unsigned int n_memzones;
+	unsigned int memzone_seq;
+	const struct rte_memzone *memzones[FASTMEM_MAX_MEMZONES_PER_SOCKET];
+	struct fastmem_bin bins[FASTMEM_N_CLASSES];
+	struct fastmem_cache *caches[RTE_MAX_LCORE][FASTMEM_N_CLASSES];
+};
+
+struct fastmem {
+	struct fastmem_socket_state sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct fastmem *fastmem;
+static const struct rte_memzone *fastmem_mz;
+static bool fastmem_is_primary; /* cached; avoids function call on hot path */
+
+static struct fastmem *
+fastmem_get(void)
+{
+	const struct rte_memzone *mz;
+
+	if (likely(fastmem != NULL))
+		return fastmem;
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		rte_errno = ENODEV;
+		return NULL;
+	}
+
+	mz = rte_memzone_lookup("fastmem_state");
+	if (mz == NULL) {
+		rte_errno = ENODEV;
+		return NULL;
+	}
+
+	fastmem_mz = mz;
+	fastmem = mz->addr;
+	return fastmem;
+}
+
+static inline unsigned int
+size_to_class(size_t size, size_t align)
+{
+	size_t effective;
+	unsigned int log2;
+
+	effective = size < FASTMEM_MIN_SIZE ? FASTMEM_MIN_SIZE : size;
+	if (align > effective)
+		effective = align;
+
+	log2 = 64u - rte_clz64(effective - 1);
+
+	if (log2 < FASTMEM_MIN_CLASS_LOG2)
+		log2 = FASTMEM_MIN_CLASS_LOG2;
+	if (log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+static inline size_t
+class_size(unsigned int class_idx)
+{
+	return (size_t)1 << (class_idx + FASTMEM_MIN_CLASS_LOG2);
+}
+
+static_assert(sizeof(struct fastmem_slab) == FASTMEM_SLAB_HEADER_SIZE,
+	"fastmem slab header must fit in exactly one cache line");
+static_assert(sizeof(struct fastmem_slab) <= FASTMEM_SLAB_SIZE,
+	"slab header larger than a slab makes no sense");
+
+static __rte_always_inline struct fastmem_slab *
+slab_of(void *obj)
+{
+	return (struct fastmem_slab *)
+		((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+}
+
+static inline size_t
+slab_slot0_offset(size_t class_size)
+{
+	return class_size < FASTMEM_SLAB_HEADER_SIZE ?
+		FASTMEM_SLAB_HEADER_SIZE : class_size;
+}
+
+static inline uint32_t
+slab_slot_count(size_t class_size)
+{
+	size_t offset = slab_slot0_offset(class_size);
+
+	return (uint32_t)((FASTMEM_SLAB_SIZE - offset) / class_size);
+}
+
+/* Must be called with bin->lock held. */
+static void
+slab_init(struct fastmem_bin *bin, struct fastmem_slab *slab)
+{
+	size_t slot_size = bin->slot_size;
+	size_t offset = slab_slot0_offset(slot_size);
+	uint32_t n = bin->slots_per_slab;
+	void *prev = NULL;
+	uint32_t i;
+
+	slab->bin = bin;
+	slab->n_slots = n;
+	slab->free_count = n;
+
+	/* Build in reverse so pops yield sequential addresses. */
+	for (i = 0; i < n; i++) {
+		void *slot = RTE_PTR_ADD(slab, offset + i * slot_size);
+		*(void **)slot = prev;
+		prev = slot;
+	}
+	slab->free_head = prev;
+}
+
+static int
+grow_socket(struct fastmem_socket_state *socket, int socket_id)
+{
+	char name[RTE_MEMZONE_NAMESIZE];
+	const struct rte_memzone *mz;
+	unsigned int i;
+
+	if (socket->reserved_bytes + FASTMEM_MEMZONE_SIZE > socket->memory_limit) {
+		FASTMEM_LOG(ERR,
+			"reserve would exceed memory_limit (%zu) on socket %d",
+			socket->memory_limit, socket_id);
+		return -ENOMEM;
+	}
+
+	if (socket->n_memzones == FASTMEM_MAX_MEMZONES_PER_SOCKET) {
+		FASTMEM_LOG(ERR,
+			"reached per-socket memzone cap (%u) on socket %d",
+			FASTMEM_MAX_MEMZONES_PER_SOCKET, socket_id);
+		return -ENOMEM;
+	}
+
+	snprintf(name, sizeof(name), "fastmem_%d_%u", socket_id,
+			socket->memzone_seq++);
+
+	mz = rte_memzone_reserve_aligned(name, FASTMEM_MEMZONE_SIZE,
+			socket_id, RTE_MEMZONE_IOVA_CONTIG,
+			FASTMEM_SLAB_SIZE);
+	if (mz == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to reserve %zu-byte memzone '%s' on socket %d: %s",
+			(size_t)FASTMEM_MEMZONE_SIZE, name, socket_id,
+			rte_strerror(rte_errno));
+		return -ENOMEM;
+	}
+
+	socket->memzones[socket->n_memzones++] = mz;
+	socket->reserved_bytes += FASTMEM_MEMZONE_SIZE;
+
+	for (i = 0; i < FASTMEM_SLABS_PER_MEMZONE; i++) {
+		struct fastmem_slab *slab = RTE_PTR_ADD(mz->addr,
+				i * FASTMEM_SLAB_SIZE);
+
+		slab->iova_base = mz->iova + i * FASTMEM_SLAB_SIZE;
+		slab->next_free = socket->free_head;
+		socket->free_head = slab;
+	}
+
+	FASTMEM_LOG(DEBUG,
+		"reserved memzone '%s' (%zu bytes) on socket %d; %zu slabs added",
+		name, (size_t)FASTMEM_MEMZONE_SIZE, socket_id,
+		(size_t)FASTMEM_SLABS_PER_MEMZONE);
+
+	return 0;
+}
+
+static struct fastmem_slab *
+slab_acquire(struct fastmem_socket_state *socket, int socket_id)
+{
+	struct fastmem_slab *slab;
+
+	rte_spinlock_lock(&socket->lock);
+
+	if (socket->free_head == NULL) {
+		int rc = grow_socket(socket, socket_id);
+
+		if (rc < 0) {
+			rte_spinlock_unlock(&socket->lock);
+			return NULL;
+		}
+	}
+
+	slab = socket->free_head;
+	socket->free_head = slab->next_free;
+	slab->next_free = NULL;
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return slab;
+}
+
+static void
+slab_release(struct fastmem_socket_state *socket,
+		struct fastmem_slab *slab)
+{
+	rte_spinlock_lock(&socket->lock);
+
+	slab->next_free = socket->free_head;
+	socket->free_head = slab;
+
+	rte_spinlock_unlock(&socket->lock);
+}
+
+static void
+bin_init(struct fastmem_bin *bin, unsigned int class_idx, int socket_id)
+{
+	size_t slot_size = class_size(class_idx);
+
+	rte_spinlock_init(&bin->lock);
+	bin->slot_size = (uint32_t)slot_size;
+	bin->slots_per_slab = slab_slot_count(slot_size);
+	bin->class_idx = class_idx;
+	TAILQ_INIT(&bin->partial);
+	TAILQ_INIT(&bin->full);
+	bin->socket_id = socket_id;
+	bin->slab_acquires = 0;
+	bin->slab_releases = 0;
+	bin->slabs_partial = 0;
+	bin->slabs_full = 0;
+}
+
+static void
+bin_release(struct fastmem_bin *bin, struct fastmem_socket_state *socket)
+{
+	struct fastmem_slab *slab;
+
+	while ((slab = TAILQ_FIRST(&bin->partial)) != NULL) {
+		TAILQ_REMOVE(&bin->partial, slab, list);
+		slab_release(socket, slab);
+	}
+	while ((slab = TAILQ_FIRST(&bin->full)) != NULL) {
+		TAILQ_REMOVE(&bin->full, slab, list);
+		slab_release(socket, slab);
+	}
+}
+
+static unsigned int
+bin_pop_locked(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	unsigned int got = 0;
+
+	while (got < n) {
+		struct fastmem_slab *slab = TAILQ_FIRST(&bin->partial);
+		void *obj;
+
+		if (slab == NULL)
+			break;
+
+		obj = slab->free_head;
+		slab->free_head = *(void **)obj;
+		slab->free_count--;
+		objs[got++] = obj;
+
+		if (slab->free_count == 0) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			TAILQ_INSERT_HEAD(&bin->full, slab, list);
+			bin->slabs_partial--;
+			bin->slabs_full++;
+		}
+	}
+
+	return got;
+}
+
+/*
+ * Fully-drained slabs are accumulated in @p to_release for the
+ * caller to return after dropping the lock.
+ */
+static unsigned int
+bin_push_locked(struct fastmem_bin *bin, void **objs, unsigned int n,
+		struct fastmem_slab **to_release)
+{
+	unsigned int n_release = 0;
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		void *obj = objs[i];
+		struct fastmem_slab *slab = (struct fastmem_slab *)
+			((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+		bool was_full = slab->free_count == 0;
+
+		*(void **)obj = slab->free_head;
+		slab->free_head = obj;
+		slab->free_count++;
+
+		if (was_full) {
+			TAILQ_REMOVE(&bin->full, slab, list);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_full--;
+			bin->slabs_partial++;
+		}
+
+		if (slab->free_count == slab->n_slots) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			bin->slabs_partial--;
+			bin->slab_releases++;
+			to_release[n_release++] = slab;
+		}
+	}
+
+	return n_release;
+}
+
+static void *
+bin_alloc_one(struct fastmem_bin *bin)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	void *obj;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (bin_pop_locked(bin, &obj, 1) == 0) {
+		struct fastmem_slab *slab;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_errno = ENOMEM;
+			return NULL;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return obj;
+}
+
+static unsigned int
+bin_alloc_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	unsigned int got = 0;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (got < n) {
+		struct fastmem_slab *slab;
+
+		got += bin_pop_locked(bin, objs + got, n - got);
+		if (got == n)
+			break;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_spinlock_lock(&bin->lock);
+			break;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return got;
+}
+
+static void
+bin_free_one(struct fastmem_bin *bin, void *obj)
+{
+	unsigned int n_release;
+	struct fastmem_slab *slab_to_release = NULL;
+	struct fastmem_socket_state *socket;
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, &obj, 1, &slab_to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	if (n_release > 0) {
+		socket = &fastmem->sockets[bin->socket_id];
+		slab_release(socket, slab_to_release);
+	}
+}
+
+static void
+bin_free_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	struct fastmem_slab *to_release[FASTMEM_CACHE_BASE_CAPACITY];
+	unsigned int n_release;
+	unsigned int i;
+
+	RTE_VERIFY(n <= RTE_DIM(to_release));
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, objs, n, to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	for (i = 0; i < n_release; i++)
+		slab_release(socket, to_release[i]);
+}
+
+static inline unsigned int
+cache_capacity(unsigned int class_idx)
+{
+	unsigned int class_log2 = class_idx + FASTMEM_MIN_CLASS_LOG2;
+	unsigned int shift;
+	unsigned int cap;
+
+	if (class_log2 <= FASTMEM_CACHE_BASE_CLASS_LOG2)
+		return FASTMEM_CACHE_BASE_CAPACITY;
+
+	shift = class_log2 - FASTMEM_CACHE_BASE_CLASS_LOG2;
+	cap = FASTMEM_CACHE_BASE_CAPACITY >> shift;
+
+	return cap < FASTMEM_CACHE_FLOOR_CAPACITY ?
+		FASTMEM_CACHE_FLOOR_CAPACITY : cap;
+}
+
+static inline struct fastmem_cache **
+cache_slot(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	if (lcore_id >= RTE_MAX_LCORE)
+		return NULL;
+	return &socket->caches[lcore_id][class_idx];
+}
+
+static struct fastmem_cache *
+cache_create(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache **slot = cache_slot(socket, class_idx, lcore_id);
+	struct fastmem_cache *cache;
+	unsigned int capacity;
+	size_t cache_size;
+	unsigned int cache_class;
+	unsigned int own_socket;
+	struct fastmem_socket_state *alloc_socket;
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	capacity = cache_capacity(class_idx);
+	cache_size = sizeof(*cache) + capacity * sizeof(void *);
+
+	/*
+	 * Allocate the cache struct from fastmem on the calling
+	 * lcore's socket (NUMA-local to the writer). Bypasses the
+	 * cache layer to avoid recursion.
+	 */
+	cache_class = size_to_class(cache_size, RTE_CACHE_LINE_SIZE);
+	own_socket = rte_socket_id();
+
+	if (cache_class >= FASTMEM_N_CLASSES) {
+		FASTMEM_LOG(ERR,
+			"cache size %zu exceeds max size class",
+			cache_size);
+		return NULL;
+	}
+
+	if (own_socket >= RTE_MAX_NUMA_NODES)
+		own_socket = (unsigned int)socket->bins[0].socket_id;
+
+	alloc_socket = &fastmem->sockets[own_socket];
+
+	cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
+	if (cache == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to allocate cache for class %u on socket %u",
+			class_idx, own_socket);
+		return NULL;
+	}
+
+	cache->count = 0;
+	cache->capacity = capacity;
+	cache->target = capacity / 2;
+	cache->alloc_cache_hits = 0;
+	cache->alloc_cache_misses = 0;
+	cache->alloc_nomem = 0;
+	cache->free_cache_hits = 0;
+	cache->free_cache_misses = 0;
+
+	*slot = cache;
+
+	return cache;
+}
+
+static __rte_always_inline struct fastmem_cache *
+cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	struct fastmem_cache **slot;
+	struct fastmem_cache *cache;
+
+	if (unlikely(!fastmem_is_primary))
+		return NULL;
+
+	slot = cache_slot(socket, class_idx, lcore_id);
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	return cache_create(socket, class_idx, lcore_id);
+}
+
+static __rte_always_inline void *
+cache_pop(struct fastmem_cache *cache, struct fastmem_bin *bin)
+{
+	if (cache->count > 0) {
+		cache->alloc_cache_hits++;
+		return cache->objs[--cache->count];
+	}
+
+	cache->count = bin_alloc_bulk(bin, cache->objs, cache->target);
+	if (cache->count == 0)
+		return NULL;
+
+	cache->alloc_cache_misses++;
+	return cache->objs[--cache->count];
+}
+
+static __rte_always_inline void
+cache_push(struct fastmem_cache *cache, struct fastmem_bin *bin, void *obj)
+{
+	unsigned int drain;
+
+	if (cache->count < cache->capacity) {
+		cache->free_cache_hits++;
+		cache->objs[cache->count++] = obj;
+		return;
+	}
+
+	cache->free_cache_misses++;
+
+	/*
+	 * Drain the oldest (bottom) half to the bin, keeping the
+	 * newest (top) half for temporal reuse.
+	 */
+	drain = cache->count - cache->target;
+	bin_free_bulk(bin, cache->objs, drain);
+	memmove(cache->objs, cache->objs + drain,
+		cache->target * sizeof(cache->objs[0]));
+	cache->count = cache->target;
+
+	cache->objs[cache->count++] = obj;
+}
+
+static void
+socket_release_caches(struct fastmem_socket_state *socket)
+{
+	unsigned int lcore;
+	unsigned int c;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache = socket->caches[lcore][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore][c] = NULL;
+		}
+	}
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_init, 24.11)
+rte_fastmem_init(void)
+{
+	unsigned int s, c;
+
+	if (fastmem != NULL)
+		return -EBUSY;
+
+	fastmem_mz = rte_memzone_reserve_aligned("fastmem_state",
+			sizeof(*fastmem), SOCKET_ID_ANY, 0,
+			RTE_CACHE_LINE_SIZE);
+	if (fastmem_mz == NULL)
+		return -ENOMEM;
+
+	fastmem = fastmem_mz->addr;
+	fastmem_is_primary = true;
+	memset(fastmem, 0, sizeof(*fastmem));
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		rte_spinlock_init(&socket->lock);
+		socket->memory_limit = SIZE_MAX;
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++)
+			bin_init(&socket->bins[c], c, (int)s);
+	}
+
+	return 0;
+}
+
+static void
+release_socket_caches(struct fastmem_socket_state *socket)
+{
+	socket_release_caches(socket);
+}
+
+static void
+release_socket_bins(struct fastmem_socket_state *socket)
+{
+	unsigned int c;
+
+	for (c = 0; c < FASTMEM_N_CLASSES; c++)
+		bin_release(&socket->bins[c], socket);
+}
+
+static void
+release_socket_memzones(struct fastmem_socket_state *socket)
+{
+	unsigned int i;
+
+	for (i = 0; i < socket->n_memzones; i++)
+		rte_memzone_free(socket->memzones[i]);
+
+	socket->free_head = NULL;
+	socket->reserved_bytes = 0;
+	socket->n_memzones = 0;
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_deinit, 24.11)
+rte_fastmem_deinit(void)
+{
+	unsigned int i;
+
+	if (fastmem == NULL)
+		return;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		fastmem = NULL;
+		fastmem_mz = NULL;
+		return;
+	}
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_caches(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_bins(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_memzones(&fastmem->sockets[i]);
+
+	rte_memzone_free(fastmem_mz);
+	fastmem_mz = NULL;
+	fastmem = NULL;
+}
+
+/* Same resolution order as rte_malloc's malloc_get_numa_socket(). */
+static __rte_always_inline unsigned int
+local_socket_id(void)
+{
+	int sid = (int)rte_socket_id();
+
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = (int)rte_lcore_to_socket_id(rte_get_main_lcore());
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = rte_socket_id_by_idx(0);
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	return 0;
+}
+
+static int
+reserve_on_socket(int sid, size_t size)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[sid];
+	int rc = 0;
+
+	rte_spinlock_lock(&socket->lock);
+
+	while (socket->reserved_bytes < size) {
+		rc = grow_socket(socket, sid);
+		if (rc < 0)
+			break;
+	}
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return rc;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_reserve, 24.11)
+rte_fastmem_reserve(size_t size, int socket_id)
+{
+	unsigned int i;
+	int rc;
+
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id != SOCKET_ID_ANY) {
+		if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+			return -EINVAL;
+		return reserve_on_socket(socket_id, size);
+	}
+
+	rc = reserve_on_socket(local_socket_id(), size);
+	if (rc == 0)
+		return 0;
+
+	for (i = 0; i < rte_socket_count(); i++) {
+		int sid = rte_socket_id_by_idx(i);
+
+		if (sid < 0 || (unsigned int)sid == local_socket_id())
+			continue;
+
+		rc = reserve_on_socket(sid, size);
+		if (rc == 0)
+			return 0;
+	}
+
+	return rc;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_set_limit, 24.11)
+rte_fastmem_set_limit(int socket_id, size_t max_bytes)
+{
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id == SOCKET_ID_ANY) {
+		for (unsigned int i = 0; i < RTE_MAX_NUMA_NODES; i++)
+			fastmem->sockets[i].memory_limit = max_bytes;
+		return 0;
+	}
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	fastmem->sockets[socket_id].memory_limit = max_bytes;
+	return 0;
+}
+
+size_t
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_get_limit, 24.11)
+rte_fastmem_get_limit(int socket_id)
+{
+	if (fastmem == NULL || socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return 0;
+
+	return fastmem->sockets[socket_id].memory_limit;
+}
+
+size_t
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_max_size, 24.11)
+rte_fastmem_max_size(void)
+{
+	return FASTMEM_MAX_ALLOC_SIZE;
+}
+
+static __rte_always_inline void *
+alloc_from_socket(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache *cache;
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		return cache_pop(cache, &socket->bins[class_idx]);
+	return bin_alloc_one(&socket->bins[class_idx]);
+}
+
+static __rte_always_inline void
+do_free(void *ptr)
+{
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+
+	slab = slab_of(ptr);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+	if (likely(cache != NULL))
+		cache_push(cache, bin, ptr);
+	else
+		bin_free_one(bin, ptr);
+}
+
+static __rte_always_inline int
+do_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags, unsigned int lcore_id,
+		int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int got = 0;
+
+	if (unlikely(fastmem_get() == NULL))
+		return -rte_errno;
+
+	if (align == 0)
+		align = RTE_CACHE_LINE_SIZE;
+	else if (unlikely((align & (align - 1)) != 0)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return -E2BIG;
+	}
+
+	socket = &fastmem->sockets[socket_id];
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		/* Drain from cache. */
+		unsigned int avail = RTE_MIN(cache->count, n);
+
+		cache->count -= avail;
+		memcpy(ptrs, &cache->objs[cache->count],
+			avail * sizeof(void *));
+		got = avail;
+		cache->alloc_cache_hits += avail;
+
+		if (got < n) {
+			unsigned int need = n - got;
+			unsigned int want = RTE_MAX(need, cache->target);
+			unsigned int filled;
+
+			if (want <= cache->capacity) {
+				/* Refill into cache, give caller their share. */
+				filled = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					cache->objs, want);
+				if (filled > 0) {
+					cache->alloc_cache_misses += RTE_MIN(filled, need);
+				}
+				if (filled >= need) {
+					memcpy(ptrs + got,
+						cache->objs + filled - need,
+						need * sizeof(void *));
+					cache->count = filled - need;
+					got = n;
+				} else {
+					memcpy(ptrs + got, cache->objs,
+						filled * sizeof(void *));
+					got += filled;
+					cache->count = 0;
+				}
+			} else {
+				/* n exceeds cache capacity; pull directly. */
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, need);
+				if (direct > 0)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	} else {
+		got = bin_alloc_bulk(&socket->bins[class_idx], ptrs, n);
+	}
+
+	if (unlikely(got < n) && fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count() && got < n; i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			socket = &fastmem->sockets[sid];
+			cache = cache_get(socket, class_idx, lcore_id);
+			if (likely(cache != NULL)) {
+				unsigned int avail =
+					RTE_MIN(cache->count, n - got);
+				cache->count -= avail;
+				memcpy(ptrs + got,
+					&cache->objs[cache->count],
+					avail * sizeof(void *));
+				cache->alloc_cache_hits += avail;
+				got += avail;
+			}
+			if (got < n) {
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, n - got);
+				if (direct > 0 && cache != NULL)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	}
+
+	if (unlikely(got < n)) {
+		/* All-or-nothing: return what we got. */
+		struct fastmem_cache **slot;
+		unsigned int i;
+
+		for (i = 0; i < got; i++)
+			do_free(ptrs[i]);
+
+		slot = cache_slot(
+			&fastmem->sockets[socket_id], class_idx,
+			lcore_id);
+		if (slot != NULL && *slot != NULL)
+			(*slot)->alloc_nomem++;
+		rte_errno = ENOMEM;
+		return -ENOMEM;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO) {
+		size_t cs = class_size(class_idx);
+		unsigned int i;
+
+		for (i = 0; i < n; i++)
+			memset(ptrs[i], 0, cs);
+	}
+
+	return 0;
+}
+
+static __rte_always_inline void *
+do_alloc(size_t size, size_t align, unsigned int flags,
+		unsigned int lcore_id, int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_cache **slot;
+	void *obj;
+
+	if (unlikely(fastmem_get() == NULL))
+		return NULL;
+
+	if (align == 0)
+		align = RTE_CACHE_LINE_SIZE;
+	else if (unlikely((align & (align - 1)) != 0)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	obj = alloc_from_socket(&fastmem->sockets[socket_id],
+			class_idx, lcore_id);
+
+	if (likely(obj != NULL))
+		goto out;
+
+	if (fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count(); i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			obj = alloc_from_socket(&fastmem->sockets[sid],
+					class_idx, lcore_id);
+			if (obj != NULL)
+				goto out;
+		}
+	}
+
+	slot = cache_slot(
+		&fastmem->sockets[socket_id], class_idx, lcore_id);
+	if (slot != NULL && *slot != NULL)
+		(*slot)->alloc_nomem++;
+	rte_errno = ENOMEM;
+	return NULL;
+
+out:
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc, 24.11)
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+{
+	return do_alloc(size, align, flags, rte_lcore_id(),
+			local_socket_id(), false);
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_socket, 24.11)
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc(size, align, flags, rte_lcore_id(),
+				local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	return do_alloc(size, align, flags, rte_lcore_id(), socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free, 24.11)
+rte_fastmem_free(void *ptr)
+{
+	if (unlikely(ptr == NULL))
+		return;
+
+	do_free(ptr);
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk, 24.11)
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags)
+{
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), local_socket_id(), false);
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk_socket, 24.11)
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc_bulk(ptrs, n, size, align, flags,
+				rte_lcore_id(), local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free_bulk, 24.11)
+rte_fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int lcore_id;
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int space;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	lcore_id = rte_lcore_id();
+
+	/* Fast path: check if first object gives us the bin. */
+	slab = slab_of(ptrs[0]);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+
+	if (unlikely(cache == NULL)) {
+		for (i = 0; i < n; i++)
+			do_free(ptrs[i]);
+		return;
+	}
+
+	/*
+	 * Try to push all objects into the cache in one memcpy.
+	 * If any object belongs to a different bin, fall back to
+	 * per-object free for the remainder.
+	 */
+	space = cache->capacity - cache->count;
+	if (likely(n <= space)) {
+		/* Verify all same bin (common case). */
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+		cache->free_cache_hits += n;
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+	/* Would overflow cache — drain first, then push. */
+	if (n <= cache->capacity) {
+		unsigned int drain;
+
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+
+		cache->free_cache_misses += n;
+		drain = cache->count - cache->target + n;
+		if (drain > cache->count)
+			drain = cache->count;
+		if (drain > 0) {
+			bin_free_bulk(bin, cache->objs, drain);
+			cache->count -= drain;
+			memmove(cache->objs, cache->objs + drain,
+				cache->count * sizeof(cache->objs[0]));
+		}
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+slow:
+	for (i = 0; i < n; i++)
+		do_free(ptrs[i]);
+}
+
+#define fastmem_handle_class_BITS 8
+
+static inline rte_fastmem_handle_t
+fastmem_handle_pack(unsigned int class_idx, int socket_id)
+{
+	return (uint32_t)class_idx |
+		((uint32_t)socket_id << fastmem_handle_class_BITS);
+}
+
+static inline unsigned int
+fastmem_handle_class(rte_fastmem_handle_t h)
+{
+	return h & ((1U << fastmem_handle_class_BITS) - 1);
+}
+
+static inline int
+fastmem_handle_socket(rte_fastmem_handle_t h)
+{
+	return (int)(h >> fastmem_handle_class_BITS);
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hlookup, 24.11)
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+
+	if (handle == NULL)
+		return -EINVAL;
+
+	if (align == 0)
+		align = RTE_CACHE_LINE_SIZE;
+	else if ((align & (align - 1)) != 0)
+		return -EINVAL;
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	class_idx = size_to_class(size, align);
+	if (class_idx >= FASTMEM_N_CLASSES)
+		return -E2BIG;
+
+	/* Pre-create the cache for the calling lcore. */
+	socket = &fastmem->sockets[socket_id];
+	cache_create(socket, class_idx, rte_lcore_id());
+
+	*handle = fastmem_handle_pack(class_idx, socket_id);
+	return 0;
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc, 24.11)
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	struct fastmem_cache *cache;
+	void *obj;
+
+	RTE_ASSERT(fastmem != NULL);
+	RTE_ASSERT(lcore_id < RTE_MAX_LCORE);
+
+	cache = socket->caches[lcore_id][class_idx];
+	RTE_ASSERT(cache != NULL);
+
+	obj = cache_pop(cache, bin);
+	if (unlikely(obj == NULL)) {
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc_bulk, 24.11)
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+
+	return do_alloc_bulk(ptrs, n, class_size(class_idx),
+			RTE_CACHE_LINE_SIZE, flags, rte_lcore_id(),
+			socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree, 24.11)
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_cache *cache;
+
+	if (unlikely(ptr == NULL))
+		return;
+
+	RTE_ASSERT(lcore_id < RTE_MAX_LCORE);
+
+	cache = socket->caches[lcore_id][class_idx];
+	RTE_ASSERT(cache != NULL);
+
+	cache_push(cache, bin, ptr);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree_bulk, 24.11)
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		for (i = 0; i < n; i++)
+			cache_push(cache, bin, ptrs[i]);
+	} else {
+		for (i = 0; i < n; i++)
+			bin_free_one(bin, ptrs[i]);
+	}
+}
+
+rte_iova_t
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_virt2iova, 24.11)
+rte_fastmem_virt2iova(const void *ptr)
+{
+	struct fastmem_slab *slab;
+
+	RTE_ASSERT(fastmem != NULL);
+
+	slab = slab_of((void *)(uintptr_t)ptr);
+
+	return slab->iova_base + ((uintptr_t)ptr - (uintptr_t)slab);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_cache_flush, 24.11)
+rte_fastmem_cache_flush(void)
+{
+	unsigned int lcore_id;
+	unsigned int s, c;
+
+	if (fastmem == NULL)
+		return;
+
+	lcore_id = rte_lcore_id();
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore_id][c] = NULL;
+		}
+	}
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats, 24.11)
+rte_fastmem_stats(struct rte_fastmem_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_stats){0};
+	stats->n_classes = FASTMEM_N_CLASSES;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		stats->bytes_backing += socket->reserved_bytes;
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			uint64_t class_allocs = 0, class_frees = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				class_allocs += cache->alloc_cache_hits +
+					cache->alloc_cache_misses;
+				class_frees += cache->free_cache_hits +
+					cache->free_cache_misses;
+				stats->alloc_nomem += cache->alloc_nomem;
+			}
+			stats->alloc_total += class_allocs;
+			stats->free_total += class_frees;
+			if (class_allocs > class_frees)
+				stats->bytes_in_use += class_size(c) *
+					(class_allocs - class_frees);
+		}
+	}
+
+	return 0;
+}
+
+static inline unsigned int
+exact_class_idx(size_t sz)
+{
+	unsigned int log2;
+
+	if (sz < FASTMEM_MIN_SIZE || sz > FASTMEM_MAX_ALLOC_SIZE)
+		return FASTMEM_N_CLASSES;
+	if ((sz & (sz - 1)) != 0)
+		return FASTMEM_N_CLASSES;
+
+	log2 = (unsigned int)rte_ctz64(sz);
+	if (log2 < FASTMEM_MIN_CLASS_LOG2 || log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_class, 24.11)
+rte_fastmem_stats_class(size_t class_size_arg,
+		struct rte_fastmem_class_stats *stats)
+{
+	unsigned int c;
+	uint64_t allocs, frees;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+		struct fastmem_bin *bin = &socket->bins[c];
+
+		for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+			struct fastmem_cache *cache = socket->caches[l][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+
+		stats->slab_acquires += bin->slab_acquires;
+		stats->slab_releases += bin->slab_releases;
+		stats->slabs_partial += bin->slabs_partial;
+		stats->slabs_full += bin->slabs_full;
+	}
+
+	allocs = stats->alloc_cache_hits + stats->alloc_cache_misses;
+	frees = stats->free_cache_hits + stats->free_cache_misses;
+	if (allocs > frees)
+		stats->in_use = allocs - frees;
+
+	return 0;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore, 24.11)
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_stats){0};
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+	}
+
+	return 0;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore_class, 24.11)
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size_arg,
+		struct rte_fastmem_lcore_class_stats *stats)
+{
+	unsigned int c;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_cache *cache =
+			fastmem->sockets[s].caches[lcore_id][c];
+		if (cache == NULL)
+			continue;
+		stats->alloc_cache_hits += cache->alloc_cache_hits;
+		stats->alloc_cache_misses += cache->alloc_cache_misses;
+		stats->alloc_nomem += cache->alloc_nomem;
+		stats->free_cache_hits += cache->free_cache_hits;
+		stats->free_cache_misses += cache->free_cache_misses;
+	}
+
+	return 0;
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_reset, 24.11)
+rte_fastmem_stats_reset(void)
+{
+	if (fastmem == NULL)
+		return;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_bin *bin = &socket->bins[c];
+
+			bin->slab_acquires = 0;
+			bin->slab_releases = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				cache->alloc_cache_hits = 0;
+				cache->alloc_cache_misses = 0;
+				cache->alloc_nomem = 0;
+				cache->free_cache_hits = 0;
+				cache->free_cache_misses = 0;
+			}
+		}
+	}
+}
+
+unsigned int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_classes, 24.11)
+rte_fastmem_classes(size_t *sizes)
+{
+	if (sizes != NULL)
+		for (unsigned int i = 0; i < FASTMEM_N_CLASSES; i++)
+			sizes[i] = class_size(i);
+	return FASTMEM_N_CLASSES;
+}
diff --git a/lib/fastmem/rte_fastmem.h b/lib/fastmem/rte_fastmem.h
new file mode 100644
index 0000000000..4da893e7f3
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.h
@@ -0,0 +1,774 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#ifndef _RTE_FASTMEM_H_
+#define _RTE_FASTMEM_H_
+
+/**
+ * @file
+ *
+ * RTE Fastmem
+ *
+ * @warning
+ * @b EXPERIMENTAL:
+ * All functions in this file may be changed or removed without prior notice.
+ *
+ * The fastmem library is a fast, general-purpose small-object
+ * allocator for DPDK applications. It is intended to allow an
+ * application to replace its many per-type mempools — each sized
+ * for a single object type (a connection, a session, a work item,
+ * a timer, etc.) — with a single allocator that handles arbitrary
+ * object sizes, grows on demand, and offers mempool-level
+ * performance for the common allocation and free paths.
+ *
+ * Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+ * supports bulk operations, and uses per-lcore caches to reduce
+ * shared-state contention. Unlike mempool, it does not require the
+ * caller to declare object sizes or counts up front.
+ *
+ * There is a single, global fastmem instance per process. The
+ * instance is brought up with rte_fastmem_init() and torn down with
+ * rte_fastmem_deinit(). Allocations are made with
+ * rte_fastmem_alloc() and freed with rte_fastmem_free().
+ *
+ * The allocator is bounded to small-object allocations. Requests
+ * larger than rte_fastmem_max_size() are rejected; callers with
+ * such needs should use rte_malloc() directly.
+ *
+ * Backing memory is reserved from DPDK memzones. Once reserved,
+ * backing memory is not returned to the system during the
+ * allocator's lifetime. Callers that need predictable latency may
+ * pre-reserve backing memory up front using rte_fastmem_reserve(),
+ * avoiding memzone-reservation overhead during steady-state
+ * operation.
+ *
+ * Alignment argument, @c align:
+ *   If non-zero, @c align specifies an exact minimum alignment and
+ *   must be a power of 2. If zero, the default alignment is
+ *   @c RTE_CACHE_LINE_SIZE, so that objects obtained from distinct
+ *   calls cannot false-share a cache line.
+ *
+ * Threads and per-lcore caches:
+ *   Allocate and free calls from EAL threads are served through a
+ *   per-lcore cache, which makes the common path lock-free.
+ *   Unregistered non-EAL threads do not use a cache; their
+ *   allocate and free calls go directly to shared state, take an
+ *   internal lock, and cost more per call.
+ *
+ * Non-preemptible caller:
+ *   Callers should not be preemptible while inside a fastmem call.
+ *   Fastmem uses internal spinlocks; if a caller is preempted
+ *   while holding one, any other thread that subsequently needs
+ *   the same lock stalls until the preempted caller resumes.
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Flag for rte_fastmem_alloc() and its variants: initialize the
+ * returned memory to zero before returning it to the caller.
+ */
+#define RTE_FASTMEM_F_ZERO RTE_BIT32(0)
+
+/**
+ * Initialize the fastmem allocator.
+ *
+ * Sets up the library's internal state. Must be called before any
+ * allocation call. Typically called once per process, after
+ * rte_eal_init() and before the application's worker threads begin
+ * making allocations.
+ *
+ * Initialization does not pre-reserve any backing memory; memzones
+ * are reserved lazily as allocations require. An application that
+ * wants to avoid memzone-reservation latency on the allocation
+ * path should follow rte_fastmem_init() with one or more calls to
+ * rte_fastmem_reserve().
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EBUSY: The allocator is already initialized.
+ *  - -ENOMEM: Unable to allocate internal state.
+ */
+__rte_experimental
+int
+rte_fastmem_init(void);
+
+/**
+ * Tear down the fastmem allocator.
+ *
+ * Releases the library's internal state and frees all backing
+ * memzones. After this call, no fastmem allocations or frees may
+ * be made until rte_fastmem_init() is called again.
+ *
+ * The caller is responsible for ensuring that no fastmem-allocated
+ * objects remain in use. Outstanding allocations at deinit time
+ * result in undefined behavior.
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ */
+__rte_experimental
+void
+rte_fastmem_deinit(void);
+
+/**
+ * Pre-reserve backing memory.
+ *
+ * Ensures that at least @p size bytes of memzone-backed memory are
+ * available to the allocator on @p socket_id, reserving additional
+ * memzones from EAL as needed to reach that total. Subsequent
+ * allocations served from the pre-reserved memory do not incur
+ * memzone-reservation cost.
+ *
+ * The reservation is cumulative: repeated calls to
+ * rte_fastmem_reserve() with the same @p socket_id grow the
+ * reservation monotonically. Reserved memory is never returned to
+ * the system during the allocator's lifetime.
+ *
+ * A typical use is to call rte_fastmem_reserve() once at
+ * application startup, with a size chosen to cover the expected
+ * steady-state working set. Allocations and frees during
+ * steady-state operation then avoid memzone reservations entirely.
+ *
+ * @param size
+ *  The minimum amount of backing memory, in bytes, to make
+ *  available on @p socket_id. The allocator may reserve more than
+ *  the requested amount due to internal rounding (e.g., to memzone
+ *  or block granularity).
+ *
+ * @param socket_id
+ *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
+ *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the reservation.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
+ *  - -EINVAL: Invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_reserve(size_t size, int socket_id);
+
+/**
+ * Set the maximum backing memory that may be reserved on a socket.
+ *
+ * Once the limit is reached, allocations that would require new
+ * backing memory on the constrained socket fail with ENOMEM.
+ * Already-reserved memory is not released.
+ *
+ * Setting a limit below the current reserved amount is allowed and
+ * prevents further growth.
+ *
+ * @param socket_id
+ *  The NUMA socket to constrain, or SOCKET_ID_ANY to apply the
+ *  limit to all sockets.
+ * @param max_bytes
+ *  Maximum backing memory in bytes, or SIZE_MAX for unlimited (the default).
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Fastmem not initialized, or invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes);
+
+/**
+ * Get the maximum backing memory limit for a socket.
+ *
+ * @param socket_id
+ *  The NUMA socket to query.
+ * @return
+ *  The limit in bytes, or SIZE_MAX if unlimited.
+ */
+__rte_experimental
+size_t
+rte_fastmem_get_limit(int socket_id);
+
+/**
+ * Retrieve the largest allocation size the allocator supports.
+ *
+ * Requests larger than this size are rejected by the allocation
+ * functions. The returned value is a property of the allocator
+ * implementation and does not change across the lifetime of the
+ * process.
+ *
+ * @return
+ *  The largest supported allocation size, in bytes.
+ */
+__rte_experimental
+size_t
+rte_fastmem_max_size(void);
+
+/**
+ * Allocate an object from the fastmem allocator.
+ *
+ * Allocates at least @p size bytes, aligned to at least @p align
+ * bytes. The returned memory is backed by huge pages and is
+ * DMA-usable; its IOVA can be obtained via rte_fastmem_virt2iova().
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_socket() to target a
+ * specific socket.
+ *
+ * The allocated memory must be freed with rte_fastmem_free(). An
+ * allocation may be freed from any lcore, not only the lcore that
+ * made the allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags. Use
+ *  RTE_FASTMEM_F_ZERO to obtain zero-initialized memory.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align (not a power of two).
+ *    - ENOMEM: Allocation could not be served from existing
+ *      backing memory and no additional memzone could be reserved.
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+/**
+ * Allocate an object on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc(), but targets the specified NUMA socket
+ * rather than the socket of the calling lcore. Use this variant
+ * when the lifetime or access pattern of the allocation is not
+ * tied to the calling lcore's socket.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set (see rte_fastmem_alloc()).
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+	__rte_alloc_size(1) __rte_alloc_align(2);
+
+/**
+ * Free an object previously allocated by the fastmem allocator.
+ *
+ * @p ptr must have been returned by a prior call to any fastmem
+ * allocation function, or be NULL. If @p ptr is NULL, no operation
+ * is performed.
+ *
+ * Free may be called from any lcore, regardless of which lcore
+ * made the original allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an object previously allocated by fastmem, or NULL.
+ */
+__rte_experimental
+void
+rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate multiple objects in bulk.
+ *
+ * Allocates @p n objects, each of size at least @p size and aligned
+ * to at least @p align bytes, and stores the resulting pointers
+ * into @p ptrs. All @p n objects have the same size and alignment.
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_bulk_socket() to target a
+ * specific socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_alloc().
+ *
+ * On failure no objects are allocated and @p ptrs is left
+ * untouched.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ *  - -EINVAL: Invalid @p align.
+ *  - -ENOMEM: Not enough objects could be allocated to fill the
+ *    request.
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags);
+
+/**
+ * Allocate multiple objects in bulk on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc_bulk(), but targets the specified NUMA
+ * socket rather than the socket of the calling lcore.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - Negative errno on failure (see rte_fastmem_alloc_bulk()).
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id);
+
+/**
+ * Free multiple objects in bulk.
+ *
+ * Frees the @p n objects pointed to by @p ptrs. Each pointer in
+ * the array must have been returned by a prior fastmem allocation
+ * call and must not have been freed. The objects need not have
+ * the same size, alignment, or socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_free().
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n);
+
+/**
+ * Opaque handle encoding a (size class, NUMA socket) pair.
+ *
+ * Obtained via rte_fastmem_hlookup(). Passing a handle to
+ * rte_fastmem_halloc() avoids the per-call size-class
+ * lookup and socket resolution, improving allocation throughput
+ * for fixed-size objects.
+ */
+typedef uint32_t rte_fastmem_handle_t;
+
+/**
+ * Look up a handle for a given object size and NUMA socket.
+ *
+ * The returned handle encodes the size class and socket, and can
+ * be passed to rte_fastmem_halloc() to allocate objects
+ * without repeating the class lookup.
+ *
+ * @param size
+ *  Object size in bytes. Must not exceed rte_fastmem_max_size().
+ *
+ * @param align
+ *  Alignment requirement (power of two), or 0 for the default
+ *  (RTE_CACHE_LINE_SIZE).
+ *
+ * @param socket_id
+ *  NUMA socket to allocate from.
+ *
+ * @param[out] handle
+ *  On success, set to the resolved handle.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Invalid alignment or socket_id.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ */
+__rte_experimental
+int
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle);
+
+/**
+ * Allocate an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc() but skips the size-class
+ * lookup and socket resolution, using the pre-resolved handle
+ * instead.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  A pointer to the allocated object, or NULL on failure
+ *  (rte_errno is set).
+ */
+__rte_experimental
+void *
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags);
+
+/**
+ * Bulk-allocate objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc_bulk() but uses a pre-resolved
+ * handle. All-or-nothing semantics apply.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param[out] ptrs
+ *  Array to receive @p n allocated pointers.
+ *
+ * @param n
+ *  Number of objects to allocate.
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  - 0: All @p n objects allocated successfully.
+ *  - -ENOMEM: Allocation failed; no objects were allocated.
+ */
+__rte_experimental
+int
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags);
+
+/**
+ * Free an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free() but skips the slab-header
+ * lookup by using the class and socket encoded in the handle.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation function.
+ *  Must belong to the same size class and socket as @p handle.
+ *  NULL is permitted (no-op).
+ */
+__rte_experimental
+void
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr);
+
+/**
+ * Bulk-free objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free_bulk() but skips per-object
+ * slab-header lookups.
+ *
+ * All objects must belong to the same size class and socket as
+ * @p handle.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n);
+
+/**
+ * Obtain the IOVA for a fastmem-allocated pointer.
+ *
+ * Translates a virtual address returned by a fastmem allocation
+ * function into the corresponding IOVA, suitable for use in device
+ * DMA descriptors.
+ *
+ * The returned IOVA is valid for the lifetime of the allocation.
+ *
+ * @p ptr must have been returned by a prior fastmem allocation
+ * function. Passing any other pointer results in undefined
+ * behavior.
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation
+ *  function.
+ *
+ * @return
+ *  The IOVA corresponding to @p ptr.
+ */
+__rte_experimental
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr);
+
+/**
+ * Flush the calling lcore's per-lcore caches.
+ *
+ * Drains every cached object from the calling lcore's
+ * per-(size class, NUMA socket) caches back to their shared
+ * bins, and releases the cache state itself. A subsequent
+ * allocation or free on this lcore lazily recreates any caches
+ * it needs.
+ *
+ * This is useful in applications that have finished a bursty
+ * phase and want to release memory that would otherwise sit idle
+ * in caches. It is also useful in tests that want to observe
+ * bin-level state without per-lcore caching hiding activity.
+ *
+ * The call has no effect when invoked from a non-EAL thread.
+ *
+ * This function is not thread-safe with respect to concurrent
+ * allocations or frees on the calling lcore; call it only when
+ * the calling lcore is not making other fastmem calls.
+ */
+__rte_experimental
+void
+rte_fastmem_cache_flush(void);
+
+/**
+ * Global summary statistics.
+ */
+struct rte_fastmem_stats {
+	uint64_t bytes_backing;  /**< Bytes of backing memory (memzones) reserved from EAL. */
+	uint64_t bytes_in_use;   /**< Approximate bytes in live objects. */
+	uint64_t alloc_total;    /**< Total successful alloc operations (hits + misses). */
+	uint64_t free_total;     /**< Total free operations (hits + misses). */
+	uint64_t alloc_nomem;    /**< Alloc attempts that failed with ENOMEM. */
+	unsigned int n_classes;  /**< Number of size classes. */
+};
+
+/**
+ * Per-size-class statistics (aggregated across all lcores).
+ *
+ * Allocation and free counters count individual objects, not
+ * operations. A bulk allocation of 32 objects that hits the cache
+ * increments alloc_cache_hits by 32.
+ */
+struct rte_fastmem_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t in_use;               /**< Objects currently live (allocs - frees). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from a per-lcore cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by a per-lcore cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+	uint64_t slab_acquires;        /**< Slabs pulled from the free pool. */
+	uint64_t slab_releases;        /**< Slabs returned to the free pool. */
+	uint32_t slabs_partial;        /**< Current partial slab count. */
+	uint32_t slabs_full;           /**< Current full slab count. */
+};
+
+/**
+ * Per-lcore statistics (aggregated across all classes).
+ */
+struct rte_fastmem_lcore_stats {
+	uint64_t alloc_cache_hits;     /**< Allocs served from this lcore's caches. */
+	uint64_t alloc_cache_misses;   /**< Allocs that missed this lcore's caches. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by this lcore's caches. */
+	uint64_t free_cache_misses;    /**< Frees that bypassed this lcore's caches. */
+};
+
+/**
+ * Per-lcore, per-class statistics (no aggregation).
+ */
+struct rte_fastmem_lcore_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+};
+
+/**
+ * Get the number of size classes and optionally their sizes.
+ *
+ * @param[out] sizes
+ *   If non-NULL, filled with the size (in bytes) of each class.
+ *   The caller must provide space for at least the returned number
+ *   of entries.
+ *
+ * @return
+ *   The number of size classes.
+ */
+__rte_experimental
+unsigned int
+rte_fastmem_classes(size_t *sizes);
+
+/**
+ * Retrieve global summary statistics.
+ *
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL or fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats);
+
+/**
+ * Retrieve statistics for a single size class.
+ *
+ * @param class_size
+ *   Exact size of the class to query (must match one of the values
+ *   returned by rte_fastmem_classes()).
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p class_size does not match any size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_class(size_t class_size,
+		struct rte_fastmem_class_stats *stats);
+
+/**
+ * Retrieve per-lcore statistics (aggregated across all classes).
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p lcore_id is invalid.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats);
+
+/**
+ * Retrieve per-lcore, per-class statistics.
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param class_size
+ *   Exact size of the class to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized,
+ *    @p lcore_id is invalid, or @p class_size does not match any
+ *    size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size,
+		struct rte_fastmem_lcore_class_stats *stats);
+
+/**
+ * Reset all statistics counters to zero.
+ *
+ * Zeroes per-lcore cache counters and per-bin counters. Does not
+ * affect the allocator's operational state.
+ */
+__rte_experimental
+void
+rte_fastmem_stats_reset(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_FASTMEM_H_ */
diff --git a/lib/meson.build b/lib/meson.build
index 8f5cfd28a5..10906d4d53 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -38,6 +38,7 @@ libraries = [
         'distributor',
         'dmadev',  # eventdev depends on this
         'efd',
+        'fastmem',
         'eventdev',
         'dispatcher', # dispatcher depends on eventdev
         'gpudev',
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [RFC v2 2/3] lib: add fastmem library
  2026-05-26  8:57     ` [RFC v2 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-26 13:23       ` Stephen Hemminger
  2026-05-27 10:12         ` Mattias Rönnblom
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2026-05-26 13:23 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Tue, 26 May 2026 10:57:42 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> +__rte_experimental
> +void *
> +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
> +	__rte_alloc_size(1) __rte_alloc_align(2);

Should also add attribute __rte_malloc which tells compiler
that pointer returned cannot alias other memory

And add __rte_dealloc(rte_fastmem_free, 1)
which tells compiler that the returned pointer should only go
back to fastmem (not free, rte_free, etc).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC v2 2/3] lib: add fastmem library
  2026-05-26 13:23       ` Stephen Hemminger
@ 2026-05-27 10:12         ` Mattias Rönnblom
  2026-05-27 10:18           ` Bruce Richardson
  0 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 10:12 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/26/26 15:23, Stephen Hemminger wrote:
> On Tue, 26 May 2026 10:57:42 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> +__rte_experimental
>> +void *
>> +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
>> +	__rte_alloc_size(1) __rte_alloc_align(2);
> 
> Should also add attribute __rte_malloc which tells compiler
> that pointer returned cannot alias other memory
> 
> And add __rte_dealloc(rte_fastmem_free, 1)
> which tells compiler that the returned pointer should only go
> back to fastmem (not free, rte_free, etc).

Done. Only works for the single-object ops (not bulk) though.

I've had a look at how to extend fastmem to support larger allocations 
(without suggesting this is the way to go).

Seems to me that implementation should be something like
a) slab allocator for small objects.
b) a cache-less per-socket page run allocator for mid-sized objects.
c) per-object memzones for large objects.

If one would implement that, you would essentially have a plug-in 
replacement for rte_malloc.h (maybe minus some debug and some more 
esoteric DPDK heap features).

Should fastmem be an outright replacement, or something that at least 
initially lives alongside the regular heap, maybe with a run- or 
compile-time option to make rte_malloc.h functions delegate to fastmem? 
This is unclear to me at this point. I fear the more ambitious, cleaner 
and more risky DPDK heap replacement path will go they way my attempts 
to replace rte_memcpy or rte_timer went.

I would agree with anyone saying that we should have only one heap-like 
API for memory allocations. rte_malloc.h obviously needs to stay, for 
backward compatibility reason, if nothing else. I would like to add bulk 
alloc/free, and allow for smaller alignments than 64, since slabs can do 
that efficiently (DPDK heap per-object header is 128 bytes!). One could 
either go about that by extending rte_malloc.h or deprecating that API 
and starting anew. In the latter case, one could do many more minor 
tweaks, like removing the type pointers (only a nuance), remove the 
validate function, change and extend the stats interface, etc.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC v2 2/3] lib: add fastmem library
  2026-05-27 10:12         ` Mattias Rönnblom
@ 2026-05-27 10:18           ` Bruce Richardson
  2026-05-27 11:17             ` Mattias Rönnblom
  2026-05-27 11:17             ` Morten Brørup
  0 siblings, 2 replies; 38+ messages in thread
From: Bruce Richardson @ 2026-05-27 10:18 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Stephen Hemminger, dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Wed, May 27, 2026 at 12:12:19PM +0200, Mattias Rönnblom wrote:
> On 5/26/26 15:23, Stephen Hemminger wrote:
> > On Tue, 26 May 2026 10:57:42 +0200
> > Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> > 
> > > +__rte_experimental
> > > +void *
> > > +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
> > > +	__rte_alloc_size(1) __rte_alloc_align(2);
> > 
> > Should also add attribute __rte_malloc which tells compiler
> > that pointer returned cannot alias other memory
> > 
> > And add __rte_dealloc(rte_fastmem_free, 1)
> > which tells compiler that the returned pointer should only go
> > back to fastmem (not free, rte_free, etc).
> 
> Done. Only works for the single-object ops (not bulk) though.
> 
> I've had a look at how to extend fastmem to support larger allocations
> (without suggesting this is the way to go).
> 
> Seems to me that implementation should be something like
> a) slab allocator for small objects.
> b) a cache-less per-socket page run allocator for mid-sized objects.
> c) per-object memzones for large objects.
> 
> If one would implement that, you would essentially have a plug-in
> replacement for rte_malloc.h (maybe minus some debug and some more esoteric
> DPDK heap features).
> 
> Should fastmem be an outright replacement, or something that at least
> initially lives alongside the regular heap, maybe with a run- or
> compile-time option to make rte_malloc.h functions delegate to fastmem? This
> is unclear to me at this point. I fear the more ambitious, cleaner and more
> risky DPDK heap replacement path will go they way my attempts to replace
> rte_memcpy or rte_timer went.
> 
> I would agree with anyone saying that we should have only one heap-like API
> for memory allocations. rte_malloc.h obviously needs to stay, for backward
> compatibility reason, if nothing else. I would like to add bulk alloc/free,
> and allow for smaller alignments than 64, since slabs can do that
> efficiently (DPDK heap per-object header is 128 bytes!). One could either go
> about that by extending rte_malloc.h or deprecating that API and starting
> anew. In the latter case, one could do many more minor tweaks, like removing
> the type pointers (only a nuance), remove the validate function, change and
> extend the stats interface, etc.
> 
+1 for replacing rte_malloc. For a replacement, I'd tend towards aiming for
compatibilty over trying to fix too many little things at once. While
changing a couple of things is ok, I'd rather not force applications to
make too many updates to their code when moving from one DPDK version to
another.

/Bruce

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC v2 2/3] lib: add fastmem library
  2026-05-27 10:18           ` Bruce Richardson
@ 2026-05-27 11:17             ` Mattias Rönnblom
  2026-05-27 11:17             ` Morten Brørup
  1 sibling, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 11:17 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Stephen Hemminger, dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/27/26 12:18, Bruce Richardson wrote:
> On Wed, May 27, 2026 at 12:12:19PM +0200, Mattias Rönnblom wrote:
>> On 5/26/26 15:23, Stephen Hemminger wrote:
>>> On Tue, 26 May 2026 10:57:42 +0200
>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>
>>>> +__rte_experimental
>>>> +void *
>>>> +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
>>>> +	__rte_alloc_size(1) __rte_alloc_align(2);
>>>
>>> Should also add attribute __rte_malloc which tells compiler
>>> that pointer returned cannot alias other memory
>>>
>>> And add __rte_dealloc(rte_fastmem_free, 1)
>>> which tells compiler that the returned pointer should only go
>>> back to fastmem (not free, rte_free, etc).
>>
>> Done. Only works for the single-object ops (not bulk) though.
>>
>> I've had a look at how to extend fastmem to support larger allocations
>> (without suggesting this is the way to go).
>>
>> Seems to me that implementation should be something like
>> a) slab allocator for small objects.
>> b) a cache-less per-socket page run allocator for mid-sized objects.
>> c) per-object memzones for large objects.
>>
>> If one would implement that, you would essentially have a plug-in
>> replacement for rte_malloc.h (maybe minus some debug and some more esoteric
>> DPDK heap features).
>>
>> Should fastmem be an outright replacement, or something that at least
>> initially lives alongside the regular heap, maybe with a run- or
>> compile-time option to make rte_malloc.h functions delegate to fastmem? This
>> is unclear to me at this point. I fear the more ambitious, cleaner and more
>> risky DPDK heap replacement path will go they way my attempts to replace
>> rte_memcpy or rte_timer went.
>>
>> I would agree with anyone saying that we should have only one heap-like API
>> for memory allocations. rte_malloc.h obviously needs to stay, for backward
>> compatibility reason, if nothing else. I would like to add bulk alloc/free,
>> and allow for smaller alignments than 64, since slabs can do that
>> efficiently (DPDK heap per-object header is 128 bytes!). One could either go
>> about that by extending rte_malloc.h or deprecating that API and starting
>> anew. In the latter case, one could do many more minor tweaks, like removing
>> the type pointers (only a nuance), remove the validate function, change and
>> extend the stats interface, etc.
>>
> +1 for replacing rte_malloc. For a replacement, I'd tend towards aiming for
> compatibilty over trying to fix too many little things at once. While
> changing a couple of things is ok, I'd rather not force applications to
> make too many updates to their code when moving from one DPDK version to
> another.
> 

What one could attempt to do is to be fully backward compatible with 
rte_malloc.h (maybe minus some debug features that require per-object 
headers?) and then expose a new API which is functionally a superset of 
rte_malloc.h.

The new API would be something like a hybrid of rte_malloc.h, the 
mempool APIs, and the kind of API you find on an in-kernel memory 
manager (e.g., Solaris'), tuned for DPDK lcore use.

> /Bruce


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC v2 2/3] lib: add fastmem library
  2026-05-27 10:18           ` Bruce Richardson
  2026-05-27 11:17             ` Mattias Rönnblom
@ 2026-05-27 11:17             ` Morten Brørup
  2026-05-27 11:29               ` Mattias Rönnblom
  1 sibling, 1 reply; 38+ messages in thread
From: Morten Brørup @ 2026-05-27 11:17 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: Stephen Hemminger, dev, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 27 May 2026 12.18
> 
> On Wed, May 27, 2026 at 12:12:19PM +0200, Mattias Rönnblom wrote:
> > On 5/26/26 15:23, Stephen Hemminger wrote:
> > > On Tue, 26 May 2026 10:57:42 +0200
> > > Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> > >
> > > > +__rte_experimental
> > > > +void *
> > > > +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
> > > > +	__rte_alloc_size(1) __rte_alloc_align(2);
> > >
> > > Should also add attribute __rte_malloc which tells compiler
> > > that pointer returned cannot alias other memory
> > >
> > > And add __rte_dealloc(rte_fastmem_free, 1)
> > > which tells compiler that the returned pointer should only go
> > > back to fastmem (not free, rte_free, etc).
> >
> > Done. Only works for the single-object ops (not bulk) though.
> >
> > I've had a look at how to extend fastmem to support larger
> allocations
> > (without suggesting this is the way to go).
> >
> > Seems to me that implementation should be something like
> > a) slab allocator for small objects.
> > b) a cache-less per-socket page run allocator for mid-sized objects.
> > c) per-object memzones for large objects.

All allocations through the library should somehow build on memzones.
If an application wants "normal" memory, it can use libc free()/malloc().

> >
> > If one would implement that, you would essentially have a plug-in
> > replacement for rte_malloc.h (maybe minus some debug and some more
> esoteric
> > DPDK heap features).
> >
> > Should fastmem be an outright replacement, or something that at least
> > initially lives alongside the regular heap, maybe with a run- or
> > compile-time option to make rte_malloc.h functions delegate to
> fastmem? This
> > is unclear to me at this point. I fear the more ambitious, cleaner
> and more
> > risky DPDK heap replacement path will go they way my attempts to
> replace
> > rte_memcpy or rte_timer went.
> >
> > I would agree with anyone saying that we should have only one heap-
> like API
> > for memory allocations. rte_malloc.h obviously needs to stay, for
> backward
> > compatibility reason, if nothing else. I would like to add bulk
> alloc/free,
> > and allow for smaller alignments than 64, since slabs can do that
> > efficiently (DPDK heap per-object header is 128 bytes!). One could
> either go
> > about that by extending rte_malloc.h or deprecating that API and
> starting
> > anew. In the latter case, one could do many more minor tweaks, like
> removing
> > the type pointers (only a nuance), remove the validate function,
> change and
> > extend the stats interface, etc.
> >
> +1 for replacing rte_malloc. For a replacement, I'd tend towards aiming
> for
> compatibilty over trying to fix too many little things at once. While
> changing a couple of things is ok, I'd rather not force applications to
> make too many updates to their code when moving from one DPDK version
> to
> another.
> 
> /Bruce

+1 for taking the path that begins with starting anew.
The new heap library can be designed specifically for the fast path, based on years of DPDK experience with what the fast path really needs.
It will allow much broader experimentation along the way.
We can easily get rid of all slow path legacy stuff in rte_malloc, like the "type" parameter.

I think we all agree that replacing rte_malloc should be the end game.
But IMO, it very important that the properties of rte_malloc do not impose any limits on the new heap library.

Someday, when the new library has sufficiently matured, we can discuss how it can replace the rte_malloc library.
Maybe some parts of rte_malloc are not replaceable, and need to be deprecated and removed.
Maybe we all have switched to using the new heap, and nobody is using rte_malloc anymore, so it can simply be removed. ;-)

Suggestion regarding naming:
The prefix could be "rte_mem_" instead of "rte_fastmem_" - it is shorter, and most libraries are "fast".
And then it could live in /lib/memory instead of /lib/fastmem.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC v2 2/3] lib: add fastmem library
  2026-05-27 11:17             ` Morten Brørup
@ 2026-05-27 11:29               ` Mattias Rönnblom
  2026-05-27 12:03                 ` Morten Brørup
  0 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 11:29 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Stephen Hemminger, dev, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel

On 5/27/26 13:17, Morten Brørup wrote:
>> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
>> Sent: Wednesday, 27 May 2026 12.18
>>
>> On Wed, May 27, 2026 at 12:12:19PM +0200, Mattias Rönnblom wrote:
>>> On 5/26/26 15:23, Stephen Hemminger wrote:
>>>> On Tue, 26 May 2026 10:57:42 +0200
>>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>>
>>>>> +__rte_experimental
>>>>> +void *
>>>>> +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
>>>>> +	__rte_alloc_size(1) __rte_alloc_align(2);
>>>>
>>>> Should also add attribute __rte_malloc which tells compiler
>>>> that pointer returned cannot alias other memory
>>>>
>>>> And add __rte_dealloc(rte_fastmem_free, 1)
>>>> which tells compiler that the returned pointer should only go
>>>> back to fastmem (not free, rte_free, etc).
>>>
>>> Done. Only works for the single-object ops (not bulk) though.
>>>
>>> I've had a look at how to extend fastmem to support larger
>> allocations
>>> (without suggesting this is the way to go).
>>>
>>> Seems to me that implementation should be something like
>>> a) slab allocator for small objects.
>>> b) a cache-less per-socket page run allocator for mid-sized objects.
>>> c) per-object memzones for large objects.
> 
> All allocations through the library should somehow build on memzones.
> If an application wants "normal" memory, it can use libc free()/malloc().
> 

All allocations come out of memzones. In the c) case, each object is a 
memzone. There is a very limited number of memzones, so they can't be many.

>>>
>>> If one would implement that, you would essentially have a plug-in
>>> replacement for rte_malloc.h (maybe minus some debug and some more
>> esoteric
>>> DPDK heap features).
>>>
>>> Should fastmem be an outright replacement, or something that at least
>>> initially lives alongside the regular heap, maybe with a run- or
>>> compile-time option to make rte_malloc.h functions delegate to
>> fastmem? This
>>> is unclear to me at this point. I fear the more ambitious, cleaner
>> and more
>>> risky DPDK heap replacement path will go they way my attempts to
>> replace
>>> rte_memcpy or rte_timer went.
>>>
>>> I would agree with anyone saying that we should have only one heap-
>> like API
>>> for memory allocations. rte_malloc.h obviously needs to stay, for
>> backward
>>> compatibility reason, if nothing else. I would like to add bulk
>> alloc/free,
>>> and allow for smaller alignments than 64, since slabs can do that
>>> efficiently (DPDK heap per-object header is 128 bytes!). One could
>> either go
>>> about that by extending rte_malloc.h or deprecating that API and
>> starting
>>> anew. In the latter case, one could do many more minor tweaks, like
>> removing
>>> the type pointers (only a nuance), remove the validate function,
>> change and
>>> extend the stats interface, etc.
>>>
>> +1 for replacing rte_malloc. For a replacement, I'd tend towards aiming
>> for
>> compatibilty over trying to fix too many little things at once. While
>> changing a couple of things is ok, I'd rather not force applications to
>> make too many updates to their code when moving from one DPDK version
>> to
>> another.
>>
>> /Bruce
> 
> +1 for taking the path that begins with starting anew.
> The new heap library can be designed specifically for the fast path, based on years of DPDK experience with what the fast path really needs.
> It will allow much broader experimentation along the way.
> We can easily get rid of all slow path legacy stuff in rte_malloc, like the "type" parameter.
> 
> I think we all agree that replacing rte_malloc should be the end game.
> But IMO, it very important that the properties of rte_malloc do not impose any limits on the new heap library.
> 
> Someday, when the new library has sufficiently matured, we can discuss how it can replace the rte_malloc library.
> Maybe some parts of rte_malloc are not replaceable, and need to be deprecated and removed.
> Maybe we all have switched to using the new heap, and nobody is using rte_malloc anymore, so it can simply be removed. ;-)
> 

It would be useful to have a run- or compile-time switch to use fastmem 
instead of the regular DPDK heap. In such a case, having it as an 
external library will complicate things during initialization, if you 
want all allocations to end up in fastmem.

Unfortunately, if a rte_malloc replacement, I think it belongs in the 
EAL, just like today's heap. That doesn't mean it have to start in the EAL.

> Suggestion regarding naming:
> The prefix could be "rte_mem_" instead of "rte_fastmem_" - it is shorter, and most libraries are "fast".
> And then it could live in /lib/memory instead of /lib/fastmem.
> 

mem is very generic, but point taken. Name should change, unclear to 
what. Depends on the role it will serve.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC v2 2/3] lib: add fastmem library
  2026-05-27 11:29               ` Mattias Rönnblom
@ 2026-05-27 12:03                 ` Morten Brørup
  0 siblings, 0 replies; 38+ messages in thread
From: Morten Brørup @ 2026-05-27 12:03 UTC (permalink / raw)
  To: Mattias Rönnblom, Bruce Richardson
  Cc: Stephen Hemminger, dev, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 27 May 2026 13.30
> 
> On 5/27/26 13:17, Morten Brørup wrote:
> >> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> >> Sent: Wednesday, 27 May 2026 12.18
> >>
> >> On Wed, May 27, 2026 at 12:12:19PM +0200, Mattias Rönnblom wrote:
> >>> On 5/26/26 15:23, Stephen Hemminger wrote:
> >>>> On Tue, 26 May 2026 10:57:42 +0200
> >>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >>>>
> >>>>> +__rte_experimental
> >>>>> +void *
> >>>>> +rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
> >>>>> +	__rte_alloc_size(1) __rte_alloc_align(2);
> >>>>
> >>>> Should also add attribute __rte_malloc which tells compiler
> >>>> that pointer returned cannot alias other memory
> >>>>
> >>>> And add __rte_dealloc(rte_fastmem_free, 1)
> >>>> which tells compiler that the returned pointer should only go
> >>>> back to fastmem (not free, rte_free, etc).
> >>>
> >>> Done. Only works for the single-object ops (not bulk) though.
> >>>
> >>> I've had a look at how to extend fastmem to support larger
> >> allocations
> >>> (without suggesting this is the way to go).
> >>>
> >>> Seems to me that implementation should be something like
> >>> a) slab allocator for small objects.
> >>> b) a cache-less per-socket page run allocator for mid-sized
> objects.
> >>> c) per-object memzones for large objects.
> >
> > All allocations through the library should somehow build on memzones.
> > If an application wants "normal" memory, it can use libc
> free()/malloc().
> >
> 
> All allocations come out of memzones. In the c) case, each object is a
> memzone. There is a very limited number of memzones, so they can't be
> many.
> 
> >>>
> >>> If one would implement that, you would essentially have a plug-in
> >>> replacement for rte_malloc.h (maybe minus some debug and some more
> >> esoteric
> >>> DPDK heap features).
> >>>
> >>> Should fastmem be an outright replacement, or something that at
> least
> >>> initially lives alongside the regular heap, maybe with a run- or
> >>> compile-time option to make rte_malloc.h functions delegate to
> >> fastmem? This
> >>> is unclear to me at this point. I fear the more ambitious, cleaner
> >> and more
> >>> risky DPDK heap replacement path will go they way my attempts to
> >> replace
> >>> rte_memcpy or rte_timer went.
> >>>
> >>> I would agree with anyone saying that we should have only one heap-
> >> like API
> >>> for memory allocations. rte_malloc.h obviously needs to stay, for
> >> backward
> >>> compatibility reason, if nothing else. I would like to add bulk
> >> alloc/free,
> >>> and allow for smaller alignments than 64, since slabs can do that
> >>> efficiently (DPDK heap per-object header is 128 bytes!). One could
> >> either go
> >>> about that by extending rte_malloc.h or deprecating that API and
> >> starting
> >>> anew. In the latter case, one could do many more minor tweaks, like
> >> removing
> >>> the type pointers (only a nuance), remove the validate function,
> >> change and
> >>> extend the stats interface, etc.
> >>>
> >> +1 for replacing rte_malloc. For a replacement, I'd tend towards
> aiming
> >> for
> >> compatibilty over trying to fix too many little things at once.
> While
> >> changing a couple of things is ok, I'd rather not force applications
> to
> >> make too many updates to their code when moving from one DPDK
> version
> >> to
> >> another.
> >>
> >> /Bruce
> >
> > +1 for taking the path that begins with starting anew.
> > The new heap library can be designed specifically for the fast path,
> based on years of DPDK experience with what the fast path really needs.
> > It will allow much broader experimentation along the way.
> > We can easily get rid of all slow path legacy stuff in rte_malloc,
> like the "type" parameter.
> >
> > I think we all agree that replacing rte_malloc should be the end
> game.
> > But IMO, it very important that the properties of rte_malloc do not
> impose any limits on the new heap library.
> >
> > Someday, when the new library has sufficiently matured, we can
> discuss how it can replace the rte_malloc library.
> > Maybe some parts of rte_malloc are not replaceable, and need to be
> deprecated and removed.
> > Maybe we all have switched to using the new heap, and nobody is using
> rte_malloc anymore, so it can simply be removed. ;-)
> >
> 
> It would be useful to have a run- or compile-time switch to use fastmem
> instead of the regular DPDK heap. In such a case, having it as an
> external library will complicate things during initialization, if you
> want all allocations to end up in fastmem.

IMHO, there is no need for a runtime (or startup time) switch; a build time configuration option suffices.
And it doesn't need to be there until we reach the point where we start looking into replacing rte_malloc with fastmem.
Putting it in too early might lead us to converging too early with rte_malloc, reducing the ability to experiment and further improve the library.

> 
> Unfortunately, if a rte_malloc replacement, I think it belongs in the
> EAL, just like today's heap. That doesn't mean it have to start in the
> EAL.

The EAL is bloated, so it would be nice if rte_malloc could somehow be moved out of the EAL, perhaps only partially.
If that is impossible, then yes, fastmem would probably have to live in the EAL too.

Good point that fastmem doesn't have to start in the EAL.
But if further investigation shows that rte_malloc is too deeply integrated with the EAL to be moved out, fastmem might as well start its life in the EAL too.
It might save us from some headache at the point in time where we move it from a separate library into the EAL.


> 
> > Suggestion regarding naming:
> > The prefix could be "rte_mem_" instead of "rte_fastmem_" - it is
> shorter, and most libraries are "fast".
> > And then it could live in /lib/memory instead of /lib/fastmem.
> >
> 
> mem is very generic, but point taken. Name should change, unclear to
> what. Depends on the role it will serve.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v2 3/3] app/test: add fastmem test suite
  2026-05-26  8:57   ` [RFC v2 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 1/3] doc: add fastmem programming guide Mattias Rönnblom
  2026-05-26  8:57     ` [RFC v2 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-26  8:57     ` Mattias Rönnblom
  2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-26  8:57 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Mattias Rönnblom

Add functional, performance, and profiling test suites for the
fastmem library.

--

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 app/test/meson.build            |    3 +
 app/test/test_fastmem.c         | 1673 +++++++++++++++++++++++++++++++
 app/test/test_fastmem_perf.c    | 1040 +++++++++++++++++++
 app/test/test_fastmem_profile.c |  157 +++
 4 files changed, 2873 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d458f9c07..d11c63be6f 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -82,6 +82,9 @@ source_file_deps = {
     'test_event_vector_adapter.c': ['eventdev', 'bus_vdev'],
     'test_eventdev.c': ['eventdev', 'bus_vdev'],
     'test_external_mem.c': [],
+    'test_fastmem.c': ['fastmem'],
+    'test_fastmem_perf.c': ['fastmem', 'mempool'],
+    'test_fastmem_profile.c': ['fastmem'],
     'test_fbarray.c': [],
     'test_fib.c': ['net', 'fib'],
     'test_fib6.c': ['rib', 'fib'],
diff --git a/app/test/test_fastmem.c b/app/test/test_fastmem.c
new file mode 100644
index 0000000000..6981de28be
--- /dev/null
+++ b/app/test/test_fastmem.c
@@ -0,0 +1,1673 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <inttypes.h>
+#include <stdalign.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_thread.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define FASTMEM_MEMZONE_SIZE (128U << 20)
+
+/*
+ * Count memzones whose names begin with the fastmem prefix.
+ * Used to verify that rte_fastmem_reserve() really did reserve
+ * backing memzones.
+ */
+static int fastmem_memzone_count;
+
+static void
+count_fastmem_memzones_walk(const struct rte_memzone *mz, void *arg)
+{
+	RTE_SET_USED(arg);
+
+	if (strncmp(mz->name, "fastmem_", strlen("fastmem_")) == 0)
+		fastmem_memzone_count++;
+}
+
+static unsigned int
+count_fastmem_memzones(void)
+{
+	fastmem_memzone_count = 0;
+	rte_memzone_walk(count_fastmem_memzones_walk, NULL);
+	return fastmem_memzone_count;
+}
+
+static int
+test_init_deinit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	/* A subsequent init/deinit cycle must succeed. */
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "second rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_init_is_not_idempotent(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, -EBUSY,
+		"expected -EBUSY on re-init, got %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_deinit_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_max_size(void)
+{
+	size_t max;
+
+	max = rte_fastmem_max_size();
+	TEST_ASSERT(max >= (1U << 20),
+		"max_size=%zu below required 1 MiB minimum", max);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_small(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * A small reserve request (1 byte) must result in exactly
+	 * one memzone reservation: the internal rounding is to
+	 * memzone granularity.
+	 */
+	rc = rte_fastmem_reserve(1, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve() failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	rte_fastmem_deinit();
+
+	/* After deinit the memzones must be released. */
+	TEST_ASSERT_EQUAL(count_fastmem_memzones(), 0,
+		"%u fastmem memzones leaked after deinit",
+		count_fastmem_memzones());
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_multiple_memzones(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	size_t reserve_size;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * Request just over one memzone's worth; this must force
+	 * a second memzone to be reserved.
+	 */
+	reserve_size = FASTMEM_MEMZONE_SIZE + 1;
+	rc = rte_fastmem_reserve(reserve_size, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve(%zu) failed: %d",
+		reserve_size, rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 2,
+		"expected 2 new memzones for %zu-byte reserve, got %u",
+		reserve_size, after - before);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_cumulative(void)
+{
+	int socket_id;
+	unsigned int after_first, after_second;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	after_first = count_fastmem_memzones();
+
+	/*
+	 * A second call requesting the same amount that's already
+	 * reserved must not trigger any new memzone reservation.
+	 */
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "second reserve failed: %d", rc);
+
+	after_second = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after_first, after_second,
+		"reserve of already-reserved amount added memzones (%u -> %u)",
+		after_first, after_second);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_invalid_socket(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, RTE_MAX_NUMA_NODES);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for out-of-range socket, got %d", rc);
+
+	rc = rte_fastmem_reserve(1, -2);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for negative socket, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_without_init(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0,
+		"expected failure without init, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_any_socket(void)
+{
+	unsigned int before, after;
+	int rc;
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * SOCKET_ID_ANY should succeed on any system with at least
+	 * one configured socket. The allocator picks the caller's
+	 * socket first and falls back to other sockets if needed.
+	 */
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0,
+		"rte_fastmem_reserve(SOCKET_ID_ANY) failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 2 tests: allocation and free.
+ */
+
+static int
+test_alloc_too_big(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(rte_fastmem_max_size() + 1, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc above max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_invalid_align(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(16, 3, 0); /* 3 is not a power of 2 */
+	TEST_ASSERT_NULL(p, "alloc with align=3 returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL,
+		"expected rte_errno=EINVAL, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_small(void)
+{
+	void *p;
+	p = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8) failed: rte_errno=%d", rte_errno);
+
+	/* Writing into the object must not crash. */
+	memset(p, 0xa5, 8);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_various_sizes(void)
+{
+	static const size_t sizes[] = {
+		1, 8, 16, 17, 63, 64, 128, 1024, 4096,
+		64 * 1024, 256 * 1024, 1024 * 1024,
+	};
+	void *ptrs[RTE_DIM(sizes)];
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(sizes); i++) {
+		ptrs[i] = rte_fastmem_alloc(sizes[i], 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc(%zu) failed: rte_errno=%d",
+			sizes[i], rte_errno);
+		memset(ptrs[i], 0x5a, sizes[i]);
+	}
+
+	for (i = 0; i < RTE_DIM(sizes); i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_alignment(void)
+{
+	static const size_t aligns[] = {
+		8, 16, 64, 256, 4096, 65536,
+	};
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(aligns); i++) {
+		void *p = rte_fastmem_alloc(1, aligns[i], 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=%zu) failed: rte_errno=%d",
+			aligns[i], rte_errno);
+		TEST_ASSERT((uintptr_t)p % aligns[i] == 0,
+			"pointer %p not aligned on %zu",
+			p, aligns[i]);
+		rte_fastmem_free(p);
+	}
+
+	/* Default (align=0) gives at least RTE_CACHE_LINE_SIZE. */
+	{
+		void *p = rte_fastmem_alloc(1, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=0) failed: rte_errno=%d", rte_errno);
+		TEST_ASSERT((uintptr_t)p % RTE_CACHE_LINE_SIZE == 0,
+			"default-align pointer %p not cache-line aligned",
+			p);
+		rte_fastmem_free(p);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_zero_flag(void)
+{
+	uint8_t *p;
+	unsigned int i;
+	bool all_zero = true;
+
+	/*
+	 * Dirty a slab first by allocating without F_ZERO, writing
+	 * a non-zero pattern, and freeing. A subsequent F_ZERO
+	 * allocation on the same slab must return zeroed memory.
+	 */
+	p = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "priming alloc failed");
+	memset(p, 0xff, 128);
+	rte_fastmem_free(p);
+
+	p = rte_fastmem_alloc(128, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_NOT_NULL(p, "F_ZERO alloc failed");
+	for (i = 0; i < 128; i++) {
+		if (p[i] != 0) {
+			all_zero = false;
+			break;
+		}
+	}
+	TEST_ASSERT(all_zero, "F_ZERO returned non-zero byte at offset %u", i);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_reuse(void)
+{
+	void *first, *second;
+	first = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(first, "first alloc failed");
+	rte_fastmem_free(first);
+
+	second = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(second, "second alloc failed");
+
+	/*
+	 * The slab's free list is LIFO, so the most recently freed
+	 * object is at the head of the list. A subsequent alloc in
+	 * the same class returns it.
+	 */
+	TEST_ASSERT_EQUAL(first, second,
+		"free + alloc did not reuse: first=%p second=%p",
+		first, second);
+
+	rte_fastmem_free(second);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_many_in_class(void)
+{
+	/*
+	 * Allocate more objects in one class than fit in a single
+	 * slab, forcing the bin to pull a second block. This
+	 * exercises the partial->full transition and the cross-slab
+	 * allocation path.
+	 */
+	enum { CLASS_SIZE = 8, COUNT = 300000 };
+	void **ptrs;
+	unsigned int i;
+
+	ptrs = calloc(COUNT, sizeof(*ptrs));
+	TEST_ASSERT_NOT_NULL(ptrs, "calloc for test ptrs failed");
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(CLASS_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d",
+			i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	free(ptrs);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket(void)
+{
+	void *p;
+	int socket_id;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing(void)
+{
+	void *small, *large;
+
+	/*
+	 * Allocate and free a small object, forcing a block to be
+	 * assigned to the small class and then returned to the
+	 * free-block pool. A subsequent allocation in a different
+	 * class must be able to reuse that block.
+	 */
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+	rte_fastmem_free(small);
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large, "large alloc failed");
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing_no_growth(void)
+{
+	struct rte_fastmem_stats stats;
+	void *small, *large;
+	uint64_t after_small;
+	int rc;
+
+	/*
+	 * Stronger version of test_alloc_block_repurposing: assert
+	 * that the cross-class allocation does not grow the
+	 * backing memory (bytes_backing stays flat). Because the
+	 * free-block pool is shared across size classes — not
+	 * partitioned per class — the block freed from the small
+	 * class must serve the large allocation without triggering
+	 * a new memzone reservation.
+	 */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, (uint64_t)0,
+		"unexpected pre-alloc bytes_backing: %" PRIu64,
+		stats.bytes_backing);
+
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT(stats.bytes_backing > 0,
+		"bytes_backing did not grow on first alloc");
+	after_small = stats.bytes_backing;
+
+	rte_fastmem_free(small);
+	rte_fastmem_cache_flush();
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large,
+		"large alloc failed: rte_errno=%d", rte_errno);
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, after_small,
+		"cross-class alloc grew backing memory from %" PRIu64
+		" to %" PRIu64,
+		after_small, stats.bytes_backing);
+
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_null(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_free(NULL);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_content_integrity(void)
+{
+	/*
+	 * Allocate a batch of objects, fill each with a distinct
+	 * byte pattern, then verify none of the patterns overlap.
+	 * This catches header overwrites (slab header corrupted by
+	 * object access) and slot-overlap bugs (two pointers pointing
+	 * at overlapping slots).
+	 */
+	enum { N = 256, SIZE = 128 };
+	uint8_t *ptrs[N];
+	unsigned int i, j;
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)i, SIZE);
+	}
+
+	for (i = 0; i < N; i++)
+		for (j = 0; j < SIZE; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], (uint8_t)i,
+				"corruption at ptrs[%u][%u]: got 0x%x, want 0x%x",
+				i, j, ptrs[i][j], (uint8_t)i);
+
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_too_big(void)
+{
+	void *p;
+	/*
+	 * A small size with an alignment larger than the maximum
+	 * size class cannot be served. The class selected must be
+	 * large enough for the alignment, but no such class exists.
+	 */
+	rte_errno = 0;
+	p = rte_fastmem_alloc(1, rte_fastmem_max_size() * 2, 0);
+	TEST_ASSERT_NULL(p,
+		"alloc with align>max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_one(void)
+{
+	void *p;
+	/* align=1 is a valid power of 2 and must be accepted. */
+	p = rte_fastmem_alloc(8, 1, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8, 1) failed: rte_errno=%d",
+		rte_errno);
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket_numa_placement(void)
+{
+	void *p;
+	int socket_id;
+	struct rte_memseg *ms;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	/*
+	 * Walk the memory to find the memseg for this pointer and
+	 * verify its socket. Skip the check if lookup fails (e.g.,
+	 * --no-huge mode may not populate memsegs for fastmem's
+	 * allocations in a way that rte_mem_virt2memseg can find).
+	 */
+	ms = rte_mem_virt2memseg(p, NULL);
+	if (ms != NULL) {
+		TEST_ASSERT_EQUAL(ms->socket_id, socket_id,
+			"alloc on socket %d landed on socket %d",
+			socket_id, ms->socket_id);
+	}
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Allocate from a socket different from the calling lcore's socket,
+ * triggering a cross-socket cache allocation. Then deinit to exercise
+ * the teardown path where a cache's backing memory lives on a
+ * different socket than the one it serves.
+ */
+static int
+test_alloc_cross_socket_deinit(void)
+{
+	int local_sid, remote_sid;
+	unsigned int i, n_sockets;
+	void *p;
+
+	local_sid = (int)rte_socket_id();
+	if (local_sid < 0 || (unsigned int)local_sid >= RTE_MAX_NUMA_NODES)
+		local_sid = rte_socket_id_by_idx(0);
+
+	n_sockets = rte_socket_count();
+	if (n_sockets < 2)
+		return TEST_SKIPPED;
+
+	/* Find a socket different from the local one. */
+	remote_sid = -1;
+	for (i = 0; i < n_sockets; i++) {
+		int sid = rte_socket_id_by_idx(i);
+		if (sid >= 0 && sid != local_sid) {
+			remote_sid = sid;
+			break;
+		}
+	}
+	if (remote_sid < 0)
+		return TEST_SKIPPED;
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, remote_sid);
+	TEST_ASSERT_NOT_NULL(p,
+		"cross-socket alloc(socket %d) failed: rte_errno=%d",
+		remote_sid, rte_errno);
+
+	rte_fastmem_free(p);
+
+	/* Teardown and re-init to exercise the deinit path with
+	 * cross-socket caches.
+	 */
+	rte_fastmem_deinit();
+
+	TEST_ASSERT_EQUAL(rte_fastmem_init(), 0,
+		"re-init after cross-socket deinit failed");
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 3 tests: per-lcore caches.
+ */
+
+static int
+test_cache_flush(void)
+{
+	void *p;
+	/*
+	 * Alloc and free one object, leaving it in the cache. Then
+	 * flush and verify that a subsequent alloc may or may not
+	 * return the same pointer (not asserting same/different —
+	 * just checking that flush does not crash and a follow-up
+	 * alloc still works).
+	 */
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "first alloc failed");
+	rte_fastmem_free(p);
+
+	rte_fastmem_cache_flush();
+
+	/* Flush again — must be idempotent. */
+	rte_fastmem_cache_flush();
+
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "post-flush alloc failed");
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_cache_flush();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_exceeds_capacity(void)
+{
+	/*
+	 * Free more objects at a single size class than the cache
+	 * capacity (64 for classes <= 4 KiB). This forces the
+	 * cache-drain slow path and verifies no corruption.
+	 */
+	enum { COUNT = 200, SIZE = 64 };
+	void *ptrs[COUNT];
+	unsigned int i;
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	/* Re-alloc the same count should still work. */
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"re-alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct non_eal_args {
+	int ok;
+	char pad[64];
+};
+
+static uint32_t
+non_eal_thread_main(void *arg)
+{
+	struct non_eal_args *args = arg;
+	uint8_t *p;
+
+	p = rte_fastmem_alloc(128, 0, 0);
+	if (p == NULL)
+		return 1;
+
+	memset(p, 0x7e, 128);
+
+	rte_fastmem_free(p);
+
+	args->ok = 1;
+	return 0;
+}
+
+static int
+test_non_eal_thread(void)
+{
+	rte_thread_t thread_id;
+	struct non_eal_args args = { 0 };
+	int rc;
+
+	rc = rte_thread_create(&thread_id, NULL, non_eal_thread_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(args.ok, 1,
+		"non-EAL thread did not complete alloc/free successfully");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_returns_memory(void)
+{
+	/*
+	 * When an entire slab's worth of objects is freed, the
+	 * slab's block is returned to the free-block pool and can
+	 * be reassigned to another size class. Verify the cache
+	 * does not permanently hold objects that prevent this.
+	 *
+	 * Allocate enough objects in one class to force multiple
+	 * slabs, free them all, then flush the cache. After the
+	 * flush, all cached objects are drained to their bins and
+	 * empty slabs are returned to the block pool.
+	 */
+	enum { N = 200, SIZE = 64 };
+	void *ptrs[N];
+	unsigned int i;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rte_fastmem_cache_flush();
+
+	/*
+	 * An allocation in a completely different class should
+	 * succeed now, having access to any blocks freed by the
+	 * flush.
+	 */
+	{
+		void *other = rte_fastmem_alloc(65536, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(other,
+			"post-flush cross-class alloc failed");
+		rte_fastmem_free(other);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_basic(void)
+{
+	enum { N = 32 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	/* Verify all pointers are non-NULL and distinct. */
+	for (unsigned int i = 0; i < N; i++) {
+		TEST_ASSERT_NOT_NULL(ptrs[i], "ptrs[%u] is NULL", i);
+		for (unsigned int j = 0; j < i; j++)
+			TEST_ASSERT(ptrs[i] != ptrs[j],
+				"ptrs[%u] == ptrs[%u]", i, j);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_zero_flag(void)
+{
+	enum { N = 8, SIZE = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, SIZE, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	for (unsigned int i = 0; i < N; i++) {
+		uint8_t *p = ptrs[i];
+
+		for (unsigned int b = 0; b < SIZE; b++)
+			TEST_ASSERT_EQUAL(p[b], 0,
+				"ptrs[%u][%u] != 0", i, b);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_exceeds_cache(void)
+{
+	/* Allocate more than cache capacity (64) in one bulk call. */
+	enum { N = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk(%u) failed: %d", N, rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_socket(void)
+{
+	enum { N = 16 };
+	void *ptrs[N];
+	int socket_id;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no sockets");
+
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* SOCKET_ID_ANY */
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket(ANY) failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_bulk(void)
+{
+	enum { N = 64 };
+	void *ptrs[N];
+	/* Allocate individually, free in bulk. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* Verify memory is reusable. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "re-alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_classes(void)
+{
+	size_t sizes[32];
+	unsigned int n;
+
+	n = rte_fastmem_classes(NULL);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+
+	n = rte_fastmem_classes(sizes);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+	TEST_ASSERT_EQUAL(sizes[0], (size_t)8, "class 0 != 8");
+	TEST_ASSERT_EQUAL(sizes[n - 1], (size_t)(1 << 20),
+		"last class != 1 MiB");
+
+	for (unsigned int i = 0; i < n; i++) {
+		TEST_ASSERT(sizes[i] != 0 && (sizes[i] & (sizes[i] - 1)) == 0,
+			"class %u size %zu not power of 2", i, sizes[i]);
+		if (i > 0)
+			TEST_ASSERT(sizes[i] > sizes[i - 1],
+				"classes not ascending at %u", i);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_class(void)
+{
+	enum { N = 10 };
+	struct rte_fastmem_class_stats cs;
+	void *ptrs[N];
+	int rc;
+
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.class_size, (size_t)64, "wrong class_size");
+	TEST_ASSERT(cs.alloc_cache_hits + cs.alloc_cache_misses == N,
+		"alloc count != N: hits=%" PRIu64 " misses=%" PRIu64,
+		cs.alloc_cache_hits, cs.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)N, "in_use != N");
+
+	for (unsigned int i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class after free failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)0, "in_use != 0 after free");
+
+	/* Invalid class size. */
+	rc = rte_fastmem_stats_class(13, &cs);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad size");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore(void)
+{
+	struct rte_fastmem_lcore_stats ls;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+	TEST_ASSERT(ls.alloc_cache_hits + ls.alloc_cache_misses > 0,
+		"no alloc activity on this lcore");
+
+	rte_fastmem_free(ptr);
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore after free failed: %d", rc);
+	TEST_ASSERT(ls.free_cache_hits + ls.free_cache_misses > 0,
+		"no free activity on this lcore");
+
+	/* Invalid lcore. */
+	rc = rte_fastmem_stats_lcore(RTE_MAX_LCORE, &ls);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad lcore");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore_class(void)
+{
+	struct rte_fastmem_lcore_class_stats lcs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore_class(rte_lcore_id(), 256, &lcs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(lcs.class_size, (size_t)256, "wrong class_size");
+	TEST_ASSERT(lcs.alloc_cache_hits + lcs.alloc_cache_misses > 0,
+		"no alloc activity");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_reset(void)
+{
+	struct rte_fastmem_stats gs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+	rte_fastmem_free(ptr);
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_fastmem_stats(&gs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(gs.alloc_total, (uint64_t)0,
+		"alloc_total not zero after reset");
+	TEST_ASSERT_EQUAL(gs.free_total, (uint64_t)0,
+		"free_total not zero after reset");
+
+	return TEST_SUCCESS;
+}
+
+
+#define MIXED_LONG_LIVED_COUNT 25
+#define MIXED_SHORT_LIVED_ITERS 1000
+#define MIXED_MIN_LCORES 3
+
+static const size_t mixed_long_sizes[] = { 64, 256, 4096 };
+static const size_t mixed_short_sizes[] = { 8, 16, 32, 64, 128, 256, 512, 1024 };
+
+struct mixed_worker_args {
+	uint32_t seed;
+	int result;
+};
+
+static uint32_t
+xorshift32(uint32_t *state)
+{
+	uint32_t x = *state;
+
+	x ^= x << 13;
+	x ^= x >> 17;
+	x ^= x << 5;
+	*state = x;
+	return x;
+}
+
+static int
+mixed_worker(void *arg)
+{
+	struct mixed_worker_args *args = arg;
+	uint32_t seed = args->seed;
+	void *long_lived[MIXED_LONG_LIVED_COUNT];
+	size_t long_sizes[MIXED_LONG_LIVED_COUNT];
+	unsigned int i;
+
+	/* Allocate long-lived objects of mixed sizes. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		long_sizes[i] = mixed_long_sizes[i % RTE_DIM(mixed_long_sizes)];
+		long_lived[i] = rte_fastmem_alloc(long_sizes[i], 0, 0);
+		if (long_lived[i] == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(long_lived[i], (int)(i + 1), long_sizes[i]);
+	}
+
+	/* Rapidly cycle short-lived objects. */
+	for (i = 0; i < MIXED_SHORT_LIVED_ITERS; i++) {
+		size_t sz = mixed_short_sizes[xorshift32(&seed) %
+					      RTE_DIM(mixed_short_sizes)];
+		uint8_t pattern = (uint8_t)(i & 0xff);
+		uint8_t *p;
+
+		p = rte_fastmem_alloc(sz, 0, 0);
+		if (p == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(p, pattern, sz);
+
+		/* Verify before freeing. */
+		for (size_t j = 0; j < sz; j++) {
+			if (p[j] != pattern) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(p);
+	}
+
+	/* Verify long-lived objects are still intact. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		uint8_t *bytes = long_lived[i];
+		uint8_t expected = (uint8_t)(i + 1);
+
+		for (size_t j = 0; j < long_sizes[i]; j++) {
+			if (bytes[j] != expected) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(long_lived[i]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_mixed_lifetimes_multi_lcore(void)
+{
+	struct mixed_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int count = 0;
+	struct rte_fastmem_stats stats;
+	int rc;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		count++;
+
+	if (count < MIXED_MIN_LCORES) {
+		printf("Not enough worker lcores (%u < %u), skipping\n",
+		       count, MIXED_MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	/* Launch workers with distinct seeds. */
+	uint32_t seed = 0xdeadbeef;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].seed = seed;
+		args[lcore_id].result = TEST_FAILED;
+		seed += 0x12345678;
+		rte_eal_remote_launch(mixed_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	/* Check all workers succeeded. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	/* Verify no memory leak. */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero after test: %" PRIu64,
+		stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+
+/*
+ * Memory limit tests.
+ *
+ * FASTMEM_MEMZONE_SIZE is 128 MiB. We use a limit of 128 MiB
+ * (one memzone) for most tests, and large objects (256 KiB) to
+ * exhaust slabs quickly.
+ */
+
+#define LIMIT_ONE_MZ ((size_t)128 << 20)
+#define LIMIT_OBJ_SIZE ((size_t)256 * 1024)
+
+static int
+test_memory_limit_basic(void)
+{
+	int rc;
+
+	rc = rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+	TEST_ASSERT_EQUAL(rc, 0, "set_memory_limit failed: %d", rc);
+
+	const size_t got = rte_fastmem_get_limit(0);
+	TEST_ASSERT_EQUAL(got, LIMIT_ONE_MZ,
+		"get_memory_limit mismatch: %zu", got);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ + 1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "second reserve should have failed");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_exhaustion(void)
+{
+	const unsigned int max_ptrs = 1024;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+
+	TEST_ASSERT(count > 0, "should have allocated at least one");
+	TEST_ASSERT(count < max_ptrs, "should have hit the limit");
+	TEST_ASSERT_EQUAL(rte_errno, ENOMEM, "expected ENOMEM, got %d", rte_errno);
+
+	rte_fastmem_free(ptrs[count - 1]);
+	void *p = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc after free should succeed");
+	rte_fastmem_free(p);
+
+	for (unsigned int i = 0; i < count - 1; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_zero_blocks_growth(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "reserve with limit=0 should fail");
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc with limit=0 should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_below_current(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve failed: %d", rc);
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 1);
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc from existing backing should work");
+	rte_fastmem_free(p);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ * 2, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "growth beyond limit should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_socket_id_any(void)
+{
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 42);
+
+	for (unsigned int i = 0; i < rte_socket_count(); i++) {
+		const int sid = rte_socket_id_by_idx(i);
+		const size_t lim = rte_fastmem_get_limit(sid);
+
+		TEST_ASSERT_EQUAL(lim, (size_t)42,
+			"socket %d limit mismatch: %zu", sid, lim);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_unlimited(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+	rte_fastmem_set_limit(SOCKET_ID_ANY, SIZE_MAX);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve after reset failed: %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_integrity_under_oom(void)
+{
+	const unsigned int n = 128;
+	const size_t obj_size = 1024;
+	uint8_t *ptrs[n];
+	const unsigned int extra_max = 1024;
+	void *extra[extra_max];
+	unsigned int n_extra = 0;
+	unsigned int i;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = rte_fastmem_alloc(obj_size, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)(i & 0xff), obj_size);
+	}
+
+	/* Exhaust remaining backing with large objects. */
+	for (n_extra = 0; n_extra < extra_max; n_extra++) {
+		extra[n_extra] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (extra[n_extra] == NULL)
+			break;
+	}
+
+	/* Verify original objects are intact. */
+	for (i = 0; i < n; i++) {
+		const uint8_t expected = (uint8_t)(i & 0xff);
+		for (unsigned int j = 0; j < obj_size; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], expected,
+				"corruption at [%u][%u]", i, j);
+	}
+
+	for (i = 0; i < n; i++)
+		rte_fastmem_free(ptrs[i]);
+	for (i = 0; i < n_extra; i++)
+		rte_fastmem_free(extra[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_bulk_alloc_oom(void)
+{
+	const unsigned int bulk_n = 64;
+	const unsigned int drain_max = 512;
+	void *ptrs[bulk_n];
+	void *drain[drain_max];
+	unsigned int drained = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (drained = 0; drained < drain_max; drained++) {
+		drain[drained] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (drain[drained] == NULL)
+			break;
+	}
+
+	/* Free a few — enough for some but not bulk_n objects. */
+	const unsigned int freed = RTE_MIN(drained, 4u);
+	for (unsigned int i = 0; i < freed; i++)
+		rte_fastmem_free(drain[--drained]);
+
+	rc = rte_fastmem_alloc_bulk(ptrs, bulk_n, LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT(rc < 0, "bulk alloc should fail");
+
+	for (unsigned int i = 0; i < drained; i++)
+		rte_fastmem_free(drain[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_recovery_after_free(void)
+{
+	const unsigned int max_ptrs = 512;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+	TEST_ASSERT(count > 0 && count < max_ptrs,
+		"expected partial fill, got %u", count);
+
+	const unsigned int half = count / 2;
+	for (unsigned int i = 0; i < half; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	for (unsigned int i = 0; i < half; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "recovery alloc[%u] failed", i);
+	}
+
+	for (unsigned int i = 0; i < count; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct limit_worker_args {
+	unsigned int alloc_count;
+	int result;
+};
+
+static int
+limit_worker(void *arg)
+{
+	struct limit_worker_args *args = arg;
+	const unsigned int max_ptrs = 128;
+	void *ptrs[max_ptrs];
+	unsigned int i;
+
+	args->alloc_count = 0;
+
+	for (i = 0; i < max_ptrs; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[i] == NULL)
+			break;
+		memset(ptrs[i], 0xab, LIMIT_OBJ_SIZE);
+		args->alloc_count++;
+	}
+
+	for (unsigned int j = 0; j < args->alloc_count; j++) {
+		uint8_t *bytes = ptrs[j];
+		for (size_t k = 0; k < LIMIT_OBJ_SIZE; k++) {
+			if (bytes[k] != 0xab) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(ptrs[j]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_memory_limit_multi_lcore_oom(void)
+{
+	struct limit_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int worker_count = 0;
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		worker_count++;
+
+	if (worker_count < 2) {
+		printf("Not enough workers (%u < 2), skipping\n", worker_count);
+		return TEST_SKIPPED;
+	}
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].result = TEST_FAILED;
+		rte_eal_remote_launch(limit_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	struct rte_fastmem_stats stats;
+	rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero: %" PRIu64, stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+static int
+fastmem_setup(void)
+{
+	return rte_fastmem_init();
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_deinit();
+}
+
+static struct unit_test_suite fastmem_lifecycle_testsuite = {
+	.suite_name = "fastmem lifecycle tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_init_deinit),
+		TEST_CASE(test_init_is_not_idempotent),
+		TEST_CASE(test_deinit_without_init),
+		TEST_CASE(test_max_size),
+		TEST_CASE(test_reserve_without_init),
+		TEST_CASE(test_cache_flush_without_init),
+		TEST_CASE(test_classes),
+		TEST_CASES_END()
+	}
+};
+
+static struct unit_test_suite fastmem_functional_testsuite = {
+	.suite_name = "fastmem functional tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_multiple_memzones),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_cumulative),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_invalid_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_any_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_invalid_align),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_various_sizes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_alignment),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_reuse),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_many_in_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing_no_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_null),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_content_integrity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_one),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket_numa_placement),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_cross_socket_deinit),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_exceeds_capacity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_non_eal_thread),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush_returns_memory),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_exceeds_cache),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_bulk),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_reset),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_mixed_lifetimes_multi_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_exhaustion),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_zero_blocks_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_below_current),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_socket_id_any),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_unlimited),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_integrity_under_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_bulk_alloc_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_recovery_after_free),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_multi_lcore_oom),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_fastmem(void)
+{
+	int rc;
+
+	rc = unit_test_suite_runner(&fastmem_lifecycle_testsuite);
+	if (rc != 0)
+		return rc;
+
+	return unit_test_suite_runner(&fastmem_functional_testsuite);
+}
+
+REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_SKIP, ASAN_OK, test_fastmem);
diff --git a/app/test/test_fastmem_perf.c b/app/test/test_fastmem_perf.c
new file mode 100644
index 0000000000..73c0a4c6ce
--- /dev/null
+++ b/app/test/test_fastmem_perf.c
@@ -0,0 +1,1040 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_mempool.h>
+#include <rte_stdatomic.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define TEST_LOG(...) printf(__VA_ARGS__)
+
+static const size_t SIZES[] = { 8, 64, 256, 1024, 4096 };
+#define N_SIZES RTE_DIM(SIZES)
+
+/* Number of ops for warmup and measurement. */
+#define WARMUP_OPS 20000u
+#define MEASURE_OPS 2000000u
+
+/* Buffer for scenarios that allocate N then free N. */
+#define BATCH_N 256
+
+/*
+ * Allocator vtable: a thin adapter exposing alloc / free /
+ * per-allocator setup/teardown. Each scenario calls these
+ * indirectly so the same timing loop serves all allocators.
+ */
+struct allocator {
+	const char *name;
+	int (*setup)(size_t size, unsigned int n_max);
+	void (*teardown)(void);
+	void *(*alloc)(void);
+	void (*free_obj)(void *ptr);
+	int (*alloc_bulk)(void **ptrs, unsigned int n);
+	void (*free_bulk)(void **ptrs, unsigned int n);
+};
+
+/* Fastmem adapter -------------------------------------------------- */
+
+static size_t fastmem_size;
+
+static int
+fastmem_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	fastmem_size = size;
+	return 0;
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_alloc(void)
+{
+	return rte_fastmem_alloc(fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free(void *ptr)
+{
+	rte_fastmem_free(ptr);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static struct rte_mempool *mempool_pool;
+
+static int
+mempool_setup(size_t size, unsigned int n_max)
+{
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int cache_size;
+
+	/*
+	 * Pool size must accommodate the full batch burst plus
+	 * per-lcore cache capacity. Use mempool's default cache
+	 * size so we're measuring its standard hot path.
+	 */
+	cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
+
+	snprintf(name, sizeof(name), "fmperf_mp_%zu", size);
+	mempool_pool = rte_mempool_create(name, n_max + cache_size * 2,
+			size, cache_size, 0, NULL, NULL, NULL, NULL,
+			SOCKET_ID_ANY, 0);
+	if (mempool_pool == NULL) {
+		TEST_LOG("mempool_create(%zu) failed\n", size);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+mempool_teardown(void)
+{
+	rte_mempool_free(mempool_pool);
+	mempool_pool = NULL;
+}
+
+static void * __rte_noinline
+mempool_alloc_one(void)
+{
+	void *obj = NULL;
+
+	if (rte_mempool_get(mempool_pool, &obj) < 0)
+		return NULL;
+	return obj;
+}
+
+static void __rte_noinline
+mempool_free_one(void *ptr)
+{
+	rte_mempool_put(mempool_pool, ptr);
+}
+
+/* rte_malloc adapter ----------------------------------------------- */
+
+static size_t malloc_size;
+
+static int
+malloc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	malloc_size = size;
+	return 0;
+}
+
+static void
+malloc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+malloc_alloc(void)
+{
+	return rte_malloc(NULL, malloc_size, 0);
+}
+
+static void __rte_noinline
+malloc_free(void *ptr)
+{
+	rte_free(ptr);
+}
+
+/* libc (glibc) malloc adapter -------------------------------------- */
+
+static size_t libc_size;
+
+static int
+libc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	/*
+	 * Round up to cache-line alignment to match the other
+	 * allocators' default alignment guarantees and keep the
+	 * comparison honest. aligned_alloc() requires size to be
+	 * a multiple of the alignment.
+	 */
+	libc_size = RTE_ALIGN_CEIL(size, RTE_CACHE_LINE_SIZE);
+	return 0;
+}
+
+static void
+libc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+libc_alloc(void)
+{
+	return aligned_alloc(RTE_CACHE_LINE_SIZE, libc_size);
+}
+
+static void __rte_noinline
+libc_free(void *ptr)
+{
+	free(ptr);
+}
+
+/* Bulk adapters ---------------------------------------------------- */
+
+static int __rte_noinline
+fastmem_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_alloc_bulk(ptrs, n, fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_free_bulk(ptrs, n);
+}
+
+/* Fastmem handle adapter ------------------------------------------- */
+
+static rte_fastmem_handle_t fastmem_handle;
+
+static int
+fastmem_h_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	return rte_fastmem_hlookup(size, 0, rte_socket_id(), &fastmem_handle);
+}
+
+static void
+fastmem_h_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_h_alloc(void)
+{
+	return rte_fastmem_halloc(fastmem_handle, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free(void *ptr)
+{
+	rte_fastmem_hfree(fastmem_handle, ptr);
+}
+
+static int __rte_noinline
+fastmem_h_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_halloc_bulk(fastmem_handle, ptrs, n, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_hfree_bulk(fastmem_handle, ptrs, n);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static int __rte_noinline
+mempool_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_mempool_get_bulk(mempool_pool, ptrs, n);
+}
+
+static void __rte_noinline
+mempool_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_mempool_put_bulk(mempool_pool, ptrs, n);
+}
+
+static int __rte_noinline
+generic_alloc_bulk(void **ptrs, unsigned int n, void *(*alloc_fn)(void))
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = alloc_fn();
+		if (ptrs[i] == NULL)
+			return -1;
+	}
+	return 0;
+}
+
+static int __rte_noinline
+malloc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, malloc_alloc);
+}
+
+static void __rte_noinline
+malloc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		malloc_free(ptrs[i]);
+}
+
+static int __rte_noinline
+libc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, libc_alloc);
+}
+
+static void __rte_noinline
+libc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		libc_free(ptrs[i]);
+}
+
+/* Adapter table ---------------------------------------------------- */
+
+static const struct allocator allocators[] = {
+	{ "fastmem",    fastmem_setup,   fastmem_teardown,   fastmem_alloc,     fastmem_free,     fastmem_alloc_bulk,   fastmem_free_bulk },
+	{ "fastmem_h",  fastmem_h_setup, fastmem_h_teardown, fastmem_h_alloc,   fastmem_h_free,   fastmem_h_alloc_bulk, fastmem_h_free_bulk },
+	{ "mempool",    mempool_setup,   mempool_teardown,   mempool_alloc_one, mempool_free_one, mempool_alloc_bulk,   mempool_free_bulk },
+	{ "rte_malloc", malloc_setup,    malloc_teardown,    malloc_alloc,      malloc_free,      malloc_alloc_bulk,    malloc_free_bulk },
+	{ "libc",       libc_setup,      libc_teardown,      libc_alloc,        libc_free,        libc_alloc_bulk,      libc_free_bulk },
+};
+#define N_ALLOCATORS RTE_DIM(allocators)
+
+/*
+ * Scenario 1: tight alloc+free loop. A single object is cycled
+ * repeatedly. The LIFO path keeps the same pointer hot, giving
+ * a best-case measurement.
+ */
+static double
+run_tight(const struct allocator *alloc, size_t size)
+{
+	void *p;
+	uint64_t tsc;
+	unsigned int i;
+
+	if (alloc->setup(size, 1) < 0)
+		return -1.0;
+
+	/* Warmup. */
+	for (i = 0; i < WARMUP_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < MEASURE_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+	tsc = rte_rdtsc_precise() - tsc;
+
+	alloc->teardown();
+
+	return (double)tsc / MEASURE_OPS;
+err:
+	alloc->teardown();
+	return -1.0;
+}
+
+/*
+ * Scenario 2: allocate N, free N (FIFO free order). Exercises
+ * cache refill and drain paths when N exceeds cache capacity.
+ */
+static void
+run_batch(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	/* Pick iteration count so total ops ~= MEASURE_OPS. */
+	iters = MEASURE_OPS / BATCH_N;
+
+	/* Warmup. */
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 3: allocate N, free N in reverse order.
+ */
+static void
+run_batch_reverse(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	iters = MEASURE_OPS / BATCH_N;
+
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 4: multi-lcore alloc/work/free with a dummy-work
+ * baseline. Each worker runs a tight alloc → touch → free loop
+ * on its own lcore. A second run with the same dummy work but
+ * no allocator traffic establishes a baseline; the per-op
+ * allocator cost is reported as (alloc_run - baseline_run).
+ *
+ * Fixed size class and a fixed amount of dummy work per op —
+ * this scenario sweeps lcore count rather than size.
+ */
+#define MULTI_SIZE 256u
+#define MULTI_WORK_BYTES 64u
+#define MULTI_WORK_PASSES 8u   /* RMW passes over the work region. */
+#define MULTI_OPS 200000u
+#define MULTI_WARMUP 2000u
+#define MAX_MULTI_LCORES 32u
+
+/*
+ * Per-worker volatile sink. Each worker writes to its own
+ * slot, preventing dead-code elimination of touch_buffer() and
+ * avoiding cross-lcore cache-line sharing on the hot path.
+ * Padded to cache-line stride to prevent false sharing between
+ * neighboring workers' slots.
+ */
+struct worker_sink {
+	volatile uint64_t value;
+} __rte_cache_aligned;
+
+static struct worker_sink worker_sinks[RTE_MAX_LCORE];
+
+/*
+ * Out-of-line dummy workload: run MULTI_WORK_PASSES
+ * read-modify-write passes over the first 'bytes' of the
+ * buffer. Each pass reads what the previous pass wrote, so the
+ * compiler cannot unroll or parallelize across passes — the
+ * work scales linearly with MULTI_WORK_PASSES. Returns an
+ * accumulator so the caller can feed it into a volatile sink;
+ * without that, the compiler could elide the whole function.
+ *
+ * __rte_noinline so it looks identical to the compiler in both
+ * the baseline (pre-allocated scratch buffer) and alloc-path
+ * runs, making the cycle-delta subtraction valid.
+ *
+ * The purpose of this being tunably expensive is to keep
+ * worker-per-iteration cost high relative to the allocator's
+ * critical section, so that even serialized allocators like
+ * rte_malloc spend most of their time outside the lock and the
+ * measured per-op allocator cost reflects its own work rather
+ * than its contention queue.
+ */
+static uint64_t __rte_noinline
+touch_buffer(void *buf, size_t bytes)
+{
+	uint64_t *p = buf;
+	size_t n = bytes / sizeof(uint64_t);
+	uint64_t acc = 0;
+	unsigned int pass;
+	size_t i;
+
+	/* Prime the buffer with a known pattern. */
+	for (i = 0; i < n; i++)
+		p[i] = i * 0x9E3779B97F4A7C15ULL;
+
+	/*
+	 * Dependent RMW passes: each pass reads p[i] written by
+	 * the previous pass, mixes the pass index in, and writes
+	 * back. The XOR into acc keeps the chain live.
+	 */
+	for (pass = 0; pass < MULTI_WORK_PASSES; pass++) {
+		for (i = 0; i < n; i++) {
+			uint64_t v = p[i];
+
+			v = v * 0xC2B2AE3D27D4EB4FULL + pass;
+			v ^= v >> 33;
+			p[i] = v;
+			acc ^= v;
+		}
+	}
+
+	return acc;
+}
+
+struct worker_args {
+	const struct allocator *alloc;
+	void *scratch;            /* baseline only; NULL => alloc path */
+	unsigned int iters;
+	unsigned int warmup;
+	unsigned int bulk_n;      /* 0 = single-object, >0 = bulk */
+	RTE_ATOMIC(bool) start_flag; /* barrier at worker entry */
+	uint64_t cycles;          /* out */
+	unsigned int ops;         /* out */
+	int err;                  /* out */
+};
+
+static int
+worker_run(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	/* Wait for start flag (spin-barrier set by main). */
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				return -1;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+
+	/* Measured loop. */
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				break;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i;
+
+	/* Publish accumulator to defeat dead-code elimination. */
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+static int
+worker_run_bulk(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	void *ptrs[BATCH_N];
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i, j;
+	unsigned int bulk_n = wa->bulk_n;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			return -1;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			break;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i * bulk_n;
+
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+/*
+ * Launch workers on the first 'n_workers' worker lcores, run
+ * either the baseline (scratch != NULL) or the alloc path
+ * (scratch == NULL), and return the mean per-op cycle cost
+ * averaged across participating workers.
+ *
+ * On any worker error, returns -1.0.
+ */
+static double
+run_multi_workers(const struct allocator *alloc, unsigned int n_workers,
+		void *const *scratches, unsigned int bulk_n)
+{
+	struct worker_args wargs[RTE_MAX_LCORE];
+	unsigned int worker_lcores[MAX_MULTI_LCORES];
+	unsigned int n = 0;
+	unsigned int lcore_id;
+	unsigned int i;
+	lcore_function_t *fn = bulk_n > 0 ? worker_run_bulk : worker_run;
+
+	/* Collect the first n_workers worker lcores. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (n >= n_workers)
+			break;
+		worker_lcores[n++] = lcore_id;
+	}
+	if (n < n_workers)
+		return -1.0;
+
+	/* Prepare per-worker args. */
+	for (i = 0; i < n_workers; i++) {
+		struct worker_args *wa = &wargs[worker_lcores[i]];
+
+		wa->alloc = alloc;
+		wa->scratch = scratches != NULL ? scratches[i] : NULL;
+		wa->iters = MULTI_OPS;
+		wa->warmup = MULTI_WARMUP;
+		wa->bulk_n = bulk_n;
+		rte_atomic_store_explicit(&wa->start_flag, false,
+				rte_memory_order_relaxed);
+	}
+
+	/* Launch workers. They spin on start_flag until released. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_remote_launch(fn, &wargs[worker_lcores[i]],
+				worker_lcores[i]);
+
+	/* Release all workers roughly simultaneously. */
+	for (i = 0; i < n_workers; i++)
+		rte_atomic_store_explicit(
+			&wargs[worker_lcores[i]].start_flag, true,
+			rte_memory_order_release);
+
+	/* Wait for completion. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_wait_lcore(worker_lcores[i]);
+
+	/* Aggregate: mean cycles per op across workers. */
+	{
+		double sum_cycles_per_op = 0.0;
+		unsigned int n_ok = 0;
+
+		for (i = 0; i < n_workers; i++) {
+			struct worker_args *wa = &wargs[worker_lcores[i]];
+
+			if (wa->err != 0 || wa->ops == 0)
+				return -1.0;
+			sum_cycles_per_op +=
+				(double)wa->cycles / (double)wa->ops;
+			n_ok++;
+		}
+		return sum_cycles_per_op / n_ok;
+	}
+}
+
+/*
+ * One sub-run of Scenario 4: given an allocator and a worker
+ * count, return (baseline, alloc_path) mean cycles per op.
+ */
+static void
+run_multi_lcore(const struct allocator *alloc, unsigned int n_workers,
+		unsigned int bulk_n, double *baseline, double *alloc_path)
+{
+	void *scratches[MAX_MULTI_LCORES] = {0};
+	unsigned int n_alloced = 0;
+	unsigned int i;
+
+	*baseline = -1.0;
+	*alloc_path = -1.0;
+
+	if (alloc->setup(MULTI_SIZE, n_workers * 64) < 0)
+		return;
+
+	/* Baseline: pre-allocate one scratch per worker. */
+	for (i = 0; i < n_workers; i++) {
+		scratches[i] = alloc->alloc();
+		if (scratches[i] == NULL)
+			goto err;
+		n_alloced++;
+	}
+
+	*baseline = run_multi_workers(alloc, n_workers, scratches, 0);
+
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	n_alloced = 0;
+
+	/* Alloc path: workers alloc+free each iter. */
+	*alloc_path = run_multi_workers(alloc, n_workers, NULL, bulk_n);
+
+	alloc->teardown();
+	return;
+err:
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	alloc->teardown();
+}
+
+/* Reporting -------------------------------------------------------- */
+
+static void
+print_header(const char *title)
+{
+	size_t i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < N_SIZES; i++)
+		TEST_LOG(" %10zu B", SIZES[i]);
+	TEST_LOG("\n");
+}
+
+static void
+print_row(const char *name, const double *values)
+{
+	size_t i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < N_SIZES; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %12s", "--");
+		else
+			TEST_LOG(" %12.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_header(const char *title, const unsigned int *lcore_counts,
+		unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < n_counts; i++)
+		TEST_LOG(" %8u lcore%c", lcore_counts[i],
+				lcore_counts[i] == 1 ? ' ' : 's');
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_row(const char *name, const double *values, unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < n_counts; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %14s", "--");
+		else
+			TEST_LOG(" %14.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+/* Driver ----------------------------------------------------------- */
+
+static int
+test_fastmem_perf(void)
+{
+	size_t i;
+	size_t a;
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	TEST_LOG("\nfastmem performance — single-lcore, fixed-size\n");
+	TEST_LOG("All numbers are TSC cycles.\n");
+
+	/* Scenario 1: tight alloc+free. */
+	print_header("Scenario 1: Single-object hot path — cycles per (alloc + free)");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			vals[i] = run_tight(&allocators[a], SIZES[i]);
+		print_row(allocators[a].name, vals);
+	}
+
+	/* Scenario 2: batched, FIFO free. */
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 3: batched, reverse free. */
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 4: multi-lcore alloc/work/free with baseline. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		/* Sweep lcore counts: 1, 2, 4, 8, ... up to max_workers. */
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		/* Ensure max_workers is the final column if not power of two. */
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 4 (Multi-lcore contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 4 parameters: size=%u B\n",
+				MULTI_SIZE);
+
+			for (a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a], lcore_counts[c],
+							0, &base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 4: Multi-lcore contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	/* Scenario 5: multi-lcore bulk alloc/work/free. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+		unsigned int bulk_n = 8;
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 5 (Multi-lcore bulk contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 5 parameters: size=%u B, "
+				"bulk=%u\n",
+				MULTI_SIZE, bulk_n);
+
+			for (size_t a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a],
+							lcore_counts[c], bulk_n,
+							&base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 5: Multi-lcore bulk contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (size_t a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	TEST_LOG("\n");
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_perf_autotest, test_fastmem_perf);
diff --git a/app/test/test_fastmem_profile.c b/app/test/test_fastmem_profile.c
new file mode 100644
index 0000000000..9a5dc94018
--- /dev/null
+++ b/app/test/test_fastmem_profile.c
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+/*
+ * A minimal fastmem workload intended for use with perf record /
+ * perf report. Runs a tight alloc/free loop for a fixed duration
+ * so that sampling profilers can attribute cycles to individual
+ * functions and instructions within the fastmem hot path.
+ *
+ * Usage:
+ *   perf record -g -- dpdk-test --no-huge --no-pci -m 8192 \
+ *       -l 0 <<< fastmem_profile_autotest
+ *   perf report
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+/* Duration of each sub-test in TSC cycles (~3 seconds at 3 GHz). */
+#define PROFILE_DURATION_CYCLES (3ULL * rte_get_tsc_hz())
+
+/* Allocation size for the profiling workload. */
+#define PROFILE_SIZE 256u
+
+/*
+ * Sub-test 1: tight alloc+free, exercises only the per-lcore
+ * cache (no bin interaction after warmup).
+ */
+static int
+profile_cache_hit(void)
+{
+	uint64_t deadline;
+	uint64_t ops = 0;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		void *p = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+
+		if (p == NULL)
+			return -1;
+		rte_fastmem_free(p);
+		ops++;
+	}
+
+	printf("  cache_hit: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+/*
+ * Sub-test 2: alloc N then free N, where N exceeds the cache
+ * capacity. This forces repeated cache refills and drains,
+ * exercising the bin lock and slab free-list traversal.
+ */
+#define PROFILE_BATCH 256u
+
+static int
+profile_cache_miss(void)
+{
+	void *ptrs[PROFILE_BATCH];
+	uint64_t deadline;
+	uint64_t ops = 0;
+	unsigned int i;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		for (i = 0; i < PROFILE_BATCH; i++) {
+			ptrs[i] = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+			if (ptrs[i] == NULL)
+				return -1;
+		}
+		for (i = 0; i < PROFILE_BATCH; i++)
+			rte_fastmem_free(ptrs[i]);
+		ops += PROFILE_BATCH;
+	}
+
+	printf("  cache_miss: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_hit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-hit workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_hit() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_miss(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-miss workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_miss() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_profile_cache_hit_autotest,
+		test_fastmem_profile_cache_hit);
+REGISTER_PERF_TEST(fastmem_profile_cache_miss_autotest,
+		test_fastmem_profile_cache_miss);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v3 0/3] lib/fastmem: fast small-object allocator
  2026-05-26  8:57     ` [RFC v2 3/3] app/test: add fastmem test suite Mattias Rönnblom
@ 2026-05-27 17:30       ` Mattias Rönnblom
  2026-05-27 17:30         ` [RFC v3 1/3] doc: add fastmem programming guide Mattias Rönnblom
                           ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 17:30 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom


This RFC introduces fastmem, a general-purpose small-object allocator
for DPDK. It is intended to replace per-type mempools with a single
allocator that handles arbitrary sizes, grows on demand, and matches
mempool-level performance on the hot path.

Motivation
----------

DPDK applications commonly maintain many mempools — one per object
type (connections, sessions, timers, work items). Each must be sized
up front, wastes memory when over-provisioned, and cannot serve
objects of a different size. Fastmem eliminates this by accepting
arbitrary sizes at runtime, backed by a slab allocator that
repurposes memory across size classes as demand shifts.

Design
------

Three-layer architecture:

1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
   reserved lazily (or pre-reserved for deterministic latency).

2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
   The alignment enables O(1) slab lookup from any object pointer
   via bitmask — no radix tree or index structure. Slabs move
   freely between 18 power-of-2 size classes (8 B to 1 MiB).

3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
   path). Cache misses trigger bulk transfers to/from the shared
   bin under a spinlock.

Key properties:

- Zero per-object metadata in the production build.
- NUMA-aware, with per-socket bins and free-slab pools.
- DMA-usable memory with O(1) virt-to-IOVA translation.
- Bulk alloc/free with all-or-nothing semantics.
- Backing memory never returned during lifetime (slabs recycled).
- Non-EAL threads supported (bypass cache, take bin lock).
- Secondary process support (lazy attach, no per-lcore caches).

API surface
-----------

  rte_fastmem_init / deinit
  rte_fastmem_reserve
  rte_fastmem_set_limit / get_limit
  rte_fastmem_alloc / alloc_socket
  rte_fastmem_realloc
  rte_fastmem_alloc_bulk / alloc_bulk_socket
  rte_fastmem_free / free_bulk
  rte_fastmem_hlookup / halloc / halloc_bulk / hfree / hfree_bulk
  rte_fastmem_virt2iova
  rte_fastmem_cache_flush
  rte_fastmem_max_size / classes
  rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
  rte_fastmem_stats_reset

All APIs are marked __rte_experimental.

Performance
-----------

The single-object hot path is roughly 2–3× the cost of mempool
and an order of magnitude faster than rte_malloc. Under
multi-lcore contention, fastmem scales similarly to mempool,
while rte_malloc collapses.

Limitations
-----------

- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
- Power-of-2 classes only; worst-case internal fragmentation ~50%.
- Backing memory not reclaimable short of deinit.

Future work
-----------

- Lcore-affine allocations (false-sharing-free by construction).
- Mempool ops driver for transparent drop-in use.
- Debug mode (cookies, double-free detection, poison-on-free).
- Telemetry integration.
- EAL integration, allowing EAL-internal subsystems to use
  fastmem for their small-object allocations.

Changes in RFC v3:
- Add rte_fastmem_realloc() with full test coverage.
- Add __rte_malloc/__rte_dealloc compiler attributes; remove
  incorrect __rte_alloc_size/__rte_alloc_align.
- Extract normalize_align() helper; remove redundant inline
  directives.
- Merge lifecycle and functional test suites.
- Add realloc subsection to programming guide.

Changes in RFC v2:
- Fix cross-socket deinit use-after-free.
- Add secondary process support.
- Add handle-based allocation API.
- Fix clang warnings; misc cleanup.


Mattias Rönnblom (3):
  doc: add fastmem programming guide
  lib: add fastmem library
  app/test: add fastmem test suite

 app/test/meson.build                  |    3 +
 app/test/test_fastmem.c               | 1801 +++++++++++++++++++++++++
 app/test/test_fastmem_perf.c          | 1040 ++++++++++++++
 app/test/test_fastmem_profile.c       |  157 +++
 doc/api/doxy-api-index.md             |    1 +
 doc/api/doxy-api.conf.in              |    1 +
 doc/guides/prog_guide/fastmem_lib.rst |  328 +++++
 doc/guides/prog_guide/index.rst       |    1 +
 lib/fastmem/meson.build               |    6 +
 lib/fastmem/rte_fastmem.c             | 1748 ++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h             |  815 +++++++++++
 lib/meson.build                       |    1 +
 12 files changed, 5902 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v3 1/3] doc: add fastmem programming guide
  2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
@ 2026-05-27 17:30         ` Mattias Rönnblom
  2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-27 17:30         ` [RFC v3 2/3] lib: add fastmem library Mattias Rönnblom
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 17:30 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Add a programming guide for the fastmem library covering usage,
API overview, design, and implementation details.

--

RFC v3:
 * Add realloc subsection to Allocation and free section.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/guides/prog_guide/fastmem_lib.rst | 328 ++++++++++++++++++++++++++
 doc/guides/prog_guide/index.rst       |   1 +
 2 files changed, 329 insertions(+)
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst

diff --git a/doc/guides/prog_guide/fastmem_lib.rst b/doc/guides/prog_guide/fastmem_lib.rst
new file mode 100644
index 0000000000..cbc3bcf191
--- /dev/null
+++ b/doc/guides/prog_guide/fastmem_lib.rst
@@ -0,0 +1,328 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2026 Ericsson AB
+
+Fastmem Library
+===============
+
+The fastmem library is a fast, general-purpose small-object
+allocator for DPDK applications. It lets an application replace
+its many per-type mempools — each sized for a single object type
+— with a single allocator that handles arbitrary object sizes,
+grows on demand, and offers mempool-level performance for the
+common allocation and free paths.
+
+Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+supports bulk operations, and uses per-lcore caches to reduce
+shared-state contention. Unlike mempool, it does not require the
+caller to declare object sizes or counts up front.
+
+
+When to use fastmem
+-------------------
+
+Use fastmem when:
+
+* Small objects (up to 1 MiB) are allocated and freed on the
+  data path with low, predictable latency requirements.
+
+* Many object types of varying sizes exist and maintaining a
+  separate mempool for each is impractical.
+
+* DMA-usable memory with efficient virtual-to-IOVA translation
+  is needed.
+
+Do not use fastmem for allocations larger than 1 MiB. Use
+``rte_malloc()`` instead.
+
+
+Initialization and teardown
+----------------------------
+
+.. code-block:: c
+
+   /* At startup, after rte_eal_init(). */
+   rte_fastmem_init();
+
+   /* Optional: pre-reserve backing memory to avoid latency
+    * spikes from on-demand memzone reservation. */
+   rte_fastmem_reserve(64 * 1024 * 1024, SOCKET_ID_ANY);
+
+   /* ... application runs ... */
+
+   /* At shutdown, after all allocations have been freed. */
+   rte_fastmem_deinit();
+
+Neither ``rte_fastmem_init()`` nor ``rte_fastmem_deinit()`` is
+thread-safe; call them from the main lcore during startup and
+shutdown.
+
+
+Allocation and free
+-------------------
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(128, 0, 0);
+   /* Use obj... */
+   rte_fastmem_free(obj);
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's NUMA
+socket. Use ``rte_fastmem_alloc_socket()`` to target a specific
+socket or to enable cross-socket fallback with ``SOCKET_ID_ANY``.
+
+Realloc
+~~~~~~~
+
+.. code-block:: c
+
+   obj = rte_fastmem_realloc(obj, 256, 0);
+
+``rte_fastmem_realloc()`` resizes an allocation, preserving its
+contents. If the existing allocation already satisfies the new
+size, the original pointer may be returned unchanged. Otherwise a
+new allocation is made, contents are copied, and the old
+allocation is freed. On failure, the original allocation remains
+valid.
+
+Alignment
+~~~~~~~~~
+
+When ``align`` is 0, the returned pointer is aligned to at least
+``RTE_CACHE_LINE_SIZE``. A non-zero ``align`` must be a power of
+two. Specifying an alignment smaller than ``RTE_CACHE_LINE_SIZE``
+is permitted but the returned object may then share a cache line
+with an adjacent allocation, risking false sharing.
+
+Zeroing
+~~~~~~~
+
+Pass ``RTE_FASTMEM_F_ZERO`` to receive zero-initialized memory:
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(256, 0, RTE_FASTMEM_F_ZERO);
+
+
+Bulk allocation and free
+-------------------------
+
+.. code-block:: c
+
+   void *ptrs[32];
+
+   if (rte_fastmem_alloc_bulk(ptrs, 32, 64, 0, 0) < 0)
+       /* handle error */;
+
+   /* Use objects... */
+
+   rte_fastmem_free_bulk(ptrs, 32);
+
+Bulk allocation has all-or-nothing semantics: either all
+requested objects are returned, or none are (and ``rte_errno``
+is set to ``ENOMEM``).
+
+Bulk free is most efficient when all objects belong to the same
+size class; in that case the objects are pushed into the
+per-lcore cache in a single operation.
+
+
+IOVA translation
+----------------
+
+Memory returned by fastmem is DMA-usable. To obtain the IOVA
+for use in device descriptors:
+
+.. code-block:: c
+
+   rte_iova_t iova = rte_fastmem_virt2iova(obj);
+
+The translation is O(1). The returned IOVA is valid for the
+lifetime of the allocation.
+
+
+NUMA awareness
+--------------
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's socket.
+``rte_fastmem_alloc_socket()`` accepts an explicit socket ID or
+``SOCKET_ID_ANY``:
+
+* Explicit socket: allocate only from that socket; fail with
+  ``ENOMEM`` if exhausted.
+
+* ``SOCKET_ID_ANY``: try the caller's local socket first, then
+  fall back to other sockets.
+
+
+Per-lcore caches
+----------------
+
+Each EAL thread has a private cache per size class. The common
+allocation and free paths operate entirely within this cache,
+avoiding locks. Cache misses (empty on alloc, full on free)
+trigger a bulk transfer to/from the shared bin under a lock.
+
+Non-EAL threads bypass the cache and take the bin lock on every
+operation.
+
+``rte_fastmem_cache_flush()`` drains the calling lcore's caches
+back to the shared bins. This is useful after bursty phases to
+release idle cached memory.
+
+
+Threading
+---------
+
+All allocation and free functions are thread-safe and may be
+called from any thread. An allocation made on one thread may be
+freed on any other.
+
+Fastmem uses internal spinlocks. A thread preempted while
+holding one delays other threads contending for the same lock
+(correctness is not affected, only latency).
+
+
+Pre-reserving memory
+--------------------
+
+By default, fastmem reserves backing memory lazily on first
+allocation. ``rte_fastmem_reserve(size, socket_id)`` forces
+reservation up front, ensuring subsequent allocations do not
+incur memzone-reservation latency:
+
+.. code-block:: c
+
+   /* Reserve 128 MiB on socket 0. */
+   rte_fastmem_reserve(128 * 1024 * 1024, 0);
+
+Once reserved, backing memory is never returned to the system
+during the allocator's lifetime.
+
+Memory limits
+~~~~~~~~~~~~~
+
+``rte_fastmem_set_limit(socket_id, max_bytes)`` caps how much
+backing memory may be reserved on a given socket. Once the limit is
+reached, allocations that would require new backing memory fail with
+``ENOMEM``. The default is ``SIZE_MAX`` (unlimited).
+``rte_fastmem_get_limit()`` returns the current limit for a socket.
+
+.. code-block:: c
+
+   /* Allow at most 256 MiB on socket 0. */
+   rte_fastmem_set_limit(0, 256 * 1024 * 1024);
+
+   /* Block all growth on socket 1. */
+   rte_fastmem_set_limit(1, 0);
+
+Pass ``SOCKET_ID_ANY`` to apply the same limit to all sockets.
+
+
+Size classes
+------------
+
+Fastmem uses power-of-two size classes from 8 bytes to 1 MiB
+(18 classes). A request for N bytes is served from the smallest
+class >= N. The maximum supported size is queryable via
+``rte_fastmem_max_size()``.
+
+With power-of-two classes, worst-case internal fragmentation is
+just under 50% (e.g., a 33-byte request occupies a 64-byte
+slot). Assuming a uniform distribution of request sizes, the
+average waste is 25%. In practice, DPDK workloads tend to
+cluster at or near powers of two, so typical waste is lower.
+
+Requests exceeding the maximum are rejected with ``E2BIG``.
+
+
+Implementation
+--------------
+
+Fastmem organizes memory in three layers: backing memzones, slabs,
+and per-lcore caches.
+
+Backing memory and slabs
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Backing memory is obtained from EAL as 128 MiB IOVA-contiguous
+memzones, each aligned to 2 MiB. A memzone is partitioned into
+64 fixed-size, 2 MiB **slabs**. Slabs are the unit of memory
+that moves between size classes: a free slab can be assigned to
+any bin on demand, and an empty slab (all objects freed) returns
+to the free-slab pool for reuse by another size class.
+
+The 2 MiB slab alignment is the key structural property. Given
+any object pointer, the allocator recovers the owning slab by
+masking off the low 21 bits — no radix tree, hash table, or
+memzone lookup is needed. This makes the free path fast: a
+single pointer-mask load reaches the slab header, which
+identifies the size class and bin.
+
+Each slab reserves 64 bytes at offset 0 for its header. The
+remaining space is divided into fixed-size slots equal to the
+size class. Allocated objects carry no per-object metadata; the
+full slot is available to the caller.
+
+Three-level allocation hierarchy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. **Per-lcore cache** — a bounded LIFO stack of free object
+   pointers, one per (lcore, size class, socket). Allocation
+   pops; free pushes. No lock is needed because only the owning
+   lcore accesses its cache.
+
+2. **Bin** — one per (size class, socket). Owns the partial and
+   full slab lists. A spinlock serializes bulk transfers between
+   the bin and per-lcore caches. Most traffic is absorbed by the
+   caches, so bin-lock contention is low.
+
+3. **Free-slab pool** — one per socket. A spinlock protects slab
+   acquisition and release. These events are rare relative to
+   object-level operations (a single small-object slab serves
+   thousands of allocations).
+
+On a cache miss (empty on alloc, full on free), the cache
+exchanges objects with the bin in bulk, targeting half-full to
+maximize headroom in both directions.
+
+Cache sizing
+~~~~~~~~~~~~
+
+Cache capacity varies by size class to bound per-lcore memory
+footprint:
+
+* Classes 8 B through 4 KiB: capacity 64.
+* Larger classes: capacity halves per class (32, 16, 8, 4),
+  flooring at 4.
+
+Even the largest classes remain cached. The capacity curve
+ensures that small, frequent allocations get the highest cache
+hit rate, while large allocations still avoid the bin lock on
+most operations.
+
+
+Statistics
+----------
+
+Fastmem maintains always-on, per-lcore counters that track
+allocation and free activity. Statistics are queryable at four
+levels of granularity: global summary, per size class, per lcore,
+and per lcore per class.
+
+``rte_fastmem_classes()`` returns the number of size classes and
+optionally fills an array with their sizes.
+
+See ``rte_fastmem.h`` for the full statistics API.
+
+
+Secondary Processes
+-------------------
+
+Fastmem works transparently in DPDK secondary processes. The shared
+state is discovered automatically on first allocation.
+
+Secondary processes do not use per-lcore caches; every allocation and
+free acquires the bin spinlock directly. This is acceptable for
+control-plane secondaries with low allocation rates. The primary
+process should pre-reserve sufficient backing memory with
+``rte_fastmem_reserve()`` since secondaries cannot grow the pool.
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index e6f24945b0..c85196c85e 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -28,6 +28,7 @@ Memory Management
     mempool_lib
     mbuf_lib
     multi_proc_support
+    fastmem_lib
 
 
 CPU Management
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v4 0/3] lib/fastmem: fast small-object allocator
  2026-05-27 17:30         ` [RFC v3 1/3] doc: add fastmem programming guide Mattias Rönnblom
@ 2026-05-30  9:26           ` Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 1/3] doc: add fastmem programming guide Mattias Rönnblom
                               ` (3 more replies)
  0 siblings, 4 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-30  9:26 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

This RFC introduces fastmem, a general-purpose small-object allocator
for DPDK. It is intended to replace per-type mempools with a single
allocator that handles arbitrary sizes, grows on demand, and matches
mempool-level performance on the hot path.

Motivation
----------

DPDK applications commonly maintain many mempools — one per object
type (connections, sessions, timers, work items). Each must be sized
up front, wastes memory when over-provisioned, and cannot serve
objects of a different size. Fastmem eliminates this by accepting
arbitrary sizes at runtime, backed by a slab allocator that
repurposes memory across size classes as demand shifts.

Design
------

Three-layer architecture:

1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
   reserved lazily (or pre-reserved for deterministic latency).

2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
   The alignment enables O(1) slab lookup from any object pointer
   via bitmask — no radix tree or index structure. Slabs move
   freely between 18 power-of-2 size classes (8 B to 1 MiB).

3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
   path). Cache misses trigger bulk transfers to/from the shared
   bin under a spinlock.

Key properties:

- Zero per-object metadata in the production build.
- NUMA-aware, with per-socket bins and free-slab pools.
- DMA-usable memory with O(1) virt-to-IOVA translation.
- Bulk alloc/free with all-or-nothing semantics.
- Backing memory never returned during lifetime (slabs recycled).
- Non-EAL threads supported (bypass cache, take bin lock).
- Secondary process support (lazy attach, no per-lcore caches).

API surface
-----------

  rte_fastmem_init / deinit
  rte_fastmem_reserve
  rte_fastmem_set_limit / get_limit
  rte_fastmem_alloc / alloc_socket
  rte_fastmem_realloc
  rte_fastmem_alloc_bulk / alloc_bulk_socket
  rte_fastmem_free / free_bulk
  rte_fastmem_hlookup / halloc / halloc_bulk / hfree / hfree_bulk
  rte_fastmem_virt2iova
  rte_fastmem_cache_flush
  rte_fastmem_max_size / classes
  rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
  rte_fastmem_stats_reset

All APIs are marked __rte_experimental.

Performance
-----------

The single-object hot path is roughly 2–3× the cost of mempool
and an order of magnitude faster than rte_malloc. Under
multi-lcore contention, fastmem scales similarly to mempool,
while rte_malloc collapses.

Limitations
-----------

- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
- Power-of-2 classes only; worst-case internal fragmentation ~50%.
- Backing memory not reclaimable short of deinit.

Future work
-----------

- Lcore-affine allocations (false-sharing-free by construction).
- Mempool ops driver for transparent drop-in use.
- Debug mode (cookies, double-free detection, poison-on-free).
- Telemetry integration.
- EAL integration, allowing EAL-internal subsystems to use
  fastmem for their small-object allocations.

Changes in RFC v4:
- Fix crash in halloc/hfree on lcores without hlookup: fall back
  to shared bin on NULL cache.
- Keep per-lcore statistics across rte_fastmem_cache_flush().
- Guard free and IOVA paths against uninitialized state.
- Lazy-attach stats readers in secondary processes; distinguish
  -ENODEV from -EINVAL.
- Protect bin statistics with the bin lock.
- Trim verbose comments.
- Add shared cache for callers without a private cache (non-EAL
  threads, secondary processes). Add rte_fastmem_stats_shared()
  and rte_fastmem_stats_shared_class().
- Document rte_fastmem_stats_reset() quiescence requirement.
- Add tests for handle alloc/free from uncached lcores, stats
  survival across flush, and shared-cache statistics.
- Update programming guide (shared cache, stats sections).

Changes in RFC v3:
- Add rte_fastmem_realloc() with full test coverage.
- Add __rte_malloc/__rte_dealloc compiler attributes; remove
  incorrect __rte_alloc_size/__rte_alloc_align.
- Extract normalize_align() helper; remove redundant inline
  directives.
- Merge lifecycle and functional test suites.
- Add realloc subsection to programming guide.

Changes in RFC v2:
- Fix cross-socket deinit use-after-free.
- Add secondary process support.
- Add handle-based allocation API.
- Fix clang warnings; misc cleanup.

Mattias Rönnblom (3):
  doc: add fastmem programming guide
  lib: add fastmem library
  app/test: add fastmem test suite

 doc/guides/prog_guide/fastmem_lib.rst | ...
 lib/fastmem/                          | ...
 app/test/test_fastmem*.c              | ...

Mattias Rönnblom (3):
  doc: add fastmem programming guide
  lib: add fastmem library
  app/test: add fastmem test suite

 app/test/meson.build                  |    3 +
 app/test/test_fastmem.c               | 2111 ++++++++++++++++++++++++
 app/test/test_fastmem_perf.c          | 1040 ++++++++++++
 app/test/test_fastmem_profile.c       |  157 ++
 doc/api/doxy-api-index.md             |    1 +
 doc/api/doxy-api.conf.in              |    1 +
 doc/guides/prog_guide/fastmem_lib.rst |  351 ++++
 doc/guides/prog_guide/index.rst       |    1 +
 lib/fastmem/meson.build               |    6 +
 lib/fastmem/rfc-cover-letter.txt      |  128 ++
 lib/fastmem/rte_fastmem.c             | 2123 +++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h             |  908 +++++++++++
 lib/meson.build                       |    1 +
 13 files changed, 6831 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rfc-cover-letter.txt
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v4 1/3] doc: add fastmem programming guide
  2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
@ 2026-05-30  9:26             ` Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 2/3] lib: add fastmem library Mattias Rönnblom
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-30  9:26 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Add a programming guide for the fastmem library covering usage,
API overview, design, and implementation details.

--

RFC v4:
 * Document per-lcore statistics surviving cache flush and
   bin-direct counters for non-cached traffic.
 * Document shared cache for callers without a private cache
   (non-EAL threads, secondary processes).

RFC v3:
 * Add realloc subsection to Allocation and free section.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/guides/prog_guide/fastmem_lib.rst | 351 ++++++++++++++++++++++++++
 doc/guides/prog_guide/index.rst       |   1 +
 2 files changed, 352 insertions(+)
 create mode 100644 doc/guides/prog_guide/fastmem_lib.rst

diff --git a/doc/guides/prog_guide/fastmem_lib.rst b/doc/guides/prog_guide/fastmem_lib.rst
new file mode 100644
index 0000000000..4d7d69770c
--- /dev/null
+++ b/doc/guides/prog_guide/fastmem_lib.rst
@@ -0,0 +1,351 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2026 Ericsson AB
+
+Fastmem Library
+===============
+
+The fastmem library is a fast, general-purpose small-object
+allocator for DPDK applications. It lets an application replace
+its many per-type mempools — each sized for a single object type
+— with a single allocator that handles arbitrary object sizes,
+grows on demand, and offers mempool-level performance for the
+common allocation and free paths.
+
+Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+supports bulk operations, and uses per-lcore caches to reduce
+shared-state contention. Unlike mempool, it does not require the
+caller to declare object sizes or counts up front.
+
+
+When to use fastmem
+-------------------
+
+Use fastmem when:
+
+* Small objects (up to 1 MiB) are allocated and freed on the
+  data path with low, predictable latency requirements.
+
+* Many object types of varying sizes exist and maintaining a
+  separate mempool for each is impractical.
+
+* DMA-usable memory with efficient virtual-to-IOVA translation
+  is needed.
+
+Do not use fastmem for allocations larger than 1 MiB. Use
+``rte_malloc()`` instead.
+
+
+Initialization and teardown
+----------------------------
+
+.. code-block:: c
+
+   /* At startup, after rte_eal_init(). */
+   rte_fastmem_init();
+
+   /* Optional: pre-reserve backing memory to avoid latency
+    * spikes from on-demand memzone reservation. */
+   rte_fastmem_reserve(64 * 1024 * 1024, SOCKET_ID_ANY);
+
+   /* ... application runs ... */
+
+   /* At shutdown, after all allocations have been freed. */
+   rte_fastmem_deinit();
+
+Neither ``rte_fastmem_init()`` nor ``rte_fastmem_deinit()`` is
+thread-safe; call them from the main lcore during startup and
+shutdown.
+
+
+Allocation and free
+-------------------
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(128, 0, 0);
+   /* Use obj... */
+   rte_fastmem_free(obj);
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's NUMA
+socket. Use ``rte_fastmem_alloc_socket()`` to target a specific
+socket or to enable cross-socket fallback with ``SOCKET_ID_ANY``.
+
+Realloc
+~~~~~~~
+
+.. code-block:: c
+
+   obj = rte_fastmem_realloc(obj, 256, 0);
+
+``rte_fastmem_realloc()`` resizes an allocation, preserving its
+contents. If the existing allocation already satisfies the new
+size, the original pointer may be returned unchanged. Otherwise a
+new allocation is made, contents are copied, and the old
+allocation is freed. On failure, the original allocation remains
+valid.
+
+Alignment
+~~~~~~~~~
+
+When ``align`` is 0, the returned pointer is aligned to at least
+``RTE_CACHE_LINE_SIZE``. A non-zero ``align`` must be a power of
+two. Specifying an alignment smaller than ``RTE_CACHE_LINE_SIZE``
+is permitted but the returned object may then share a cache line
+with an adjacent allocation, risking false sharing.
+
+Zeroing
+~~~~~~~
+
+Pass ``RTE_FASTMEM_F_ZERO`` to receive zero-initialized memory:
+
+.. code-block:: c
+
+   void *obj = rte_fastmem_alloc(256, 0, RTE_FASTMEM_F_ZERO);
+
+
+Bulk allocation and free
+-------------------------
+
+.. code-block:: c
+
+   void *ptrs[32];
+
+   if (rte_fastmem_alloc_bulk(ptrs, 32, 64, 0, 0) < 0)
+       /* handle error */;
+
+   /* Use objects... */
+
+   rte_fastmem_free_bulk(ptrs, 32);
+
+Bulk allocation has all-or-nothing semantics: either all
+requested objects are returned, or none are (and ``rte_errno``
+is set to ``ENOMEM``).
+
+Bulk free is most efficient when all objects belong to the same
+size class; in that case the objects are pushed into the
+caller's cache in a single operation.
+
+
+IOVA translation
+----------------
+
+Memory returned by fastmem is DMA-usable. To obtain the IOVA
+for use in device descriptors:
+
+.. code-block:: c
+
+   rte_iova_t iova = rte_fastmem_virt2iova(obj);
+
+The translation is O(1). The returned IOVA is valid for the
+lifetime of the allocation.
+
+
+NUMA awareness
+--------------
+
+``rte_fastmem_alloc()`` allocates on the calling lcore's socket.
+``rte_fastmem_alloc_socket()`` accepts an explicit socket ID or
+``SOCKET_ID_ANY``:
+
+* Explicit socket: allocate only from that socket; fail with
+  ``ENOMEM`` if exhausted.
+
+* ``SOCKET_ID_ANY``: try the caller's local socket first, then
+  fall back to other sockets.
+
+
+Caches
+------
+
+Only threads with an lcore id running in the **primary** process
+get a private cache per size class. The common allocation and free
+paths operate entirely within this private cache, avoiding locks.
+Cache misses (empty on alloc, full on free) trigger a bulk transfer
+to/from the shared bin under a lock.
+
+Every other caller — unregistered non-EAL threads (which have no
+lcore id), and all threads in a secondary process (which never use
+private caches) — shares a single **shared cache** per (size class,
+socket), protected by a per-socket spinlock. These callers still
+benefit from caching, but pay for the shared lock and so cost more
+per operation than a private-cache thread.
+
+``rte_fastmem_cache_flush()`` drains the calling lcore's private
+caches back to the shared bins. This is useful after bursty phases
+to release idle cached memory. It has no effect on a thread that
+has no private cache.
+
+
+Threading
+---------
+
+All allocation and free functions are thread-safe and may be
+called from any thread. An allocation made on one thread may be
+freed on any other.
+
+Fastmem uses internal spinlocks. A thread preempted while
+holding one delays other threads contending for the same lock
+(correctness is not affected, only latency).
+
+
+Pre-reserving memory
+--------------------
+
+By default, fastmem reserves backing memory lazily on first
+allocation. ``rte_fastmem_reserve(size, socket_id)`` forces
+reservation up front, ensuring subsequent allocations do not
+incur memzone-reservation latency:
+
+.. code-block:: c
+
+   /* Reserve 128 MiB on socket 0. */
+   rte_fastmem_reserve(128 * 1024 * 1024, 0);
+
+Once reserved, backing memory is never returned to the system
+during the allocator's lifetime.
+
+Memory limits
+~~~~~~~~~~~~~
+
+``rte_fastmem_set_limit(socket_id, max_bytes)`` caps how much
+backing memory may be reserved on a given socket. Once the limit is
+reached, allocations that would require new backing memory fail with
+``ENOMEM``. The default is ``SIZE_MAX`` (unlimited).
+``rte_fastmem_get_limit()`` returns the current limit for a socket.
+
+.. code-block:: c
+
+   /* Allow at most 256 MiB on socket 0. */
+   rte_fastmem_set_limit(0, 256 * 1024 * 1024);
+
+   /* Block all growth on socket 1. */
+   rte_fastmem_set_limit(1, 0);
+
+Pass ``SOCKET_ID_ANY`` to apply the same limit to all sockets.
+
+
+Size classes
+------------
+
+Fastmem uses power-of-two size classes from 8 bytes to 1 MiB
+(18 classes). A request for N bytes is served from the smallest
+class >= N. The maximum supported size is queryable via
+``rte_fastmem_max_size()``.
+
+With power-of-two classes, worst-case internal fragmentation is
+just under 50% (e.g., a 33-byte request occupies a 64-byte
+slot). Assuming a uniform distribution of request sizes, the
+average waste is 25%. In practice, DPDK workloads tend to
+cluster at or near powers of two, so typical waste is lower.
+
+Requests exceeding the maximum are rejected with ``E2BIG``.
+
+
+Implementation
+--------------
+
+Fastmem organizes memory in three layers: backing memzones, slabs,
+and caches.
+
+Backing memory and slabs
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Backing memory is obtained from EAL as 128 MiB IOVA-contiguous
+memzones, each aligned to 2 MiB. A memzone is partitioned into
+64 fixed-size, 2 MiB **slabs**. Slabs are the unit of memory
+that moves between size classes: a free slab can be assigned to
+any bin on demand, and an empty slab (all objects freed) returns
+to the free-slab pool for reuse by another size class.
+
+The 2 MiB slab alignment is the key structural property. Given
+any object pointer, the allocator recovers the owning slab by
+masking off the low 21 bits — no radix tree, hash table, or
+memzone lookup is needed. This makes the free path fast: a
+single pointer-mask load reaches the slab header, which
+identifies the size class and bin.
+
+Each slab reserves 64 bytes at offset 0 for its header. The
+remaining space is divided into fixed-size slots equal to the
+size class. Allocated objects carry no per-object metadata; the
+full slot is available to the caller.
+
+Three-level allocation hierarchy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. **Cache** — a bounded LIFO stack of free object pointers.
+   Allocation pops; free pushes. Lcore-id-equipped primary threads
+   each get a private cache per (lcore, size class, socket), which
+   needs no lock because only the owning lcore touches it. All
+   other callers share one cache per (size class, socket), guarded
+   by a per-socket spinlock.
+
+2. **Bin** — one per (size class, socket). Owns the partial and
+   full slab lists. A spinlock serializes bulk transfers between
+   the bin and the caches. Most traffic is absorbed by the
+   caches, so bin-lock contention is low.
+
+3. **Free-slab pool** — one per socket. A spinlock protects slab
+   acquisition and release. These events are rare relative to
+   object-level operations (a single small-object slab serves
+   thousands of allocations).
+
+On a cache miss (empty on alloc, full on free), the cache
+exchanges objects with the bin in bulk, targeting half-full to
+maximize headroom in both directions.
+
+Cache sizing
+~~~~~~~~~~~~
+
+Cache capacity varies by size class to bound per-cache memory
+footprint:
+
+* Classes 8 B through 4 KiB: capacity 64.
+* Larger classes: capacity halves per class (32, 16, 8, 4),
+  flooring at 4.
+
+Even the largest classes remain cached. The capacity curve
+ensures that small, frequent allocations get the highest cache
+hit rate, while large allocations still avoid the bin lock on
+most operations. The shared cache uses the same capacities.
+
+
+Statistics
+----------
+
+Fastmem maintains always-on counters that track allocation and
+free activity. Statistics are queryable at several levels of
+granularity: global summary, per size class, per lcore, per lcore
+per class, and for the shared cache (with
+``rte_fastmem_stats_shared()`` and
+``rte_fastmem_stats_shared_class()``).
+
+Counters are stored independently of the caches, so they survive
+``rte_fastmem_cache_flush()`` and persist until an explicit
+``rte_fastmem_stats_reset()``.
+
+Allocations and frees made without a private per-lcore cache — by
+lcore-less threads and by all threads in a secondary process — go
+through the shared cache. They cannot be attributed to an lcore, so
+they do not appear in the per-lcore or per-lcore-per-class views,
+but they are counted in the global and per-class statistics and
+reported by the shared-cache statistics functions.
+
+``rte_fastmem_classes()`` returns the number of size classes and
+optionally fills an array with their sizes.
+
+See ``rte_fastmem.h`` for the full statistics API.
+
+
+Secondary Processes
+-------------------
+
+Fastmem works transparently in DPDK secondary processes. The shared
+state is discovered automatically on first allocation.
+
+Secondary processes do not use private per-lcore caches, even for
+their lcore-id-equipped threads; all of their traffic goes through
+the shared cache (the same one used by lcore-less primary threads).
+This is acceptable for control-plane secondaries with low allocation
+rates. The primary process should pre-reserve sufficient backing
+memory with ``rte_fastmem_reserve()`` since secondaries cannot grow
+the pool.
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index e6f24945b0..c85196c85e 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -28,6 +28,7 @@ Memory Management
     mempool_lib
     mbuf_lib
     multi_proc_support
+    fastmem_lib
 
 
 CPU Management
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v4 2/3] lib: add fastmem library
  2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 1/3] doc: add fastmem programming guide Mattias Rönnblom
@ 2026-05-30  9:26             ` Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 3/3] app/test: add fastmem test suite Mattias Rönnblom
  2026-06-10 12:35             ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Konstantin Ananyev
  3 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-30  9:26 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Introduce fastmem, a fast general-purpose small-object allocator
for DPDK applications. It allows an application to replace its
many per-type mempools with a single allocator that handles
arbitrary sizes, grows on demand, and offers mempool-level
performance on the hot path.

Applications that manage many object types (connections, sessions,
work items, timers) currently maintain a separate mempool for each,
requiring upfront sizing and wasting memory on over-provisioned
pools. Fastmem removes both constraints.

Key properties:

 * Huge-page-backed, NUMA-aware, DMA-usable.
 * Per-lcore caches for lock-free alloc/free on EAL threads.
 * Bulk alloc and free APIs.
 * Power-of-two size classes from 8 B to 1 MiB.
 * Backing memory grows lazily; rte_fastmem_reserve() allows
   upfront reservation to avoid latency spikes.
 * Always-on per-lcore and per-class statistics.

Bounded to small objects; requests above rte_fastmem_max_size()
are rejected. Replacing rte_malloc is currently not a goal.

--

RFC v4:
 * Fix crash in halloc/hfree on lcores without hlookup: fall
   back to shared bin on NULL cache.
 * Keep per-lcore statistics across rte_fastmem_cache_flush():
   retain the cache struct so counters survive.
 * Guard free and IOVA paths against uninitialized state.
 * Lazy-attach stats readers in secondary processes; distinguish
   -ENODEV from -EINVAL.
 * Add likely() hint to cache-present branch in
   account_alloc_nomem().
 * Protect bin statistics with the bin lock.
 * Trim verbose comments.
 * Add shared cache for callers without a private cache (non-EAL
   threads, secondary processes). Add rte_fastmem_stats_shared()
   and rte_fastmem_stats_shared_class().
 * Document rte_fastmem_stats_reset() quiescence requirement.

RFC v3:
 * Add rte_fastmem_realloc().
 * Add __rte_malloc/__rte_dealloc attributes; remove incorrect
   __rte_alloc_size/__rte_alloc_align.
 * Extract normalize_align() helper.
 * Remove inline directives from static functions.

RFC v2:
 * Fix use-after-free in rte_fastmem_deinit() with cross-socket
   caches: restructure into three-phase teardown.
 * Add secondary process support (lazy attach, safe deinit).
 * Add handle-based allocation API (rte_fastmem_hlookup,
   rte_fastmem_halloc, rte_fastmem_halloc_bulk).
 * Fix clang -Wthread-safety-analysis warnings.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/api/doxy-api-index.md        |    1 +
 doc/api/doxy-api.conf.in         |    1 +
 lib/fastmem/meson.build          |    6 +
 lib/fastmem/rfc-cover-letter.txt |  128 ++
 lib/fastmem/rte_fastmem.c        | 2123 ++++++++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h        |  908 +++++++++++++
 lib/meson.build                  |    1 +
 7 files changed, 3168 insertions(+)
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rfc-cover-letter.txt
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 9296042119..7ebf1201ce 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -70,6 +70,7 @@ The public API headers are grouped by topics:
   [memzone](@ref rte_memzone.h),
   [mempool](@ref rte_mempool.h),
   [malloc](@ref rte_malloc.h),
+  [fastmem](@ref rte_fastmem.h),
   [memcpy](@ref rte_memcpy.h)
 
 - **timers**:
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index bedd944681..4355e9fb2d 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -43,6 +43,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/efd \
                           @TOPDIR@/lib/ethdev \
                           @TOPDIR@/lib/eventdev \
+                          @TOPDIR@/lib/fastmem \
                           @TOPDIR@/lib/fib \
                           @TOPDIR@/lib/gpudev \
                           @TOPDIR@/lib/graph \
diff --git a/lib/fastmem/meson.build b/lib/fastmem/meson.build
new file mode 100644
index 0000000000..6c7834608f
--- /dev/null
+++ b/lib/fastmem/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Ericsson AB
+
+sources = files('rte_fastmem.c')
+headers = files('rte_fastmem.h')
+deps += ['eal']
diff --git a/lib/fastmem/rfc-cover-letter.txt b/lib/fastmem/rfc-cover-letter.txt
new file mode 100644
index 0000000000..53752c7e8b
--- /dev/null
+++ b/lib/fastmem/rfc-cover-letter.txt
@@ -0,0 +1,128 @@
+Subject: [RFC] lib/fastmem: fast small-object allocator
+
+This RFC introduces fastmem, a general-purpose small-object allocator
+for DPDK. It is intended to replace per-type mempools with a single
+allocator that handles arbitrary sizes, grows on demand, and matches
+mempool-level performance on the hot path.
+
+Motivation
+----------
+
+DPDK applications commonly maintain many mempools — one per object
+type (connections, sessions, timers, work items). Each must be sized
+up front, wastes memory when over-provisioned, and cannot serve
+objects of a different size. Fastmem eliminates this by accepting
+arbitrary sizes at runtime, backed by a slab allocator that
+repurposes memory across size classes as demand shifts.
+
+Design
+------
+
+Three-layer architecture:
+
+1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
+   reserved lazily (or pre-reserved for deterministic latency).
+
+2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
+   The alignment enables O(1) slab lookup from any object pointer
+   via bitmask — no radix tree or index structure. Slabs move
+   freely between 18 power-of-2 size classes (8 B to 1 MiB).
+
+3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
+   path). Cache misses trigger bulk transfers to/from the shared
+   bin under a spinlock.
+
+Key properties:
+
+- Zero per-object metadata in the production build.
+- NUMA-aware, with per-socket bins and free-slab pools.
+- DMA-usable memory with O(1) virt-to-IOVA translation.
+- Bulk alloc/free with all-or-nothing semantics.
+- Backing memory never returned during lifetime (slabs recycled).
+- Non-EAL threads supported (bypass cache, take bin lock).
+- Secondary process support (lazy attach, no per-lcore caches).
+
+API surface
+-----------
+
+  rte_fastmem_init / deinit
+  rte_fastmem_reserve
+  rte_fastmem_set_limit / get_limit
+  rte_fastmem_alloc / alloc_socket
+  rte_fastmem_realloc
+  rte_fastmem_alloc_bulk / alloc_bulk_socket
+  rte_fastmem_free / free_bulk
+  rte_fastmem_hlookup / halloc / halloc_bulk / hfree / hfree_bulk
+  rte_fastmem_virt2iova
+  rte_fastmem_cache_flush
+  rte_fastmem_max_size / classes
+  rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
+  rte_fastmem_stats_reset
+
+All APIs are marked __rte_experimental.
+
+Performance
+-----------
+
+The single-object hot path is roughly 2–3× the cost of mempool
+and an order of magnitude faster than rte_malloc. Under
+multi-lcore contention, fastmem scales similarly to mempool,
+while rte_malloc collapses.
+
+Limitations
+-----------
+
+- Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
+- Power-of-2 classes only; worst-case internal fragmentation ~50%.
+- Backing memory not reclaimable short of deinit.
+
+Future work
+-----------
+
+- Lcore-affine allocations (false-sharing-free by construction).
+- Mempool ops driver for transparent drop-in use.
+- Debug mode (cookies, double-free detection, poison-on-free).
+- Telemetry integration.
+- EAL integration, allowing EAL-internal subsystems to use
+  fastmem for their small-object allocations.
+
+Changes in RFC v4:
+- Fix crash in halloc/hfree on lcores without hlookup: fall back
+  to shared bin on NULL cache.
+- Keep per-lcore statistics across rte_fastmem_cache_flush().
+- Guard free and IOVA paths against uninitialized state.
+- Lazy-attach stats readers in secondary processes; distinguish
+  -ENODEV from -EINVAL.
+- Protect bin statistics with the bin lock.
+- Trim verbose comments.
+- Add shared cache for callers without a private cache (non-EAL
+  threads, secondary processes). Add rte_fastmem_stats_shared()
+  and rte_fastmem_stats_shared_class().
+- Document rte_fastmem_stats_reset() quiescence requirement.
+- Add tests for handle alloc/free from uncached lcores, stats
+  survival across flush, and shared-cache statistics.
+- Update programming guide (shared cache, stats sections).
+
+Changes in RFC v3:
+- Add rte_fastmem_realloc() with full test coverage.
+- Add __rte_malloc/__rte_dealloc compiler attributes; remove
+  incorrect __rte_alloc_size/__rte_alloc_align.
+- Extract normalize_align() helper; remove redundant inline
+  directives.
+- Merge lifecycle and functional test suites.
+- Add realloc subsection to programming guide.
+
+Changes in RFC v2:
+- Fix cross-socket deinit use-after-free.
+- Add secondary process support.
+- Add handle-based allocation API.
+- Fix clang warnings; misc cleanup.
+
+Mattias Rönnblom (3):
+  doc: add fastmem programming guide
+  lib: add fastmem library
+  app/test: add fastmem test suite
+
+ doc/guides/prog_guide/fastmem_lib.rst | ...
+ lib/fastmem/                          | ...
+ app/test/test_fastmem*.c              | ...
diff --git a/lib/fastmem/rte_fastmem.c b/lib/fastmem/rte_fastmem.c
new file mode 100644
index 0000000000..4add00ce80
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.c
@@ -0,0 +1,2123 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/queue.h>
+
+#include <eal_export.h>
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_eal.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_spinlock.h>
+
+#include <rte_fastmem.h>
+
+RTE_LOG_REGISTER_DEFAULT(fastmem_logtype, NOTICE);
+
+#define RTE_LOGTYPE_FASTMEM fastmem_logtype
+
+#define FASTMEM_LOG(level, ...) \
+	RTE_LOG_LINE(level, FASTMEM, "" __VA_ARGS__)
+
+#define FASTMEM_MEMZONE_SIZE_LOG2 27                            /* 128 MiB */
+#define FASTMEM_MEMZONE_SIZE ((size_t)1 << FASTMEM_MEMZONE_SIZE_LOG2)
+
+#define FASTMEM_SLAB_SIZE_LOG2 21                               /*   2 MiB */
+#define FASTMEM_SLAB_SIZE ((size_t)1 << FASTMEM_SLAB_SIZE_LOG2)
+#define FASTMEM_SLAB_MASK (FASTMEM_SLAB_SIZE - 1)
+
+#define FASTMEM_SLABS_PER_MEMZONE (FASTMEM_MEMZONE_SIZE / FASTMEM_SLAB_SIZE)
+
+#define FASTMEM_MAX_MEMZONES_PER_SOCKET 64
+
+#define FASTMEM_MIN_CLASS_LOG2 3                                /*   8 B */
+#define FASTMEM_MAX_CLASS_LOG2 20                               /*   1 MiB */
+#define FASTMEM_N_CLASSES (FASTMEM_MAX_CLASS_LOG2 - FASTMEM_MIN_CLASS_LOG2 + 1)
+
+#define FASTMEM_MIN_SIZE ((size_t)1 << FASTMEM_MIN_CLASS_LOG2)
+#define FASTMEM_MAX_ALLOC_SIZE ((size_t)1 << FASTMEM_MAX_CLASS_LOG2)
+
+#define FASTMEM_SLAB_HEADER_SIZE RTE_CACHE_LINE_SIZE
+
+#define FASTMEM_CACHE_BASE_CAPACITY 64
+#define FASTMEM_CACHE_FLOOR_CAPACITY 4
+#define FASTMEM_CACHE_BASE_CLASS_LOG2 12                        /* 4 KiB */
+
+struct fastmem_bin;
+
+/*
+ * Slab header at offset 0 of each 2 MiB slab. Either free (linked
+ * via next_free) or assigned to a bin (linked via list).
+ */
+struct fastmem_slab {
+	struct fastmem_bin *bin;
+	void *free_head;
+	uint32_t free_count;
+	uint32_t n_slots;
+	struct fastmem_slab *next_free;
+	TAILQ_ENTRY(fastmem_slab) list;
+	rte_iova_t iova_base;
+} __rte_aligned(FASTMEM_SLAB_HEADER_SIZE);
+
+TAILQ_HEAD(fastmem_slab_list, fastmem_slab);
+
+struct fastmem_bin {
+	rte_spinlock_t lock;
+	uint32_t slot_size;
+	uint32_t slots_per_slab;
+	uint32_t class_idx;
+	struct fastmem_slab_list partial;
+	struct fastmem_slab_list full;
+	int socket_id;
+	uint64_t slab_acquires;
+	uint64_t slab_releases;
+	uint32_t slabs_partial;
+	uint32_t slabs_full;
+	/*
+	 * Traffic served straight from the bin, with no cache of any kind
+	 * backing it. Reached only on the fallback where a caller has no
+	 * private per-lcore cache and the shared cache could not be created
+	 * either (cache-struct allocation failed, e.g. under a memory limit
+	 * or in an under-provisioned secondary). The normal cache-less path
+	 * goes through the shared cache and is counted there, not here.
+	 * Written under bin->lock, read locklessly by the stats functions.
+	 * Not attributable to an lcore, so it appears only in the global and
+	 * per-class statistics.
+	 */
+	uint64_t nocache_allocs;
+	uint64_t nocache_frees;
+	uint64_t nocache_nomem;
+};
+
+/*
+ * Bounded LIFO of free object pointers, holding statistics counters
+ * alongside the hot-path fields so alloc and free stay on one cache line.
+ *
+ * Used in two ways: as a private per-(lcore, class, socket) cache for
+ * lcore-id-equipped primary threads (written only by its owning lcore, so
+ * lock-free), and as a per-(class, socket) cache shared by all other
+ * callers (serialized by the socket's shared_cache_lock).
+ *
+ * Never freed once created (rte_fastmem_cache_flush() drains the objects
+ * but keeps the struct), so the counters survive a flush and stats readers
+ * may touch it safely.
+ */
+struct fastmem_cache {
+	uint32_t count;
+	uint32_t capacity;
+	uint32_t target;
+	uint64_t alloc_cache_hits;
+	uint64_t alloc_cache_misses;
+	uint64_t alloc_nomem;
+	uint64_t free_cache_hits;
+	uint64_t free_cache_misses;
+	void *objs[];
+} __rte_cache_aligned;
+
+struct fastmem_socket_state {
+	rte_spinlock_t lock;
+	struct fastmem_slab *free_head;
+	size_t reserved_bytes;
+	size_t memory_limit;
+	unsigned int n_memzones;
+	unsigned int memzone_seq;
+	const struct rte_memzone *memzones[FASTMEM_MAX_MEMZONES_PER_SOCKET];
+	struct fastmem_bin bins[FASTMEM_N_CLASSES];
+	struct fastmem_cache *caches[RTE_MAX_LCORE][FASTMEM_N_CLASSES];
+	/*
+	 * Cache shared by all callers lacking a private per-lcore cache
+	 * (lcore-less primary threads and every secondary-process thread),
+	 * guarded by one spinlock for the whole socket.
+	 */
+	rte_spinlock_t shared_cache_lock;
+	struct fastmem_cache *shared_caches[FASTMEM_N_CLASSES];
+};
+
+struct fastmem {
+	struct fastmem_socket_state sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct fastmem *fastmem;
+static const struct rte_memzone *fastmem_mz;
+static bool fastmem_is_primary; /* cached; avoids function call on hot path */
+
+/*
+ * Ensure the global fastmem state is available to this process,
+ * lazily attaching a secondary to the shared memzone on first use.
+ * Returns false (rte_errno = ENODEV) if the primary has not
+ * initialized the library.
+ */
+static bool
+fastmem_assure(void)
+{
+	const struct rte_memzone *mz;
+
+	if (likely(fastmem != NULL))
+		return true;
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		rte_errno = ENODEV;
+		return false;
+	}
+
+	mz = rte_memzone_lookup("fastmem_state");
+	if (mz == NULL) {
+		rte_errno = ENODEV;
+		return false;
+	}
+
+	fastmem_mz = mz;
+	fastmem = mz->addr;
+	return true;
+}
+
+static unsigned int
+size_to_class(size_t size, size_t align)
+{
+	size_t effective;
+	unsigned int log2;
+
+	effective = size < FASTMEM_MIN_SIZE ? FASTMEM_MIN_SIZE : size;
+	if (align > effective)
+		effective = align;
+
+	log2 = 64u - rte_clz64(effective - 1);
+
+	if (log2 < FASTMEM_MIN_CLASS_LOG2)
+		log2 = FASTMEM_MIN_CLASS_LOG2;
+	if (log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+static size_t
+class_size(unsigned int class_idx)
+{
+	return (size_t)1 << (class_idx + FASTMEM_MIN_CLASS_LOG2);
+}
+
+/**
+ * Normalize and validate the alignment argument.
+ * Returns true on success (align updated in place), false on invalid input.
+ */
+static bool
+normalize_align(size_t *align)
+{
+	if (*align == 0) {
+		*align = RTE_CACHE_LINE_SIZE;
+		return true;
+	}
+	return rte_is_power_of_2(*align);
+}
+
+static_assert(sizeof(struct fastmem_slab) == FASTMEM_SLAB_HEADER_SIZE,
+	"fastmem slab header must fit in exactly one cache line");
+static_assert(sizeof(struct fastmem_slab) <= FASTMEM_SLAB_SIZE,
+	"slab header larger than a slab makes no sense");
+
+static struct fastmem_slab *
+slab_of(void *obj)
+{
+	return (struct fastmem_slab *)
+		((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+}
+
+static size_t
+slab_slot0_offset(size_t class_size)
+{
+	return class_size < FASTMEM_SLAB_HEADER_SIZE ?
+		FASTMEM_SLAB_HEADER_SIZE : class_size;
+}
+
+static uint32_t
+slab_slot_count(size_t class_size)
+{
+	size_t offset = slab_slot0_offset(class_size);
+
+	return (uint32_t)((FASTMEM_SLAB_SIZE - offset) / class_size);
+}
+
+/* Must be called with bin->lock held. */
+static void
+slab_init(struct fastmem_bin *bin, struct fastmem_slab *slab)
+{
+	size_t slot_size = bin->slot_size;
+	size_t offset = slab_slot0_offset(slot_size);
+	uint32_t n = bin->slots_per_slab;
+	void *prev = NULL;
+	uint32_t i;
+
+	slab->bin = bin;
+	slab->n_slots = n;
+	slab->free_count = n;
+
+	/* Build in reverse so pops yield sequential addresses. */
+	for (i = 0; i < n; i++) {
+		void *slot = RTE_PTR_ADD(slab, offset + i * slot_size);
+		*(void **)slot = prev;
+		prev = slot;
+	}
+	slab->free_head = prev;
+}
+
+static int
+grow_socket(struct fastmem_socket_state *socket, int socket_id)
+{
+	char name[RTE_MEMZONE_NAMESIZE];
+	const struct rte_memzone *mz;
+	unsigned int i;
+
+	if (socket->reserved_bytes + FASTMEM_MEMZONE_SIZE > socket->memory_limit) {
+		FASTMEM_LOG(ERR,
+			"reserve would exceed memory_limit (%zu) on socket %d",
+			socket->memory_limit, socket_id);
+		return -ENOMEM;
+	}
+
+	if (socket->n_memzones == FASTMEM_MAX_MEMZONES_PER_SOCKET) {
+		FASTMEM_LOG(ERR,
+			"reached per-socket memzone cap (%u) on socket %d",
+			FASTMEM_MAX_MEMZONES_PER_SOCKET, socket_id);
+		return -ENOMEM;
+	}
+
+	snprintf(name, sizeof(name), "fastmem_%d_%u", socket_id,
+			socket->memzone_seq++);
+
+	mz = rte_memzone_reserve_aligned(name, FASTMEM_MEMZONE_SIZE,
+			socket_id, RTE_MEMZONE_IOVA_CONTIG,
+			FASTMEM_SLAB_SIZE);
+	if (mz == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to reserve %zu-byte memzone '%s' on socket %d: %s",
+			(size_t)FASTMEM_MEMZONE_SIZE, name, socket_id,
+			rte_strerror(rte_errno));
+		return -ENOMEM;
+	}
+
+	socket->memzones[socket->n_memzones++] = mz;
+	socket->reserved_bytes += FASTMEM_MEMZONE_SIZE;
+
+	for (i = 0; i < FASTMEM_SLABS_PER_MEMZONE; i++) {
+		struct fastmem_slab *slab = RTE_PTR_ADD(mz->addr,
+				i * FASTMEM_SLAB_SIZE);
+
+		slab->iova_base = mz->iova + i * FASTMEM_SLAB_SIZE;
+		slab->next_free = socket->free_head;
+		socket->free_head = slab;
+	}
+
+	FASTMEM_LOG(DEBUG,
+		"reserved memzone '%s' (%zu bytes) on socket %d; %zu slabs added",
+		name, (size_t)FASTMEM_MEMZONE_SIZE, socket_id,
+		(size_t)FASTMEM_SLABS_PER_MEMZONE);
+
+	return 0;
+}
+
+static struct fastmem_slab *
+slab_acquire(struct fastmem_socket_state *socket, int socket_id)
+{
+	struct fastmem_slab *slab;
+
+	rte_spinlock_lock(&socket->lock);
+
+	if (socket->free_head == NULL) {
+		int rc = grow_socket(socket, socket_id);
+
+		if (rc < 0) {
+			rte_spinlock_unlock(&socket->lock);
+			return NULL;
+		}
+	}
+
+	slab = socket->free_head;
+	socket->free_head = slab->next_free;
+	slab->next_free = NULL;
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return slab;
+}
+
+static void
+slab_release(struct fastmem_socket_state *socket,
+		struct fastmem_slab *slab)
+{
+	rte_spinlock_lock(&socket->lock);
+
+	slab->next_free = socket->free_head;
+	socket->free_head = slab;
+
+	rte_spinlock_unlock(&socket->lock);
+}
+
+static void
+bin_init(struct fastmem_bin *bin, unsigned int class_idx, int socket_id)
+{
+	size_t slot_size = class_size(class_idx);
+
+	rte_spinlock_init(&bin->lock);
+	bin->slot_size = (uint32_t)slot_size;
+	bin->slots_per_slab = slab_slot_count(slot_size);
+	bin->class_idx = class_idx;
+	TAILQ_INIT(&bin->partial);
+	TAILQ_INIT(&bin->full);
+	bin->socket_id = socket_id;
+	bin->slab_acquires = 0;
+	bin->slab_releases = 0;
+	bin->slabs_partial = 0;
+	bin->slabs_full = 0;
+}
+
+static void
+bin_release(struct fastmem_bin *bin, struct fastmem_socket_state *socket)
+{
+	struct fastmem_slab *slab;
+
+	while ((slab = TAILQ_FIRST(&bin->partial)) != NULL) {
+		TAILQ_REMOVE(&bin->partial, slab, list);
+		slab_release(socket, slab);
+	}
+	while ((slab = TAILQ_FIRST(&bin->full)) != NULL) {
+		TAILQ_REMOVE(&bin->full, slab, list);
+		slab_release(socket, slab);
+	}
+}
+
+static unsigned int
+bin_pop_locked(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	unsigned int got = 0;
+
+	while (got < n) {
+		struct fastmem_slab *slab = TAILQ_FIRST(&bin->partial);
+		void *obj;
+
+		if (slab == NULL)
+			break;
+
+		obj = slab->free_head;
+		slab->free_head = *(void **)obj;
+		slab->free_count--;
+		objs[got++] = obj;
+
+		if (slab->free_count == 0) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			TAILQ_INSERT_HEAD(&bin->full, slab, list);
+			bin->slabs_partial--;
+			bin->slabs_full++;
+		}
+	}
+
+	return got;
+}
+
+/*
+ * Fully-drained slabs are accumulated in @p to_release for the
+ * caller to return after dropping the lock.
+ */
+static unsigned int
+bin_push_locked(struct fastmem_bin *bin, void **objs, unsigned int n,
+		struct fastmem_slab **to_release)
+{
+	unsigned int n_release = 0;
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		void *obj = objs[i];
+		struct fastmem_slab *slab = (struct fastmem_slab *)
+			((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+		bool was_full = slab->free_count == 0;
+
+		*(void **)obj = slab->free_head;
+		slab->free_head = obj;
+		slab->free_count++;
+
+		if (was_full) {
+			TAILQ_REMOVE(&bin->full, slab, list);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_full--;
+			bin->slabs_partial++;
+		}
+
+		if (slab->free_count == slab->n_slots) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			bin->slabs_partial--;
+			bin->slab_releases++;
+			to_release[n_release++] = slab;
+		}
+	}
+
+	return n_release;
+}
+
+/*
+ * Allocate a single object from the bin. Pass @p nocache true only on the
+ * no-cache fallback (a user allocation that has neither a private nor a
+ * shared cache); it counts the alloc against the bin's no-cache statistics.
+ * Internal cache machinery (refills) passes false.
+ */
+static void *
+bin_alloc_one(struct fastmem_bin *bin, bool nocache)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	void *obj;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (bin_pop_locked(bin, &obj, 1) == 0) {
+		struct fastmem_slab *slab;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_errno = ENOMEM;
+			return NULL;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	if (nocache)
+		bin->nocache_allocs++;
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return obj;
+}
+
+/*
+ * Allocate up to @p n objects from the bin. Pass @p nocache true only on the
+ * no-cache fallback (a user allocation that has neither a private nor a
+ * shared cache); it counts the allocs against the bin's no-cache statistics.
+ * Internal cache machinery (e.g. a cache refill) passes false.
+ */
+static unsigned int
+bin_alloc_bulk(struct fastmem_bin *bin, void **objs, unsigned int n,
+		bool nocache)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	unsigned int got = 0;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (got < n) {
+		struct fastmem_slab *slab;
+
+		got += bin_pop_locked(bin, objs + got, n - got);
+		if (got == n)
+			break;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_spinlock_lock(&bin->lock);
+			break;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	if (nocache)
+		bin->nocache_allocs += got;
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return got;
+}
+
+/*
+ * Free a single object to the bin. Pass @p nocache true only on the no-cache
+ * fallback (a user free that has neither a private nor a shared cache); it
+ * counts the free against the bin's no-cache statistics. Internal cache
+ * machinery (drain, teardown, flush) passes false.
+ */
+static void
+bin_free_one(struct fastmem_bin *bin, void *obj, bool nocache)
+{
+	unsigned int n_release;
+	struct fastmem_slab *slab_to_release = NULL;
+	struct fastmem_socket_state *socket;
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, &obj, 1, &slab_to_release);
+	if (nocache)
+		bin->nocache_frees++;
+	rte_spinlock_unlock(&bin->lock);
+
+	if (n_release > 0) {
+		socket = &fastmem->sockets[bin->socket_id];
+		slab_release(socket, slab_to_release);
+	}
+}
+
+/*
+ * Free a batch of objects to the bin. Always internal cache machinery
+ * (drain, teardown, flush), never a no-cache user free, so unlike
+ * bin_free_one() it has no nocache flag and is never counted against the
+ * bin's no-cache statistics.
+ */
+static void
+bin_free_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	struct fastmem_slab *to_release[FASTMEM_CACHE_BASE_CAPACITY];
+	unsigned int n_release;
+	unsigned int i;
+
+	RTE_VERIFY(n <= RTE_DIM(to_release));
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, objs, n, to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	for (i = 0; i < n_release; i++)
+		slab_release(socket, to_release[i]);
+}
+
+static unsigned int
+cache_capacity(unsigned int class_idx)
+{
+	unsigned int class_log2 = class_idx + FASTMEM_MIN_CLASS_LOG2;
+	unsigned int shift;
+	unsigned int cap;
+
+	if (class_log2 <= FASTMEM_CACHE_BASE_CLASS_LOG2)
+		return FASTMEM_CACHE_BASE_CAPACITY;
+
+	shift = class_log2 - FASTMEM_CACHE_BASE_CLASS_LOG2;
+	cap = FASTMEM_CACHE_BASE_CAPACITY >> shift;
+
+	return cap < FASTMEM_CACHE_FLOOR_CAPACITY ?
+		FASTMEM_CACHE_FLOOR_CAPACITY : cap;
+}
+
+static struct fastmem_cache **
+cache_slot(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	if (lcore_id >= RTE_MAX_LCORE)
+		return NULL;
+	return &socket->caches[lcore_id][class_idx];
+}
+
+/*
+ * Allocate and initialize a cache struct, itself drawn from fastmem on the
+ * calling lcore's socket, bypassing the cache layer to avoid recursion.
+ */
+static struct fastmem_cache *
+cache_alloc(struct fastmem_socket_state *socket, unsigned int class_idx)
+{
+	struct fastmem_cache *cache;
+	unsigned int capacity = cache_capacity(class_idx);
+	size_t cache_size = sizeof(*cache) + capacity * sizeof(void *);
+	unsigned int cache_class = size_to_class(cache_size, RTE_CACHE_LINE_SIZE);
+	unsigned int own_socket = rte_socket_id();
+	struct fastmem_socket_state *alloc_socket;
+
+	if (cache_class >= FASTMEM_N_CLASSES) {
+		FASTMEM_LOG(ERR,
+			"cache size %zu exceeds max size class",
+			cache_size);
+		return NULL;
+	}
+
+	if (own_socket >= RTE_MAX_NUMA_NODES)
+		own_socket = (unsigned int)socket->bins[0].socket_id;
+
+	alloc_socket = &fastmem->sockets[own_socket];
+
+	cache = bin_alloc_one(&alloc_socket->bins[cache_class], false);
+	if (cache == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to allocate cache for class %u on socket %u",
+			class_idx, own_socket);
+		return NULL;
+	}
+
+	cache->count = 0;
+	cache->capacity = capacity;
+	cache->target = capacity / 2;
+	cache->alloc_cache_hits = 0;
+	cache->alloc_cache_misses = 0;
+	cache->alloc_nomem = 0;
+	cache->free_cache_hits = 0;
+	cache->free_cache_misses = 0;
+
+	return cache;
+}
+
+static struct fastmem_cache *
+cache_create(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache **slot = cache_slot(socket, class_idx, lcore_id);
+	struct fastmem_cache *cache;
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	cache = cache_alloc(socket, class_idx);
+	if (cache == NULL)
+		return NULL;
+
+	*slot = cache;
+
+	return cache;
+}
+
+/*
+ * Get-or-create the private per-lcore cache. Returns NULL for callers that
+ * have no private cache (secondary process, or no lcore id), which then use
+ * the shared cache instead.
+ */
+static struct fastmem_cache *
+cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	struct fastmem_cache **slot;
+	struct fastmem_cache *cache;
+
+	if (unlikely(!fastmem_is_primary))
+		return NULL;
+
+	slot = cache_slot(socket, class_idx, lcore_id);
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	return cache_create(socket, class_idx, lcore_id);
+}
+
+static void *
+cache_pop(struct fastmem_cache *cache, struct fastmem_bin *bin)
+{
+	if (cache->count > 0) {
+		cache->alloc_cache_hits++;
+		return cache->objs[--cache->count];
+	}
+
+	cache->count = bin_alloc_bulk(bin, cache->objs, cache->target, false);
+	if (cache->count == 0)
+		return NULL;
+
+	cache->alloc_cache_misses++;
+	return cache->objs[--cache->count];
+}
+
+static void
+cache_push(struct fastmem_cache *cache, struct fastmem_bin *bin, void *obj)
+{
+	unsigned int drain;
+
+	if (cache->count < cache->capacity) {
+		cache->free_cache_hits++;
+		cache->objs[cache->count++] = obj;
+		return;
+	}
+
+	cache->free_cache_misses++;
+
+	/*
+	 * Drain the oldest (bottom) half to the bin, keep the newest
+	 * (top) half for temporal reuse.
+	 */
+	drain = cache->count - cache->target;
+	bin_free_bulk(bin, cache->objs, drain);
+	memmove(cache->objs, cache->objs + drain,
+		cache->target * sizeof(cache->objs[0]));
+	cache->count = cache->target;
+
+	cache->objs[cache->count++] = obj;
+}
+
+/* Get-or-create the shared cache; call with shared_cache_lock held. */
+static struct fastmem_cache *
+shared_cache_get(struct fastmem_socket_state *socket, unsigned int class_idx)
+{
+	struct fastmem_cache *cache = socket->shared_caches[class_idx];
+
+	if (cache != NULL)
+		return cache;
+
+	cache = cache_alloc(socket, class_idx);
+	if (cache == NULL)
+		return NULL;
+
+	socket->shared_caches[class_idx] = cache;
+
+	return cache;
+}
+
+/* Allocate one object via the shared cache, or straight from the bin if the
+ * cache cannot be created. */
+static void *
+shared_alloc_one(struct fastmem_socket_state *socket, unsigned int class_idx)
+{
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	struct fastmem_cache *cache;
+	void *obj;
+
+	rte_spinlock_lock(&socket->shared_cache_lock);
+
+	cache = shared_cache_get(socket, class_idx);
+	if (likely(cache != NULL)) {
+		obj = cache_pop(cache, bin);
+		rte_spinlock_unlock(&socket->shared_cache_lock);
+		return obj;
+	}
+
+	rte_spinlock_unlock(&socket->shared_cache_lock);
+
+	return bin_alloc_one(bin, true);
+}
+
+/* Allocate up to @p n objects via the shared cache; returns the count got. */
+static unsigned int
+shared_alloc_bulk(struct fastmem_socket_state *socket, unsigned int class_idx,
+		void **objs, unsigned int n)
+{
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	struct fastmem_cache *cache;
+	unsigned int got = 0;
+
+	rte_spinlock_lock(&socket->shared_cache_lock);
+
+	cache = shared_cache_get(socket, class_idx);
+	if (likely(cache != NULL)) {
+		while (got < n) {
+			void *obj = cache_pop(cache, bin);
+
+			if (obj == NULL)
+				break;
+			objs[got++] = obj;
+		}
+		rte_spinlock_unlock(&socket->shared_cache_lock);
+		return got;
+	}
+
+	rte_spinlock_unlock(&socket->shared_cache_lock);
+
+	return bin_alloc_bulk(bin, objs, n, true);
+}
+
+/* Free one object via the shared cache. */
+static void
+shared_free_one(struct fastmem_socket_state *socket, unsigned int class_idx,
+		void *obj)
+{
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	struct fastmem_cache *cache;
+
+	rte_spinlock_lock(&socket->shared_cache_lock);
+
+	cache = shared_cache_get(socket, class_idx);
+	if (likely(cache != NULL)) {
+		cache_push(cache, bin, obj);
+		rte_spinlock_unlock(&socket->shared_cache_lock);
+		return;
+	}
+
+	rte_spinlock_unlock(&socket->shared_cache_lock);
+
+	bin_free_one(bin, obj, true);
+}
+
+/* Record an alloc failure against the per-lcore cache, the shared cache, or
+ * the bin's no-cache counter, in that order of preference. */
+static void
+account_alloc_nomem(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache *cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		cache->alloc_nomem++;
+		return;
+	}
+
+	rte_spinlock_lock(&socket->shared_cache_lock);
+	cache = shared_cache_get(socket, class_idx);
+	if (likely(cache != NULL)) {
+		cache->alloc_nomem++;
+		rte_spinlock_unlock(&socket->shared_cache_lock);
+		return;
+	}
+	rte_spinlock_unlock(&socket->shared_cache_lock);
+
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+
+	rte_spinlock_lock(&bin->lock);
+	bin->nocache_nomem++;
+	rte_spinlock_unlock(&bin->lock);
+}
+
+static void
+socket_release_caches(struct fastmem_socket_state *socket)
+{
+	unsigned int lcore;
+	unsigned int c;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache = socket->caches[lcore][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache, false);
+
+			socket->caches[lcore][c] = NULL;
+		}
+	}
+
+	for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+		struct fastmem_cache *cache = socket->shared_caches[c];
+		struct fastmem_slab *cache_slab;
+
+		if (cache == NULL)
+			continue;
+
+		if (cache->count > 0) {
+			bin_free_bulk(&socket->bins[c],
+				cache->objs, cache->count);
+			cache->count = 0;
+		}
+
+		cache_slab = slab_of(cache);
+		bin_free_one(cache_slab->bin, cache, false);
+
+		socket->shared_caches[c] = NULL;
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_init, 24.11)
+int
+rte_fastmem_init(void)
+{
+	unsigned int s, c;
+
+	if (fastmem != NULL)
+		return -EBUSY;
+
+	fastmem_mz = rte_memzone_reserve_aligned("fastmem_state",
+			sizeof(*fastmem), SOCKET_ID_ANY, 0,
+			RTE_CACHE_LINE_SIZE);
+	if (fastmem_mz == NULL)
+		return -ENOMEM;
+
+	fastmem = fastmem_mz->addr;
+	fastmem_is_primary = true;
+	memset(fastmem, 0, sizeof(*fastmem));
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		rte_spinlock_init(&socket->lock);
+		rte_spinlock_init(&socket->shared_cache_lock);
+		socket->memory_limit = SIZE_MAX;
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++)
+			bin_init(&socket->bins[c], c, (int)s);
+	}
+
+	return 0;
+}
+
+static void
+release_socket_caches(struct fastmem_socket_state *socket)
+{
+	socket_release_caches(socket);
+}
+
+static void
+release_socket_bins(struct fastmem_socket_state *socket)
+{
+	unsigned int c;
+
+	for (c = 0; c < FASTMEM_N_CLASSES; c++)
+		bin_release(&socket->bins[c], socket);
+}
+
+static void
+release_socket_memzones(struct fastmem_socket_state *socket)
+{
+	unsigned int i;
+
+	for (i = 0; i < socket->n_memzones; i++)
+		rte_memzone_free(socket->memzones[i]);
+
+	socket->free_head = NULL;
+	socket->reserved_bytes = 0;
+	socket->n_memzones = 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_deinit, 24.11)
+void
+rte_fastmem_deinit(void)
+{
+	unsigned int i;
+
+	if (fastmem == NULL)
+		return;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		fastmem = NULL;
+		fastmem_mz = NULL;
+		return;
+	}
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_caches(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_bins(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_memzones(&fastmem->sockets[i]);
+
+	rte_memzone_free(fastmem_mz);
+	fastmem_mz = NULL;
+	fastmem = NULL;
+}
+
+/* Same resolution order as rte_malloc's malloc_get_numa_socket(). */
+static unsigned int
+local_socket_id(void)
+{
+	int sid = (int)rte_socket_id();
+
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = (int)rte_lcore_to_socket_id(rte_get_main_lcore());
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = rte_socket_id_by_idx(0);
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	return 0;
+}
+
+static int
+reserve_on_socket(int sid, size_t size)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[sid];
+	int rc = 0;
+
+	rte_spinlock_lock(&socket->lock);
+
+	while (socket->reserved_bytes < size) {
+		rc = grow_socket(socket, sid);
+		if (rc < 0)
+			break;
+	}
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return rc;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_reserve, 24.11)
+int
+rte_fastmem_reserve(size_t size, int socket_id)
+{
+	unsigned int i;
+	int rc;
+
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id != SOCKET_ID_ANY) {
+		if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+			return -EINVAL;
+		return reserve_on_socket(socket_id, size);
+	}
+
+	rc = reserve_on_socket(local_socket_id(), size);
+	if (rc == 0)
+		return 0;
+
+	for (i = 0; i < rte_socket_count(); i++) {
+		int sid = rte_socket_id_by_idx(i);
+
+		if (sid < 0 || (unsigned int)sid == local_socket_id())
+			continue;
+
+		rc = reserve_on_socket(sid, size);
+		if (rc == 0)
+			return 0;
+	}
+
+	return rc;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_set_limit, 24.11)
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes)
+{
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id == SOCKET_ID_ANY) {
+		for (unsigned int i = 0; i < RTE_MAX_NUMA_NODES; i++)
+			fastmem->sockets[i].memory_limit = max_bytes;
+		return 0;
+	}
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	fastmem->sockets[socket_id].memory_limit = max_bytes;
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_get_limit, 24.11)
+size_t
+rte_fastmem_get_limit(int socket_id)
+{
+	if (fastmem == NULL || socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return 0;
+
+	return fastmem->sockets[socket_id].memory_limit;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_max_size, 24.11)
+size_t
+rte_fastmem_max_size(void)
+{
+	return FASTMEM_MAX_ALLOC_SIZE;
+}
+
+static void *
+alloc_from_socket(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache *cache;
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		return cache_pop(cache, bin);
+
+	return shared_alloc_one(socket, class_idx);
+}
+
+static void
+do_free(void *ptr)
+{
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+
+	if (unlikely(!fastmem_assure()))
+		return;
+
+	slab = slab_of(ptr);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+	if (likely(cache != NULL))
+		cache_push(cache, bin, ptr);
+	else
+		shared_free_one(socket, bin->class_idx, ptr);
+}
+
+static int
+do_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags, unsigned int lcore_id,
+		int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int got = 0;
+
+	if (unlikely(!fastmem_assure()))
+		return -rte_errno;
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return -E2BIG;
+	}
+
+	socket = &fastmem->sockets[socket_id];
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		/* Drain from cache. */
+		unsigned int avail = RTE_MIN(cache->count, n);
+
+		cache->count -= avail;
+		memcpy(ptrs, &cache->objs[cache->count],
+			avail * sizeof(void *));
+		got = avail;
+		cache->alloc_cache_hits += avail;
+
+		if (got < n) {
+			unsigned int need = n - got;
+			unsigned int want = RTE_MAX(need, cache->target);
+			unsigned int filled;
+
+			if (want <= cache->capacity) {
+				/* Refill into cache, give caller their share. */
+				filled = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					cache->objs, want, false);
+				if (filled > 0)
+					cache->alloc_cache_misses += RTE_MIN(filled, need);
+				if (filled >= need) {
+					memcpy(ptrs + got,
+						cache->objs + filled - need,
+						need * sizeof(void *));
+					cache->count = filled - need;
+					got = n;
+				} else {
+					memcpy(ptrs + got, cache->objs,
+						filled * sizeof(void *));
+					got += filled;
+					cache->count = 0;
+				}
+			} else {
+				/*
+				 * n exceeds cache capacity; pull directly,
+				 * but count as cache misses since the caller
+				 * has a cache.
+				 */
+				unsigned int pulled = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, need, false);
+				if (pulled > 0)
+					cache->alloc_cache_misses += pulled;
+				got += pulled;
+			}
+		}
+	} else {
+		got = shared_alloc_bulk(socket, class_idx, ptrs, n);
+	}
+
+	if (unlikely(got < n) && fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count() && got < n; i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			socket = &fastmem->sockets[sid];
+			cache = cache_get(socket, class_idx, lcore_id);
+			if (likely(cache != NULL)) {
+				unsigned int avail =
+					RTE_MIN(cache->count, n - got);
+				cache->count -= avail;
+				memcpy(ptrs + got,
+					&cache->objs[cache->count],
+					avail * sizeof(void *));
+				cache->alloc_cache_hits += avail;
+				got += avail;
+			}
+			if (got < n) {
+				if (cache != NULL) {
+					unsigned int pulled = bin_alloc_bulk(
+						&socket->bins[class_idx],
+						ptrs + got, n - got, false);
+					if (pulled > 0)
+						cache->alloc_cache_misses += pulled;
+					got += pulled;
+				} else {
+					got += shared_alloc_bulk(socket,
+						class_idx, ptrs + got, n - got);
+				}
+			}
+		}
+	}
+
+	if (unlikely(got < n)) {
+		/* All-or-nothing: return what we got. */
+		unsigned int i;
+
+		for (i = 0; i < got; i++)
+			do_free(ptrs[i]);
+
+		account_alloc_nomem(&fastmem->sockets[socket_id], class_idx,
+			lcore_id);
+		rte_errno = ENOMEM;
+		return -ENOMEM;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO) {
+		size_t cs = class_size(class_idx);
+		unsigned int i;
+
+		for (i = 0; i < n; i++)
+			memset(ptrs[i], 0, cs);
+	}
+
+	return 0;
+}
+
+static void *
+do_alloc(size_t size, size_t align, unsigned int flags,
+		unsigned int lcore_id, int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	void *obj;
+
+	if (unlikely(!fastmem_assure()))
+		return NULL;
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	obj = alloc_from_socket(&fastmem->sockets[socket_id],
+			class_idx, lcore_id);
+
+	if (likely(obj != NULL))
+		goto out;
+
+	if (fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count(); i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			obj = alloc_from_socket(&fastmem->sockets[sid],
+					class_idx, lcore_id);
+			if (obj != NULL)
+				goto out;
+		}
+	}
+
+	account_alloc_nomem(&fastmem->sockets[socket_id], class_idx, lcore_id);
+	rte_errno = ENOMEM;
+	return NULL;
+
+out:
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc, 24.11)
+void *
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+{
+	return do_alloc(size, align, flags, rte_lcore_id(),
+			local_socket_id(), false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_socket, 24.11)
+void *
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc(size, align, flags, rte_lcore_id(),
+				local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	return do_alloc(size, align, flags, rte_lcore_id(), socket_id, false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free, 24.11)
+void
+rte_fastmem_free(void *ptr)
+{
+	if (unlikely(ptr == NULL))
+		return;
+
+	do_free(ptr);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_realloc, 24.11)
+void *
+rte_fastmem_realloc(void *ptr, size_t size, size_t align)
+{
+	struct fastmem_slab *slab;
+	unsigned int old_class, new_class;
+	size_t old_size;
+	void *new_ptr;
+
+	if (ptr == NULL)
+		return rte_fastmem_alloc(size, align, 0);
+
+	if (size == 0) {
+		rte_fastmem_free(ptr);
+		return NULL;
+	}
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	new_class = size_to_class(size, align);
+	if (unlikely(new_class >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	slab = slab_of(ptr);
+	old_class = slab->bin->class_idx;
+
+	if (new_class == old_class)
+		return ptr;
+
+	new_ptr = rte_fastmem_alloc(size, align, 0);
+	if (unlikely(new_ptr == NULL))
+		return NULL;
+
+	old_size = class_size(old_class);
+	memcpy(new_ptr, ptr, RTE_MIN(old_size, size));
+	rte_fastmem_free(ptr);
+
+	return new_ptr;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk, 24.11)
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags)
+{
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), local_socket_id(), false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk_socket, 24.11)
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc_bulk(ptrs, n, size, align, flags,
+				rte_lcore_id(), local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), socket_id, false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free_bulk, 24.11)
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int lcore_id;
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int space;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	if (unlikely(!fastmem_assure()))
+		return;
+
+	lcore_id = rte_lcore_id();
+
+	/* Fast path: check if first object gives us the bin. */
+	slab = slab_of(ptrs[0]);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+
+	if (unlikely(cache == NULL)) {
+		for (i = 0; i < n; i++)
+			do_free(ptrs[i]);
+		return;
+	}
+
+	/*
+	 * Try to push all objects into the cache in one memcpy.
+	 * If any object belongs to a different bin, fall back to
+	 * per-object free for the remainder.
+	 */
+	space = cache->capacity - cache->count;
+	if (likely(n <= space)) {
+		/* Verify all same bin (common case). */
+		for (i = 1; i < n; i++)
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		cache->free_cache_hits += n;
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+	/* Would overflow cache — drain first, then push. */
+	if (n <= cache->capacity) {
+		unsigned int drain;
+
+		for (i = 1; i < n; i++)
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+
+		cache->free_cache_misses += n;
+		drain = cache->count - cache->target + n;
+		if (drain > cache->count)
+			drain = cache->count;
+		if (drain > 0) {
+			bin_free_bulk(bin, cache->objs, drain);
+			cache->count -= drain;
+			memmove(cache->objs, cache->objs + drain,
+				cache->count * sizeof(cache->objs[0]));
+		}
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+slow:
+	for (i = 0; i < n; i++)
+		do_free(ptrs[i]);
+}
+
+#define fastmem_handle_class_BITS 8
+
+static rte_fastmem_handle_t
+fastmem_handle_pack(unsigned int class_idx, int socket_id)
+{
+	return (uint32_t)class_idx |
+		((uint32_t)socket_id << fastmem_handle_class_BITS);
+}
+
+static unsigned int
+fastmem_handle_class(rte_fastmem_handle_t h)
+{
+	return h & ((1U << fastmem_handle_class_BITS) - 1);
+}
+
+static int
+fastmem_handle_socket(rte_fastmem_handle_t h)
+{
+	return (int)(h >> fastmem_handle_class_BITS);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hlookup, 24.11)
+int
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+
+	if (handle == NULL)
+		return -EINVAL;
+
+	if (!normalize_align(&align))
+		return -EINVAL;
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	class_idx = size_to_class(size, align);
+	if (class_idx >= FASTMEM_N_CLASSES)
+		return -E2BIG;
+
+	/* Pre-create the cache for the calling lcore. */
+	socket = &fastmem->sockets[socket_id];
+	cache_create(socket, class_idx, rte_lcore_id());
+
+	*handle = fastmem_handle_pack(class_idx, socket_id);
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc, 24.11)
+void *
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_socket_state *socket;
+	struct fastmem_bin *bin;
+	struct fastmem_cache *cache;
+	void *obj;
+
+	if (unlikely(!fastmem_assure()))
+		return NULL;
+
+	socket = &fastmem->sockets[socket_id];
+	bin = &socket->bins[class_idx];
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		obj = cache_pop(cache, bin);
+	else
+		obj = shared_alloc_one(socket, class_idx);
+
+	if (unlikely(obj == NULL)) {
+		account_alloc_nomem(socket, class_idx, lcore_id);
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc_bulk, 24.11)
+int
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+
+	return do_alloc_bulk(ptrs, n, class_size(class_idx),
+			RTE_CACHE_LINE_SIZE, flags, rte_lcore_id(),
+			socket_id, false);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree, 24.11)
+void
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_socket_state *socket;
+	struct fastmem_bin *bin;
+	struct fastmem_cache *cache;
+
+	if (unlikely(ptr == NULL))
+		return;
+
+	if (unlikely(!fastmem_assure()))
+		return;
+
+	socket = &fastmem->sockets[socket_id];
+	bin = &socket->bins[class_idx];
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		cache_push(cache, bin, ptr);
+	else
+		shared_free_one(socket, class_idx, ptr);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree_bulk, 24.11)
+void
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	struct fastmem_socket_state *socket;
+	struct fastmem_bin *bin;
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	if (unlikely(!fastmem_assure()))
+		return;
+
+	socket = &fastmem->sockets[socket_id];
+	bin = &socket->bins[class_idx];
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		for (i = 0; i < n; i++)
+			cache_push(cache, bin, ptrs[i]);
+	} else {
+		for (i = 0; i < n; i++)
+			shared_free_one(socket, class_idx, ptrs[i]);
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_virt2iova, 24.11)
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr)
+{
+	struct fastmem_slab *slab;
+
+	if (unlikely(!fastmem_assure()))
+		return RTE_BAD_IOVA;
+
+	slab = slab_of((void *)(uintptr_t)ptr);
+
+	return slab->iova_base + ((uintptr_t)ptr - (uintptr_t)slab);
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_cache_flush, 24.11)
+void
+rte_fastmem_cache_flush(void)
+{
+	unsigned int lcore_id;
+	unsigned int s, c;
+
+	if (fastmem == NULL)
+		return;
+
+	lcore_id = rte_lcore_id();
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+
+			if (cache == NULL)
+				continue;
+
+			/*
+			 * Drain the objects back to the bin, but keep the
+			 * cache struct: it holds the lcore's statistics,
+			 * which must survive the flush.
+			 */
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+		}
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats, 24.11)
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats)
+{
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+
+	*stats = (struct rte_fastmem_stats){0};
+	stats->n_classes = FASTMEM_N_CLASSES;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		stats->bytes_backing += socket->reserved_bytes;
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_bin *bin = &socket->bins[c];
+			uint64_t class_allocs, class_frees;
+
+			class_allocs = bin->nocache_allocs;
+			class_frees = bin->nocache_frees;
+			stats->alloc_nomem += bin->nocache_nomem;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+
+				if (cache == NULL)
+					continue;
+
+				class_allocs += cache->alloc_cache_hits +
+					cache->alloc_cache_misses;
+				class_frees += cache->free_cache_hits +
+					cache->free_cache_misses;
+				stats->alloc_nomem += cache->alloc_nomem;
+			}
+
+			struct fastmem_cache *shared = socket->shared_caches[c];
+
+			if (shared != NULL) {
+				class_allocs += shared->alloc_cache_hits +
+					shared->alloc_cache_misses;
+				class_frees += shared->free_cache_hits +
+					shared->free_cache_misses;
+				stats->alloc_nomem += shared->alloc_nomem;
+			}
+
+			stats->alloc_total += class_allocs;
+			stats->free_total += class_frees;
+			if (class_allocs > class_frees)
+				stats->bytes_in_use += class_size(c) *
+					(class_allocs - class_frees);
+		}
+	}
+
+	return 0;
+}
+
+static unsigned int
+exact_class_idx(size_t sz)
+{
+	unsigned int log2;
+
+	if (sz < FASTMEM_MIN_SIZE || sz > FASTMEM_MAX_ALLOC_SIZE)
+		return FASTMEM_N_CLASSES;
+	if ((sz & (sz - 1)) != 0)
+		return FASTMEM_N_CLASSES;
+
+	log2 = (unsigned int)rte_ctz64(sz);
+	if (log2 < FASTMEM_MIN_CLASS_LOG2 || log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_class, 24.11)
+int
+rte_fastmem_stats_class(size_t class_size_arg,
+		struct rte_fastmem_class_stats *stats)
+{
+	unsigned int c;
+	uint64_t allocs, frees;
+
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+		struct fastmem_bin *bin = &socket->bins[c];
+
+		for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+			struct fastmem_cache *cache = socket->caches[l][c];
+
+			if (cache == NULL)
+				continue;
+
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+
+		struct fastmem_cache *shared = socket->shared_caches[c];
+
+		if (shared != NULL) {
+			stats->alloc_cache_hits += shared->alloc_cache_hits;
+			stats->alloc_cache_misses += shared->alloc_cache_misses;
+			stats->alloc_nomem += shared->alloc_nomem;
+			stats->free_cache_hits += shared->free_cache_hits;
+			stats->free_cache_misses += shared->free_cache_misses;
+		}
+
+		/* No-cache fallback traffic; fold into the miss counters. */
+		stats->alloc_cache_misses += bin->nocache_allocs;
+		stats->free_cache_misses += bin->nocache_frees;
+		stats->alloc_nomem += bin->nocache_nomem;
+
+		stats->slab_acquires += bin->slab_acquires;
+		stats->slab_releases += bin->slab_releases;
+		stats->slabs_partial += bin->slabs_partial;
+		stats->slabs_full += bin->slabs_full;
+	}
+
+	allocs = stats->alloc_cache_hits + stats->alloc_cache_misses;
+	frees = stats->free_cache_hits + stats->free_cache_misses;
+	if (allocs > frees)
+		stats->in_use = allocs - frees;
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore, 24.11)
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats)
+{
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_stats){0};
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+
+			if (cache == NULL)
+				continue;
+
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore_class, 24.11)
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size_arg,
+		struct rte_fastmem_lcore_class_stats *stats)
+{
+	unsigned int c;
+
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_cache *cache =
+			fastmem->sockets[s].caches[lcore_id][c];
+
+		if (cache == NULL)
+			continue;
+
+		stats->alloc_cache_hits += cache->alloc_cache_hits;
+		stats->alloc_cache_misses += cache->alloc_cache_misses;
+		stats->alloc_nomem += cache->alloc_nomem;
+		stats->free_cache_hits += cache->free_cache_hits;
+		stats->free_cache_misses += cache->free_cache_misses;
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_shared, 24.11)
+int
+rte_fastmem_stats_shared(struct rte_fastmem_lcore_stats *stats)
+{
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+
+	*stats = (struct rte_fastmem_lcore_stats){0};
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache = socket->shared_caches[c];
+
+			if (cache == NULL)
+				continue;
+
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_shared_class, 24.11)
+int
+rte_fastmem_stats_shared_class(size_t class_size_arg,
+		struct rte_fastmem_lcore_class_stats *stats)
+{
+	unsigned int c;
+
+	if (stats == NULL)
+		return -EINVAL;
+	if (!fastmem_assure())
+		return -ENODEV;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_cache *cache =
+			fastmem->sockets[s].shared_caches[c];
+
+		if (cache == NULL)
+			continue;
+
+		stats->alloc_cache_hits += cache->alloc_cache_hits;
+		stats->alloc_cache_misses += cache->alloc_cache_misses;
+		stats->alloc_nomem += cache->alloc_nomem;
+		stats->free_cache_hits += cache->free_cache_hits;
+		stats->free_cache_misses += cache->free_cache_misses;
+	}
+
+	return 0;
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_reset, 24.11)
+void
+rte_fastmem_stats_reset(void)
+{
+	if (fastmem == NULL)
+		return;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_bin *bin = &socket->bins[c];
+
+			rte_spinlock_lock(&bin->lock);
+			bin->slab_acquires = 0;
+			bin->slab_releases = 0;
+			bin->nocache_allocs = 0;
+			bin->nocache_frees = 0;
+			bin->nocache_nomem = 0;
+			rte_spinlock_unlock(&bin->lock);
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				cache->alloc_cache_hits = 0;
+				cache->alloc_cache_misses = 0;
+				cache->alloc_nomem = 0;
+				cache->free_cache_hits = 0;
+				cache->free_cache_misses = 0;
+			}
+
+			rte_spinlock_lock(&socket->shared_cache_lock);
+			struct fastmem_cache *shared = socket->shared_caches[c];
+			if (shared != NULL) {
+				shared->alloc_cache_hits = 0;
+				shared->alloc_cache_misses = 0;
+				shared->alloc_nomem = 0;
+				shared->free_cache_hits = 0;
+				shared->free_cache_misses = 0;
+			}
+			rte_spinlock_unlock(&socket->shared_cache_lock);
+		}
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_classes, 24.11)
+unsigned int
+rte_fastmem_classes(size_t *sizes)
+{
+	if (sizes != NULL)
+		for (unsigned int i = 0; i < FASTMEM_N_CLASSES; i++)
+			sizes[i] = class_size(i);
+	return FASTMEM_N_CLASSES;
+}
diff --git a/lib/fastmem/rte_fastmem.h b/lib/fastmem/rte_fastmem.h
new file mode 100644
index 0000000000..8526d2a001
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.h
@@ -0,0 +1,908 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#ifndef _RTE_FASTMEM_H_
+#define _RTE_FASTMEM_H_
+
+/**
+ * @file
+ *
+ * RTE Fastmem
+ *
+ * @warning
+ * @b EXPERIMENTAL:
+ * All functions in this file may be changed or removed without prior notice.
+ *
+ * The fastmem library is a fast, general-purpose small-object
+ * allocator for DPDK applications. It is intended to allow an
+ * application to replace its many per-type mempools — each sized
+ * for a single object type (a connection, a session, a work item,
+ * a timer, etc.) — with a single allocator that handles arbitrary
+ * object sizes, grows on demand, and offers mempool-level
+ * performance for the common allocation and free paths.
+ *
+ * Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+ * supports bulk operations, and uses per-lcore caches to reduce
+ * shared-state contention. Unlike mempool, it does not require the
+ * caller to declare object sizes or counts up front.
+ *
+ * There is a single, global fastmem instance per process. The
+ * instance is brought up with rte_fastmem_init() and torn down with
+ * rte_fastmem_deinit(). Allocations are made with
+ * rte_fastmem_alloc() and freed with rte_fastmem_free().
+ *
+ * The allocator is bounded to small-object allocations. Requests
+ * larger than rte_fastmem_max_size() are rejected; callers with
+ * such needs should use rte_malloc() directly.
+ *
+ * Backing memory is reserved from DPDK memzones. Once reserved,
+ * backing memory is not returned to the system during the
+ * allocator's lifetime. Callers that need predictable latency may
+ * pre-reserve backing memory up front using rte_fastmem_reserve(),
+ * avoiding memzone-reservation overhead during steady-state
+ * operation.
+ *
+ * Alignment argument, @c align:
+ *   If non-zero, @c align specifies an exact minimum alignment and
+ *   must be a power of 2. If zero, the default alignment is
+ *   @c RTE_CACHE_LINE_SIZE, so that objects obtained from distinct
+ *   calls cannot false-share a cache line.
+ *
+ * Threads and caches:
+ *   Only threads with an lcore id running in the primary process
+ *   get a private per-lcore cache, which makes their common path
+ *   lock-free. Every other caller — unregistered non-EAL threads
+ *   (which have no lcore id), and all threads in a secondary
+ *   process — instead shares a single spinlock-protected cache per
+ *   (size class, socket). These callers still benefit from caching,
+ *   but pay for the shared lock and so cost more per call than a
+ *   private-cache thread.
+ *
+ * Non-preemptible caller:
+ *   Callers should not be preemptible while inside a fastmem call.
+ *   Fastmem uses internal spinlocks; if a caller is preempted
+ *   while holding one, any other thread that subsequently needs
+ *   the same lock stalls until the preempted caller resumes.
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Flag for rte_fastmem_alloc() and its variants: initialize the
+ * returned memory to zero before returning it to the caller.
+ */
+#define RTE_FASTMEM_F_ZERO RTE_BIT32(0)
+
+/**
+ * Initialize the fastmem allocator.
+ *
+ * Sets up the library's internal state. Must be called before any
+ * allocation call. Typically called once per process, after
+ * rte_eal_init() and before the application's worker threads begin
+ * making allocations.
+ *
+ * Initialization does not pre-reserve any backing memory; memzones
+ * are reserved lazily as allocations require. An application that
+ * wants to avoid memzone-reservation latency on the allocation
+ * path should follow rte_fastmem_init() with one or more calls to
+ * rte_fastmem_reserve().
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EBUSY: The allocator is already initialized.
+ *  - -ENOMEM: Unable to allocate internal state.
+ */
+__rte_experimental
+int
+rte_fastmem_init(void);
+
+/**
+ * Tear down the fastmem allocator.
+ *
+ * Releases the library's internal state and frees all backing
+ * memzones. After this call, no fastmem allocations or frees may
+ * be made until rte_fastmem_init() is called again.
+ *
+ * The caller is responsible for ensuring that no fastmem-allocated
+ * objects remain in use. Outstanding allocations at deinit time
+ * result in undefined behavior.
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ */
+__rte_experimental
+void
+rte_fastmem_deinit(void);
+
+/**
+ * Pre-reserve backing memory.
+ *
+ * Ensures that at least @p size bytes of memzone-backed memory are
+ * available to the allocator on @p socket_id, reserving additional
+ * memzones from EAL as needed to reach that total. Subsequent
+ * allocations served from the pre-reserved memory do not incur
+ * memzone-reservation cost.
+ *
+ * The reservation is cumulative: repeated calls to
+ * rte_fastmem_reserve() with the same @p socket_id grow the
+ * reservation monotonically. Reserved memory is never returned to
+ * the system during the allocator's lifetime.
+ *
+ * A typical use is to call rte_fastmem_reserve() once at
+ * application startup, with a size chosen to cover the expected
+ * steady-state working set. Allocations and frees during
+ * steady-state operation then avoid memzone reservations entirely.
+ *
+ * @param size
+ *  The minimum amount of backing memory, in bytes, to make
+ *  available on @p socket_id. The allocator may reserve more than
+ *  the requested amount due to internal rounding (e.g., to memzone
+ *  or block granularity).
+ *
+ * @param socket_id
+ *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
+ *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the reservation.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
+ *  - -EINVAL: Invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_reserve(size_t size, int socket_id);
+
+/**
+ * Set the maximum backing memory that may be reserved on a socket.
+ *
+ * Once the limit is reached, allocations that would require new
+ * backing memory on the constrained socket fail with ENOMEM.
+ * Already-reserved memory is not released.
+ *
+ * Setting a limit below the current reserved amount is allowed and
+ * prevents further growth.
+ *
+ * @param socket_id
+ *  The NUMA socket to constrain, or SOCKET_ID_ANY to apply the
+ *  limit to all sockets.
+ * @param max_bytes
+ *  Maximum backing memory in bytes, or SIZE_MAX for unlimited (the default).
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Fastmem not initialized, or invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes);
+
+/**
+ * Get the maximum backing memory limit for a socket.
+ *
+ * @param socket_id
+ *  The NUMA socket to query.
+ * @return
+ *  The limit in bytes, or SIZE_MAX if unlimited.
+ */
+__rte_experimental
+size_t
+rte_fastmem_get_limit(int socket_id);
+
+/**
+ * Retrieve the largest allocation size the allocator supports.
+ *
+ * Requests larger than this size are rejected by the allocation
+ * functions. The returned value is a property of the allocator
+ * implementation and does not change across the lifetime of the
+ * process.
+ *
+ * @return
+ *  The largest supported allocation size, in bytes.
+ */
+__rte_experimental
+size_t
+rte_fastmem_max_size(void);
+
+/* Forward declaration for __rte_dealloc attribute. */
+void rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate an object from the fastmem allocator.
+ *
+ * Allocates at least @p size bytes, aligned to at least @p align
+ * bytes. The returned memory is backed by huge pages and is
+ * DMA-usable; its IOVA can be obtained via rte_fastmem_virt2iova().
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_socket() to target a
+ * specific socket.
+ *
+ * The allocated memory must be freed with rte_fastmem_free(). An
+ * allocation may be freed from any lcore, not only the lcore that
+ * made the allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags. Use
+ *  RTE_FASTMEM_F_ZERO to obtain zero-initialized memory.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align (not a power of two).
+ *    - ENOMEM: Allocation could not be served from existing
+ *      backing memory and no additional memzone could be reserved.
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Allocate an object on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc(), but targets the specified NUMA socket
+ * rather than the socket of the calling lcore. Use this variant
+ * when the lifetime or access pattern of the allocation is not
+ * tied to the calling lcore's socket.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set (see rte_fastmem_alloc()).
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Resize a fastmem allocation, preserving existing contents.
+ *
+ * If @p ptr is NULL, equivalent to rte_fastmem_alloc(size, align, 0).
+ * If @p size is 0, frees @p ptr and returns NULL.
+ *
+ * If the existing allocation can already satisfy the new size and
+ * alignment, the original pointer may be returned unchanged.
+ * Otherwise, a new allocation is made, the contents are copied
+ * (up to the minimum of old and new sizes), and the old allocation
+ * is freed.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an existing fastmem allocation, or NULL.
+ *
+ * @param size
+ *  New requested size in bytes. If 0, the allocation is freed.
+ *
+ * @param align
+ *  If 0, alignment is at least @c RTE_CACHE_LINE_SIZE. Otherwise,
+ *  must be a power of 2.
+ *
+ * @return
+ *  - A pointer to the resized allocation on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align.
+ *    - ENOMEM: Allocation could not be served.
+ *  On failure, the original allocation at @p ptr remains valid.
+ */
+__rte_experimental
+void *
+rte_fastmem_realloc(void *ptr, size_t size, size_t align)
+	__rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Free an object previously allocated by the fastmem allocator.
+ *
+ * @p ptr must have been returned by a prior call to any fastmem
+ * allocation function, or be NULL. If @p ptr is NULL, no operation
+ * is performed.
+ *
+ * Free may be called from any lcore, regardless of which lcore
+ * made the original allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an object previously allocated by fastmem, or NULL.
+ */
+__rte_experimental
+void
+rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate multiple objects in bulk.
+ *
+ * Allocates @p n objects, each of size at least @p size and aligned
+ * to at least @p align bytes, and stores the resulting pointers
+ * into @p ptrs. All @p n objects have the same size and alignment.
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_bulk_socket() to target a
+ * specific socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_alloc().
+ *
+ * On failure no objects are allocated and @p ptrs is left
+ * untouched.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ *  - -EINVAL: Invalid @p align.
+ *  - -ENOMEM: Not enough objects could be allocated to fill the
+ *    request.
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags);
+
+/**
+ * Allocate multiple objects in bulk on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc_bulk(), but targets the specified NUMA
+ * socket rather than the socket of the calling lcore.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - Negative errno on failure (see rte_fastmem_alloc_bulk()).
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id);
+
+/**
+ * Free multiple objects in bulk.
+ *
+ * Frees the @p n objects pointed to by @p ptrs. Each pointer in
+ * the array must have been returned by a prior fastmem allocation
+ * call and must not have been freed. The objects need not have
+ * the same size, alignment, or socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_free().
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n);
+
+/**
+ * Opaque handle encoding a (size class, NUMA socket) pair.
+ *
+ * Obtained via rte_fastmem_hlookup(). Passing a handle to
+ * rte_fastmem_halloc() avoids the per-call size-class
+ * lookup and socket resolution, improving allocation throughput
+ * for fixed-size objects.
+ */
+typedef uint32_t rte_fastmem_handle_t;
+
+/**
+ * Look up a handle for a given object size and NUMA socket.
+ *
+ * The returned handle encodes the size class and socket, and can
+ * be passed to rte_fastmem_halloc() to allocate objects
+ * without repeating the class lookup.
+ *
+ * @param size
+ *  Object size in bytes. Must not exceed rte_fastmem_max_size().
+ *
+ * @param align
+ *  Alignment requirement (power of two), or 0 for the default
+ *  (RTE_CACHE_LINE_SIZE).
+ *
+ * @param socket_id
+ *  NUMA socket to allocate from.
+ *
+ * @param[out] handle
+ *  On success, set to the resolved handle.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Invalid alignment or socket_id.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ */
+__rte_experimental
+int
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle);
+
+/**
+ * Allocate an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc() but skips the size-class
+ * lookup and socket resolution, using the pre-resolved handle
+ * instead.
+ *
+ * A handle is not tied to the lcore that produced it: it may be
+ * shared across threads and used from any thread, including from
+ * lcores that never called rte_fastmem_hlookup() and from non-EAL
+ * threads. As with rte_fastmem_alloc(), callers without a private
+ * per-lcore cache use the shared cache instead.
+ *
+ * This function is MT-safe.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  A pointer to the allocated object, or NULL on failure
+ *  (rte_errno is set).
+ */
+__rte_experimental
+void *
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Bulk-allocate objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc_bulk() but uses a pre-resolved
+ * handle. All-or-nothing semantics apply.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param[out] ptrs
+ *  Array to receive @p n allocated pointers.
+ *
+ * @param n
+ *  Number of objects to allocate.
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  - 0: All @p n objects allocated successfully.
+ *  - -ENOMEM: Allocation failed; no objects were allocated.
+ */
+__rte_experimental
+int
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags);
+
+/**
+ * Free an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free() but skips the slab-header
+ * lookup by using the class and socket encoded in the handle.
+ *
+ * Like rte_fastmem_halloc(), this may be called from any thread,
+ * regardless of which thread produced the handle or the object.
+ *
+ * This function is MT-safe.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation function.
+ *  Must belong to the same size class and socket as @p handle.
+ *  NULL is permitted (no-op).
+ */
+__rte_experimental
+void
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr);
+
+/**
+ * Bulk-free objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free_bulk() but skips per-object
+ * slab-header lookups.
+ *
+ * All objects must belong to the same size class and socket as
+ * @p handle.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n);
+
+/**
+ * Obtain the IOVA for a fastmem-allocated pointer.
+ *
+ * Translates a virtual address returned by a fastmem allocation
+ * function into the corresponding IOVA, suitable for use in device
+ * DMA descriptors.
+ *
+ * The returned IOVA is valid for the lifetime of the allocation.
+ *
+ * @p ptr must have been returned by a prior fastmem allocation
+ * function. Passing any other pointer results in undefined
+ * behavior.
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation
+ *  function.
+ *
+ * @return
+ *  The IOVA corresponding to @p ptr, or ``RTE_BAD_IOVA`` if the
+ *  library is not initialized (and could not be attached to).
+ */
+__rte_experimental
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr);
+
+/**
+ * Flush the calling lcore's per-lcore caches.
+ *
+ * Drains every cached object from the calling lcore's
+ * per-(size class, NUMA socket) caches back to their shared
+ * bins, and releases the cache state itself. A subsequent
+ * allocation or free on this lcore lazily recreates any caches
+ * it needs.
+ *
+ * This is useful in applications that have finished a bursty
+ * phase and want to release memory that would otherwise sit idle
+ * in caches. It is also useful in tests that want to observe
+ * bin-level state without per-lcore caching hiding activity.
+ *
+ * Only private per-lcore caches are flushed. The call has no
+ * effect when invoked from a thread that has no private cache (a
+ * lcore-less thread, or any thread in a secondary process); the
+ * shared cache is never flushed.
+ *
+ * This function is not thread-safe with respect to concurrent
+ * allocations or frees on the calling lcore; call it only when
+ * the calling lcore is not making other fastmem calls.
+ */
+__rte_experimental
+void
+rte_fastmem_cache_flush(void);
+
+/**
+ * Global summary statistics.
+ */
+struct rte_fastmem_stats {
+	uint64_t bytes_backing;  /**< Bytes of backing memory (memzones) reserved from EAL. */
+	uint64_t bytes_in_use;   /**< Approximate bytes in live objects. */
+	uint64_t alloc_total;    /**< Total successful alloc operations (hits + misses). */
+	uint64_t free_total;     /**< Total free operations (hits + misses). */
+	uint64_t alloc_nomem;    /**< Alloc attempts that failed with ENOMEM. */
+	unsigned int n_classes;  /**< Number of size classes. */
+};
+
+/**
+ * Per-size-class statistics (aggregated across all lcores).
+ *
+ * Allocation and free counters count individual objects, not
+ * operations. A bulk allocation of 32 objects that hits the cache
+ * increments alloc_cache_hits by 32.
+ */
+struct rte_fastmem_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t in_use;               /**< Objects currently live (allocs - frees). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from a per-lcore cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by a per-lcore cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+	uint64_t slab_acquires;        /**< Slabs pulled from the free pool. */
+	uint64_t slab_releases;        /**< Slabs returned to the free pool. */
+	uint32_t slabs_partial;        /**< Current partial slab count. */
+	uint32_t slabs_full;           /**< Current full slab count. */
+};
+
+/**
+ * Per-lcore statistics (aggregated across all classes).
+ *
+ * Covers activity served through the lcore's private per-lcore
+ * cache, which exists only for lcore-id-equipped threads in the
+ * primary process. Allocations and frees made without a private
+ * cache (lcore-less threads, and any thread in a secondary
+ * process) are not attributed to any lcore and so do not appear
+ * here; they are visible in the global and per-class statistics,
+ * and in the shared-cache statistics retrieved with
+ * rte_fastmem_stats_shared().
+ *
+ * This structure is also used to report the shared cache; see
+ * rte_fastmem_stats_shared().
+ */
+struct rte_fastmem_lcore_stats {
+	uint64_t alloc_cache_hits;     /**< Allocs served from this lcore's caches. */
+	uint64_t alloc_cache_misses;   /**< Allocs that missed this lcore's caches. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by this lcore's caches. */
+	uint64_t free_cache_misses;    /**< Frees that bypassed this lcore's caches. */
+};
+
+/**
+ * Per-lcore, per-class statistics (no aggregation).
+ *
+ * Also used to report the shared cache for a single class; see
+ * rte_fastmem_stats_shared_class().
+ */
+struct rte_fastmem_lcore_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+};
+
+/**
+ * Get the number of size classes and optionally their sizes.
+ *
+ * @param[out] sizes
+ *   If non-NULL, filled with the size (in bytes) of each class.
+ *   The caller must provide space for at least the returned number
+ *   of entries.
+ *
+ * @return
+ *   The number of size classes.
+ */
+__rte_experimental
+unsigned int
+rte_fastmem_classes(size_t *sizes);
+
+/**
+ * Retrieve global summary statistics.
+ *
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * May be called from a secondary process, which lazily attaches to
+ * the shared state on first use.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats);
+
+/**
+ * Retrieve statistics for a single size class.
+ *
+ * @param class_size
+ *   Exact size of the class to query (must match one of the values
+ *   returned by rte_fastmem_classes()).
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * May be called from a secondary process, which lazily attaches to
+ * the shared state on first use.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, or @p class_size does not match any
+ *    size class.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_class(size_t class_size,
+		struct rte_fastmem_class_stats *stats);
+
+/**
+ * Retrieve per-lcore statistics (aggregated across all classes).
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * May be called from a secondary process, which lazily attaches to
+ * the shared state on first use.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, or @p lcore_id is invalid.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats);
+
+/**
+ * Retrieve per-lcore, per-class statistics.
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param class_size
+ *   Exact size of the class to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * May be called from a secondary process, which lazily attaches to
+ * the shared state on first use.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, @p lcore_id is invalid, or
+ *    @p class_size does not match any size class.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size,
+		struct rte_fastmem_lcore_class_stats *stats);
+
+/**
+ * Retrieve statistics for the shared cache (aggregated across all
+ * classes).
+ *
+ * The shared cache serves every caller without a private per-lcore
+ * cache: threads without an lcore id, and all threads in a
+ * secondary process (where private per-lcore caches are never
+ * used). Its activity is reported here rather than under any
+ * single lcore.
+ *
+ * @param[out] stats
+ *   Structure to fill. The per-lcore stats layout is reused.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_shared(struct rte_fastmem_lcore_stats *stats);
+
+/**
+ * Retrieve shared-cache statistics for a single size class.
+ *
+ * @param class_size
+ *   Exact size of the class to query.
+ * @param[out] stats
+ *   Structure to fill. The per-lcore-per-class stats layout is reused.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, or @p class_size does not match any
+ *    size class.
+ *  - -ENODEV: fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_shared_class(size_t class_size,
+		struct rte_fastmem_lcore_class_stats *stats);
+
+/**
+ * Reset all statistics counters to zero.
+ *
+ * Zeroes the per-lcore, shared-cache, and per-bin counters. Does
+ * not affect the allocator's operational state.
+ *
+ * For an accurate reset, call when no other threads are
+ * actively allocating or freeing.
+ */
+__rte_experimental
+void
+rte_fastmem_stats_reset(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_FASTMEM_H_ */
diff --git a/lib/meson.build b/lib/meson.build
index 8f5cfd28a5..10906d4d53 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -38,6 +38,7 @@ libraries = [
         'distributor',
         'dmadev',  # eventdev depends on this
         'efd',
+        'fastmem',
         'eventdev',
         'dispatcher', # dispatcher depends on eventdev
         'gpudev',
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC v4 3/3] app/test: add fastmem test suite
  2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 1/3] doc: add fastmem programming guide Mattias Rönnblom
  2026-05-30  9:26             ` [RFC v4 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-30  9:26             ` Mattias Rönnblom
  2026-06-10 12:35             ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Konstantin Ananyev
  3 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-30  9:26 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Add functional, performance, and profiling test suites for the
fastmem library.

--

RFC v4:
 * Add tests for handle alloc/free from uncached lcores and
   non-EAL threads.
 * Add tests that statistics survive cache flush.
 * Add test for shared-cache statistics.
 * Refactor tests to use per-test setup/teardown.

RFC v3:
 * Add realloc test cases (same class, grow, shrink, NULL ptr,
   zero size, too big, invalid align).
 * Merge lifecycle and functional test suites into one.
 * Suppress -Wuse-after-free in test_alloc_reuse (intentional
   pointer comparison after free).

RFC v2:
 * Add test_alloc_cross_socket_deinit exercising cross-socket
   teardown path.
 * Remove trailing double blank lines in test_fastmem.c.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 app/test/meson.build            |    3 +
 app/test/test_fastmem.c         | 2111 +++++++++++++++++++++++++++++++
 app/test/test_fastmem_perf.c    | 1040 +++++++++++++++
 app/test/test_fastmem_profile.c |  157 +++
 4 files changed, 3311 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 3f9340f2f5..fe375e97f3 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -82,6 +82,9 @@ source_file_deps = {
     'test_event_vector_adapter.c': ['eventdev', 'bus_vdev'],
     'test_eventdev.c': ['eventdev', 'bus_vdev'],
     'test_external_mem.c': [],
+    'test_fastmem.c': ['fastmem'],
+    'test_fastmem_perf.c': ['fastmem', 'mempool'],
+    'test_fastmem_profile.c': ['fastmem'],
     'test_fbarray.c': [],
     'test_fib.c': ['net', 'fib'],
     'test_fib6.c': ['rib', 'fib'],
diff --git a/app/test/test_fastmem.c b/app/test/test_fastmem.c
new file mode 100644
index 0000000000..24ba1e671a
--- /dev/null
+++ b/app/test/test_fastmem.c
@@ -0,0 +1,2111 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <inttypes.h>
+#include <stdalign.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_thread.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define FASTMEM_MEMZONE_SIZE (128U << 20)
+
+/*
+ * Count memzones whose names begin with the fastmem prefix.
+ * Used to verify that rte_fastmem_reserve() really did reserve
+ * backing memzones.
+ */
+static int fastmem_memzone_count;
+
+static void
+count_fastmem_memzones_walk(const struct rte_memzone *mz, void *arg)
+{
+	RTE_SET_USED(arg);
+
+	if (strncmp(mz->name, "fastmem_", strlen("fastmem_")) == 0)
+		fastmem_memzone_count++;
+}
+
+static unsigned int
+count_fastmem_memzones(void)
+{
+	fastmem_memzone_count = 0;
+	rte_memzone_walk(count_fastmem_memzones_walk, NULL);
+	return fastmem_memzone_count;
+}
+
+static int
+test_init_deinit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	/* A subsequent init/deinit cycle must succeed. */
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "second rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_init_is_not_idempotent(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, -EBUSY,
+		"expected -EBUSY on re-init, got %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_deinit_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_max_size(void)
+{
+	size_t max;
+
+	max = rte_fastmem_max_size();
+	TEST_ASSERT(max >= (1U << 20),
+		"max_size=%zu below required 1 MiB minimum", max);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_small(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * A small reserve request (1 byte) must result in exactly
+	 * one memzone reservation: the internal rounding is to
+	 * memzone granularity.
+	 */
+	rc = rte_fastmem_reserve(1, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve() failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	rte_fastmem_deinit();
+
+	/* After deinit the memzones must be released. */
+	TEST_ASSERT_EQUAL(count_fastmem_memzones(), 0,
+		"%u fastmem memzones leaked after deinit",
+		count_fastmem_memzones());
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_multiple_memzones(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	size_t reserve_size;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * Request just over one memzone's worth; this must force
+	 * a second memzone to be reserved.
+	 */
+	reserve_size = FASTMEM_MEMZONE_SIZE + 1;
+	rc = rte_fastmem_reserve(reserve_size, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve(%zu) failed: %d",
+		reserve_size, rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 2,
+		"expected 2 new memzones for %zu-byte reserve, got %u",
+		reserve_size, after - before);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_cumulative(void)
+{
+	int socket_id;
+	unsigned int after_first, after_second;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	after_first = count_fastmem_memzones();
+
+	/*
+	 * A second call requesting the same amount that's already
+	 * reserved must not trigger any new memzone reservation.
+	 */
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "second reserve failed: %d", rc);
+
+	after_second = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after_first, after_second,
+		"reserve of already-reserved amount added memzones (%u -> %u)",
+		after_first, after_second);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_invalid_socket(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, RTE_MAX_NUMA_NODES);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for out-of-range socket, got %d", rc);
+
+	rc = rte_fastmem_reserve(1, -2);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for negative socket, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_without_init(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0,
+		"expected failure without init, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_any_socket(void)
+{
+	unsigned int before, after;
+	int rc;
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * SOCKET_ID_ANY should succeed on any system with at least
+	 * one configured socket. The allocator picks the caller's
+	 * socket first and falls back to other sockets if needed.
+	 */
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0,
+		"rte_fastmem_reserve(SOCKET_ID_ANY) failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 2 tests: allocation and free.
+ */
+
+static int
+test_alloc_too_big(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(rte_fastmem_max_size() + 1, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc above max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_invalid_align(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(16, 3, 0); /* 3 is not a power of 2 */
+	TEST_ASSERT_NULL(p, "alloc with align=3 returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL,
+		"expected rte_errno=EINVAL, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_small(void)
+{
+	void *p;
+	p = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8) failed: rte_errno=%d", rte_errno);
+
+	/* Writing into the object must not crash. */
+	memset(p, 0xa5, 8);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_various_sizes(void)
+{
+	static const size_t sizes[] = {
+		1, 8, 16, 17, 63, 64, 128, 1024, 4096,
+		64 * 1024, 256 * 1024, 1024 * 1024,
+	};
+	void *ptrs[RTE_DIM(sizes)];
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(sizes); i++) {
+		ptrs[i] = rte_fastmem_alloc(sizes[i], 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc(%zu) failed: rte_errno=%d",
+			sizes[i], rte_errno);
+		memset(ptrs[i], 0x5a, sizes[i]);
+	}
+
+	for (i = 0; i < RTE_DIM(sizes); i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_alignment(void)
+{
+	static const size_t aligns[] = {
+		8, 16, 64, 256, 4096, 65536,
+	};
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(aligns); i++) {
+		void *p = rte_fastmem_alloc(1, aligns[i], 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=%zu) failed: rte_errno=%d",
+			aligns[i], rte_errno);
+		TEST_ASSERT((uintptr_t)p % aligns[i] == 0,
+			"pointer %p not aligned on %zu",
+			p, aligns[i]);
+		rte_fastmem_free(p);
+	}
+
+	/* Default (align=0) gives at least RTE_CACHE_LINE_SIZE. */
+	{
+		void *p = rte_fastmem_alloc(1, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=0) failed: rte_errno=%d", rte_errno);
+		TEST_ASSERT((uintptr_t)p % RTE_CACHE_LINE_SIZE == 0,
+			"default-align pointer %p not cache-line aligned",
+			p);
+		rte_fastmem_free(p);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_zero_flag(void)
+{
+	uint8_t *p;
+	unsigned int i;
+	bool all_zero = true;
+
+	/*
+	 * Dirty a slab first by allocating without F_ZERO, writing
+	 * a non-zero pattern, and freeing. A subsequent F_ZERO
+	 * allocation on the same slab must return zeroed memory.
+	 */
+	p = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "priming alloc failed");
+	memset(p, 0xff, 128);
+	rte_fastmem_free(p);
+
+	p = rte_fastmem_alloc(128, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_NOT_NULL(p, "F_ZERO alloc failed");
+	for (i = 0; i < 128; i++) {
+		if (p[i] != 0) {
+			all_zero = false;
+			break;
+		}
+	}
+	TEST_ASSERT(all_zero, "F_ZERO returned non-zero byte at offset %u", i);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+#if defined(__GNUC__) && !defined(__clang__)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wuse-after-free"
+#endif
+static int
+test_alloc_reuse(void)
+{
+	void *first, *second;
+
+	first = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(first, "first alloc failed");
+	rte_fastmem_free(first);
+
+	second = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(second, "second alloc failed");
+
+	/*
+	 * The slab's free list is LIFO, so the most recently freed
+	 * object is at the head of the list. A subsequent alloc in
+	 * the same class returns it.
+	 */
+	TEST_ASSERT_EQUAL(first, second,
+		"free + alloc did not reuse: first=%p second=%p",
+		first, second);
+
+	rte_fastmem_free(second);
+
+	return TEST_SUCCESS;
+}
+#if defined(__GNUC__) && !defined(__clang__)
+#pragma GCC diagnostic pop
+#endif
+
+static int
+test_alloc_many_in_class(void)
+{
+	/*
+	 * Allocate more objects in one class than fit in a single
+	 * slab, forcing the bin to pull a second block. This
+	 * exercises the partial->full transition and the cross-slab
+	 * allocation path.
+	 */
+	enum { CLASS_SIZE = 8, COUNT = 300000 };
+	void **ptrs;
+	unsigned int i;
+
+	ptrs = calloc(COUNT, sizeof(*ptrs));
+	TEST_ASSERT_NOT_NULL(ptrs, "calloc for test ptrs failed");
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(CLASS_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d",
+			i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	free(ptrs);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket(void)
+{
+	void *p;
+	int socket_id;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing(void)
+{
+	void *small, *large;
+
+	/*
+	 * Allocate and free a small object, forcing a block to be
+	 * assigned to the small class and then returned to the
+	 * free-block pool. A subsequent allocation in a different
+	 * class must be able to reuse that block.
+	 */
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+	rte_fastmem_free(small);
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large, "large alloc failed");
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing_no_growth(void)
+{
+	struct rte_fastmem_stats stats;
+	void *small, *large;
+	uint64_t after_small;
+	int rc;
+
+	/*
+	 * Stronger version of test_alloc_block_repurposing: assert
+	 * that the cross-class allocation does not grow the
+	 * backing memory (bytes_backing stays flat). Because the
+	 * free-block pool is shared across size classes — not
+	 * partitioned per class — the block freed from the small
+	 * class must serve the large allocation without triggering
+	 * a new memzone reservation.
+	 */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, (uint64_t)0,
+		"unexpected pre-alloc bytes_backing: %" PRIu64,
+		stats.bytes_backing);
+
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT(stats.bytes_backing > 0,
+		"bytes_backing did not grow on first alloc");
+	after_small = stats.bytes_backing;
+
+	rte_fastmem_free(small);
+	rte_fastmem_cache_flush();
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large,
+		"large alloc failed: rte_errno=%d", rte_errno);
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, after_small,
+		"cross-class alloc grew backing memory from %" PRIu64
+		" to %" PRIu64,
+		after_small, stats.bytes_backing);
+
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_null(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_free(NULL);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_content_integrity(void)
+{
+	/*
+	 * Allocate a batch of objects, fill each with a distinct
+	 * byte pattern, then verify none of the patterns overlap.
+	 * This catches header overwrites (slab header corrupted by
+	 * object access) and slot-overlap bugs (two pointers pointing
+	 * at overlapping slots).
+	 */
+	enum { N = 256, SIZE = 128 };
+	uint8_t *ptrs[N];
+	unsigned int i, j;
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)i, SIZE);
+	}
+
+	for (i = 0; i < N; i++)
+		for (j = 0; j < SIZE; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], (uint8_t)i,
+				"corruption at ptrs[%u][%u]: got 0x%x, want 0x%x",
+				i, j, ptrs[i][j], (uint8_t)i);
+
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_too_big(void)
+{
+	void *p;
+	/*
+	 * A small size with an alignment larger than the maximum
+	 * size class cannot be served. The class selected must be
+	 * large enough for the alignment, but no such class exists.
+	 */
+	rte_errno = 0;
+	p = rte_fastmem_alloc(1, rte_fastmem_max_size() * 2, 0);
+	TEST_ASSERT_NULL(p,
+		"alloc with align>max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_one(void)
+{
+	void *p;
+	/* align=1 is a valid power of 2 and must be accepted. */
+	p = rte_fastmem_alloc(8, 1, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8, 1) failed: rte_errno=%d",
+		rte_errno);
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket_numa_placement(void)
+{
+	void *p;
+	int socket_id;
+	struct rte_memseg *ms;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	/*
+	 * Walk the memory to find the memseg for this pointer and
+	 * verify its socket. Skip the check if lookup fails (e.g.,
+	 * --no-huge mode may not populate memsegs for fastmem's
+	 * allocations in a way that rte_mem_virt2memseg can find).
+	 */
+	ms = rte_mem_virt2memseg(p, NULL);
+	if (ms != NULL) {
+		TEST_ASSERT_EQUAL(ms->socket_id, socket_id,
+			"alloc on socket %d landed on socket %d",
+			socket_id, ms->socket_id);
+	}
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Allocate from a socket different from the calling lcore's socket,
+ * triggering a cross-socket cache allocation. Then deinit to exercise
+ * the teardown path where a cache's backing memory lives on a
+ * different socket than the one it serves.
+ */
+static int
+test_alloc_cross_socket_deinit(void)
+{
+	int local_sid, remote_sid;
+	unsigned int i, n_sockets;
+	void *p;
+
+	local_sid = (int)rte_socket_id();
+	if (local_sid < 0 || (unsigned int)local_sid >= RTE_MAX_NUMA_NODES)
+		local_sid = rte_socket_id_by_idx(0);
+
+	n_sockets = rte_socket_count();
+	if (n_sockets < 2)
+		return TEST_SKIPPED;
+
+	/* Find a socket different from the local one. */
+	remote_sid = -1;
+	for (i = 0; i < n_sockets; i++) {
+		int sid = rte_socket_id_by_idx(i);
+		if (sid >= 0 && sid != local_sid) {
+			remote_sid = sid;
+			break;
+		}
+	}
+	if (remote_sid < 0)
+		return TEST_SKIPPED;
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, remote_sid);
+	TEST_ASSERT_NOT_NULL(p,
+		"cross-socket alloc(socket %d) failed: rte_errno=%d",
+		remote_sid, rte_errno);
+
+	rte_fastmem_free(p);
+
+	/* Teardown and re-init to exercise the deinit path with
+	 * cross-socket caches.
+	 */
+	rte_fastmem_deinit();
+
+	TEST_ASSERT_EQUAL(rte_fastmem_init(), 0,
+		"re-init after cross-socket deinit failed");
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 3 tests: per-lcore caches.
+ */
+
+static int
+test_cache_flush(void)
+{
+	void *p;
+	/*
+	 * Alloc and free one object, leaving it in the cache. Then
+	 * flush and verify that a subsequent alloc may or may not
+	 * return the same pointer (not asserting same/different —
+	 * just checking that flush does not crash and a follow-up
+	 * alloc still works).
+	 */
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "first alloc failed");
+	rte_fastmem_free(p);
+
+	rte_fastmem_cache_flush();
+
+	/* Flush again — must be idempotent. */
+	rte_fastmem_cache_flush();
+
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "post-flush alloc failed");
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_cache_flush();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_exceeds_capacity(void)
+{
+	/*
+	 * Free more objects at a single size class than the cache
+	 * capacity (64 for classes <= 4 KiB). This forces the
+	 * cache-drain slow path and verifies no corruption.
+	 */
+	enum { COUNT = 200, SIZE = 64 };
+	void *ptrs[COUNT];
+	unsigned int i;
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	/* Re-alloc the same count should still work. */
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"re-alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct non_eal_args {
+	int ok;
+	char pad[64];
+};
+
+static uint32_t
+non_eal_thread_main(void *arg)
+{
+	struct non_eal_args *args = arg;
+	uint8_t *p;
+
+	p = rte_fastmem_alloc(128, 0, 0);
+	if (p == NULL)
+		return 1;
+
+	memset(p, 0x7e, 128);
+
+	rte_fastmem_free(p);
+
+	args->ok = 1;
+	return 0;
+}
+
+static int
+test_non_eal_thread(void)
+{
+	rte_thread_t thread_id;
+	struct non_eal_args args = { 0 };
+	int rc;
+
+	rc = rte_thread_create(&thread_id, NULL, non_eal_thread_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(args.ok, 1,
+		"non-EAL thread did not complete alloc/free successfully");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_returns_memory(void)
+{
+	/*
+	 * When an entire slab's worth of objects is freed, the
+	 * slab's block is returned to the free-block pool and can
+	 * be reassigned to another size class. Verify the cache
+	 * does not permanently hold objects that prevent this.
+	 *
+	 * Allocate enough objects in one class to force multiple
+	 * slabs, free them all, then flush the cache. After the
+	 * flush, all cached objects are drained to their bins and
+	 * empty slabs are returned to the block pool.
+	 */
+	enum { N = 200, SIZE = 64 };
+	void *ptrs[N];
+	unsigned int i;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rte_fastmem_cache_flush();
+
+	/*
+	 * An allocation in a completely different class should
+	 * succeed now, having access to any blocks freed by the
+	 * flush.
+	 */
+	{
+		void *other = rte_fastmem_alloc(65536, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(other,
+			"post-flush cross-class alloc failed");
+		rte_fastmem_free(other);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_basic(void)
+{
+	enum { N = 32 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	/* Verify all pointers are non-NULL and distinct. */
+	for (unsigned int i = 0; i < N; i++) {
+		TEST_ASSERT_NOT_NULL(ptrs[i], "ptrs[%u] is NULL", i);
+		for (unsigned int j = 0; j < i; j++)
+			TEST_ASSERT(ptrs[i] != ptrs[j],
+				"ptrs[%u] == ptrs[%u]", i, j);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_zero_flag(void)
+{
+	enum { N = 8, SIZE = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, SIZE, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	for (unsigned int i = 0; i < N; i++) {
+		uint8_t *p = ptrs[i];
+
+		for (unsigned int b = 0; b < SIZE; b++)
+			TEST_ASSERT_EQUAL(p[b], 0,
+				"ptrs[%u][%u] != 0", i, b);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_exceeds_cache(void)
+{
+	/* Allocate more than cache capacity (64) in one bulk call. */
+	enum { N = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk(%u) failed: %d", N, rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_socket(void)
+{
+	enum { N = 16 };
+	void *ptrs[N];
+	int socket_id;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no sockets");
+
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* SOCKET_ID_ANY */
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket(ANY) failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_bulk(void)
+{
+	enum { N = 64 };
+	void *ptrs[N];
+	/* Allocate individually, free in bulk. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* Verify memory is reusable. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "re-alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_classes(void)
+{
+	size_t sizes[32];
+	unsigned int n;
+
+	n = rte_fastmem_classes(NULL);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+
+	n = rte_fastmem_classes(sizes);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+	TEST_ASSERT_EQUAL(sizes[0], (size_t)8, "class 0 != 8");
+	TEST_ASSERT_EQUAL(sizes[n - 1], (size_t)(1 << 20),
+		"last class != 1 MiB");
+
+	for (unsigned int i = 0; i < n; i++) {
+		TEST_ASSERT(sizes[i] != 0 && (sizes[i] & (sizes[i] - 1)) == 0,
+			"class %u size %zu not power of 2", i, sizes[i]);
+		if (i > 0)
+			TEST_ASSERT(sizes[i] > sizes[i - 1],
+				"classes not ascending at %u", i);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_class(void)
+{
+	enum { N = 10 };
+	struct rte_fastmem_class_stats cs;
+	void *ptrs[N];
+	int rc;
+
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.class_size, (size_t)64, "wrong class_size");
+	TEST_ASSERT(cs.alloc_cache_hits + cs.alloc_cache_misses == N,
+		"alloc count != N: hits=%" PRIu64 " misses=%" PRIu64,
+		cs.alloc_cache_hits, cs.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)N, "in_use != N");
+
+	for (unsigned int i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class after free failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)0, "in_use != 0 after free");
+
+	/* Invalid class size. */
+	rc = rte_fastmem_stats_class(13, &cs);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad size");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore(void)
+{
+	struct rte_fastmem_lcore_stats ls;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+	TEST_ASSERT(ls.alloc_cache_hits + ls.alloc_cache_misses > 0,
+		"no alloc activity on this lcore");
+
+	rte_fastmem_free(ptr);
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore after free failed: %d", rc);
+	TEST_ASSERT(ls.free_cache_hits + ls.free_cache_misses > 0,
+		"no free activity on this lcore");
+
+	/* Invalid lcore. */
+	rc = rte_fastmem_stats_lcore(RTE_MAX_LCORE, &ls);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad lcore");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore_class(void)
+{
+	struct rte_fastmem_lcore_class_stats lcs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore_class(rte_lcore_id(), 256, &lcs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(lcs.class_size, (size_t)256, "wrong class_size");
+	TEST_ASSERT(lcs.alloc_cache_hits + lcs.alloc_cache_misses > 0,
+		"no alloc activity");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_reset(void)
+{
+	struct rte_fastmem_stats gs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+	rte_fastmem_free(ptr);
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_fastmem_stats(&gs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(gs.alloc_total, (uint64_t)0,
+		"alloc_total not zero after reset");
+	TEST_ASSERT_EQUAL(gs.free_total, (uint64_t)0,
+		"free_total not zero after reset");
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Counters are stored separately from the per-lcore caches, so a
+ * cache flush (which frees the cache structs) must not discard
+ * accumulated statistics.
+ */
+static int
+test_stats_survive_cache_flush(void)
+{
+	enum { N = 10 };
+	struct rte_fastmem_class_stats before, after;
+	struct rte_fastmem_lcore_stats lbefore, lafter;
+	void *ptrs[N];
+	unsigned int i;
+	int rc;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rc = rte_fastmem_stats_class(64, &before);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class failed: %d", rc);
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &lbefore);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+
+	TEST_ASSERT(before.alloc_cache_hits + before.alloc_cache_misses == N,
+		"expected %d allocs before flush", N);
+
+	rte_fastmem_cache_flush();
+
+	rc = rte_fastmem_stats_class(64, &after);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class after flush failed: %d", rc);
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &lafter);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore after flush failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(after.alloc_cache_hits, before.alloc_cache_hits,
+		"alloc_cache_hits lost across flush: %" PRIu64 " -> %" PRIu64,
+		before.alloc_cache_hits, after.alloc_cache_hits);
+	TEST_ASSERT_EQUAL(after.alloc_cache_misses, before.alloc_cache_misses,
+		"alloc_cache_misses lost across flush: %" PRIu64 " -> %" PRIu64,
+		before.alloc_cache_misses, after.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(after.free_cache_hits, before.free_cache_hits,
+		"free_cache_hits lost across flush: %" PRIu64 " -> %" PRIu64,
+		before.free_cache_hits, after.free_cache_hits);
+	TEST_ASSERT_EQUAL(lafter.alloc_cache_hits + lafter.alloc_cache_misses,
+		lbefore.alloc_cache_hits + lbefore.alloc_cache_misses,
+		"per-lcore alloc counters lost across flush");
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Allocations made by a non-EAL thread cannot be attributed to an
+ * lcore, but must still be reflected in the global and per-class
+ * statistics.
+ */
+static uint32_t
+stats_non_eal_main(void *arg)
+{
+	struct non_eal_args *args = arg;
+	void *ptrs[8];
+	unsigned int i;
+
+	for (i = 0; i < RTE_DIM(ptrs); i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		if (ptrs[i] == NULL)
+			return 1;
+	}
+	for (i = 0; i < RTE_DIM(ptrs); i++)
+		rte_fastmem_free(ptrs[i]);
+
+	args->ok = 1;
+	return 0;
+}
+
+static int
+test_stats_count_non_eal(void)
+{
+	enum { N = 8 };
+	struct rte_fastmem_stats before, after;
+	struct non_eal_args args = { 0 };
+	rte_thread_t thread_id;
+	int rc;
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_fastmem_stats(&before);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+
+	rc = rte_thread_create(&thread_id, NULL, stats_non_eal_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+	TEST_ASSERT_EQUAL(args.ok, 1, "non-EAL thread alloc/free failed");
+
+	rc = rte_fastmem_stats(&after);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(after.alloc_total - before.alloc_total, (uint64_t)N,
+		"non-EAL allocs not counted globally: delta=%" PRIu64,
+		after.alloc_total - before.alloc_total);
+	TEST_ASSERT_EQUAL(after.free_total - before.free_total, (uint64_t)N,
+		"non-EAL frees not counted globally: delta=%" PRIu64,
+		after.free_total - before.free_total);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * A non-EAL thread has no lcore id, so its traffic must land in the
+ * shared cache and be reported by rte_fastmem_stats_shared().
+ */
+static int
+test_stats_shared_non_eal(void)
+{
+	enum { N = 8 };
+	struct rte_fastmem_lcore_stats sh;
+	struct rte_fastmem_lcore_class_stats shc;
+	struct non_eal_args args = { 0 };
+	rte_thread_t thread_id;
+	int rc;
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_thread_create(&thread_id, NULL, stats_non_eal_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+	TEST_ASSERT_EQUAL(args.ok, 1, "non-EAL thread alloc/free failed");
+
+	rc = rte_fastmem_stats_shared(&sh);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_shared failed: %d", rc);
+	TEST_ASSERT_EQUAL(sh.alloc_cache_hits + sh.alloc_cache_misses,
+		(uint64_t)N, "shared allocs not counted: %" PRIu64,
+		sh.alloc_cache_hits + sh.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(sh.free_cache_hits + sh.free_cache_misses,
+		(uint64_t)N, "shared frees not counted: %" PRIu64,
+		sh.free_cache_hits + sh.free_cache_misses);
+
+	/* stats_non_eal_main allocates 64-byte objects. */
+	rc = rte_fastmem_stats_shared_class(64, &shc);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_shared_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(shc.class_size, (size_t)64, "wrong class_size");
+	TEST_ASSERT_EQUAL(shc.alloc_cache_hits + shc.alloc_cache_misses,
+		(uint64_t)N, "shared class allocs not counted: %" PRIu64,
+		shc.alloc_cache_hits + shc.alloc_cache_misses);
+
+	/* The shared traffic must not be attributed to any lcore. */
+	struct rte_fastmem_lcore_stats ls;
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+	TEST_ASSERT_EQUAL(ls.alloc_cache_hits + ls.alloc_cache_misses,
+		(uint64_t)0, "shared traffic leaked into lcore stats");
+
+	/* Error paths. */
+	rc = rte_fastmem_stats_shared(NULL);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for NULL stats");
+	rc = rte_fastmem_stats_shared_class(13, &shc);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad size");
+
+	return TEST_SUCCESS;
+}
+
+
+#define MIXED_LONG_LIVED_COUNT 25
+#define MIXED_SHORT_LIVED_ITERS 1000
+#define MIXED_MIN_LCORES 3
+
+static const size_t mixed_long_sizes[] = { 64, 256, 4096 };
+static const size_t mixed_short_sizes[] = { 8, 16, 32, 64, 128, 256, 512, 1024 };
+
+struct mixed_worker_args {
+	uint32_t seed;
+	int result;
+};
+
+static uint32_t
+xorshift32(uint32_t *state)
+{
+	uint32_t x = *state;
+
+	x ^= x << 13;
+	x ^= x >> 17;
+	x ^= x << 5;
+	*state = x;
+	return x;
+}
+
+static int
+mixed_worker(void *arg)
+{
+	struct mixed_worker_args *args = arg;
+	uint32_t seed = args->seed;
+	void *long_lived[MIXED_LONG_LIVED_COUNT];
+	size_t long_sizes[MIXED_LONG_LIVED_COUNT];
+	unsigned int i;
+
+	/* Allocate long-lived objects of mixed sizes. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		long_sizes[i] = mixed_long_sizes[i % RTE_DIM(mixed_long_sizes)];
+		long_lived[i] = rte_fastmem_alloc(long_sizes[i], 0, 0);
+		if (long_lived[i] == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(long_lived[i], (int)(i + 1), long_sizes[i]);
+	}
+
+	/* Rapidly cycle short-lived objects. */
+	for (i = 0; i < MIXED_SHORT_LIVED_ITERS; i++) {
+		size_t sz = mixed_short_sizes[xorshift32(&seed) %
+					      RTE_DIM(mixed_short_sizes)];
+		uint8_t pattern = (uint8_t)(i & 0xff);
+		uint8_t *p;
+
+		p = rte_fastmem_alloc(sz, 0, 0);
+		if (p == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(p, pattern, sz);
+
+		/* Verify before freeing. */
+		for (size_t j = 0; j < sz; j++) {
+			if (p[j] != pattern) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(p);
+	}
+
+	/* Verify long-lived objects are still intact. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		uint8_t *bytes = long_lived[i];
+		uint8_t expected = (uint8_t)(i + 1);
+
+		for (size_t j = 0; j < long_sizes[i]; j++) {
+			if (bytes[j] != expected) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(long_lived[i]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_mixed_lifetimes_multi_lcore(void)
+{
+	struct mixed_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int count = 0;
+	struct rte_fastmem_stats stats;
+	int rc;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		count++;
+
+	if (count < MIXED_MIN_LCORES) {
+		printf("Not enough worker lcores (%u < %u), skipping\n",
+		       count, MIXED_MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	/* Launch workers with distinct seeds. */
+	uint32_t seed = 0xdeadbeef;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].seed = seed;
+		args[lcore_id].result = TEST_FAILED;
+		seed += 0x12345678;
+		rte_eal_remote_launch(mixed_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	/* Check all workers succeeded. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	/* Verify no memory leak. */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero after test: %" PRIu64,
+		stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+
+/*
+ * Memory limit tests.
+ *
+ * FASTMEM_MEMZONE_SIZE is 128 MiB. We use a limit of 128 MiB
+ * (one memzone) for most tests, and large objects (256 KiB) to
+ * exhaust slabs quickly.
+ */
+
+#define LIMIT_ONE_MZ ((size_t)128 << 20)
+#define LIMIT_OBJ_SIZE ((size_t)256 * 1024)
+
+static int
+test_memory_limit_basic(void)
+{
+	int rc;
+
+	rc = rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+	TEST_ASSERT_EQUAL(rc, 0, "set_memory_limit failed: %d", rc);
+
+	const size_t got = rte_fastmem_get_limit(0);
+	TEST_ASSERT_EQUAL(got, LIMIT_ONE_MZ,
+		"get_memory_limit mismatch: %zu", got);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ + 1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "second reserve should have failed");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_exhaustion(void)
+{
+	const unsigned int max_ptrs = 1024;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+
+	TEST_ASSERT(count > 0, "should have allocated at least one");
+	TEST_ASSERT(count < max_ptrs, "should have hit the limit");
+	TEST_ASSERT_EQUAL(rte_errno, ENOMEM, "expected ENOMEM, got %d", rte_errno);
+
+	rte_fastmem_free(ptrs[count - 1]);
+	void *p = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc after free should succeed");
+	rte_fastmem_free(p);
+
+	for (unsigned int i = 0; i < count - 1; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_zero_blocks_growth(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "reserve with limit=0 should fail");
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc with limit=0 should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_below_current(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve failed: %d", rc);
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 1);
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc from existing backing should work");
+	rte_fastmem_free(p);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ * 2, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "growth beyond limit should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_socket_id_any(void)
+{
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 42);
+
+	for (unsigned int i = 0; i < rte_socket_count(); i++) {
+		const int sid = rte_socket_id_by_idx(i);
+		const size_t lim = rte_fastmem_get_limit(sid);
+
+		TEST_ASSERT_EQUAL(lim, (size_t)42,
+			"socket %d limit mismatch: %zu", sid, lim);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_unlimited(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+	rte_fastmem_set_limit(SOCKET_ID_ANY, SIZE_MAX);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve after reset failed: %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_integrity_under_oom(void)
+{
+	const unsigned int n = 128;
+	const size_t obj_size = 1024;
+	uint8_t *ptrs[n];
+	const unsigned int extra_max = 1024;
+	void *extra[extra_max];
+	unsigned int n_extra = 0;
+	unsigned int i;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = rte_fastmem_alloc(obj_size, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)(i & 0xff), obj_size);
+	}
+
+	/* Exhaust remaining backing with large objects. */
+	for (n_extra = 0; n_extra < extra_max; n_extra++) {
+		extra[n_extra] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (extra[n_extra] == NULL)
+			break;
+	}
+
+	/* Verify original objects are intact. */
+	for (i = 0; i < n; i++) {
+		const uint8_t expected = (uint8_t)(i & 0xff);
+		for (unsigned int j = 0; j < obj_size; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], expected,
+				"corruption at [%u][%u]", i, j);
+	}
+
+	for (i = 0; i < n; i++)
+		rte_fastmem_free(ptrs[i]);
+	for (i = 0; i < n_extra; i++)
+		rte_fastmem_free(extra[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_bulk_alloc_oom(void)
+{
+	const unsigned int bulk_n = 64;
+	const unsigned int drain_max = 512;
+	void *ptrs[bulk_n];
+	void *drain[drain_max];
+	unsigned int drained = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (drained = 0; drained < drain_max; drained++) {
+		drain[drained] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (drain[drained] == NULL)
+			break;
+	}
+
+	/* Free a few — enough for some but not bulk_n objects. */
+	const unsigned int freed = RTE_MIN(drained, 4u);
+	for (unsigned int i = 0; i < freed; i++)
+		rte_fastmem_free(drain[--drained]);
+
+	rc = rte_fastmem_alloc_bulk(ptrs, bulk_n, LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT(rc < 0, "bulk alloc should fail");
+
+	for (unsigned int i = 0; i < drained; i++)
+		rte_fastmem_free(drain[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_recovery_after_free(void)
+{
+	const unsigned int max_ptrs = 512;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+	TEST_ASSERT(count > 0 && count < max_ptrs,
+		"expected partial fill, got %u", count);
+
+	const unsigned int half = count / 2;
+	for (unsigned int i = 0; i < half; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	for (unsigned int i = 0; i < half; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "recovery alloc[%u] failed", i);
+	}
+
+	for (unsigned int i = 0; i < count; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct limit_worker_args {
+	unsigned int alloc_count;
+	int result;
+};
+
+static int
+limit_worker(void *arg)
+{
+	struct limit_worker_args *args = arg;
+	const unsigned int max_ptrs = 128;
+	void *ptrs[max_ptrs];
+	unsigned int i;
+
+	args->alloc_count = 0;
+
+	for (i = 0; i < max_ptrs; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[i] == NULL)
+			break;
+		memset(ptrs[i], 0xab, LIMIT_OBJ_SIZE);
+		args->alloc_count++;
+	}
+
+	for (unsigned int j = 0; j < args->alloc_count; j++) {
+		uint8_t *bytes = ptrs[j];
+		for (size_t k = 0; k < LIMIT_OBJ_SIZE; k++) {
+			if (bytes[k] != 0xab) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(ptrs[j]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_memory_limit_multi_lcore_oom(void)
+{
+	struct limit_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int worker_count = 0;
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		worker_count++;
+
+	if (worker_count < 2) {
+		printf("Not enough workers (%u < 2), skipping\n", worker_count);
+		return TEST_SKIPPED;
+	}
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].result = TEST_FAILED;
+		rte_eal_remote_launch(limit_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	struct rte_fastmem_stats stats;
+	rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero: %" PRIu64, stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_same_class(void)
+{
+	void *ptr = rte_fastmem_alloc(32, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	/* Realloc to a smaller size within the same class (64 B class). */
+	void *ptr2 = rte_fastmem_realloc(ptr, 33, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc failed");
+	TEST_ASSERT_EQUAL(ptr, ptr2,
+		"realloc returned different pointer for same class");
+
+	/* Realloc to exact class boundary — still same class. */
+	void *ptr3 = rte_fastmem_realloc(ptr2, 64, 0);
+	TEST_ASSERT_NOT_NULL(ptr3, "realloc failed");
+	TEST_ASSERT_EQUAL(ptr2, ptr3,
+		"realloc returned different pointer for same class");
+
+	rte_fastmem_free(ptr3);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_grow(void)
+{
+	const uint8_t pattern = 0xab;
+	void *ptr = rte_fastmem_alloc(16, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	memset(ptr, pattern, 16);
+
+	/* Grow beyond current class. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 128, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc grow failed");
+
+	/* Verify contents preserved. */
+	uint8_t *bytes = ptr2;
+	for (unsigned int i = 0; i < 16; i++)
+		TEST_ASSERT_EQUAL(bytes[i], pattern,
+			"content corrupted at byte %u", i);
+
+	rte_fastmem_free(ptr2);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_shrink(void)
+{
+	const uint8_t pattern = 0xcd;
+	void *ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	memset(ptr, pattern, 256);
+
+	/* Shrink to a smaller class. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 16, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc shrink failed");
+
+	/* Verify contents preserved up to new size. */
+	uint8_t *bytes = ptr2;
+	for (unsigned int i = 0; i < 16; i++)
+		TEST_ASSERT_EQUAL(bytes[i], pattern,
+			"content corrupted at byte %u", i);
+
+	rte_fastmem_free(ptr2);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_null_ptr(void)
+{
+	/* NULL ptr should behave like alloc. */
+	void *ptr = rte_fastmem_realloc(NULL, 64, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "realloc(NULL) failed");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_zero_size(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	/* size 0 should free and return NULL. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 0, 0);
+	TEST_ASSERT_NULL(ptr2, "realloc(size=0) should return NULL");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_too_big(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	void *ptr2 = rte_fastmem_realloc(ptr, rte_fastmem_max_size() + 1, 0);
+	TEST_ASSERT_NULL(ptr2, "realloc should fail for oversized request");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG, "expected E2BIG");
+
+	/* Original pointer should still be valid. */
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_invalid_align(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	void *ptr2 = rte_fastmem_realloc(ptr, 64, 3);
+	TEST_ASSERT_NULL(ptr2, "realloc should fail for non-power-of-2 align");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL, "expected EINVAL");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Handle-based allocation API.
+ */
+
+static int
+test_halloc_basic(void)
+{
+	rte_fastmem_handle_t handle;
+	void *ptrs[16];
+	void *p;
+	int rc;
+	unsigned int i;
+
+	rc = rte_fastmem_hlookup(64, 0, rte_socket_id_by_idx(0), &handle);
+	TEST_ASSERT_EQUAL(rc, 0, "hlookup failed: %d", rc);
+
+	p = rte_fastmem_halloc(handle, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_NOT_NULL(p, "halloc failed: rte_errno=%d", rte_errno);
+	memset(p, 0x5a, 64);
+	rte_fastmem_hfree(handle, p);
+
+	/* NULL pointer free is a no-op. */
+	rte_fastmem_hfree(handle, NULL);
+
+	rc = rte_fastmem_halloc_bulk(handle, ptrs, RTE_DIM(ptrs), 0);
+	TEST_ASSERT_EQUAL(rc, 0, "halloc_bulk failed: %d", rc);
+	for (i = 0; i < RTE_DIM(ptrs); i++)
+		TEST_ASSERT_NOT_NULL(ptrs[i], "halloc_bulk[%u] NULL", i);
+	rte_fastmem_hfree_bulk(handle, ptrs, RTE_DIM(ptrs));
+
+	return TEST_SUCCESS;
+}
+
+struct halloc_worker_args {
+	rte_fastmem_handle_t handle;
+	int result;
+};
+
+/*
+ * Allocate and free using a handle that was looked up on a
+ * different lcore. The worker lcore has no pre-existing cache for
+ * the handle's size class, so this exercises the path where
+ * halloc/hfree must lazily create (or bypass) the per-lcore cache.
+ */
+static int
+halloc_worker(void *arg)
+{
+	struct halloc_worker_args *args = arg;
+	void *ptrs[8];
+	uint8_t *p;
+	unsigned int i;
+
+	args->result = TEST_FAILED;
+
+	p = rte_fastmem_halloc(args->handle, 0);
+	if (p == NULL)
+		return -1;
+	memset(p, 0x3c, 64);
+	rte_fastmem_hfree(args->handle, p);
+
+	if (rte_fastmem_halloc_bulk(args->handle, ptrs, RTE_DIM(ptrs), 0) < 0)
+		return -1;
+	for (i = 0; i < RTE_DIM(ptrs); i++) {
+		if (ptrs[i] == NULL)
+			return -1;
+	}
+	rte_fastmem_hfree_bulk(args->handle, ptrs, RTE_DIM(ptrs));
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_halloc_other_lcore(void)
+{
+	struct halloc_worker_args args;
+	rte_fastmem_handle_t handle;
+	unsigned int lcore_id;
+	int rc;
+
+	lcore_id = rte_get_next_lcore(-1, 1, 0);
+	if (lcore_id == RTE_MAX_LCORE)
+		return TEST_SKIPPED;
+
+	/* Look up the handle on the main lcore only. */
+	rc = rte_fastmem_hlookup(64, 0, rte_socket_id_by_idx(0), &handle);
+	TEST_ASSERT_EQUAL(rc, 0, "hlookup failed: %d", rc);
+
+	args.handle = handle;
+	args.result = TEST_FAILED;
+
+	rte_eal_remote_launch(halloc_worker, &args, lcore_id);
+	rc = rte_eal_wait_lcore(lcore_id);
+	TEST_ASSERT_EQUAL(rc, 0, "worker returned %d", rc);
+	TEST_ASSERT_EQUAL(args.result, TEST_SUCCESS,
+		"halloc/hfree failed on a lcore that did not call hlookup");
+
+	return TEST_SUCCESS;
+}
+
+static uint32_t
+halloc_non_eal_main(void *arg)
+{
+	struct halloc_worker_args *args = arg;
+
+	return halloc_worker(args) == 0 ? 0 : 1;
+}
+
+static int
+test_halloc_non_eal_thread(void)
+{
+	struct halloc_worker_args args;
+	rte_fastmem_handle_t handle;
+	rte_thread_t thread_id;
+	int rc;
+
+	rc = rte_fastmem_hlookup(64, 0, rte_socket_id_by_idx(0), &handle);
+	TEST_ASSERT_EQUAL(rc, 0, "hlookup failed: %d", rc);
+
+	args.handle = handle;
+	args.result = TEST_FAILED;
+
+	rc = rte_thread_create(&thread_id, NULL, halloc_non_eal_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(args.result, TEST_SUCCESS,
+		"halloc/hfree failed on a non-EAL thread");
+
+	return TEST_SUCCESS;
+}
+
+static int
+fastmem_setup(void)
+{
+	return rte_fastmem_init();
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_deinit();
+}
+
+static struct unit_test_suite fastmem_testsuite = {
+	.suite_name = "fastmem tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_init_deinit),
+		TEST_CASE(test_init_is_not_idempotent),
+		TEST_CASE(test_deinit_without_init),
+		TEST_CASE(test_max_size),
+		TEST_CASE(test_reserve_without_init),
+		TEST_CASE(test_cache_flush_without_init),
+		TEST_CASE(test_classes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_multiple_memzones),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_cumulative),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_invalid_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_any_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_invalid_align),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_various_sizes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_alignment),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_reuse),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_many_in_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing_no_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_null),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_content_integrity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_one),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket_numa_placement),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_cross_socket_deinit),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_exceeds_capacity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_non_eal_thread),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush_returns_memory),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_exceeds_cache),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_bulk),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_reset),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_survive_cache_flush),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_count_non_eal),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_shared_non_eal),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_mixed_lifetimes_multi_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_exhaustion),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_zero_blocks_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_below_current),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_socket_id_any),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_unlimited),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_integrity_under_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_bulk_alloc_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_recovery_after_free),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_multi_lcore_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_same_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_grow),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_shrink),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_null_ptr),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_zero_size),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_invalid_align),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_halloc_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_halloc_other_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_halloc_non_eal_thread),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_fastmem(void)
+{
+	return unit_test_suite_runner(&fastmem_testsuite);
+}
+
+REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_SKIP, ASAN_OK, test_fastmem);
diff --git a/app/test/test_fastmem_perf.c b/app/test/test_fastmem_perf.c
new file mode 100644
index 0000000000..73c0a4c6ce
--- /dev/null
+++ b/app/test/test_fastmem_perf.c
@@ -0,0 +1,1040 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_mempool.h>
+#include <rte_stdatomic.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define TEST_LOG(...) printf(__VA_ARGS__)
+
+static const size_t SIZES[] = { 8, 64, 256, 1024, 4096 };
+#define N_SIZES RTE_DIM(SIZES)
+
+/* Number of ops for warmup and measurement. */
+#define WARMUP_OPS 20000u
+#define MEASURE_OPS 2000000u
+
+/* Buffer for scenarios that allocate N then free N. */
+#define BATCH_N 256
+
+/*
+ * Allocator vtable: a thin adapter exposing alloc / free /
+ * per-allocator setup/teardown. Each scenario calls these
+ * indirectly so the same timing loop serves all allocators.
+ */
+struct allocator {
+	const char *name;
+	int (*setup)(size_t size, unsigned int n_max);
+	void (*teardown)(void);
+	void *(*alloc)(void);
+	void (*free_obj)(void *ptr);
+	int (*alloc_bulk)(void **ptrs, unsigned int n);
+	void (*free_bulk)(void **ptrs, unsigned int n);
+};
+
+/* Fastmem adapter -------------------------------------------------- */
+
+static size_t fastmem_size;
+
+static int
+fastmem_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	fastmem_size = size;
+	return 0;
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_alloc(void)
+{
+	return rte_fastmem_alloc(fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free(void *ptr)
+{
+	rte_fastmem_free(ptr);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static struct rte_mempool *mempool_pool;
+
+static int
+mempool_setup(size_t size, unsigned int n_max)
+{
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int cache_size;
+
+	/*
+	 * Pool size must accommodate the full batch burst plus
+	 * per-lcore cache capacity. Use mempool's default cache
+	 * size so we're measuring its standard hot path.
+	 */
+	cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
+
+	snprintf(name, sizeof(name), "fmperf_mp_%zu", size);
+	mempool_pool = rte_mempool_create(name, n_max + cache_size * 2,
+			size, cache_size, 0, NULL, NULL, NULL, NULL,
+			SOCKET_ID_ANY, 0);
+	if (mempool_pool == NULL) {
+		TEST_LOG("mempool_create(%zu) failed\n", size);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+mempool_teardown(void)
+{
+	rte_mempool_free(mempool_pool);
+	mempool_pool = NULL;
+}
+
+static void * __rte_noinline
+mempool_alloc_one(void)
+{
+	void *obj = NULL;
+
+	if (rte_mempool_get(mempool_pool, &obj) < 0)
+		return NULL;
+	return obj;
+}
+
+static void __rte_noinline
+mempool_free_one(void *ptr)
+{
+	rte_mempool_put(mempool_pool, ptr);
+}
+
+/* rte_malloc adapter ----------------------------------------------- */
+
+static size_t malloc_size;
+
+static int
+malloc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	malloc_size = size;
+	return 0;
+}
+
+static void
+malloc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+malloc_alloc(void)
+{
+	return rte_malloc(NULL, malloc_size, 0);
+}
+
+static void __rte_noinline
+malloc_free(void *ptr)
+{
+	rte_free(ptr);
+}
+
+/* libc (glibc) malloc adapter -------------------------------------- */
+
+static size_t libc_size;
+
+static int
+libc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	/*
+	 * Round up to cache-line alignment to match the other
+	 * allocators' default alignment guarantees and keep the
+	 * comparison honest. aligned_alloc() requires size to be
+	 * a multiple of the alignment.
+	 */
+	libc_size = RTE_ALIGN_CEIL(size, RTE_CACHE_LINE_SIZE);
+	return 0;
+}
+
+static void
+libc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+libc_alloc(void)
+{
+	return aligned_alloc(RTE_CACHE_LINE_SIZE, libc_size);
+}
+
+static void __rte_noinline
+libc_free(void *ptr)
+{
+	free(ptr);
+}
+
+/* Bulk adapters ---------------------------------------------------- */
+
+static int __rte_noinline
+fastmem_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_alloc_bulk(ptrs, n, fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_free_bulk(ptrs, n);
+}
+
+/* Fastmem handle adapter ------------------------------------------- */
+
+static rte_fastmem_handle_t fastmem_handle;
+
+static int
+fastmem_h_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	return rte_fastmem_hlookup(size, 0, rte_socket_id(), &fastmem_handle);
+}
+
+static void
+fastmem_h_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_h_alloc(void)
+{
+	return rte_fastmem_halloc(fastmem_handle, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free(void *ptr)
+{
+	rte_fastmem_hfree(fastmem_handle, ptr);
+}
+
+static int __rte_noinline
+fastmem_h_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_halloc_bulk(fastmem_handle, ptrs, n, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_hfree_bulk(fastmem_handle, ptrs, n);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static int __rte_noinline
+mempool_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_mempool_get_bulk(mempool_pool, ptrs, n);
+}
+
+static void __rte_noinline
+mempool_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_mempool_put_bulk(mempool_pool, ptrs, n);
+}
+
+static int __rte_noinline
+generic_alloc_bulk(void **ptrs, unsigned int n, void *(*alloc_fn)(void))
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = alloc_fn();
+		if (ptrs[i] == NULL)
+			return -1;
+	}
+	return 0;
+}
+
+static int __rte_noinline
+malloc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, malloc_alloc);
+}
+
+static void __rte_noinline
+malloc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		malloc_free(ptrs[i]);
+}
+
+static int __rte_noinline
+libc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, libc_alloc);
+}
+
+static void __rte_noinline
+libc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		libc_free(ptrs[i]);
+}
+
+/* Adapter table ---------------------------------------------------- */
+
+static const struct allocator allocators[] = {
+	{ "fastmem",    fastmem_setup,   fastmem_teardown,   fastmem_alloc,     fastmem_free,     fastmem_alloc_bulk,   fastmem_free_bulk },
+	{ "fastmem_h",  fastmem_h_setup, fastmem_h_teardown, fastmem_h_alloc,   fastmem_h_free,   fastmem_h_alloc_bulk, fastmem_h_free_bulk },
+	{ "mempool",    mempool_setup,   mempool_teardown,   mempool_alloc_one, mempool_free_one, mempool_alloc_bulk,   mempool_free_bulk },
+	{ "rte_malloc", malloc_setup,    malloc_teardown,    malloc_alloc,      malloc_free,      malloc_alloc_bulk,    malloc_free_bulk },
+	{ "libc",       libc_setup,      libc_teardown,      libc_alloc,        libc_free,        libc_alloc_bulk,      libc_free_bulk },
+};
+#define N_ALLOCATORS RTE_DIM(allocators)
+
+/*
+ * Scenario 1: tight alloc+free loop. A single object is cycled
+ * repeatedly. The LIFO path keeps the same pointer hot, giving
+ * a best-case measurement.
+ */
+static double
+run_tight(const struct allocator *alloc, size_t size)
+{
+	void *p;
+	uint64_t tsc;
+	unsigned int i;
+
+	if (alloc->setup(size, 1) < 0)
+		return -1.0;
+
+	/* Warmup. */
+	for (i = 0; i < WARMUP_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < MEASURE_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+	tsc = rte_rdtsc_precise() - tsc;
+
+	alloc->teardown();
+
+	return (double)tsc / MEASURE_OPS;
+err:
+	alloc->teardown();
+	return -1.0;
+}
+
+/*
+ * Scenario 2: allocate N, free N (FIFO free order). Exercises
+ * cache refill and drain paths when N exceeds cache capacity.
+ */
+static void
+run_batch(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	/* Pick iteration count so total ops ~= MEASURE_OPS. */
+	iters = MEASURE_OPS / BATCH_N;
+
+	/* Warmup. */
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 3: allocate N, free N in reverse order.
+ */
+static void
+run_batch_reverse(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	iters = MEASURE_OPS / BATCH_N;
+
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 4: multi-lcore alloc/work/free with a dummy-work
+ * baseline. Each worker runs a tight alloc → touch → free loop
+ * on its own lcore. A second run with the same dummy work but
+ * no allocator traffic establishes a baseline; the per-op
+ * allocator cost is reported as (alloc_run - baseline_run).
+ *
+ * Fixed size class and a fixed amount of dummy work per op —
+ * this scenario sweeps lcore count rather than size.
+ */
+#define MULTI_SIZE 256u
+#define MULTI_WORK_BYTES 64u
+#define MULTI_WORK_PASSES 8u   /* RMW passes over the work region. */
+#define MULTI_OPS 200000u
+#define MULTI_WARMUP 2000u
+#define MAX_MULTI_LCORES 32u
+
+/*
+ * Per-worker volatile sink. Each worker writes to its own
+ * slot, preventing dead-code elimination of touch_buffer() and
+ * avoiding cross-lcore cache-line sharing on the hot path.
+ * Padded to cache-line stride to prevent false sharing between
+ * neighboring workers' slots.
+ */
+struct worker_sink {
+	volatile uint64_t value;
+} __rte_cache_aligned;
+
+static struct worker_sink worker_sinks[RTE_MAX_LCORE];
+
+/*
+ * Out-of-line dummy workload: run MULTI_WORK_PASSES
+ * read-modify-write passes over the first 'bytes' of the
+ * buffer. Each pass reads what the previous pass wrote, so the
+ * compiler cannot unroll or parallelize across passes — the
+ * work scales linearly with MULTI_WORK_PASSES. Returns an
+ * accumulator so the caller can feed it into a volatile sink;
+ * without that, the compiler could elide the whole function.
+ *
+ * __rte_noinline so it looks identical to the compiler in both
+ * the baseline (pre-allocated scratch buffer) and alloc-path
+ * runs, making the cycle-delta subtraction valid.
+ *
+ * The purpose of this being tunably expensive is to keep
+ * worker-per-iteration cost high relative to the allocator's
+ * critical section, so that even serialized allocators like
+ * rte_malloc spend most of their time outside the lock and the
+ * measured per-op allocator cost reflects its own work rather
+ * than its contention queue.
+ */
+static uint64_t __rte_noinline
+touch_buffer(void *buf, size_t bytes)
+{
+	uint64_t *p = buf;
+	size_t n = bytes / sizeof(uint64_t);
+	uint64_t acc = 0;
+	unsigned int pass;
+	size_t i;
+
+	/* Prime the buffer with a known pattern. */
+	for (i = 0; i < n; i++)
+		p[i] = i * 0x9E3779B97F4A7C15ULL;
+
+	/*
+	 * Dependent RMW passes: each pass reads p[i] written by
+	 * the previous pass, mixes the pass index in, and writes
+	 * back. The XOR into acc keeps the chain live.
+	 */
+	for (pass = 0; pass < MULTI_WORK_PASSES; pass++) {
+		for (i = 0; i < n; i++) {
+			uint64_t v = p[i];
+
+			v = v * 0xC2B2AE3D27D4EB4FULL + pass;
+			v ^= v >> 33;
+			p[i] = v;
+			acc ^= v;
+		}
+	}
+
+	return acc;
+}
+
+struct worker_args {
+	const struct allocator *alloc;
+	void *scratch;            /* baseline only; NULL => alloc path */
+	unsigned int iters;
+	unsigned int warmup;
+	unsigned int bulk_n;      /* 0 = single-object, >0 = bulk */
+	RTE_ATOMIC(bool) start_flag; /* barrier at worker entry */
+	uint64_t cycles;          /* out */
+	unsigned int ops;         /* out */
+	int err;                  /* out */
+};
+
+static int
+worker_run(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	/* Wait for start flag (spin-barrier set by main). */
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				return -1;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+
+	/* Measured loop. */
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				break;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i;
+
+	/* Publish accumulator to defeat dead-code elimination. */
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+static int
+worker_run_bulk(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	void *ptrs[BATCH_N];
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i, j;
+	unsigned int bulk_n = wa->bulk_n;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			return -1;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			break;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i * bulk_n;
+
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+/*
+ * Launch workers on the first 'n_workers' worker lcores, run
+ * either the baseline (scratch != NULL) or the alloc path
+ * (scratch == NULL), and return the mean per-op cycle cost
+ * averaged across participating workers.
+ *
+ * On any worker error, returns -1.0.
+ */
+static double
+run_multi_workers(const struct allocator *alloc, unsigned int n_workers,
+		void *const *scratches, unsigned int bulk_n)
+{
+	struct worker_args wargs[RTE_MAX_LCORE];
+	unsigned int worker_lcores[MAX_MULTI_LCORES];
+	unsigned int n = 0;
+	unsigned int lcore_id;
+	unsigned int i;
+	lcore_function_t *fn = bulk_n > 0 ? worker_run_bulk : worker_run;
+
+	/* Collect the first n_workers worker lcores. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (n >= n_workers)
+			break;
+		worker_lcores[n++] = lcore_id;
+	}
+	if (n < n_workers)
+		return -1.0;
+
+	/* Prepare per-worker args. */
+	for (i = 0; i < n_workers; i++) {
+		struct worker_args *wa = &wargs[worker_lcores[i]];
+
+		wa->alloc = alloc;
+		wa->scratch = scratches != NULL ? scratches[i] : NULL;
+		wa->iters = MULTI_OPS;
+		wa->warmup = MULTI_WARMUP;
+		wa->bulk_n = bulk_n;
+		rte_atomic_store_explicit(&wa->start_flag, false,
+				rte_memory_order_relaxed);
+	}
+
+	/* Launch workers. They spin on start_flag until released. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_remote_launch(fn, &wargs[worker_lcores[i]],
+				worker_lcores[i]);
+
+	/* Release all workers roughly simultaneously. */
+	for (i = 0; i < n_workers; i++)
+		rte_atomic_store_explicit(
+			&wargs[worker_lcores[i]].start_flag, true,
+			rte_memory_order_release);
+
+	/* Wait for completion. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_wait_lcore(worker_lcores[i]);
+
+	/* Aggregate: mean cycles per op across workers. */
+	{
+		double sum_cycles_per_op = 0.0;
+		unsigned int n_ok = 0;
+
+		for (i = 0; i < n_workers; i++) {
+			struct worker_args *wa = &wargs[worker_lcores[i]];
+
+			if (wa->err != 0 || wa->ops == 0)
+				return -1.0;
+			sum_cycles_per_op +=
+				(double)wa->cycles / (double)wa->ops;
+			n_ok++;
+		}
+		return sum_cycles_per_op / n_ok;
+	}
+}
+
+/*
+ * One sub-run of Scenario 4: given an allocator and a worker
+ * count, return (baseline, alloc_path) mean cycles per op.
+ */
+static void
+run_multi_lcore(const struct allocator *alloc, unsigned int n_workers,
+		unsigned int bulk_n, double *baseline, double *alloc_path)
+{
+	void *scratches[MAX_MULTI_LCORES] = {0};
+	unsigned int n_alloced = 0;
+	unsigned int i;
+
+	*baseline = -1.0;
+	*alloc_path = -1.0;
+
+	if (alloc->setup(MULTI_SIZE, n_workers * 64) < 0)
+		return;
+
+	/* Baseline: pre-allocate one scratch per worker. */
+	for (i = 0; i < n_workers; i++) {
+		scratches[i] = alloc->alloc();
+		if (scratches[i] == NULL)
+			goto err;
+		n_alloced++;
+	}
+
+	*baseline = run_multi_workers(alloc, n_workers, scratches, 0);
+
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	n_alloced = 0;
+
+	/* Alloc path: workers alloc+free each iter. */
+	*alloc_path = run_multi_workers(alloc, n_workers, NULL, bulk_n);
+
+	alloc->teardown();
+	return;
+err:
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	alloc->teardown();
+}
+
+/* Reporting -------------------------------------------------------- */
+
+static void
+print_header(const char *title)
+{
+	size_t i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < N_SIZES; i++)
+		TEST_LOG(" %10zu B", SIZES[i]);
+	TEST_LOG("\n");
+}
+
+static void
+print_row(const char *name, const double *values)
+{
+	size_t i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < N_SIZES; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %12s", "--");
+		else
+			TEST_LOG(" %12.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_header(const char *title, const unsigned int *lcore_counts,
+		unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < n_counts; i++)
+		TEST_LOG(" %8u lcore%c", lcore_counts[i],
+				lcore_counts[i] == 1 ? ' ' : 's');
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_row(const char *name, const double *values, unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < n_counts; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %14s", "--");
+		else
+			TEST_LOG(" %14.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+/* Driver ----------------------------------------------------------- */
+
+static int
+test_fastmem_perf(void)
+{
+	size_t i;
+	size_t a;
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	TEST_LOG("\nfastmem performance — single-lcore, fixed-size\n");
+	TEST_LOG("All numbers are TSC cycles.\n");
+
+	/* Scenario 1: tight alloc+free. */
+	print_header("Scenario 1: Single-object hot path — cycles per (alloc + free)");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			vals[i] = run_tight(&allocators[a], SIZES[i]);
+		print_row(allocators[a].name, vals);
+	}
+
+	/* Scenario 2: batched, FIFO free. */
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 3: batched, reverse free. */
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 4: multi-lcore alloc/work/free with baseline. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		/* Sweep lcore counts: 1, 2, 4, 8, ... up to max_workers. */
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		/* Ensure max_workers is the final column if not power of two. */
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 4 (Multi-lcore contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 4 parameters: size=%u B\n",
+				MULTI_SIZE);
+
+			for (a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a], lcore_counts[c],
+							0, &base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 4: Multi-lcore contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	/* Scenario 5: multi-lcore bulk alloc/work/free. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+		unsigned int bulk_n = 8;
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 5 (Multi-lcore bulk contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 5 parameters: size=%u B, "
+				"bulk=%u\n",
+				MULTI_SIZE, bulk_n);
+
+			for (size_t a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a],
+							lcore_counts[c], bulk_n,
+							&base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 5: Multi-lcore bulk contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (size_t a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	TEST_LOG("\n");
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_perf_autotest, test_fastmem_perf);
diff --git a/app/test/test_fastmem_profile.c b/app/test/test_fastmem_profile.c
new file mode 100644
index 0000000000..9a5dc94018
--- /dev/null
+++ b/app/test/test_fastmem_profile.c
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+/*
+ * A minimal fastmem workload intended for use with perf record /
+ * perf report. Runs a tight alloc/free loop for a fixed duration
+ * so that sampling profilers can attribute cycles to individual
+ * functions and instructions within the fastmem hot path.
+ *
+ * Usage:
+ *   perf record -g -- dpdk-test --no-huge --no-pci -m 8192 \
+ *       -l 0 <<< fastmem_profile_autotest
+ *   perf report
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+/* Duration of each sub-test in TSC cycles (~3 seconds at 3 GHz). */
+#define PROFILE_DURATION_CYCLES (3ULL * rte_get_tsc_hz())
+
+/* Allocation size for the profiling workload. */
+#define PROFILE_SIZE 256u
+
+/*
+ * Sub-test 1: tight alloc+free, exercises only the per-lcore
+ * cache (no bin interaction after warmup).
+ */
+static int
+profile_cache_hit(void)
+{
+	uint64_t deadline;
+	uint64_t ops = 0;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		void *p = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+
+		if (p == NULL)
+			return -1;
+		rte_fastmem_free(p);
+		ops++;
+	}
+
+	printf("  cache_hit: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+/*
+ * Sub-test 2: alloc N then free N, where N exceeds the cache
+ * capacity. This forces repeated cache refills and drains,
+ * exercising the bin lock and slab free-list traversal.
+ */
+#define PROFILE_BATCH 256u
+
+static int
+profile_cache_miss(void)
+{
+	void *ptrs[PROFILE_BATCH];
+	uint64_t deadline;
+	uint64_t ops = 0;
+	unsigned int i;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		for (i = 0; i < PROFILE_BATCH; i++) {
+			ptrs[i] = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+			if (ptrs[i] == NULL)
+				return -1;
+		}
+		for (i = 0; i < PROFILE_BATCH; i++)
+			rte_fastmem_free(ptrs[i]);
+		ops += PROFILE_BATCH;
+	}
+
+	printf("  cache_miss: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_hit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-hit workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_hit() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_miss(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-miss workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_miss() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_profile_cache_hit_autotest,
+		test_fastmem_profile_cache_hit);
+REGISTER_PERF_TEST(fastmem_profile_cache_miss_autotest,
+		test_fastmem_profile_cache_miss);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [RFC v4 0/3] lib/fastmem: fast small-object allocator
  2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
                               ` (2 preceding siblings ...)
  2026-05-30  9:26             ` [RFC v4 3/3] app/test: add fastmem test suite Mattias Rönnblom
@ 2026-06-10 12:35             ` Konstantin Ananyev
  3 siblings, 0 replies; 38+ messages in thread
From: Konstantin Ananyev @ 2026-06-10 12:35 UTC (permalink / raw)
  To: Mattias Rönnblom, dev@dpdk.org
  Cc: Morten Brørup, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger, Bruce Richardson

Hi Mattias,

> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.

As stated before, I summitted RFC for the one we use internally:
https://patchwork.dpdk.org/project/dpdk/patch/20260610103918.96857-1-konstantin.ananyev@huawei.com/
Many things and ideas are similar, some are not.
Below I tried to summarize the main differences (as I see them).
I do understand that our use-cases and requirements are different,
but might be we can have a blend that will fit all of us.
Another two things that are probably necessary to move forward:
- some unified set of stress/performance test-caces that we can run
  against all three: mempool/fastmem/memtank.
- some sort of guinea-pig: DPDK sub-component where this new lib     
  can be applied. We can try straight with the mbuf, but that's probably
  quite ambitious choice for the first integration. Again, this is just one
  of the possible usage scenarios.   

Let me know what are your thoughts here.
Thanks
Konstantin 

> Motivation
> ----------
> 
> DPDK applications commonly maintain many mempools — one per object
> type (connections, sessions, timers, work items). Each must be sized
> up front, wastes memory when over-provisioned, and cannot serve
> objects of a different size. Fastmem eliminates this by accepting
> arbitrary sizes at runtime, backed by a slab allocator that
> repurposes memory across size classes as demand shifts.

I agree about first one - it is a big problem that you have to over-provision
everything with the mempool these days.  
About forcing user to explicitly create multiple pools - for me it is not such big problem,  
after all in most cases user knows the size of the objects he need to alloc/free upfront.
AFAIK - majority of SLAB-based allocators these days support both flavors:
user can create/maintain his own SLAB for some specific object types, or use generic
alloc/free which is backed by bunch of SLABs underneath, each serving specific size.
Might be we can do the same and support both too. 

> 
> Design
> ------
> 
> Three-layer architecture:
> 
> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
>    reserved lazily (or pre-reserved for deterministic latency).
> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.

In many cases, user don't need DMA-capabl;e memory for such objects,
so simple rte_malloc or even libc malloc might be enough.
I understand the intention - it is probably the fastest way to do things,
but I think it is way too constrained.
Might be the best approach is to do what memtank does -
allow user to define his own allox/free/init callbacks, then fastmem
approach will become just one of possible cases.      

>    The alignment enables O(1) slab lookup from any object pointer
>    via bitmask — no radix tree or index structure. Slabs move
>    freely between 18 power-of-2 size classes (8 B to 1 MiB).

That is cool idea and thought about doing the same: limit possible size
for SLAB to power-of-two values. 
But then I realized that we still need to store inside the object some
extra metadata for stats and sanity-checking. So extra 8B for a SLAB
pointer doesn't make much difference.
But again - I think we can support both and make it configurable at creation time.

> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
>    path). Cache misses trigger bulk transfers to/from the shared
>    bin under a spinlock.

memtank lacks of per-lcore caches right now, mostly due to lack of time
to implement it. It is definitely a good feature - way to go.  

> Key properties:
> 
> - Zero per-object metadata in the production build.
> - NUMA-aware, with per-socket bins and free-slab pools.
> - DMA-usable memory with O(1) virt-to-IOVA translation.
> - Bulk alloc/free with all-or-nothing semantics.

Personally, I don't find it very convenient.
For most cases we care about - we do use best-effort semantics.
Again, probably we can support both, same as rte_ring API.

> - Backing memory never returned during lifetime (slabs recycled).

For our case it is important to have ability return memory back to the system.
memtank lib supports it (though of course some fragmentation is possible).
Again it is much easier with separate pools.
What I really like with fastmem - that one SLAB can re-use memory from different
one, that seems usefull and might mitigate memory footprint growth till some extent.
Again, I suppose both flavors caon coexist:
individual pools can grow/shring, while fasmem (bunch of predefined pools) will not. 

> - Non-EAL threads supported (bypass cache, take bin lock).
> - Secondary process support (lazy attach, no per-lcore caches).
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v3 2/3] lib: add fastmem library
  2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-27 17:30         ` [RFC v3 1/3] doc: add fastmem programming guide Mattias Rönnblom
@ 2026-05-27 17:30         ` Mattias Rönnblom
  2026-05-28  9:11           ` Morten Brørup
  2026-05-27 17:30         ` [RFC v3 3/3] app/test: add fastmem test suite Mattias Rönnblom
  2026-05-28  9:02         ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Morten Brørup
  3 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 17:30 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Introduce fastmem, a fast general-purpose small-object allocator
for DPDK applications. It allows an application to replace its
many per-type mempools with a single allocator that handles
arbitrary sizes, grows on demand, and offers mempool-level
performance on the hot path.

Applications that manage many object types (connections, sessions,
work items, timers) currently maintain a separate mempool for each,
requiring upfront sizing and wasting memory on over-provisioned
pools. Fastmem removes both constraints.

Key properties:

 * Huge-page-backed, NUMA-aware, DMA-usable.
 * Per-lcore caches for lock-free alloc/free on EAL threads.
 * Bulk alloc and free APIs.
 * Power-of-two size classes from 8 B to 1 MiB.
 * Backing memory grows lazily; rte_fastmem_reserve() allows
   upfront reservation to avoid latency spikes.
 * Always-on per-lcore and per-class statistics.

Bounded to small objects; requests above rte_fastmem_max_size()
are rejected. Replacing rte_malloc is currently not a goal.

--

RFC v3:
 * Add rte_fastmem_realloc().
 * Add __rte_malloc and __rte_dealloc attributes to allocation functions.
 * Remove __rte_alloc_size and __rte_alloc_align attributes.
   These told the compiler the object is exactly the requested
   size, but fastmem rounds up to the size class and the caller
   may use the full class size. The mismatch caused false
   _FORTIFY_SOURCE buffer-overflow aborts.
 * Extract normalize_align() helper replacing repeated inline
   alignment validation logic.
 * Remove inline directives from static functions (redundant;
   both GCC and clang inline them at -O2 regardless).

RFC v2:
 * Fix use-after-free in rte_fastmem_deinit() when caches were
   allocated cross-socket. Restructured teardown into three phases.
 * Add defensive bounds check to local_socket_id() final fallback.
 * Add secondary process support. Shared state is discovered lazily
   on first allocation; secondaries operate without per-lcore caches.
 * Add handle-based allocation API (rte_fastmem_hlookup,
   rte_fastmem_halloc, rte_fastmem_halloc_bulk).
 * Fix clang -Wthread-safety-analysis warnings.
 * Move fastmem to alphabetical position in lib/meson.build.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 doc/api/doxy-api-index.md |    1 +
 doc/api/doxy-api.conf.in  |    1 +
 lib/fastmem/meson.build   |    6 +
 lib/fastmem/rte_fastmem.c | 1748 +++++++++++++++++++++++++++++++++++++
 lib/fastmem/rte_fastmem.h |  815 +++++++++++++++++
 lib/meson.build           |    1 +
 6 files changed, 2572 insertions(+)
 create mode 100644 lib/fastmem/meson.build
 create mode 100644 lib/fastmem/rte_fastmem.c
 create mode 100644 lib/fastmem/rte_fastmem.h

diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 9296042119..7ebf1201ce 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -70,6 +70,7 @@ The public API headers are grouped by topics:
   [memzone](@ref rte_memzone.h),
   [mempool](@ref rte_mempool.h),
   [malloc](@ref rte_malloc.h),
+  [fastmem](@ref rte_fastmem.h),
   [memcpy](@ref rte_memcpy.h)
 
 - **timers**:
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index bedd944681..4355e9fb2d 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -43,6 +43,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/efd \
                           @TOPDIR@/lib/ethdev \
                           @TOPDIR@/lib/eventdev \
+                          @TOPDIR@/lib/fastmem \
                           @TOPDIR@/lib/fib \
                           @TOPDIR@/lib/gpudev \
                           @TOPDIR@/lib/graph \
diff --git a/lib/fastmem/meson.build b/lib/fastmem/meson.build
new file mode 100644
index 0000000000..6c7834608f
--- /dev/null
+++ b/lib/fastmem/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2026 Ericsson AB
+
+sources = files('rte_fastmem.c')
+headers = files('rte_fastmem.h')
+deps += ['eal']
diff --git a/lib/fastmem/rte_fastmem.c b/lib/fastmem/rte_fastmem.c
new file mode 100644
index 0000000000..5eff2ff693
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.c
@@ -0,0 +1,1748 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_eal.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_spinlock.h>
+
+#include <rte_fastmem.h>
+
+#include <eal_export.h>
+
+RTE_LOG_REGISTER_DEFAULT(fastmem_logtype, NOTICE);
+
+#define RTE_LOGTYPE_FASTMEM fastmem_logtype
+
+#define FASTMEM_LOG(level, ...) \
+	RTE_LOG_LINE(level, FASTMEM, "" __VA_ARGS__)
+
+#define FASTMEM_MEMZONE_SIZE_LOG2 27                            /* 128 MiB */
+#define FASTMEM_MEMZONE_SIZE ((size_t)1 << FASTMEM_MEMZONE_SIZE_LOG2)
+
+#define FASTMEM_SLAB_SIZE_LOG2 21                               /*   2 MiB */
+#define FASTMEM_SLAB_SIZE ((size_t)1 << FASTMEM_SLAB_SIZE_LOG2)
+#define FASTMEM_SLAB_MASK (FASTMEM_SLAB_SIZE - 1)
+
+#define FASTMEM_SLABS_PER_MEMZONE (FASTMEM_MEMZONE_SIZE / FASTMEM_SLAB_SIZE)
+
+#define FASTMEM_MAX_MEMZONES_PER_SOCKET 64
+
+#define FASTMEM_MIN_CLASS_LOG2 3                                /*   8 B */
+#define FASTMEM_MAX_CLASS_LOG2 20                               /*   1 MiB */
+#define FASTMEM_N_CLASSES (FASTMEM_MAX_CLASS_LOG2 - FASTMEM_MIN_CLASS_LOG2 + 1)
+
+#define FASTMEM_MIN_SIZE ((size_t)1 << FASTMEM_MIN_CLASS_LOG2)
+#define FASTMEM_MAX_ALLOC_SIZE ((size_t)1 << FASTMEM_MAX_CLASS_LOG2)
+
+#define FASTMEM_SLAB_HEADER_SIZE RTE_CACHE_LINE_SIZE
+
+#define FASTMEM_CACHE_BASE_CAPACITY 64
+#define FASTMEM_CACHE_FLOOR_CAPACITY 4
+#define FASTMEM_CACHE_BASE_CLASS_LOG2 12                        /* 4 KiB */
+
+struct fastmem_bin;
+
+/*
+ * Slab header at offset 0 of each 2 MiB slab. Either free (linked
+ * via next_free) or assigned to a bin (linked via list).
+ */
+struct fastmem_slab {
+	struct fastmem_bin *bin;
+	void *free_head;
+	uint32_t free_count;
+	uint32_t n_slots;
+	struct fastmem_slab *next_free;
+	TAILQ_ENTRY(fastmem_slab) list;
+	rte_iova_t iova_base;
+} __rte_aligned(FASTMEM_SLAB_HEADER_SIZE);
+
+TAILQ_HEAD(fastmem_slab_list, fastmem_slab);
+
+struct fastmem_bin {
+	rte_spinlock_t lock;
+	uint32_t slot_size;
+	uint32_t slots_per_slab;
+	uint32_t class_idx;
+	struct fastmem_slab_list partial;
+	struct fastmem_slab_list full;
+	int socket_id;
+	uint64_t slab_acquires;
+	uint64_t slab_releases;
+	uint32_t slabs_partial;
+	uint32_t slabs_full;
+};
+
+/* Per-(lcore, class, socket) bounded LIFO of free object pointers. */
+struct fastmem_cache {
+	uint32_t count;
+	uint32_t capacity;
+	uint32_t target;
+	uint64_t alloc_cache_hits;
+	uint64_t alloc_cache_misses;
+	uint64_t alloc_nomem;
+	uint64_t free_cache_hits;
+	uint64_t free_cache_misses;
+	void *objs[];
+} __rte_cache_aligned;
+
+struct fastmem_socket_state {
+	rte_spinlock_t lock;
+	struct fastmem_slab *free_head;
+	size_t reserved_bytes;
+	size_t memory_limit;
+	unsigned int n_memzones;
+	unsigned int memzone_seq;
+	const struct rte_memzone *memzones[FASTMEM_MAX_MEMZONES_PER_SOCKET];
+	struct fastmem_bin bins[FASTMEM_N_CLASSES];
+	struct fastmem_cache *caches[RTE_MAX_LCORE][FASTMEM_N_CLASSES];
+};
+
+struct fastmem {
+	struct fastmem_socket_state sockets[RTE_MAX_NUMA_NODES];
+};
+
+static struct fastmem *fastmem;
+static const struct rte_memzone *fastmem_mz;
+static bool fastmem_is_primary; /* cached; avoids function call on hot path */
+
+static struct fastmem *
+fastmem_get(void)
+{
+	const struct rte_memzone *mz;
+
+	if (likely(fastmem != NULL))
+		return fastmem;
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		rte_errno = ENODEV;
+		return NULL;
+	}
+
+	mz = rte_memzone_lookup("fastmem_state");
+	if (mz == NULL) {
+		rte_errno = ENODEV;
+		return NULL;
+	}
+
+	fastmem_mz = mz;
+	fastmem = mz->addr;
+	return fastmem;
+}
+
+static unsigned int
+size_to_class(size_t size, size_t align)
+{
+	size_t effective;
+	unsigned int log2;
+
+	effective = size < FASTMEM_MIN_SIZE ? FASTMEM_MIN_SIZE : size;
+	if (align > effective)
+		effective = align;
+
+	log2 = 64u - rte_clz64(effective - 1);
+
+	if (log2 < FASTMEM_MIN_CLASS_LOG2)
+		log2 = FASTMEM_MIN_CLASS_LOG2;
+	if (log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+static size_t
+class_size(unsigned int class_idx)
+{
+	return (size_t)1 << (class_idx + FASTMEM_MIN_CLASS_LOG2);
+}
+
+/**
+ * Normalize and validate the alignment argument.
+ * Returns true on success (align updated in place), false on invalid input.
+ */
+static bool
+normalize_align(size_t *align)
+{
+	if (*align == 0) {
+		*align = RTE_CACHE_LINE_SIZE;
+		return true;
+	}
+	return rte_is_power_of_2(*align);
+}
+
+static_assert(sizeof(struct fastmem_slab) == FASTMEM_SLAB_HEADER_SIZE,
+	"fastmem slab header must fit in exactly one cache line");
+static_assert(sizeof(struct fastmem_slab) <= FASTMEM_SLAB_SIZE,
+	"slab header larger than a slab makes no sense");
+
+static struct fastmem_slab *
+slab_of(void *obj)
+{
+	return (struct fastmem_slab *)
+		((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+}
+
+static size_t
+slab_slot0_offset(size_t class_size)
+{
+	return class_size < FASTMEM_SLAB_HEADER_SIZE ?
+		FASTMEM_SLAB_HEADER_SIZE : class_size;
+}
+
+static uint32_t
+slab_slot_count(size_t class_size)
+{
+	size_t offset = slab_slot0_offset(class_size);
+
+	return (uint32_t)((FASTMEM_SLAB_SIZE - offset) / class_size);
+}
+
+/* Must be called with bin->lock held. */
+static void
+slab_init(struct fastmem_bin *bin, struct fastmem_slab *slab)
+{
+	size_t slot_size = bin->slot_size;
+	size_t offset = slab_slot0_offset(slot_size);
+	uint32_t n = bin->slots_per_slab;
+	void *prev = NULL;
+	uint32_t i;
+
+	slab->bin = bin;
+	slab->n_slots = n;
+	slab->free_count = n;
+
+	/* Build in reverse so pops yield sequential addresses. */
+	for (i = 0; i < n; i++) {
+		void *slot = RTE_PTR_ADD(slab, offset + i * slot_size);
+		*(void **)slot = prev;
+		prev = slot;
+	}
+	slab->free_head = prev;
+}
+
+static int
+grow_socket(struct fastmem_socket_state *socket, int socket_id)
+{
+	char name[RTE_MEMZONE_NAMESIZE];
+	const struct rte_memzone *mz;
+	unsigned int i;
+
+	if (socket->reserved_bytes + FASTMEM_MEMZONE_SIZE > socket->memory_limit) {
+		FASTMEM_LOG(ERR,
+			"reserve would exceed memory_limit (%zu) on socket %d",
+			socket->memory_limit, socket_id);
+		return -ENOMEM;
+	}
+
+	if (socket->n_memzones == FASTMEM_MAX_MEMZONES_PER_SOCKET) {
+		FASTMEM_LOG(ERR,
+			"reached per-socket memzone cap (%u) on socket %d",
+			FASTMEM_MAX_MEMZONES_PER_SOCKET, socket_id);
+		return -ENOMEM;
+	}
+
+	snprintf(name, sizeof(name), "fastmem_%d_%u", socket_id,
+			socket->memzone_seq++);
+
+	mz = rte_memzone_reserve_aligned(name, FASTMEM_MEMZONE_SIZE,
+			socket_id, RTE_MEMZONE_IOVA_CONTIG,
+			FASTMEM_SLAB_SIZE);
+	if (mz == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to reserve %zu-byte memzone '%s' on socket %d: %s",
+			(size_t)FASTMEM_MEMZONE_SIZE, name, socket_id,
+			rte_strerror(rte_errno));
+		return -ENOMEM;
+	}
+
+	socket->memzones[socket->n_memzones++] = mz;
+	socket->reserved_bytes += FASTMEM_MEMZONE_SIZE;
+
+	for (i = 0; i < FASTMEM_SLABS_PER_MEMZONE; i++) {
+		struct fastmem_slab *slab = RTE_PTR_ADD(mz->addr,
+				i * FASTMEM_SLAB_SIZE);
+
+		slab->iova_base = mz->iova + i * FASTMEM_SLAB_SIZE;
+		slab->next_free = socket->free_head;
+		socket->free_head = slab;
+	}
+
+	FASTMEM_LOG(DEBUG,
+		"reserved memzone '%s' (%zu bytes) on socket %d; %zu slabs added",
+		name, (size_t)FASTMEM_MEMZONE_SIZE, socket_id,
+		(size_t)FASTMEM_SLABS_PER_MEMZONE);
+
+	return 0;
+}
+
+static struct fastmem_slab *
+slab_acquire(struct fastmem_socket_state *socket, int socket_id)
+{
+	struct fastmem_slab *slab;
+
+	rte_spinlock_lock(&socket->lock);
+
+	if (socket->free_head == NULL) {
+		int rc = grow_socket(socket, socket_id);
+
+		if (rc < 0) {
+			rte_spinlock_unlock(&socket->lock);
+			return NULL;
+		}
+	}
+
+	slab = socket->free_head;
+	socket->free_head = slab->next_free;
+	slab->next_free = NULL;
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return slab;
+}
+
+static void
+slab_release(struct fastmem_socket_state *socket,
+		struct fastmem_slab *slab)
+{
+	rte_spinlock_lock(&socket->lock);
+
+	slab->next_free = socket->free_head;
+	socket->free_head = slab;
+
+	rte_spinlock_unlock(&socket->lock);
+}
+
+static void
+bin_init(struct fastmem_bin *bin, unsigned int class_idx, int socket_id)
+{
+	size_t slot_size = class_size(class_idx);
+
+	rte_spinlock_init(&bin->lock);
+	bin->slot_size = (uint32_t)slot_size;
+	bin->slots_per_slab = slab_slot_count(slot_size);
+	bin->class_idx = class_idx;
+	TAILQ_INIT(&bin->partial);
+	TAILQ_INIT(&bin->full);
+	bin->socket_id = socket_id;
+	bin->slab_acquires = 0;
+	bin->slab_releases = 0;
+	bin->slabs_partial = 0;
+	bin->slabs_full = 0;
+}
+
+static void
+bin_release(struct fastmem_bin *bin, struct fastmem_socket_state *socket)
+{
+	struct fastmem_slab *slab;
+
+	while ((slab = TAILQ_FIRST(&bin->partial)) != NULL) {
+		TAILQ_REMOVE(&bin->partial, slab, list);
+		slab_release(socket, slab);
+	}
+	while ((slab = TAILQ_FIRST(&bin->full)) != NULL) {
+		TAILQ_REMOVE(&bin->full, slab, list);
+		slab_release(socket, slab);
+	}
+}
+
+static unsigned int
+bin_pop_locked(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	unsigned int got = 0;
+
+	while (got < n) {
+		struct fastmem_slab *slab = TAILQ_FIRST(&bin->partial);
+		void *obj;
+
+		if (slab == NULL)
+			break;
+
+		obj = slab->free_head;
+		slab->free_head = *(void **)obj;
+		slab->free_count--;
+		objs[got++] = obj;
+
+		if (slab->free_count == 0) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			TAILQ_INSERT_HEAD(&bin->full, slab, list);
+			bin->slabs_partial--;
+			bin->slabs_full++;
+		}
+	}
+
+	return got;
+}
+
+/*
+ * Fully-drained slabs are accumulated in @p to_release for the
+ * caller to return after dropping the lock.
+ */
+static unsigned int
+bin_push_locked(struct fastmem_bin *bin, void **objs, unsigned int n,
+		struct fastmem_slab **to_release)
+{
+	unsigned int n_release = 0;
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		void *obj = objs[i];
+		struct fastmem_slab *slab = (struct fastmem_slab *)
+			((uintptr_t)obj & ~(uintptr_t)FASTMEM_SLAB_MASK);
+		bool was_full = slab->free_count == 0;
+
+		*(void **)obj = slab->free_head;
+		slab->free_head = obj;
+		slab->free_count++;
+
+		if (was_full) {
+			TAILQ_REMOVE(&bin->full, slab, list);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_full--;
+			bin->slabs_partial++;
+		}
+
+		if (slab->free_count == slab->n_slots) {
+			TAILQ_REMOVE(&bin->partial, slab, list);
+			bin->slabs_partial--;
+			bin->slab_releases++;
+			to_release[n_release++] = slab;
+		}
+	}
+
+	return n_release;
+}
+
+static void *
+bin_alloc_one(struct fastmem_bin *bin)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	void *obj;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (bin_pop_locked(bin, &obj, 1) == 0) {
+		struct fastmem_slab *slab;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_errno = ENOMEM;
+			return NULL;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return obj;
+}
+
+static unsigned int
+bin_alloc_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	unsigned int got = 0;
+
+	rte_spinlock_lock(&bin->lock);
+
+	while (got < n) {
+		struct fastmem_slab *slab;
+
+		got += bin_pop_locked(bin, objs + got, n - got);
+		if (got == n)
+			break;
+
+		if (TAILQ_FIRST(&bin->partial) != NULL)
+			continue;
+
+		rte_spinlock_unlock(&bin->lock);
+
+		slab = slab_acquire(socket, bin->socket_id);
+		if (slab == NULL) {
+			rte_spinlock_lock(&bin->lock);
+			break;
+		}
+
+		rte_spinlock_lock(&bin->lock);
+
+		if (unlikely(TAILQ_FIRST(&bin->partial) != NULL)) {
+			/* Release surplus slab without holding bin->lock. */
+			rte_spinlock_unlock(&bin->lock);
+			slab_release(socket, slab);
+			rte_spinlock_lock(&bin->lock);
+		} else {
+			slab_init(bin, slab);
+			TAILQ_INSERT_HEAD(&bin->partial, slab, list);
+			bin->slabs_partial++;
+			bin->slab_acquires++;
+		}
+	}
+
+	rte_spinlock_unlock(&bin->lock);
+
+	return got;
+}
+
+static void
+bin_free_one(struct fastmem_bin *bin, void *obj)
+{
+	unsigned int n_release;
+	struct fastmem_slab *slab_to_release = NULL;
+	struct fastmem_socket_state *socket;
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, &obj, 1, &slab_to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	if (n_release > 0) {
+		socket = &fastmem->sockets[bin->socket_id];
+		slab_release(socket, slab_to_release);
+	}
+}
+
+static void
+bin_free_bulk(struct fastmem_bin *bin, void **objs, unsigned int n)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[bin->socket_id];
+	struct fastmem_slab *to_release[FASTMEM_CACHE_BASE_CAPACITY];
+	unsigned int n_release;
+	unsigned int i;
+
+	RTE_VERIFY(n <= RTE_DIM(to_release));
+
+	rte_spinlock_lock(&bin->lock);
+	n_release = bin_push_locked(bin, objs, n, to_release);
+	rte_spinlock_unlock(&bin->lock);
+
+	for (i = 0; i < n_release; i++)
+		slab_release(socket, to_release[i]);
+}
+
+static unsigned int
+cache_capacity(unsigned int class_idx)
+{
+	unsigned int class_log2 = class_idx + FASTMEM_MIN_CLASS_LOG2;
+	unsigned int shift;
+	unsigned int cap;
+
+	if (class_log2 <= FASTMEM_CACHE_BASE_CLASS_LOG2)
+		return FASTMEM_CACHE_BASE_CAPACITY;
+
+	shift = class_log2 - FASTMEM_CACHE_BASE_CLASS_LOG2;
+	cap = FASTMEM_CACHE_BASE_CAPACITY >> shift;
+
+	return cap < FASTMEM_CACHE_FLOOR_CAPACITY ?
+		FASTMEM_CACHE_FLOOR_CAPACITY : cap;
+}
+
+static struct fastmem_cache **
+cache_slot(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	if (lcore_id >= RTE_MAX_LCORE)
+		return NULL;
+	return &socket->caches[lcore_id][class_idx];
+}
+
+static struct fastmem_cache *
+cache_create(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache **slot = cache_slot(socket, class_idx, lcore_id);
+	struct fastmem_cache *cache;
+	unsigned int capacity;
+	size_t cache_size;
+	unsigned int cache_class;
+	unsigned int own_socket;
+	struct fastmem_socket_state *alloc_socket;
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	capacity = cache_capacity(class_idx);
+	cache_size = sizeof(*cache) + capacity * sizeof(void *);
+
+	/*
+	 * Allocate the cache struct from fastmem on the calling
+	 * lcore's socket (NUMA-local to the writer). Bypasses the
+	 * cache layer to avoid recursion.
+	 */
+	cache_class = size_to_class(cache_size, RTE_CACHE_LINE_SIZE);
+	own_socket = rte_socket_id();
+
+	if (cache_class >= FASTMEM_N_CLASSES) {
+		FASTMEM_LOG(ERR,
+			"cache size %zu exceeds max size class",
+			cache_size);
+		return NULL;
+	}
+
+	if (own_socket >= RTE_MAX_NUMA_NODES)
+		own_socket = (unsigned int)socket->bins[0].socket_id;
+
+	alloc_socket = &fastmem->sockets[own_socket];
+
+	cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
+	if (cache == NULL) {
+		FASTMEM_LOG(ERR,
+			"failed to allocate cache for class %u on socket %u",
+			class_idx, own_socket);
+		return NULL;
+	}
+
+	cache->count = 0;
+	cache->capacity = capacity;
+	cache->target = capacity / 2;
+	cache->alloc_cache_hits = 0;
+	cache->alloc_cache_misses = 0;
+	cache->alloc_nomem = 0;
+	cache->free_cache_hits = 0;
+	cache->free_cache_misses = 0;
+
+	*slot = cache;
+
+	return cache;
+}
+
+static struct fastmem_cache *
+cache_get(struct fastmem_socket_state *socket, unsigned int class_idx,
+		unsigned int lcore_id)
+{
+	struct fastmem_cache **slot;
+	struct fastmem_cache *cache;
+
+	if (unlikely(!fastmem_is_primary))
+		return NULL;
+
+	slot = cache_slot(socket, class_idx, lcore_id);
+
+	if (slot == NULL)
+		return NULL;
+
+	cache = *slot;
+	if (cache != NULL)
+		return cache;
+
+	return cache_create(socket, class_idx, lcore_id);
+}
+
+static void *
+cache_pop(struct fastmem_cache *cache, struct fastmem_bin *bin)
+{
+	if (cache->count > 0) {
+		cache->alloc_cache_hits++;
+		return cache->objs[--cache->count];
+	}
+
+	cache->count = bin_alloc_bulk(bin, cache->objs, cache->target);
+	if (cache->count == 0)
+		return NULL;
+
+	cache->alloc_cache_misses++;
+	return cache->objs[--cache->count];
+}
+
+static void
+cache_push(struct fastmem_cache *cache, struct fastmem_bin *bin, void *obj)
+{
+	unsigned int drain;
+
+	if (cache->count < cache->capacity) {
+		cache->free_cache_hits++;
+		cache->objs[cache->count++] = obj;
+		return;
+	}
+
+	cache->free_cache_misses++;
+
+	/*
+	 * Drain the oldest (bottom) half to the bin, keeping the
+	 * newest (top) half for temporal reuse.
+	 */
+	drain = cache->count - cache->target;
+	bin_free_bulk(bin, cache->objs, drain);
+	memmove(cache->objs, cache->objs + drain,
+		cache->target * sizeof(cache->objs[0]));
+	cache->count = cache->target;
+
+	cache->objs[cache->count++] = obj;
+}
+
+static void
+socket_release_caches(struct fastmem_socket_state *socket)
+{
+	unsigned int lcore;
+	unsigned int c;
+
+	for (lcore = 0; lcore < RTE_MAX_LCORE; lcore++) {
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache = socket->caches[lcore][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore][c] = NULL;
+		}
+	}
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_init, 24.11)
+rte_fastmem_init(void)
+{
+	unsigned int s, c;
+
+	if (fastmem != NULL)
+		return -EBUSY;
+
+	fastmem_mz = rte_memzone_reserve_aligned("fastmem_state",
+			sizeof(*fastmem), SOCKET_ID_ANY, 0,
+			RTE_CACHE_LINE_SIZE);
+	if (fastmem_mz == NULL)
+		return -ENOMEM;
+
+	fastmem = fastmem_mz->addr;
+	fastmem_is_primary = true;
+	memset(fastmem, 0, sizeof(*fastmem));
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		rte_spinlock_init(&socket->lock);
+		socket->memory_limit = SIZE_MAX;
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++)
+			bin_init(&socket->bins[c], c, (int)s);
+	}
+
+	return 0;
+}
+
+static void
+release_socket_caches(struct fastmem_socket_state *socket)
+{
+	socket_release_caches(socket);
+}
+
+static void
+release_socket_bins(struct fastmem_socket_state *socket)
+{
+	unsigned int c;
+
+	for (c = 0; c < FASTMEM_N_CLASSES; c++)
+		bin_release(&socket->bins[c], socket);
+}
+
+static void
+release_socket_memzones(struct fastmem_socket_state *socket)
+{
+	unsigned int i;
+
+	for (i = 0; i < socket->n_memzones; i++)
+		rte_memzone_free(socket->memzones[i]);
+
+	socket->free_head = NULL;
+	socket->reserved_bytes = 0;
+	socket->n_memzones = 0;
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_deinit, 24.11)
+rte_fastmem_deinit(void)
+{
+	unsigned int i;
+
+	if (fastmem == NULL)
+		return;
+
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY) {
+		fastmem = NULL;
+		fastmem_mz = NULL;
+		return;
+	}
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_caches(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_bins(&fastmem->sockets[i]);
+
+	for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
+		release_socket_memzones(&fastmem->sockets[i]);
+
+	rte_memzone_free(fastmem_mz);
+	fastmem_mz = NULL;
+	fastmem = NULL;
+}
+
+/* Same resolution order as rte_malloc's malloc_get_numa_socket(). */
+static unsigned int
+local_socket_id(void)
+{
+	int sid = (int)rte_socket_id();
+
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = (int)rte_lcore_to_socket_id(rte_get_main_lcore());
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	sid = rte_socket_id_by_idx(0);
+	if (likely(sid >= 0 && sid < RTE_MAX_NUMA_NODES))
+		return sid;
+
+	return 0;
+}
+
+static int
+reserve_on_socket(int sid, size_t size)
+{
+	struct fastmem_socket_state *socket = &fastmem->sockets[sid];
+	int rc = 0;
+
+	rte_spinlock_lock(&socket->lock);
+
+	while (socket->reserved_bytes < size) {
+		rc = grow_socket(socket, sid);
+		if (rc < 0)
+			break;
+	}
+
+	rte_spinlock_unlock(&socket->lock);
+
+	return rc;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_reserve, 24.11)
+rte_fastmem_reserve(size_t size, int socket_id)
+{
+	unsigned int i;
+	int rc;
+
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id != SOCKET_ID_ANY) {
+		if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+			return -EINVAL;
+		return reserve_on_socket(socket_id, size);
+	}
+
+	rc = reserve_on_socket(local_socket_id(), size);
+	if (rc == 0)
+		return 0;
+
+	for (i = 0; i < rte_socket_count(); i++) {
+		int sid = rte_socket_id_by_idx(i);
+
+		if (sid < 0 || (unsigned int)sid == local_socket_id())
+			continue;
+
+		rc = reserve_on_socket(sid, size);
+		if (rc == 0)
+			return 0;
+	}
+
+	return rc;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_set_limit, 24.11)
+rte_fastmem_set_limit(int socket_id, size_t max_bytes)
+{
+	if (fastmem == NULL)
+		return -EINVAL;
+
+	if (socket_id == SOCKET_ID_ANY) {
+		for (unsigned int i = 0; i < RTE_MAX_NUMA_NODES; i++)
+			fastmem->sockets[i].memory_limit = max_bytes;
+		return 0;
+	}
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	fastmem->sockets[socket_id].memory_limit = max_bytes;
+	return 0;
+}
+
+size_t
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_get_limit, 24.11)
+rte_fastmem_get_limit(int socket_id)
+{
+	if (fastmem == NULL || socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return 0;
+
+	return fastmem->sockets[socket_id].memory_limit;
+}
+
+size_t
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_max_size, 24.11)
+rte_fastmem_max_size(void)
+{
+	return FASTMEM_MAX_ALLOC_SIZE;
+}
+
+static void *
+alloc_from_socket(struct fastmem_socket_state *socket,
+		unsigned int class_idx, unsigned int lcore_id)
+{
+	struct fastmem_cache *cache;
+
+	cache = cache_get(socket, class_idx, lcore_id);
+	if (likely(cache != NULL))
+		return cache_pop(cache, &socket->bins[class_idx]);
+	return bin_alloc_one(&socket->bins[class_idx]);
+}
+
+static void
+do_free(void *ptr)
+{
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+
+	slab = slab_of(ptr);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+	if (likely(cache != NULL))
+		cache_push(cache, bin, ptr);
+	else
+		bin_free_one(bin, ptr);
+}
+
+static int
+do_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags, unsigned int lcore_id,
+		int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int got = 0;
+
+	if (unlikely(fastmem_get() == NULL))
+		return -rte_errno;
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return -E2BIG;
+	}
+
+	socket = &fastmem->sockets[socket_id];
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		/* Drain from cache. */
+		unsigned int avail = RTE_MIN(cache->count, n);
+
+		cache->count -= avail;
+		memcpy(ptrs, &cache->objs[cache->count],
+			avail * sizeof(void *));
+		got = avail;
+		cache->alloc_cache_hits += avail;
+
+		if (got < n) {
+			unsigned int need = n - got;
+			unsigned int want = RTE_MAX(need, cache->target);
+			unsigned int filled;
+
+			if (want <= cache->capacity) {
+				/* Refill into cache, give caller their share. */
+				filled = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					cache->objs, want);
+				if (filled > 0) {
+					cache->alloc_cache_misses += RTE_MIN(filled, need);
+				}
+				if (filled >= need) {
+					memcpy(ptrs + got,
+						cache->objs + filled - need,
+						need * sizeof(void *));
+					cache->count = filled - need;
+					got = n;
+				} else {
+					memcpy(ptrs + got, cache->objs,
+						filled * sizeof(void *));
+					got += filled;
+					cache->count = 0;
+				}
+			} else {
+				/* n exceeds cache capacity; pull directly. */
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, need);
+				if (direct > 0)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	} else {
+		got = bin_alloc_bulk(&socket->bins[class_idx], ptrs, n);
+	}
+
+	if (unlikely(got < n) && fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count() && got < n; i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			socket = &fastmem->sockets[sid];
+			cache = cache_get(socket, class_idx, lcore_id);
+			if (likely(cache != NULL)) {
+				unsigned int avail =
+					RTE_MIN(cache->count, n - got);
+				cache->count -= avail;
+				memcpy(ptrs + got,
+					&cache->objs[cache->count],
+					avail * sizeof(void *));
+				cache->alloc_cache_hits += avail;
+				got += avail;
+			}
+			if (got < n) {
+				unsigned int direct = bin_alloc_bulk(
+					&socket->bins[class_idx],
+					ptrs + got, n - got);
+				if (direct > 0 && cache != NULL)
+					cache->alloc_cache_misses += direct;
+				got += direct;
+			}
+		}
+	}
+
+	if (unlikely(got < n)) {
+		/* All-or-nothing: return what we got. */
+		struct fastmem_cache **slot;
+		unsigned int i;
+
+		for (i = 0; i < got; i++)
+			do_free(ptrs[i]);
+
+		slot = cache_slot(
+			&fastmem->sockets[socket_id], class_idx,
+			lcore_id);
+		if (slot != NULL && *slot != NULL)
+			(*slot)->alloc_nomem++;
+		rte_errno = ENOMEM;
+		return -ENOMEM;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO) {
+		size_t cs = class_size(class_idx);
+		unsigned int i;
+
+		for (i = 0; i < n; i++)
+			memset(ptrs[i], 0, cs);
+	}
+
+	return 0;
+}
+
+static void *
+do_alloc(size_t size, size_t align, unsigned int flags,
+		unsigned int lcore_id, int socket_id, bool fallback)
+{
+	unsigned int class_idx;
+	struct fastmem_cache **slot;
+	void *obj;
+
+	if (unlikely(fastmem_get() == NULL))
+		return NULL;
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	class_idx = size_to_class(size, align);
+	if (unlikely(class_idx >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	obj = alloc_from_socket(&fastmem->sockets[socket_id],
+			class_idx, lcore_id);
+
+	if (likely(obj != NULL))
+		goto out;
+
+	if (fallback) {
+		unsigned int i;
+
+		for (i = 0; i < rte_socket_count(); i++) {
+			int sid = rte_socket_id_by_idx(i);
+
+			if (sid < 0 || sid == socket_id)
+				continue;
+
+			obj = alloc_from_socket(&fastmem->sockets[sid],
+					class_idx, lcore_id);
+			if (obj != NULL)
+				goto out;
+		}
+	}
+
+	slot = cache_slot(
+		&fastmem->sockets[socket_id], class_idx, lcore_id);
+	if (slot != NULL && *slot != NULL)
+		(*slot)->alloc_nomem++;
+	rte_errno = ENOMEM;
+	return NULL;
+
+out:
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc, 24.11)
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+{
+	return do_alloc(size, align, flags, rte_lcore_id(),
+			local_socket_id(), false);
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_socket, 24.11)
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc(size, align, flags, rte_lcore_id(),
+				local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	return do_alloc(size, align, flags, rte_lcore_id(), socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free, 24.11)
+rte_fastmem_free(void *ptr)
+{
+	if (unlikely(ptr == NULL))
+		return;
+
+	do_free(ptr);
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_realloc, 24.11)
+rte_fastmem_realloc(void *ptr, size_t size, size_t align)
+{
+	struct fastmem_slab *slab;
+	unsigned int old_class, new_class;
+	size_t old_size;
+	void *new_ptr;
+
+	if (ptr == NULL)
+		return rte_fastmem_alloc(size, align, 0);
+
+	if (size == 0) {
+		rte_fastmem_free(ptr);
+		return NULL;
+	}
+
+	if (unlikely(!normalize_align(&align))) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	new_class = size_to_class(size, align);
+	if (unlikely(new_class >= FASTMEM_N_CLASSES)) {
+		rte_errno = E2BIG;
+		return NULL;
+	}
+
+	slab = slab_of(ptr);
+	old_class = slab->bin->class_idx;
+
+	if (new_class == old_class)
+		return ptr;
+
+	new_ptr = rte_fastmem_alloc(size, align, 0);
+	if (unlikely(new_ptr == NULL))
+		return NULL;
+
+	old_size = class_size(old_class);
+	memcpy(new_ptr, ptr, RTE_MIN(old_size, size));
+	rte_fastmem_free(ptr);
+
+	return new_ptr;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk, 24.11)
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags)
+{
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), local_socket_id(), false);
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_alloc_bulk_socket, 24.11)
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id)
+{
+	if (socket_id == SOCKET_ID_ANY)
+		return do_alloc_bulk(ptrs, n, size, align, flags,
+				rte_lcore_id(), local_socket_id(), true);
+
+	if (unlikely(socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)) {
+		rte_errno = EINVAL;
+		return -EINVAL;
+	}
+
+	return do_alloc_bulk(ptrs, n, size, align, flags,
+			rte_lcore_id(), socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_free_bulk, 24.11)
+rte_fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int lcore_id;
+	struct fastmem_slab *slab;
+	struct fastmem_bin *bin;
+	struct fastmem_socket_state *socket;
+	struct fastmem_cache *cache;
+	unsigned int space;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	lcore_id = rte_lcore_id();
+
+	/* Fast path: check if first object gives us the bin. */
+	slab = slab_of(ptrs[0]);
+	bin = slab->bin;
+	socket = &fastmem->sockets[bin->socket_id];
+	cache = cache_get(socket, bin->class_idx, lcore_id);
+
+	if (unlikely(cache == NULL)) {
+		for (i = 0; i < n; i++)
+			do_free(ptrs[i]);
+		return;
+	}
+
+	/*
+	 * Try to push all objects into the cache in one memcpy.
+	 * If any object belongs to a different bin, fall back to
+	 * per-object free for the remainder.
+	 */
+	space = cache->capacity - cache->count;
+	if (likely(n <= space)) {
+		/* Verify all same bin (common case). */
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+		cache->free_cache_hits += n;
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+	/* Would overflow cache — drain first, then push. */
+	if (n <= cache->capacity) {
+		unsigned int drain;
+
+		for (i = 1; i < n; i++) {
+			if (slab_of(ptrs[i])->bin != bin)
+				goto slow;
+		}
+
+		cache->free_cache_misses += n;
+		drain = cache->count - cache->target + n;
+		if (drain > cache->count)
+			drain = cache->count;
+		if (drain > 0) {
+			bin_free_bulk(bin, cache->objs, drain);
+			cache->count -= drain;
+			memmove(cache->objs, cache->objs + drain,
+				cache->count * sizeof(cache->objs[0]));
+		}
+		memcpy(&cache->objs[cache->count], ptrs,
+			n * sizeof(void *));
+		cache->count += n;
+		return;
+	}
+
+slow:
+	for (i = 0; i < n; i++)
+		do_free(ptrs[i]);
+}
+
+#define fastmem_handle_class_BITS 8
+
+static rte_fastmem_handle_t
+fastmem_handle_pack(unsigned int class_idx, int socket_id)
+{
+	return (uint32_t)class_idx |
+		((uint32_t)socket_id << fastmem_handle_class_BITS);
+}
+
+static unsigned int
+fastmem_handle_class(rte_fastmem_handle_t h)
+{
+	return h & ((1U << fastmem_handle_class_BITS) - 1);
+}
+
+static int
+fastmem_handle_socket(rte_fastmem_handle_t h)
+{
+	return (int)(h >> fastmem_handle_class_BITS);
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hlookup, 24.11)
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle)
+{
+	unsigned int class_idx;
+	struct fastmem_socket_state *socket;
+
+	if (handle == NULL)
+		return -EINVAL;
+
+	if (!normalize_align(&align))
+		return -EINVAL;
+
+	if (socket_id < 0 || socket_id >= RTE_MAX_NUMA_NODES)
+		return -EINVAL;
+
+	class_idx = size_to_class(size, align);
+	if (class_idx >= FASTMEM_N_CLASSES)
+		return -E2BIG;
+
+	/* Pre-create the cache for the calling lcore. */
+	socket = &fastmem->sockets[socket_id];
+	cache_create(socket, class_idx, rte_lcore_id());
+
+	*handle = fastmem_handle_pack(class_idx, socket_id);
+	return 0;
+}
+
+void *
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc, 24.11)
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	struct fastmem_cache *cache;
+	void *obj;
+
+	RTE_ASSERT(fastmem != NULL);
+	RTE_ASSERT(lcore_id < RTE_MAX_LCORE);
+
+	cache = socket->caches[lcore_id][class_idx];
+	RTE_ASSERT(cache != NULL);
+
+	obj = cache_pop(cache, bin);
+	if (unlikely(obj == NULL)) {
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	if (flags & RTE_FASTMEM_F_ZERO)
+		memset(obj, 0, class_size(class_idx));
+
+	return obj;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_halloc_bulk, 24.11)
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+
+	return do_alloc_bulk(ptrs, n, class_size(class_idx),
+			RTE_CACHE_LINE_SIZE, flags, rte_lcore_id(),
+			socket_id, false);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree, 24.11)
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	unsigned int lcore_id = rte_lcore_id();
+	struct fastmem_cache *cache;
+
+	if (unlikely(ptr == NULL))
+		return;
+
+	RTE_ASSERT(lcore_id < RTE_MAX_LCORE);
+
+	cache = socket->caches[lcore_id][class_idx];
+	RTE_ASSERT(cache != NULL);
+
+	cache_push(cache, bin, ptr);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_hfree_bulk, 24.11)
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n)
+{
+	unsigned int class_idx = fastmem_handle_class(handle);
+	int socket_id = fastmem_handle_socket(handle);
+	struct fastmem_socket_state *socket = &fastmem->sockets[socket_id];
+	struct fastmem_bin *bin = &socket->bins[class_idx];
+	unsigned int lcore_id;
+	struct fastmem_cache *cache;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return;
+
+	lcore_id = rte_lcore_id();
+	cache = cache_get(socket, class_idx, lcore_id);
+
+	if (likely(cache != NULL)) {
+		for (i = 0; i < n; i++)
+			cache_push(cache, bin, ptrs[i]);
+	} else {
+		for (i = 0; i < n; i++)
+			bin_free_one(bin, ptrs[i]);
+	}
+}
+
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_virt2iova, 24.11)
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr)
+{
+	struct fastmem_slab *slab;
+
+	RTE_ASSERT(fastmem != NULL);
+
+	slab = slab_of((void *)(uintptr_t)ptr);
+
+	return slab->iova_base + ((uintptr_t)ptr - (uintptr_t)slab);
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_cache_flush, 24.11)
+rte_fastmem_cache_flush(void)
+{
+	unsigned int lcore_id;
+	unsigned int s, c;
+
+	if (fastmem == NULL)
+		return;
+
+	lcore_id = rte_lcore_id();
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	for (s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			struct fastmem_slab *cache_slab;
+
+			if (cache == NULL)
+				continue;
+
+			if (cache->count > 0) {
+				bin_free_bulk(&socket->bins[c],
+					cache->objs, cache->count);
+				cache->count = 0;
+			}
+
+			cache_slab = slab_of(cache);
+			bin_free_one(cache_slab->bin, cache);
+
+			socket->caches[lcore_id][c] = NULL;
+		}
+	}
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats, 24.11)
+rte_fastmem_stats(struct rte_fastmem_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_stats){0};
+	stats->n_classes = FASTMEM_N_CLASSES;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		stats->bytes_backing += socket->reserved_bytes;
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			uint64_t class_allocs = 0, class_frees = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				class_allocs += cache->alloc_cache_hits +
+					cache->alloc_cache_misses;
+				class_frees += cache->free_cache_hits +
+					cache->free_cache_misses;
+				stats->alloc_nomem += cache->alloc_nomem;
+			}
+			stats->alloc_total += class_allocs;
+			stats->free_total += class_frees;
+			if (class_allocs > class_frees)
+				stats->bytes_in_use += class_size(c) *
+					(class_allocs - class_frees);
+		}
+	}
+
+	return 0;
+}
+
+static unsigned int
+exact_class_idx(size_t sz)
+{
+	unsigned int log2;
+
+	if (sz < FASTMEM_MIN_SIZE || sz > FASTMEM_MAX_ALLOC_SIZE)
+		return FASTMEM_N_CLASSES;
+	if ((sz & (sz - 1)) != 0)
+		return FASTMEM_N_CLASSES;
+
+	log2 = (unsigned int)rte_ctz64(sz);
+	if (log2 < FASTMEM_MIN_CLASS_LOG2 || log2 > FASTMEM_MAX_CLASS_LOG2)
+		return FASTMEM_N_CLASSES;
+
+	return log2 - FASTMEM_MIN_CLASS_LOG2;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_class, 24.11)
+rte_fastmem_stats_class(size_t class_size_arg,
+		struct rte_fastmem_class_stats *stats)
+{
+	unsigned int c;
+	uint64_t allocs, frees;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+		struct fastmem_bin *bin = &socket->bins[c];
+
+		for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+			struct fastmem_cache *cache = socket->caches[l][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+
+		stats->slab_acquires += bin->slab_acquires;
+		stats->slab_releases += bin->slab_releases;
+		stats->slabs_partial += bin->slabs_partial;
+		stats->slabs_full += bin->slabs_full;
+	}
+
+	allocs = stats->alloc_cache_hits + stats->alloc_cache_misses;
+	frees = stats->free_cache_hits + stats->free_cache_misses;
+	if (allocs > frees)
+		stats->in_use = allocs - frees;
+
+	return 0;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore, 24.11)
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats)
+{
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_stats){0};
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_cache *cache =
+				socket->caches[lcore_id][c];
+			if (cache == NULL)
+				continue;
+			stats->alloc_cache_hits += cache->alloc_cache_hits;
+			stats->alloc_cache_misses += cache->alloc_cache_misses;
+			stats->alloc_nomem += cache->alloc_nomem;
+			stats->free_cache_hits += cache->free_cache_hits;
+			stats->free_cache_misses += cache->free_cache_misses;
+		}
+	}
+
+	return 0;
+}
+
+int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_lcore_class, 24.11)
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size_arg,
+		struct rte_fastmem_lcore_class_stats *stats)
+{
+	unsigned int c;
+
+	if (stats == NULL || fastmem == NULL)
+		return -EINVAL;
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	c = exact_class_idx(class_size_arg);
+	if (c >= FASTMEM_N_CLASSES)
+		return -EINVAL;
+
+	*stats = (struct rte_fastmem_lcore_class_stats){0};
+	stats->class_size = class_size(c);
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_cache *cache =
+			fastmem->sockets[s].caches[lcore_id][c];
+		if (cache == NULL)
+			continue;
+		stats->alloc_cache_hits += cache->alloc_cache_hits;
+		stats->alloc_cache_misses += cache->alloc_cache_misses;
+		stats->alloc_nomem += cache->alloc_nomem;
+		stats->free_cache_hits += cache->free_cache_hits;
+		stats->free_cache_misses += cache->free_cache_misses;
+	}
+
+	return 0;
+}
+
+void
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_stats_reset, 24.11)
+rte_fastmem_stats_reset(void)
+{
+	if (fastmem == NULL)
+		return;
+
+	for (unsigned int s = 0; s < RTE_MAX_NUMA_NODES; s++) {
+		struct fastmem_socket_state *socket = &fastmem->sockets[s];
+
+		for (unsigned int c = 0; c < FASTMEM_N_CLASSES; c++) {
+			struct fastmem_bin *bin = &socket->bins[c];
+
+			bin->slab_acquires = 0;
+			bin->slab_releases = 0;
+
+			for (unsigned int l = 0; l < RTE_MAX_LCORE; l++) {
+				struct fastmem_cache *cache =
+					socket->caches[l][c];
+				if (cache == NULL)
+					continue;
+				cache->alloc_cache_hits = 0;
+				cache->alloc_cache_misses = 0;
+				cache->alloc_nomem = 0;
+				cache->free_cache_hits = 0;
+				cache->free_cache_misses = 0;
+			}
+		}
+	}
+}
+
+unsigned int
+RTE_EXPORT_EXPERIMENTAL_SYMBOL(rte_fastmem_classes, 24.11)
+rte_fastmem_classes(size_t *sizes)
+{
+	if (sizes != NULL)
+		for (unsigned int i = 0; i < FASTMEM_N_CLASSES; i++)
+			sizes[i] = class_size(i);
+	return FASTMEM_N_CLASSES;
+}
diff --git a/lib/fastmem/rte_fastmem.h b/lib/fastmem/rte_fastmem.h
new file mode 100644
index 0000000000..1d74660da1
--- /dev/null
+++ b/lib/fastmem/rte_fastmem.h
@@ -0,0 +1,815 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#ifndef _RTE_FASTMEM_H_
+#define _RTE_FASTMEM_H_
+
+/**
+ * @file
+ *
+ * RTE Fastmem
+ *
+ * @warning
+ * @b EXPERIMENTAL:
+ * All functions in this file may be changed or removed without prior notice.
+ *
+ * The fastmem library is a fast, general-purpose small-object
+ * allocator for DPDK applications. It is intended to allow an
+ * application to replace its many per-type mempools — each sized
+ * for a single object type (a connection, a session, a work item,
+ * a timer, etc.) — with a single allocator that handles arbitrary
+ * object sizes, grows on demand, and offers mempool-level
+ * performance for the common allocation and free paths.
+ *
+ * Like mempool, fastmem is backed by huge pages, is NUMA-aware,
+ * supports bulk operations, and uses per-lcore caches to reduce
+ * shared-state contention. Unlike mempool, it does not require the
+ * caller to declare object sizes or counts up front.
+ *
+ * There is a single, global fastmem instance per process. The
+ * instance is brought up with rte_fastmem_init() and torn down with
+ * rte_fastmem_deinit(). Allocations are made with
+ * rte_fastmem_alloc() and freed with rte_fastmem_free().
+ *
+ * The allocator is bounded to small-object allocations. Requests
+ * larger than rte_fastmem_max_size() are rejected; callers with
+ * such needs should use rte_malloc() directly.
+ *
+ * Backing memory is reserved from DPDK memzones. Once reserved,
+ * backing memory is not returned to the system during the
+ * allocator's lifetime. Callers that need predictable latency may
+ * pre-reserve backing memory up front using rte_fastmem_reserve(),
+ * avoiding memzone-reservation overhead during steady-state
+ * operation.
+ *
+ * Alignment argument, @c align:
+ *   If non-zero, @c align specifies an exact minimum alignment and
+ *   must be a power of 2. If zero, the default alignment is
+ *   @c RTE_CACHE_LINE_SIZE, so that objects obtained from distinct
+ *   calls cannot false-share a cache line.
+ *
+ * Threads and per-lcore caches:
+ *   Allocate and free calls from EAL threads are served through a
+ *   per-lcore cache, which makes the common path lock-free.
+ *   Unregistered non-EAL threads do not use a cache; their
+ *   allocate and free calls go directly to shared state, take an
+ *   internal lock, and cost more per call.
+ *
+ * Non-preemptible caller:
+ *   Callers should not be preemptible while inside a fastmem call.
+ *   Fastmem uses internal spinlocks; if a caller is preempted
+ *   while holding one, any other thread that subsequently needs
+ *   the same lock stalls until the preempted caller resumes.
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Flag for rte_fastmem_alloc() and its variants: initialize the
+ * returned memory to zero before returning it to the caller.
+ */
+#define RTE_FASTMEM_F_ZERO RTE_BIT32(0)
+
+/**
+ * Initialize the fastmem allocator.
+ *
+ * Sets up the library's internal state. Must be called before any
+ * allocation call. Typically called once per process, after
+ * rte_eal_init() and before the application's worker threads begin
+ * making allocations.
+ *
+ * Initialization does not pre-reserve any backing memory; memzones
+ * are reserved lazily as allocations require. An application that
+ * wants to avoid memzone-reservation latency on the allocation
+ * path should follow rte_fastmem_init() with one or more calls to
+ * rte_fastmem_reserve().
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EBUSY: The allocator is already initialized.
+ *  - -ENOMEM: Unable to allocate internal state.
+ */
+__rte_experimental
+int
+rte_fastmem_init(void);
+
+/**
+ * Tear down the fastmem allocator.
+ *
+ * Releases the library's internal state and frees all backing
+ * memzones. After this call, no fastmem allocations or frees may
+ * be made until rte_fastmem_init() is called again.
+ *
+ * The caller is responsible for ensuring that no fastmem-allocated
+ * objects remain in use. Outstanding allocations at deinit time
+ * result in undefined behavior.
+ *
+ * This function is not thread-safe and must not be called
+ * concurrently with any other fastmem function.
+ */
+__rte_experimental
+void
+rte_fastmem_deinit(void);
+
+/**
+ * Pre-reserve backing memory.
+ *
+ * Ensures that at least @p size bytes of memzone-backed memory are
+ * available to the allocator on @p socket_id, reserving additional
+ * memzones from EAL as needed to reach that total. Subsequent
+ * allocations served from the pre-reserved memory do not incur
+ * memzone-reservation cost.
+ *
+ * The reservation is cumulative: repeated calls to
+ * rte_fastmem_reserve() with the same @p socket_id grow the
+ * reservation monotonically. Reserved memory is never returned to
+ * the system during the allocator's lifetime.
+ *
+ * A typical use is to call rte_fastmem_reserve() once at
+ * application startup, with a size chosen to cover the expected
+ * steady-state working set. Allocations and frees during
+ * steady-state operation then avoid memzone reservations entirely.
+ *
+ * @param size
+ *  The minimum amount of backing memory, in bytes, to make
+ *  available on @p socket_id. The allocator may reserve more than
+ *  the requested amount due to internal rounding (e.g., to memzone
+ *  or block granularity).
+ *
+ * @param socket_id
+ *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
+ *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the reservation.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
+ *  - -EINVAL: Invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_reserve(size_t size, int socket_id);
+
+/**
+ * Set the maximum backing memory that may be reserved on a socket.
+ *
+ * Once the limit is reached, allocations that would require new
+ * backing memory on the constrained socket fail with ENOMEM.
+ * Already-reserved memory is not released.
+ *
+ * Setting a limit below the current reserved amount is allowed and
+ * prevents further growth.
+ *
+ * @param socket_id
+ *  The NUMA socket to constrain, or SOCKET_ID_ANY to apply the
+ *  limit to all sockets.
+ * @param max_bytes
+ *  Maximum backing memory in bytes, or SIZE_MAX for unlimited (the default).
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Fastmem not initialized, or invalid @p socket_id.
+ */
+__rte_experimental
+int
+rte_fastmem_set_limit(int socket_id, size_t max_bytes);
+
+/**
+ * Get the maximum backing memory limit for a socket.
+ *
+ * @param socket_id
+ *  The NUMA socket to query.
+ * @return
+ *  The limit in bytes, or SIZE_MAX if unlimited.
+ */
+__rte_experimental
+size_t
+rte_fastmem_get_limit(int socket_id);
+
+/**
+ * Retrieve the largest allocation size the allocator supports.
+ *
+ * Requests larger than this size are rejected by the allocation
+ * functions. The returned value is a property of the allocator
+ * implementation and does not change across the lifetime of the
+ * process.
+ *
+ * @return
+ *  The largest supported allocation size, in bytes.
+ */
+__rte_experimental
+size_t
+rte_fastmem_max_size(void);
+
+/* Forward declaration for __rte_dealloc attribute. */
+void rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate an object from the fastmem allocator.
+ *
+ * Allocates at least @p size bytes, aligned to at least @p align
+ * bytes. The returned memory is backed by huge pages and is
+ * DMA-usable; its IOVA can be obtained via rte_fastmem_virt2iova().
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_socket() to target a
+ * specific socket.
+ *
+ * The allocated memory must be freed with rte_fastmem_free(). An
+ * allocation may be freed from any lcore, not only the lcore that
+ * made the allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags. Use
+ *  RTE_FASTMEM_F_ZERO to obtain zero-initialized memory.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align (not a power of two).
+ *    - ENOMEM: Allocation could not be served from existing
+ *      backing memory and no additional memzone could be reserved.
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc(size_t size, size_t align, unsigned int flags)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Allocate an object on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc(), but targets the specified NUMA socket
+ * rather than the socket of the calling lcore. Use this variant
+ * when the lifetime or access pattern of the allocation is not
+ * tied to the calling lcore's socket.
+ *
+ * This function is MT-safe.
+ *
+ * @param size
+ *  Requested allocation size, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, the returned pointer will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, the returned pointer will
+ *  be aligned on a multiple of @p align, which must be a power of
+ *  2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - A pointer to the allocated object on success.
+ *  - NULL on failure, with @c rte_errno set (see rte_fastmem_alloc()).
+ */
+__rte_experimental
+void *
+rte_fastmem_alloc_socket(size_t size, size_t align, unsigned int flags,
+		int socket_id)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Resize a fastmem allocation, preserving existing contents.
+ *
+ * If @p ptr is NULL, equivalent to rte_fastmem_alloc(size, align, 0).
+ * If @p size is 0, frees @p ptr and returns NULL.
+ *
+ * If the existing allocation can already satisfy the new size and
+ * alignment, the original pointer may be returned unchanged.
+ * Otherwise, a new allocation is made, the contents are copied
+ * (up to the minimum of old and new sizes), and the old allocation
+ * is freed.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an existing fastmem allocation, or NULL.
+ *
+ * @param size
+ *  New requested size in bytes. If 0, the allocation is freed.
+ *
+ * @param align
+ *  If 0, alignment is at least @c RTE_CACHE_LINE_SIZE. Otherwise,
+ *  must be a power of 2.
+ *
+ * @return
+ *  - A pointer to the resized allocation on success.
+ *  - NULL on failure, with @c rte_errno set:
+ *    - E2BIG: @p size exceeds rte_fastmem_max_size().
+ *    - EINVAL: Invalid @p align.
+ *    - ENOMEM: Allocation could not be served.
+ *  On failure, the original allocation at @p ptr remains valid.
+ */
+__rte_experimental
+void *
+rte_fastmem_realloc(void *ptr, size_t size, size_t align)
+	__rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Free an object previously allocated by the fastmem allocator.
+ *
+ * @p ptr must have been returned by a prior call to any fastmem
+ * allocation function, or be NULL. If @p ptr is NULL, no operation
+ * is performed.
+ *
+ * Free may be called from any lcore, regardless of which lcore
+ * made the original allocation.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptr
+ *  Pointer to an object previously allocated by fastmem, or NULL.
+ */
+__rte_experimental
+void
+rte_fastmem_free(void *ptr);
+
+/**
+ * Allocate multiple objects in bulk.
+ *
+ * Allocates @p n objects, each of size at least @p size and aligned
+ * to at least @p align bytes, and stores the resulting pointers
+ * into @p ptrs. All @p n objects have the same size and alignment.
+ *
+ * On NUMA systems, the memory is allocated on the socket of the
+ * calling lcore. Use rte_fastmem_alloc_bulk_socket() to target a
+ * specific socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_alloc().
+ *
+ * On failure no objects are allocated and @p ptrs is left
+ * untouched.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ *  - -EINVAL: Invalid @p align.
+ *  - -ENOMEM: Not enough objects could be allocated to fill the
+ *    request.
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk(void **ptrs, unsigned int n, size_t size, size_t align,
+		unsigned int flags);
+
+/**
+ * Allocate multiple objects in bulk on a specific NUMA socket.
+ *
+ * Like rte_fastmem_alloc_bulk(), but targets the specified NUMA
+ * socket rather than the socket of the calling lcore.
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of at least @p n pointers into which the newly
+ *  allocated object pointers are written.
+ *
+ * @param n
+ *  The number of objects to allocate.
+ *
+ * @param size
+ *  Requested size of each object, in bytes. Must not exceed
+ *  rte_fastmem_max_size().
+ *
+ * @param align
+ *  If 0, returned pointers will be aligned to at least
+ *  @c RTE_CACHE_LINE_SIZE. Otherwise, returned pointers will be
+ *  aligned on a multiple of @p align, which must be a power of 2.
+ *
+ * @param flags
+ *  A bitwise OR of zero or more RTE_FASTMEM_F_* flags.
+ *
+ * @param socket_id
+ *  The NUMA socket on which to allocate, or SOCKET_ID_ANY to
+ *  leave the choice to the allocator. With SOCKET_ID_ANY, the
+ *  allocator starts on the calling lcore's socket (or the first
+ *  configured socket if the caller is not bound to one) and falls
+ *  back to other sockets if the preferred socket cannot satisfy
+ *  the request.
+ *
+ * @return
+ *  - 0: All @p n objects were allocated and stored in @p ptrs.
+ *  - Negative errno on failure (see rte_fastmem_alloc_bulk()).
+ */
+__rte_experimental
+int
+rte_fastmem_alloc_bulk_socket(void **ptrs, unsigned int n, size_t size,
+		size_t align, unsigned int flags, int socket_id);
+
+/**
+ * Free multiple objects in bulk.
+ *
+ * Frees the @p n objects pointed to by @p ptrs. Each pointer in
+ * the array must have been returned by a prior fastmem allocation
+ * call and must not have been freed. The objects need not have
+ * the same size, alignment, or socket.
+ *
+ * The bulk path amortizes per-object overhead and is typically
+ * faster than @p n individual calls to rte_fastmem_free().
+ *
+ * This function is MT-safe.
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_free_bulk(void **ptrs, unsigned int n);
+
+/**
+ * Opaque handle encoding a (size class, NUMA socket) pair.
+ *
+ * Obtained via rte_fastmem_hlookup(). Passing a handle to
+ * rte_fastmem_halloc() avoids the per-call size-class
+ * lookup and socket resolution, improving allocation throughput
+ * for fixed-size objects.
+ */
+typedef uint32_t rte_fastmem_handle_t;
+
+/**
+ * Look up a handle for a given object size and NUMA socket.
+ *
+ * The returned handle encodes the size class and socket, and can
+ * be passed to rte_fastmem_halloc() to allocate objects
+ * without repeating the class lookup.
+ *
+ * @param size
+ *  Object size in bytes. Must not exceed rte_fastmem_max_size().
+ *
+ * @param align
+ *  Alignment requirement (power of two), or 0 for the default
+ *  (RTE_CACHE_LINE_SIZE).
+ *
+ * @param socket_id
+ *  NUMA socket to allocate from.
+ *
+ * @param[out] handle
+ *  On success, set to the resolved handle.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: Invalid alignment or socket_id.
+ *  - -E2BIG: @p size exceeds rte_fastmem_max_size().
+ */
+__rte_experimental
+int
+rte_fastmem_hlookup(size_t size, size_t align, int socket_id,
+		rte_fastmem_handle_t *handle);
+
+/**
+ * Allocate an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc() but skips the size-class
+ * lookup and socket resolution, using the pre-resolved handle
+ * instead.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  A pointer to the allocated object, or NULL on failure
+ *  (rte_errno is set).
+ */
+__rte_experimental
+void *
+rte_fastmem_halloc(rte_fastmem_handle_t handle, unsigned int flags)
+	__rte_malloc __rte_dealloc(rte_fastmem_free, 1);
+
+/**
+ * Bulk-allocate objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_alloc_bulk() but uses a pre-resolved
+ * handle. All-or-nothing semantics apply.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param[out] ptrs
+ *  Array to receive @p n allocated pointers.
+ *
+ * @param n
+ *  Number of objects to allocate.
+ *
+ * @param flags
+ *  Allocation flags (e.g., RTE_FASTMEM_F_ZERO).
+ *
+ * @return
+ *  - 0: All @p n objects allocated successfully.
+ *  - -ENOMEM: Allocation failed; no objects were allocated.
+ */
+__rte_experimental
+int
+rte_fastmem_halloc_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n, unsigned int flags);
+
+/**
+ * Free an object using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free() but skips the slab-header
+ * lookup by using the class and socket encoded in the handle.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation function.
+ *  Must belong to the same size class and socket as @p handle.
+ *  NULL is permitted (no-op).
+ */
+__rte_experimental
+void
+rte_fastmem_hfree(rte_fastmem_handle_t handle, void *ptr);
+
+/**
+ * Bulk-free objects using a pre-resolved handle.
+ *
+ * Equivalent to rte_fastmem_free_bulk() but skips per-object
+ * slab-header lookups.
+ *
+ * All objects must belong to the same size class and socket as
+ * @p handle.
+ *
+ * @param handle
+ *  A handle previously obtained from rte_fastmem_hlookup().
+ *
+ * @param ptrs
+ *  An array of @p n pointers to fastmem-allocated objects.
+ *
+ * @param n
+ *  The number of objects to free.
+ */
+__rte_experimental
+void
+rte_fastmem_hfree_bulk(rte_fastmem_handle_t handle,
+		void **ptrs, unsigned int n);
+
+/**
+ * Obtain the IOVA for a fastmem-allocated pointer.
+ *
+ * Translates a virtual address returned by a fastmem allocation
+ * function into the corresponding IOVA, suitable for use in device
+ * DMA descriptors.
+ *
+ * The returned IOVA is valid for the lifetime of the allocation.
+ *
+ * @p ptr must have been returned by a prior fastmem allocation
+ * function. Passing any other pointer results in undefined
+ * behavior.
+ *
+ * @param ptr
+ *  A pointer previously returned by a fastmem allocation
+ *  function.
+ *
+ * @return
+ *  The IOVA corresponding to @p ptr.
+ */
+__rte_experimental
+rte_iova_t
+rte_fastmem_virt2iova(const void *ptr);
+
+/**
+ * Flush the calling lcore's per-lcore caches.
+ *
+ * Drains every cached object from the calling lcore's
+ * per-(size class, NUMA socket) caches back to their shared
+ * bins, and releases the cache state itself. A subsequent
+ * allocation or free on this lcore lazily recreates any caches
+ * it needs.
+ *
+ * This is useful in applications that have finished a bursty
+ * phase and want to release memory that would otherwise sit idle
+ * in caches. It is also useful in tests that want to observe
+ * bin-level state without per-lcore caching hiding activity.
+ *
+ * The call has no effect when invoked from a non-EAL thread.
+ *
+ * This function is not thread-safe with respect to concurrent
+ * allocations or frees on the calling lcore; call it only when
+ * the calling lcore is not making other fastmem calls.
+ */
+__rte_experimental
+void
+rte_fastmem_cache_flush(void);
+
+/**
+ * Global summary statistics.
+ */
+struct rte_fastmem_stats {
+	uint64_t bytes_backing;  /**< Bytes of backing memory (memzones) reserved from EAL. */
+	uint64_t bytes_in_use;   /**< Approximate bytes in live objects. */
+	uint64_t alloc_total;    /**< Total successful alloc operations (hits + misses). */
+	uint64_t free_total;     /**< Total free operations (hits + misses). */
+	uint64_t alloc_nomem;    /**< Alloc attempts that failed with ENOMEM. */
+	unsigned int n_classes;  /**< Number of size classes. */
+};
+
+/**
+ * Per-size-class statistics (aggregated across all lcores).
+ *
+ * Allocation and free counters count individual objects, not
+ * operations. A bulk allocation of 32 objects that hits the cache
+ * increments alloc_cache_hits by 32.
+ */
+struct rte_fastmem_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t in_use;               /**< Objects currently live (allocs - frees). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from a per-lcore cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by a per-lcore cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+	uint64_t slab_acquires;        /**< Slabs pulled from the free pool. */
+	uint64_t slab_releases;        /**< Slabs returned to the free pool. */
+	uint32_t slabs_partial;        /**< Current partial slab count. */
+	uint32_t slabs_full;           /**< Current full slab count. */
+};
+
+/**
+ * Per-lcore statistics (aggregated across all classes).
+ */
+struct rte_fastmem_lcore_stats {
+	uint64_t alloc_cache_hits;     /**< Allocs served from this lcore's caches. */
+	uint64_t alloc_cache_misses;   /**< Allocs that missed this lcore's caches. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by this lcore's caches. */
+	uint64_t free_cache_misses;    /**< Frees that bypassed this lcore's caches. */
+};
+
+/**
+ * Per-lcore, per-class statistics (no aggregation).
+ */
+struct rte_fastmem_lcore_class_stats {
+	size_t class_size;             /**< Usable size of this class (bytes). */
+	uint64_t alloc_cache_hits;     /**< Allocs served from cache. */
+	uint64_t alloc_cache_misses;   /**< Allocs that triggered a bin refill. */
+	uint64_t alloc_nomem;          /**< Alloc attempts that failed with ENOMEM. */
+	uint64_t free_cache_hits;      /**< Frees absorbed by cache. */
+	uint64_t free_cache_misses;    /**< Frees that triggered a bin drain. */
+};
+
+/**
+ * Get the number of size classes and optionally their sizes.
+ *
+ * @param[out] sizes
+ *   If non-NULL, filled with the size (in bytes) of each class.
+ *   The caller must provide space for at least the returned number
+ *   of entries.
+ *
+ * @return
+ *   The number of size classes.
+ */
+__rte_experimental
+unsigned int
+rte_fastmem_classes(size_t *sizes);
+
+/**
+ * Retrieve global summary statistics.
+ *
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL or fastmem is not initialized.
+ */
+__rte_experimental
+int
+rte_fastmem_stats(struct rte_fastmem_stats *stats);
+
+/**
+ * Retrieve statistics for a single size class.
+ *
+ * @param class_size
+ *   Exact size of the class to query (must match one of the values
+ *   returned by rte_fastmem_classes()).
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p class_size does not match any size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_class(size_t class_size,
+		struct rte_fastmem_class_stats *stats);
+
+/**
+ * Retrieve per-lcore statistics (aggregated across all classes).
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized, or
+ *    @p lcore_id is invalid.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore(unsigned int lcore_id,
+		struct rte_fastmem_lcore_stats *stats);
+
+/**
+ * Retrieve per-lcore, per-class statistics.
+ *
+ * @param lcore_id
+ *   The lcore to query.
+ * @param class_size
+ *   Exact size of the class to query.
+ * @param[out] stats
+ *   Structure to fill.
+ *
+ * @return
+ *  - 0: Success.
+ *  - -EINVAL: @p stats is NULL, fastmem is not initialized,
+ *    @p lcore_id is invalid, or @p class_size does not match any
+ *    size class.
+ */
+__rte_experimental
+int
+rte_fastmem_stats_lcore_class(unsigned int lcore_id, size_t class_size,
+		struct rte_fastmem_lcore_class_stats *stats);
+
+/**
+ * Reset all statistics counters to zero.
+ *
+ * Zeroes per-lcore cache counters and per-bin counters. Does not
+ * affect the allocator's operational state.
+ */
+__rte_experimental
+void
+rte_fastmem_stats_reset(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_FASTMEM_H_ */
diff --git a/lib/meson.build b/lib/meson.build
index 8f5cfd28a5..10906d4d53 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -38,6 +38,7 @@ libraries = [
         'distributor',
         'dmadev',  # eventdev depends on this
         'efd',
+        'fastmem',
         'eventdev',
         'dispatcher', # dispatcher depends on eventdev
         'gpudev',
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [RFC v3 2/3] lib: add fastmem library
  2026-05-27 17:30         ` [RFC v3 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-28  9:11           ` Morten Brørup
  2026-05-28 14:45             ` Varghese, Vipin
  0 siblings, 1 reply; 38+ messages in thread
From: Morten Brørup @ 2026-05-28  9:11 UTC (permalink / raw)
  To: Bruce Richardson, Vipin Varghese, Mattias Rönnblom, dev
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger

> +/**
> + * Pre-reserve backing memory.
> + *
> + * Ensures that at least @p size bytes of memzone-backed memory are
> + * available to the allocator on @p socket_id, reserving additional
> + * memzones from EAL as needed to reach that total. Subsequent
> + * allocations served from the pre-reserved memory do not incur
> + * memzone-reservation cost.
> + *
> + * The reservation is cumulative: repeated calls to
> + * rte_fastmem_reserve() with the same @p socket_id grow the
> + * reservation monotonically. Reserved memory is never returned to
> + * the system during the allocator's lifetime.
> + *
> + * A typical use is to call rte_fastmem_reserve() once at
> + * application startup, with a size chosen to cover the expected
> + * steady-state working set. Allocations and frees during
> + * steady-state operation then avoid memzone reservations entirely.
> + *
> + * @param size
> + *  The minimum amount of backing memory, in bytes, to make
> + *  available on @p socket_id. The allocator may reserve more than
> + *  the requested amount due to internal rounding (e.g., to memzone
> + *  or block granularity).
> + *
> + * @param socket_id
> + *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
> + *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
> + *  allocator starts on the calling lcore's socket (or the first
> + *  configured socket if the caller is not bound to one) and falls
> + *  back to other sockets if the preferred socket cannot satisfy
> + *  the reservation.
> + *
> + * @return
> + *  - 0: Success.
> + *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
> + *  - -EINVAL: Invalid @p socket_id.
> + */
> +__rte_experimental
> +int
> +rte_fastmem_reserve(size_t size, int socket_id);

@Bruce,
I vaguely recall that we discussed something about busses and sockets a long time ago, but I cannot remember the details.
Is socket_id the right type (and parameter name) to identify a memory bus?

@Vipin,
You have been working on topology awareness. Same question to you:
Is socket_id the right type (and parameter name) to identify a memory bus?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC v3 2/3] lib: add fastmem library
  2026-05-28  9:11           ` Morten Brørup
@ 2026-05-28 14:45             ` Varghese, Vipin
  2026-05-28 19:56               ` Morten Brørup
  0 siblings, 1 reply; 38+ messages in thread
From: Varghese, Vipin @ 2026-05-28 14:45 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson, Mattias Rönnblom,
	dev@dpdk.org
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger

Public

Hi @Morten Brørup

<snipped>

>
> > +/**
> > + * Pre-reserve backing memory.
> > + *
> > + * Ensures that at least @p size bytes of memzone-backed memory are
> > + * available to the allocator on @p socket_id, reserving additional
> > + * memzones from EAL as needed to reach that total. Subsequent
> > + * allocations served from the pre-reserved memory do not incur
> > + * memzone-reservation cost.
> > + *
> > + * The reservation is cumulative: repeated calls to
> > + * rte_fastmem_reserve() with the same @p socket_id grow the
> > + * reservation monotonically. Reserved memory is never returned to
> > + * the system during the allocator's lifetime.
> > + *
> > + * A typical use is to call rte_fastmem_reserve() once at
> > + * application startup, with a size chosen to cover the expected
> > + * steady-state working set. Allocations and frees during
> > + * steady-state operation then avoid memzone reservations entirely.
> > + *
> > + * @param size
> > + *  The minimum amount of backing memory, in bytes, to make
> > + *  available on @p socket_id. The allocator may reserve more than
> > + *  the requested amount due to internal rounding (e.g., to memzone
> > + *  or block granularity).
> > + *
> > + * @param socket_id
> > + *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
> > + *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
> > + *  allocator starts on the calling lcore's socket (or the first
> > + *  configured socket if the caller is not bound to one) and falls
> > + *  back to other sockets if the preferred socket cannot satisfy
> > + *  the reservation.
> > + *
> > + * @return
> > + *  - 0: Success.
> > + *  - -ENOMEM: Insufficient huge-page memory to satisfy the request.
> > + *  - -EINVAL: Invalid @p socket_id.
> > + */
> > +__rte_experimental
> > +int
> > +rte_fastmem_reserve(size_t size, int socket_id);
>
> @Bruce,
> I vaguely recall that we discussed something about busses and sockets a long time
> ago, but I cannot remember the details.
> Is socket_id the right type (and parameter name) to identify a memory bus?
>
> @Vipin,
> You have been working on topology awareness. Same question to you:
> Is socket_id the right type (and parameter name) to identify a memory bus?

Short answer: socket_id is no longer a precise or sufficient abstraction to represent a memory bus.
Based on the topology work with libhwloc, we’ve observed the following across Ampere, Intel, and AMD platforms:

Features like SNC (Sub-NUMA Clustering) on Intel and NPS (NUMA Per Socket) on AMD change how socket_id maps to hardware.
In these modes:

1) A single physical socket can expose multiple NUMA domains.
2)These NUMA domains align more closely with memory controller groupings (i.e., memory buses) rather than the full socket.

Depending on the architecture:
a) Memory controllers may be collocated with compute cores or placed on separate tiles.
b) As a result, socket_id can represent different scopes (full socket vs. sub-socket domains), making it inconsistent.

Hence practically: In some configurations, socket_id ≈ memory domain. In others, it is coarser than the actual memory bus topology.

To address this ambiguity, in the topology patches (v5/v6), we are moving toward clearer separation:

a. Cache domains (L1/L2/L3/L4) for compute locality
b. NUMA domains (memory + IO) as the unit for allocation locality

This direction better reflects real hardware and avoids overloading socket_id with multiple meanings.

Happy to align this with the topology model we’re introducing so the abstraction remains consistent going forward.
Thanks,
Vipin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC v3 2/3] lib: add fastmem library
  2026-05-28 14:45             ` Varghese, Vipin
@ 2026-05-28 19:56               ` Morten Brørup
  2026-05-29 14:29                 ` Varghese, Vipin
  2026-05-30 16:22                 ` Mattias Rönnblom
  0 siblings, 2 replies; 38+ messages in thread
From: Morten Brørup @ 2026-05-28 19:56 UTC (permalink / raw)
  To: Varghese, Vipin, Bruce Richardson, Mattias Rönnblom, dev
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger

> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com]
> Sent: Thursday, 28 May 2026 16.45
> 
> Public
> 
> Hi @Morten Brørup
> 
> <snipped>
> 
> >
> > > +/**
> > > + * Pre-reserve backing memory.
> > > + *
> > > + * Ensures that at least @p size bytes of memzone-backed memory
> are
> > > + * available to the allocator on @p socket_id, reserving
> additional
> > > + * memzones from EAL as needed to reach that total. Subsequent
> > > + * allocations served from the pre-reserved memory do not incur
> > > + * memzone-reservation cost.
> > > + *
> > > + * The reservation is cumulative: repeated calls to
> > > + * rte_fastmem_reserve() with the same @p socket_id grow the
> > > + * reservation monotonically. Reserved memory is never returned to
> > > + * the system during the allocator's lifetime.
> > > + *
> > > + * A typical use is to call rte_fastmem_reserve() once at
> > > + * application startup, with a size chosen to cover the expected
> > > + * steady-state working set. Allocations and frees during
> > > + * steady-state operation then avoid memzone reservations
> entirely.
> > > + *
> > > + * @param size
> > > + *  The minimum amount of backing memory, in bytes, to make
> > > + *  available on @p socket_id. The allocator may reserve more than
> > > + *  the requested amount due to internal rounding (e.g., to
> memzone
> > > + *  or block granularity).
> > > + *
> > > + * @param socket_id
> > > + *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
> > > + *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
> > > + *  allocator starts on the calling lcore's socket (or the first
> > > + *  configured socket if the caller is not bound to one) and falls
> > > + *  back to other sockets if the preferred socket cannot satisfy
> > > + *  the reservation.
> > > + *
> > > + * @return
> > > + *  - 0: Success.
> > > + *  - -ENOMEM: Insufficient huge-page memory to satisfy the
> request.
> > > + *  - -EINVAL: Invalid @p socket_id.
> > > + */
> > > +__rte_experimental
> > > +int
> > > +rte_fastmem_reserve(size_t size, int socket_id);
> >
> > @Bruce,
> > I vaguely recall that we discussed something about busses and sockets
> a long time
> > ago, but I cannot remember the details.
> > Is socket_id the right type (and parameter name) to identify a memory
> bus?
> >
> > @Vipin,
> > You have been working on topology awareness. Same question to you:
> > Is socket_id the right type (and parameter name) to identify a memory
> bus?
> 
> Short answer: socket_id is no longer a precise or sufficient
> abstraction to represent a memory bus.
> Based on the topology work with libhwloc, we’ve observed the following
> across Ampere, Intel, and AMD platforms:
> 
> Features like SNC (Sub-NUMA Clustering) on Intel and NPS (NUMA Per
> Socket) on AMD change how socket_id maps to hardware.
> In these modes:
> 
> 1) A single physical socket can expose multiple NUMA domains.
> 2)These NUMA domains align more closely with memory controller
> groupings (i.e., memory buses) rather than the full socket.
> 
> 
> Depending on the architecture:
> a) Memory controllers may be collocated with compute cores or placed on
> separate tiles.
> b) As a result, socket_id can represent different scopes (full socket
> vs. sub-socket domains), making it inconsistent.
> 
> 
> 
> Hence practically: In some configurations, socket_id ≈ memory domain.
> In others, it is coarser than the actual memory bus topology.
> 
> To address this ambiguity, in the topology patches (v5/v6), we are
> moving toward clearer separation:
> 
> a. Cache domains (L1/L2/L3/L4) for compute locality
> b. NUMA domains (memory + IO) as the unit for allocation locality
> 
> This direction better reflects real hardware and avoids overloading
> socket_id with multiple meanings.
> 
> Happy to align this with the topology model we’re introducing so the
> abstraction remains consistent going forward.
> Thanks,
> Vipin

Thank you for the quick and detailed response, Vipin!

I haven't looked deeply into the v5/v6 topology patches yet (it's on my TODO list).

The rte_fastmem library builds on top of the rte_memzone library.

So, if the rte_memzone library is updated to replace the meaning of its "socket_id" parameter with some NUMA domain identifier (we better rename the "socket_id" to a new name "numa_domain_id"), then the rte_fastmem library could remain unaffected, and its "socket_id" parameter would be passed on directly to the rte_memzone library's "numa_domain_id"?

This is my conclusion: At this point, proper support for allocating memory in specific NUMA domains is an rte_memzone library issue, and nothing to worry about for the rte_fastmem library - it will be automatically supported in rte_fastmem when supported by rte_memzone.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC v3 2/3] lib: add fastmem library
  2026-05-28 19:56               ` Morten Brørup
@ 2026-05-29 14:29                 ` Varghese, Vipin
  2026-05-30 16:22                 ` Mattias Rönnblom
  1 sibling, 0 replies; 38+ messages in thread
From: Varghese, Vipin @ 2026-05-29 14:29 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson, Mattias Rönnblom,
	dev@dpdk.org
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger

Public

<snipped>

> > >
> > > > +/**
> > > > + * Pre-reserve backing memory.
> > > > + *
> > > > + * Ensures that at least @p size bytes of memzone-backed memory
> > are
> > > > + * available to the allocator on @p socket_id, reserving
> > additional
> > > > + * memzones from EAL as needed to reach that total. Subsequent
> > > > + * allocations served from the pre-reserved memory do not incur
> > > > + * memzone-reservation cost.
> > > > + *
> > > > + * The reservation is cumulative: repeated calls to
> > > > + * rte_fastmem_reserve() with the same @p socket_id grow the
> > > > + * reservation monotonically. Reserved memory is never returned
> > > > + to
> > > > + * the system during the allocator's lifetime.
> > > > + *
> > > > + * A typical use is to call rte_fastmem_reserve() once at
> > > > + * application startup, with a size chosen to cover the expected
> > > > + * steady-state working set. Allocations and frees during
> > > > + * steady-state operation then avoid memzone reservations
> > entirely.
> > > > + *
> > > > + * @param size
> > > > + *  The minimum amount of backing memory, in bytes, to make
> > > > + *  available on @p socket_id. The allocator may reserve more
> > > > + than
> > > > + *  the requested amount due to internal rounding (e.g., to
> > memzone
> > > > + *  or block granularity).
> > > > + *
> > > > + * @param socket_id
> > > > + *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
> > > > + *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
> > > > + *  allocator starts on the calling lcore's socket (or the first
> > > > + *  configured socket if the caller is not bound to one) and
> > > > + falls
> > > > + *  back to other sockets if the preferred socket cannot satisfy
> > > > + *  the reservation.
> > > > + *
> > > > + * @return
> > > > + *  - 0: Success.
> > > > + *  - -ENOMEM: Insufficient huge-page memory to satisfy the
> > request.
> > > > + *  - -EINVAL: Invalid @p socket_id.
> > > > + */
> > > > +__rte_experimental
> > > > +int
> > > > +rte_fastmem_reserve(size_t size, int socket_id);
> > >
> > > @Bruce,
> > > I vaguely recall that we discussed something about busses and
> > > sockets
> > a long time
> > > ago, but I cannot remember the details.
> > > Is socket_id the right type (and parameter name) to identify a
> > > memory
> > bus?
> > >
> > > @Vipin,
> > > You have been working on topology awareness. Same question to you:
> > > Is socket_id the right type (and parameter name) to identify a
> > > memory
> > bus?
> >
> > Short answer: socket_id is no longer a precise or sufficient
> > abstraction to represent a memory bus.
> > Based on the topology work with libhwloc, we’ve observed the following
> > across Ampere, Intel, and AMD platforms:
> >
> > Features like SNC (Sub-NUMA Clustering) on Intel and NPS (NUMA Per
> > Socket) on AMD change how socket_id maps to hardware.
> > In these modes:
> >
> > 1) A single physical socket can expose multiple NUMA domains.
> > 2)These NUMA domains align more closely with memory controller
> > groupings (i.e., memory buses) rather than the full socket.
> >
> >
> > Depending on the architecture:
> > a) Memory controllers may be collocated with compute cores or placed
> > on separate tiles.
> > b) As a result, socket_id can represent different scopes (full socket
> > vs. sub-socket domains), making it inconsistent.
> >
> >
> >
> > Hence practically: In some configurations, socket_id ≈ memory domain.
> > In others, it is coarser than the actual memory bus topology.
> >
> > To address this ambiguity, in the topology patches (v5/v6), we are
> > moving toward clearer separation:
> >
> > a. Cache domains (L1/L2/L3/L4) for compute locality b. NUMA domains
> > (memory + IO) as the unit for allocation locality
> >
> > This direction better reflects real hardware and avoids overloading
> > socket_id with multiple meanings.
> >
> > Happy to align this with the topology model we’re introducing so the
> > abstraction remains consistent going forward.
> > Thanks,
> > Vipin
>
> Thank you for the quick and detailed response, Vipin!
>
> I haven't looked deeply into the v5/v6 topology patches yet (it's on my TODO list).
>
> The rte_fastmem library builds on top of the rte_memzone library.
>
> So, if the rte_memzone library is updated to replace the meaning of its "socket_id"
> parameter with some NUMA domain identifier (we better rename the "socket_id" to a
> new name "numa_domain_id"), then the rte_fastmem library could remain
> unaffected, and its "socket_id" parameter would be passed on directly to the
> rte_memzone library's "numa_domain_id"?

Valid point, I finishing the last changes for v6. Will be pushing it out by weekend.
+1 changing terminology

>
> This is my conclusion: At this point, proper support for allocating memory in specific
> NUMA domains is an rte_memzone library issue, and nothing to worry about for the
> rte_fastmem library - it will be automatically supported in rte_fastmem when
> supported by rte_memzone.
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC v3 2/3] lib: add fastmem library
  2026-05-28 19:56               ` Morten Brørup
  2026-05-29 14:29                 ` Varghese, Vipin
@ 2026-05-30 16:22                 ` Mattias Rönnblom
  1 sibling, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-30 16:22 UTC (permalink / raw)
  To: Morten Brørup, Varghese, Vipin, Bruce Richardson, dev
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger

On 5/28/26 21:56, Morten Brørup wrote:
>> From: Varghese, Vipin [mailto:Vipin.Varghese@amd.com]
>> Sent: Thursday, 28 May 2026 16.45
>>
>> Public
>>
>> Hi @Morten Brørup
>>
>> <snipped>
>>
>>>
>>>> +/**
>>>> + * Pre-reserve backing memory.
>>>> + *
>>>> + * Ensures that at least @p size bytes of memzone-backed memory
>> are
>>>> + * available to the allocator on @p socket_id, reserving
>> additional
>>>> + * memzones from EAL as needed to reach that total. Subsequent
>>>> + * allocations served from the pre-reserved memory do not incur
>>>> + * memzone-reservation cost.
>>>> + *
>>>> + * The reservation is cumulative: repeated calls to
>>>> + * rte_fastmem_reserve() with the same @p socket_id grow the
>>>> + * reservation monotonically. Reserved memory is never returned to
>>>> + * the system during the allocator's lifetime.
>>>> + *
>>>> + * A typical use is to call rte_fastmem_reserve() once at
>>>> + * application startup, with a size chosen to cover the expected
>>>> + * steady-state working set. Allocations and frees during
>>>> + * steady-state operation then avoid memzone reservations
>> entirely.
>>>> + *
>>>> + * @param size
>>>> + *  The minimum amount of backing memory, in bytes, to make
>>>> + *  available on @p socket_id. The allocator may reserve more than
>>>> + *  the requested amount due to internal rounding (e.g., to
>> memzone
>>>> + *  or block granularity).
>>>> + *
>>>> + * @param socket_id
>>>> + *  The NUMA socket on which to reserve memory, or SOCKET_ID_ANY
>>>> + *  to leave the choice to the allocator. With SOCKET_ID_ANY, the
>>>> + *  allocator starts on the calling lcore's socket (or the first
>>>> + *  configured socket if the caller is not bound to one) and falls
>>>> + *  back to other sockets if the preferred socket cannot satisfy
>>>> + *  the reservation.
>>>> + *
>>>> + * @return
>>>> + *  - 0: Success.
>>>> + *  - -ENOMEM: Insufficient huge-page memory to satisfy the
>> request.
>>>> + *  - -EINVAL: Invalid @p socket_id.
>>>> + */
>>>> +__rte_experimental
>>>> +int
>>>> +rte_fastmem_reserve(size_t size, int socket_id);
>>>
>>> @Bruce,
>>> I vaguely recall that we discussed something about busses and sockets
>> a long time
>>> ago, but I cannot remember the details.
>>> Is socket_id the right type (and parameter name) to identify a memory
>> bus?
>>>
>>> @Vipin,
>>> You have been working on topology awareness. Same question to you:
>>> Is socket_id the right type (and parameter name) to identify a memory
>> bus?
>>
>> Short answer: socket_id is no longer a precise or sufficient
>> abstraction to represent a memory bus.
>> Based on the topology work with libhwloc, we’ve observed the following
>> across Ampere, Intel, and AMD platforms:
>>
>> Features like SNC (Sub-NUMA Clustering) on Intel and NPS (NUMA Per
>> Socket) on AMD change how socket_id maps to hardware.
>> In these modes:
>>
>> 1) A single physical socket can expose multiple NUMA domains.
>> 2)These NUMA domains align more closely with memory controller
>> groupings (i.e., memory buses) rather than the full socket.
>>
>>
>> Depending on the architecture:
>> a) Memory controllers may be collocated with compute cores or placed on
>> separate tiles.
>> b) As a result, socket_id can represent different scopes (full socket
>> vs. sub-socket domains), making it inconsistent.
>>
>>
>>
>> Hence practically: In some configurations, socket_id ≈ memory domain.
>> In others, it is coarser than the actual memory bus topology.
>>
>> To address this ambiguity, in the topology patches (v5/v6), we are
>> moving toward clearer separation:
>>
>> a. Cache domains (L1/L2/L3/L4) for compute locality
>> b. NUMA domains (memory + IO) as the unit for allocation locality
>>
>> This direction better reflects real hardware and avoids overloading
>> socket_id with multiple meanings.
>>
>> Happy to align this with the topology model we’re introducing so the
>> abstraction remains consistent going forward.
>> Thanks,
>> Vipin
> 
> Thank you for the quick and detailed response, Vipin!
> 
> I haven't looked deeply into the v5/v6 topology patches yet (it's on my TODO list).
> 
> The rte_fastmem library builds on top of the rte_memzone library.
> 
> So, if the rte_memzone library is updated to replace the meaning of its "socket_id" parameter with some NUMA domain identifier (we better rename the "socket_id" to a new name "numa_domain_id"), then the rte_fastmem library could remain unaffected, and its "socket_id" parameter would be passed on directly to the rte_memzone library's "numa_domain_id"?
> 

What is a "domain" here? It's the same as what is usually (in my 
experience) referred to as a NUMA node?

I would just call it "node_id".

> This is my conclusion: At this point, proper support for allocating memory in specific NUMA domains is an rte_memzone library issue, and nothing to worry about for the rte_fastmem library - it will be automatically supported in rte_fastmem when supported by rte_memzone.
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [RFC v3 3/3] app/test: add fastmem test suite
  2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
  2026-05-27 17:30         ` [RFC v3 1/3] doc: add fastmem programming guide Mattias Rönnblom
  2026-05-27 17:30         ` [RFC v3 2/3] lib: add fastmem library Mattias Rönnblom
@ 2026-05-27 17:30         ` Mattias Rönnblom
  2026-05-28  9:02         ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Morten Brørup
  3 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-27 17:30 UTC (permalink / raw)
  To: dev
  Cc: Morten Brørup, Konstantin Ananyev, Mattias Rönnblom,
	Yogaraj Baskaravel, Stephen Hemminger, Bruce Richardson,
	Mattias Rönnblom

Add functional, performance, and profiling test suites for the
fastmem library.

--

RFC v3:
 * Add realloc test cases (same class, grow, shrink, NULL ptr,
   zero size, too big, invalid align).
 * Merge lifecycle and functional test suites into one.
 * Suppress -Wuse-after-free in test_alloc_reuse (intentional
   pointer comparison after free).

RFC v2:
 * Add test_alloc_cross_socket_deinit exercising cross-socket
   teardown path.
 * Remove trailing double blank lines in test_fastmem.c.

Signed-off-by: Mattias Rönnblom <hofors@lysator.liu.se>
---
 app/test/meson.build            |    3 +
 app/test/test_fastmem.c         | 1801 +++++++++++++++++++++++++++++++
 app/test/test_fastmem_perf.c    | 1040 ++++++++++++++++++
 app/test/test_fastmem_profile.c |  157 +++
 4 files changed, 3001 insertions(+)
 create mode 100644 app/test/test_fastmem.c
 create mode 100644 app/test/test_fastmem_perf.c
 create mode 100644 app/test/test_fastmem_profile.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d458f9c07..d11c63be6f 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -82,6 +82,9 @@ source_file_deps = {
     'test_event_vector_adapter.c': ['eventdev', 'bus_vdev'],
     'test_eventdev.c': ['eventdev', 'bus_vdev'],
     'test_external_mem.c': [],
+    'test_fastmem.c': ['fastmem'],
+    'test_fastmem_perf.c': ['fastmem', 'mempool'],
+    'test_fastmem_profile.c': ['fastmem'],
     'test_fbarray.c': [],
     'test_fib.c': ['net', 'fib'],
     'test_fib6.c': ['rib', 'fib'],
diff --git a/app/test/test_fastmem.c b/app/test/test_fastmem.c
new file mode 100644
index 0000000000..3ec9022b2d
--- /dev/null
+++ b/app/test/test_fastmem.c
@@ -0,0 +1,1801 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <errno.h>
+#include <inttypes.h>
+#include <stdalign.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_thread.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define FASTMEM_MEMZONE_SIZE (128U << 20)
+
+/*
+ * Count memzones whose names begin with the fastmem prefix.
+ * Used to verify that rte_fastmem_reserve() really did reserve
+ * backing memzones.
+ */
+static int fastmem_memzone_count;
+
+static void
+count_fastmem_memzones_walk(const struct rte_memzone *mz, void *arg)
+{
+	RTE_SET_USED(arg);
+
+	if (strncmp(mz->name, "fastmem_", strlen("fastmem_")) == 0)
+		fastmem_memzone_count++;
+}
+
+static unsigned int
+count_fastmem_memzones(void)
+{
+	fastmem_memzone_count = 0;
+	rte_memzone_walk(count_fastmem_memzones_walk, NULL);
+	return fastmem_memzone_count;
+}
+
+static int
+test_init_deinit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	/* A subsequent init/deinit cycle must succeed. */
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "second rte_fastmem_init() failed: %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_init_is_not_idempotent(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_init() failed: %d", rc);
+
+	rc = rte_fastmem_init();
+	TEST_ASSERT_EQUAL(rc, -EBUSY,
+		"expected -EBUSY on re-init, got %d", rc);
+
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_deinit_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_max_size(void)
+{
+	size_t max;
+
+	max = rte_fastmem_max_size();
+	TEST_ASSERT(max >= (1U << 20),
+		"max_size=%zu below required 1 MiB minimum", max);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_small(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * A small reserve request (1 byte) must result in exactly
+	 * one memzone reservation: the internal rounding is to
+	 * memzone granularity.
+	 */
+	rc = rte_fastmem_reserve(1, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve() failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	rte_fastmem_deinit();
+
+	/* After deinit the memzones must be released. */
+	TEST_ASSERT_EQUAL(count_fastmem_memzones(), 0,
+		"%u fastmem memzones leaked after deinit",
+		count_fastmem_memzones());
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_multiple_memzones(void)
+{
+	int socket_id;
+	unsigned int before, after;
+	size_t reserve_size;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * Request just over one memzone's worth; this must force
+	 * a second memzone to be reserved.
+	 */
+	reserve_size = FASTMEM_MEMZONE_SIZE + 1;
+	rc = rte_fastmem_reserve(reserve_size, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_reserve(%zu) failed: %d",
+		reserve_size, rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 2,
+		"expected 2 new memzones for %zu-byte reserve, got %u",
+		reserve_size, after - before);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_cumulative(void)
+{
+	int socket_id;
+	unsigned int after_first, after_second;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	after_first = count_fastmem_memzones();
+
+	/*
+	 * A second call requesting the same amount that's already
+	 * reserved must not trigger any new memzone reservation.
+	 */
+	rc = rte_fastmem_reserve(FASTMEM_MEMZONE_SIZE, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "second reserve failed: %d", rc);
+
+	after_second = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after_first, after_second,
+		"reserve of already-reserved amount added memzones (%u -> %u)",
+		after_first, after_second);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_invalid_socket(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, RTE_MAX_NUMA_NODES);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for out-of-range socket, got %d", rc);
+
+	rc = rte_fastmem_reserve(1, -2);
+	TEST_ASSERT_EQUAL(rc, -EINVAL,
+		"expected -EINVAL for negative socket, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_without_init(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0,
+		"expected failure without init, got %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_reserve_any_socket(void)
+{
+	unsigned int before, after;
+	int rc;
+
+	before = count_fastmem_memzones();
+
+	/*
+	 * SOCKET_ID_ANY should succeed on any system with at least
+	 * one configured socket. The allocator picks the caller's
+	 * socket first and falls back to other sockets if needed.
+	 */
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0,
+		"rte_fastmem_reserve(SOCKET_ID_ANY) failed: %d", rc);
+
+	after = count_fastmem_memzones();
+	TEST_ASSERT_EQUAL(after - before, 1,
+		"expected 1 new memzone, got %u", after - before);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 2 tests: allocation and free.
+ */
+
+static int
+test_alloc_too_big(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(rte_fastmem_max_size() + 1, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc above max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_invalid_align(void)
+{
+	void *p;
+	rte_errno = 0;
+	p = rte_fastmem_alloc(16, 3, 0); /* 3 is not a power of 2 */
+	TEST_ASSERT_NULL(p, "alloc with align=3 returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL,
+		"expected rte_errno=EINVAL, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_small(void)
+{
+	void *p;
+	p = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8) failed: rte_errno=%d", rte_errno);
+
+	/* Writing into the object must not crash. */
+	memset(p, 0xa5, 8);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_free_various_sizes(void)
+{
+	static const size_t sizes[] = {
+		1, 8, 16, 17, 63, 64, 128, 1024, 4096,
+		64 * 1024, 256 * 1024, 1024 * 1024,
+	};
+	void *ptrs[RTE_DIM(sizes)];
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(sizes); i++) {
+		ptrs[i] = rte_fastmem_alloc(sizes[i], 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc(%zu) failed: rte_errno=%d",
+			sizes[i], rte_errno);
+		memset(ptrs[i], 0x5a, sizes[i]);
+	}
+
+	for (i = 0; i < RTE_DIM(sizes); i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_alignment(void)
+{
+	static const size_t aligns[] = {
+		8, 16, 64, 256, 4096, 65536,
+	};
+	unsigned int i;
+	for (i = 0; i < RTE_DIM(aligns); i++) {
+		void *p = rte_fastmem_alloc(1, aligns[i], 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=%zu) failed: rte_errno=%d",
+			aligns[i], rte_errno);
+		TEST_ASSERT((uintptr_t)p % aligns[i] == 0,
+			"pointer %p not aligned on %zu",
+			p, aligns[i]);
+		rte_fastmem_free(p);
+	}
+
+	/* Default (align=0) gives at least RTE_CACHE_LINE_SIZE. */
+	{
+		void *p = rte_fastmem_alloc(1, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(p,
+			"alloc(1, align=0) failed: rte_errno=%d", rte_errno);
+		TEST_ASSERT((uintptr_t)p % RTE_CACHE_LINE_SIZE == 0,
+			"default-align pointer %p not cache-line aligned",
+			p);
+		rte_fastmem_free(p);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_zero_flag(void)
+{
+	uint8_t *p;
+	unsigned int i;
+	bool all_zero = true;
+
+	/*
+	 * Dirty a slab first by allocating without F_ZERO, writing
+	 * a non-zero pattern, and freeing. A subsequent F_ZERO
+	 * allocation on the same slab must return zeroed memory.
+	 */
+	p = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "priming alloc failed");
+	memset(p, 0xff, 128);
+	rte_fastmem_free(p);
+
+	p = rte_fastmem_alloc(128, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_NOT_NULL(p, "F_ZERO alloc failed");
+	for (i = 0; i < 128; i++) {
+		if (p[i] != 0) {
+			all_zero = false;
+			break;
+		}
+	}
+	TEST_ASSERT(all_zero, "F_ZERO returned non-zero byte at offset %u", i);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+#if defined(__GNUC__) && !defined(__clang__)
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wuse-after-free"
+#endif
+static int
+test_alloc_reuse(void)
+{
+	void *first, *second;
+
+	first = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(first, "first alloc failed");
+	rte_fastmem_free(first);
+
+	second = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(second, "second alloc failed");
+
+	/*
+	 * The slab's free list is LIFO, so the most recently freed
+	 * object is at the head of the list. A subsequent alloc in
+	 * the same class returns it.
+	 */
+	TEST_ASSERT_EQUAL(first, second,
+		"free + alloc did not reuse: first=%p second=%p",
+		first, second);
+
+	rte_fastmem_free(second);
+
+	return TEST_SUCCESS;
+}
+#if defined(__GNUC__) && !defined(__clang__)
+#pragma GCC diagnostic pop
+#endif
+
+static int
+test_alloc_many_in_class(void)
+{
+	/*
+	 * Allocate more objects in one class than fit in a single
+	 * slab, forcing the bin to pull a second block. This
+	 * exercises the partial->full transition and the cross-slab
+	 * allocation path.
+	 */
+	enum { CLASS_SIZE = 8, COUNT = 300000 };
+	void **ptrs;
+	unsigned int i;
+
+	ptrs = calloc(COUNT, sizeof(*ptrs));
+	TEST_ASSERT_NOT_NULL(ptrs, "calloc for test ptrs failed");
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(CLASS_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d",
+			i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	free(ptrs);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket(void)
+{
+	void *p;
+	int socket_id;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing(void)
+{
+	void *small, *large;
+
+	/*
+	 * Allocate and free a small object, forcing a block to be
+	 * assigned to the small class and then returned to the
+	 * free-block pool. A subsequent allocation in a different
+	 * class must be able to reuse that block.
+	 */
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+	rte_fastmem_free(small);
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large, "large alloc failed");
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_block_repurposing_no_growth(void)
+{
+	struct rte_fastmem_stats stats;
+	void *small, *large;
+	uint64_t after_small;
+	int rc;
+
+	/*
+	 * Stronger version of test_alloc_block_repurposing: assert
+	 * that the cross-class allocation does not grow the
+	 * backing memory (bytes_backing stays flat). Because the
+	 * free-block pool is shared across size classes — not
+	 * partitioned per class — the block freed from the small
+	 * class must serve the large allocation without triggering
+	 * a new memzone reservation.
+	 */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, (uint64_t)0,
+		"unexpected pre-alloc bytes_backing: %" PRIu64,
+		stats.bytes_backing);
+
+	small = rte_fastmem_alloc(8, 0, 0);
+	TEST_ASSERT_NOT_NULL(small, "small alloc failed");
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT(stats.bytes_backing > 0,
+		"bytes_backing did not grow on first alloc");
+	after_small = stats.bytes_backing;
+
+	rte_fastmem_free(small);
+	rte_fastmem_cache_flush();
+
+	large = rte_fastmem_alloc(256 * 1024, 0, 0);
+	TEST_ASSERT_NOT_NULL(large,
+		"large alloc failed: rte_errno=%d", rte_errno);
+
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_fastmem_stats() failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_backing, after_small,
+		"cross-class alloc grew backing memory from %" PRIu64
+		" to %" PRIu64,
+		after_small, stats.bytes_backing);
+
+	rte_fastmem_free(large);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_null(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_free(NULL);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_content_integrity(void)
+{
+	/*
+	 * Allocate a batch of objects, fill each with a distinct
+	 * byte pattern, then verify none of the patterns overlap.
+	 * This catches header overwrites (slab header corrupted by
+	 * object access) and slot-overlap bugs (two pointers pointing
+	 * at overlapping slots).
+	 */
+	enum { N = 256, SIZE = 128 };
+	uint8_t *ptrs[N];
+	unsigned int i, j;
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)i, SIZE);
+	}
+
+	for (i = 0; i < N; i++)
+		for (j = 0; j < SIZE; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], (uint8_t)i,
+				"corruption at ptrs[%u][%u]: got 0x%x, want 0x%x",
+				i, j, ptrs[i][j], (uint8_t)i);
+
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_too_big(void)
+{
+	void *p;
+	/*
+	 * A small size with an alignment larger than the maximum
+	 * size class cannot be served. The class selected must be
+	 * large enough for the alignment, but no such class exists.
+	 */
+	rte_errno = 0;
+	p = rte_fastmem_alloc(1, rte_fastmem_max_size() * 2, 0);
+	TEST_ASSERT_NULL(p,
+		"alloc with align>max_size returned non-NULL");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG,
+		"expected rte_errno=E2BIG, got %d", rte_errno);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_align_one(void)
+{
+	void *p;
+	/* align=1 is a valid power of 2 and must be accepted. */
+	p = rte_fastmem_alloc(8, 1, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc(8, 1) failed: rte_errno=%d",
+		rte_errno);
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_socket_numa_placement(void)
+{
+	void *p;
+	int socket_id;
+	struct rte_memseg *ms;
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no available sockets");
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, socket_id);
+	TEST_ASSERT_NOT_NULL(p,
+		"alloc_socket(%d) failed: rte_errno=%d",
+		socket_id, rte_errno);
+
+	/*
+	 * Walk the memory to find the memseg for this pointer and
+	 * verify its socket. Skip the check if lookup fails (e.g.,
+	 * --no-huge mode may not populate memsegs for fastmem's
+	 * allocations in a way that rte_mem_virt2memseg can find).
+	 */
+	ms = rte_mem_virt2memseg(p, NULL);
+	if (ms != NULL) {
+		TEST_ASSERT_EQUAL(ms->socket_id, socket_id,
+			"alloc on socket %d landed on socket %d",
+			socket_id, ms->socket_id);
+	}
+
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Allocate from a socket different from the calling lcore's socket,
+ * triggering a cross-socket cache allocation. Then deinit to exercise
+ * the teardown path where a cache's backing memory lives on a
+ * different socket than the one it serves.
+ */
+static int
+test_alloc_cross_socket_deinit(void)
+{
+	int local_sid, remote_sid;
+	unsigned int i, n_sockets;
+	void *p;
+
+	local_sid = (int)rte_socket_id();
+	if (local_sid < 0 || (unsigned int)local_sid >= RTE_MAX_NUMA_NODES)
+		local_sid = rte_socket_id_by_idx(0);
+
+	n_sockets = rte_socket_count();
+	if (n_sockets < 2)
+		return TEST_SKIPPED;
+
+	/* Find a socket different from the local one. */
+	remote_sid = -1;
+	for (i = 0; i < n_sockets; i++) {
+		int sid = rte_socket_id_by_idx(i);
+		if (sid >= 0 && sid != local_sid) {
+			remote_sid = sid;
+			break;
+		}
+	}
+	if (remote_sid < 0)
+		return TEST_SKIPPED;
+
+	p = rte_fastmem_alloc_socket(64, 0, 0, remote_sid);
+	TEST_ASSERT_NOT_NULL(p,
+		"cross-socket alloc(socket %d) failed: rte_errno=%d",
+		remote_sid, rte_errno);
+
+	rte_fastmem_free(p);
+
+	/* Teardown and re-init to exercise the deinit path with
+	 * cross-socket caches.
+	 */
+	rte_fastmem_deinit();
+
+	TEST_ASSERT_EQUAL(rte_fastmem_init(), 0,
+		"re-init after cross-socket deinit failed");
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Stage 3 tests: per-lcore caches.
+ */
+
+static int
+test_cache_flush(void)
+{
+	void *p;
+	/*
+	 * Alloc and free one object, leaving it in the cache. Then
+	 * flush and verify that a subsequent alloc may or may not
+	 * return the same pointer (not asserting same/different —
+	 * just checking that flush does not crash and a follow-up
+	 * alloc still works).
+	 */
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "first alloc failed");
+	rte_fastmem_free(p);
+
+	rte_fastmem_cache_flush();
+
+	/* Flush again — must be idempotent. */
+	rte_fastmem_cache_flush();
+
+	p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "post-flush alloc failed");
+	rte_fastmem_free(p);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_without_init(void)
+{
+	/* Must be a no-op, not a crash. */
+	rte_fastmem_cache_flush();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_exceeds_capacity(void)
+{
+	/*
+	 * Free more objects at a single size class than the cache
+	 * capacity (64 for classes <= 4 KiB). This forces the
+	 * cache-drain slow path and verifies no corruption.
+	 */
+	enum { COUNT = 200, SIZE = 64 };
+	void *ptrs[COUNT];
+	unsigned int i;
+
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	/* Re-alloc the same count should still work. */
+	for (i = 0; i < COUNT; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i],
+			"re-alloc[%u] failed: rte_errno=%d", i, rte_errno);
+	}
+
+	for (i = 0; i < COUNT; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct non_eal_args {
+	int ok;
+	char pad[64];
+};
+
+static uint32_t
+non_eal_thread_main(void *arg)
+{
+	struct non_eal_args *args = arg;
+	uint8_t *p;
+
+	p = rte_fastmem_alloc(128, 0, 0);
+	if (p == NULL)
+		return 1;
+
+	memset(p, 0x7e, 128);
+
+	rte_fastmem_free(p);
+
+	args->ok = 1;
+	return 0;
+}
+
+static int
+test_non_eal_thread(void)
+{
+	rte_thread_t thread_id;
+	struct non_eal_args args = { 0 };
+	int rc;
+
+	rc = rte_thread_create(&thread_id, NULL, non_eal_thread_main, &args);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_create() failed: %d", rc);
+
+	rc = rte_thread_join(thread_id, NULL);
+	TEST_ASSERT_EQUAL(rc, 0, "rte_thread_join() failed: %d", rc);
+
+	TEST_ASSERT_EQUAL(args.ok, 1,
+		"non-EAL thread did not complete alloc/free successfully");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cache_flush_returns_memory(void)
+{
+	/*
+	 * When an entire slab's worth of objects is freed, the
+	 * slab's block is returned to the free-block pool and can
+	 * be reassigned to another size class. Verify the cache
+	 * does not permanently hold objects that prevent this.
+	 *
+	 * Allocate enough objects in one class to force multiple
+	 * slabs, free them all, then flush the cache. After the
+	 * flush, all cached objects are drained to their bins and
+	 * empty slabs are returned to the block pool.
+	 */
+	enum { N = 200, SIZE = 64 };
+	void *ptrs[N];
+	unsigned int i;
+
+	for (i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+	for (i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rte_fastmem_cache_flush();
+
+	/*
+	 * An allocation in a completely different class should
+	 * succeed now, having access to any blocks freed by the
+	 * flush.
+	 */
+	{
+		void *other = rte_fastmem_alloc(65536, 0, 0);
+
+		TEST_ASSERT_NOT_NULL(other,
+			"post-flush cross-class alloc failed");
+		rte_fastmem_free(other);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_basic(void)
+{
+	enum { N = 32 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	/* Verify all pointers are non-NULL and distinct. */
+	for (unsigned int i = 0; i < N; i++) {
+		TEST_ASSERT_NOT_NULL(ptrs[i], "ptrs[%u] is NULL", i);
+		for (unsigned int j = 0; j < i; j++)
+			TEST_ASSERT(ptrs[i] != ptrs[j],
+				"ptrs[%u] == ptrs[%u]", i, j);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_zero_flag(void)
+{
+	enum { N = 8, SIZE = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, SIZE, 0, RTE_FASTMEM_F_ZERO);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk failed: %d", rc);
+
+	for (unsigned int i = 0; i < N; i++) {
+		uint8_t *p = ptrs[i];
+
+		for (unsigned int b = 0; b < SIZE; b++)
+			TEST_ASSERT_EQUAL(p[b], 0,
+				"ptrs[%u][%u] != 0", i, b);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_exceeds_cache(void)
+{
+	/* Allocate more than cache capacity (64) in one bulk call. */
+	enum { N = 128 };
+	void *ptrs[N];
+	int rc;
+
+	rc = rte_fastmem_alloc_bulk(ptrs, N, 64, 0, 0);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk(%u) failed: %d", N, rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_alloc_bulk_socket(void)
+{
+	enum { N = 16 };
+	void *ptrs[N];
+	int socket_id;
+	int rc;
+
+	socket_id = rte_socket_id_by_idx(0);
+	TEST_ASSERT(socket_id >= 0, "no sockets");
+
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, socket_id);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* SOCKET_ID_ANY */
+	rc = rte_fastmem_alloc_bulk_socket(ptrs, N, 64, 0, 0, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "alloc_bulk_socket(ANY) failed: %d", rc);
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_free_bulk(void)
+{
+	enum { N = 64 };
+	void *ptrs[N];
+	/* Allocate individually, free in bulk. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	/* Verify memory is reusable. */
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "re-alloc[%u] failed", i);
+	}
+
+	rte_fastmem_free_bulk(ptrs, N);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_classes(void)
+{
+	size_t sizes[32];
+	unsigned int n;
+
+	n = rte_fastmem_classes(NULL);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+
+	n = rte_fastmem_classes(sizes);
+	TEST_ASSERT_EQUAL(n, 18u, "expected 18 classes, got %u", n);
+	TEST_ASSERT_EQUAL(sizes[0], (size_t)8, "class 0 != 8");
+	TEST_ASSERT_EQUAL(sizes[n - 1], (size_t)(1 << 20),
+		"last class != 1 MiB");
+
+	for (unsigned int i = 0; i < n; i++) {
+		TEST_ASSERT(sizes[i] != 0 && (sizes[i] & (sizes[i] - 1)) == 0,
+			"class %u size %zu not power of 2", i, sizes[i]);
+		if (i > 0)
+			TEST_ASSERT(sizes[i] > sizes[i - 1],
+				"classes not ascending at %u", i);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_class(void)
+{
+	enum { N = 10 };
+	struct rte_fastmem_class_stats cs;
+	void *ptrs[N];
+	int rc;
+
+	for (unsigned int i = 0; i < N; i++) {
+		ptrs[i] = rte_fastmem_alloc(64, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+	}
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.class_size, (size_t)64, "wrong class_size");
+	TEST_ASSERT(cs.alloc_cache_hits + cs.alloc_cache_misses == N,
+		"alloc count != N: hits=%" PRIu64 " misses=%" PRIu64,
+		cs.alloc_cache_hits, cs.alloc_cache_misses);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)N, "in_use != N");
+
+	for (unsigned int i = 0; i < N; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	rc = rte_fastmem_stats_class(64, &cs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_class after free failed: %d", rc);
+	TEST_ASSERT_EQUAL(cs.in_use, (uint64_t)0, "in_use != 0 after free");
+
+	/* Invalid class size. */
+	rc = rte_fastmem_stats_class(13, &cs);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad size");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore(void)
+{
+	struct rte_fastmem_lcore_stats ls;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(128, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore failed: %d", rc);
+	TEST_ASSERT(ls.alloc_cache_hits + ls.alloc_cache_misses > 0,
+		"no alloc activity on this lcore");
+
+	rte_fastmem_free(ptr);
+
+	rc = rte_fastmem_stats_lcore(rte_lcore_id(), &ls);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore after free failed: %d", rc);
+	TEST_ASSERT(ls.free_cache_hits + ls.free_cache_misses > 0,
+		"no free activity on this lcore");
+
+	/* Invalid lcore. */
+	rc = rte_fastmem_stats_lcore(RTE_MAX_LCORE, &ls);
+	TEST_ASSERT_EQUAL(rc, -EINVAL, "expected -EINVAL for bad lcore");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_lcore_class(void)
+{
+	struct rte_fastmem_lcore_class_stats lcs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	rc = rte_fastmem_stats_lcore_class(rte_lcore_id(), 256, &lcs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats_lcore_class failed: %d", rc);
+	TEST_ASSERT_EQUAL(lcs.class_size, (size_t)256, "wrong class_size");
+	TEST_ASSERT(lcs.alloc_cache_hits + lcs.alloc_cache_misses > 0,
+		"no alloc activity");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_stats_reset(void)
+{
+	struct rte_fastmem_stats gs;
+	void *ptr;
+	int rc;
+
+	ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+	rte_fastmem_free(ptr);
+
+	rte_fastmem_stats_reset();
+
+	rc = rte_fastmem_stats(&gs);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(gs.alloc_total, (uint64_t)0,
+		"alloc_total not zero after reset");
+	TEST_ASSERT_EQUAL(gs.free_total, (uint64_t)0,
+		"free_total not zero after reset");
+
+	return TEST_SUCCESS;
+}
+
+
+#define MIXED_LONG_LIVED_COUNT 25
+#define MIXED_SHORT_LIVED_ITERS 1000
+#define MIXED_MIN_LCORES 3
+
+static const size_t mixed_long_sizes[] = { 64, 256, 4096 };
+static const size_t mixed_short_sizes[] = { 8, 16, 32, 64, 128, 256, 512, 1024 };
+
+struct mixed_worker_args {
+	uint32_t seed;
+	int result;
+};
+
+static uint32_t
+xorshift32(uint32_t *state)
+{
+	uint32_t x = *state;
+
+	x ^= x << 13;
+	x ^= x >> 17;
+	x ^= x << 5;
+	*state = x;
+	return x;
+}
+
+static int
+mixed_worker(void *arg)
+{
+	struct mixed_worker_args *args = arg;
+	uint32_t seed = args->seed;
+	void *long_lived[MIXED_LONG_LIVED_COUNT];
+	size_t long_sizes[MIXED_LONG_LIVED_COUNT];
+	unsigned int i;
+
+	/* Allocate long-lived objects of mixed sizes. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		long_sizes[i] = mixed_long_sizes[i % RTE_DIM(mixed_long_sizes)];
+		long_lived[i] = rte_fastmem_alloc(long_sizes[i], 0, 0);
+		if (long_lived[i] == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(long_lived[i], (int)(i + 1), long_sizes[i]);
+	}
+
+	/* Rapidly cycle short-lived objects. */
+	for (i = 0; i < MIXED_SHORT_LIVED_ITERS; i++) {
+		size_t sz = mixed_short_sizes[xorshift32(&seed) %
+					      RTE_DIM(mixed_short_sizes)];
+		uint8_t pattern = (uint8_t)(i & 0xff);
+		uint8_t *p;
+
+		p = rte_fastmem_alloc(sz, 0, 0);
+		if (p == NULL) {
+			args->result = TEST_FAILED;
+			return -1;
+		}
+		memset(p, pattern, sz);
+
+		/* Verify before freeing. */
+		for (size_t j = 0; j < sz; j++) {
+			if (p[j] != pattern) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(p);
+	}
+
+	/* Verify long-lived objects are still intact. */
+	for (i = 0; i < MIXED_LONG_LIVED_COUNT; i++) {
+		uint8_t *bytes = long_lived[i];
+		uint8_t expected = (uint8_t)(i + 1);
+
+		for (size_t j = 0; j < long_sizes[i]; j++) {
+			if (bytes[j] != expected) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(long_lived[i]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_mixed_lifetimes_multi_lcore(void)
+{
+	struct mixed_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int count = 0;
+	struct rte_fastmem_stats stats;
+	int rc;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		count++;
+
+	if (count < MIXED_MIN_LCORES) {
+		printf("Not enough worker lcores (%u < %u), skipping\n",
+		       count, MIXED_MIN_LCORES);
+		return TEST_SKIPPED;
+	}
+
+	/* Launch workers with distinct seeds. */
+	uint32_t seed = 0xdeadbeef;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].seed = seed;
+		args[lcore_id].result = TEST_FAILED;
+		seed += 0x12345678;
+		rte_eal_remote_launch(mixed_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	/* Check all workers succeeded. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	/* Verify no memory leak. */
+	rc = rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(rc, 0, "stats failed: %d", rc);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero after test: %" PRIu64,
+		stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+
+/*
+ * Memory limit tests.
+ *
+ * FASTMEM_MEMZONE_SIZE is 128 MiB. We use a limit of 128 MiB
+ * (one memzone) for most tests, and large objects (256 KiB) to
+ * exhaust slabs quickly.
+ */
+
+#define LIMIT_ONE_MZ ((size_t)128 << 20)
+#define LIMIT_OBJ_SIZE ((size_t)256 * 1024)
+
+static int
+test_memory_limit_basic(void)
+{
+	int rc;
+
+	rc = rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+	TEST_ASSERT_EQUAL(rc, 0, "set_memory_limit failed: %d", rc);
+
+	const size_t got = rte_fastmem_get_limit(0);
+	TEST_ASSERT_EQUAL(got, LIMIT_ONE_MZ,
+		"get_memory_limit mismatch: %zu", got);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "first reserve failed: %d", rc);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ + 1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "second reserve should have failed");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_exhaustion(void)
+{
+	const unsigned int max_ptrs = 1024;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+
+	TEST_ASSERT(count > 0, "should have allocated at least one");
+	TEST_ASSERT(count < max_ptrs, "should have hit the limit");
+	TEST_ASSERT_EQUAL(rte_errno, ENOMEM, "expected ENOMEM, got %d", rte_errno);
+
+	rte_fastmem_free(ptrs[count - 1]);
+	void *p = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc after free should succeed");
+	rte_fastmem_free(p);
+
+	for (unsigned int i = 0; i < count - 1; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_zero_blocks_growth(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+
+	rc = rte_fastmem_reserve(1, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "reserve with limit=0 should fail");
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NULL(p, "alloc with limit=0 should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_below_current(void)
+{
+	int rc;
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve failed: %d", rc);
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 1);
+
+	void *p = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(p, "alloc from existing backing should work");
+	rte_fastmem_free(p);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ * 2, SOCKET_ID_ANY);
+	TEST_ASSERT(rc < 0, "growth beyond limit should fail");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_socket_id_any(void)
+{
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 42);
+
+	for (unsigned int i = 0; i < rte_socket_count(); i++) {
+		const int sid = rte_socket_id_by_idx(i);
+		const size_t lim = rte_fastmem_get_limit(sid);
+
+		TEST_ASSERT_EQUAL(lim, (size_t)42,
+			"socket %d limit mismatch: %zu", sid, lim);
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_unlimited(void)
+{
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, 0);
+	rte_fastmem_set_limit(SOCKET_ID_ANY, SIZE_MAX);
+
+	rc = rte_fastmem_reserve(LIMIT_ONE_MZ, SOCKET_ID_ANY);
+	TEST_ASSERT_EQUAL(rc, 0, "reserve after reset failed: %d", rc);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_alloc_integrity_under_oom(void)
+{
+	const unsigned int n = 128;
+	const size_t obj_size = 1024;
+	uint8_t *ptrs[n];
+	const unsigned int extra_max = 1024;
+	void *extra[extra_max];
+	unsigned int n_extra = 0;
+	unsigned int i;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = rte_fastmem_alloc(obj_size, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "alloc[%u] failed", i);
+		memset(ptrs[i], (int)(i & 0xff), obj_size);
+	}
+
+	/* Exhaust remaining backing with large objects. */
+	for (n_extra = 0; n_extra < extra_max; n_extra++) {
+		extra[n_extra] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (extra[n_extra] == NULL)
+			break;
+	}
+
+	/* Verify original objects are intact. */
+	for (i = 0; i < n; i++) {
+		const uint8_t expected = (uint8_t)(i & 0xff);
+		for (unsigned int j = 0; j < obj_size; j++)
+			TEST_ASSERT_EQUAL(ptrs[i][j], expected,
+				"corruption at [%u][%u]", i, j);
+	}
+
+	for (i = 0; i < n; i++)
+		rte_fastmem_free(ptrs[i]);
+	for (i = 0; i < n_extra; i++)
+		rte_fastmem_free(extra[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_bulk_alloc_oom(void)
+{
+	const unsigned int bulk_n = 64;
+	const unsigned int drain_max = 512;
+	void *ptrs[bulk_n];
+	void *drain[drain_max];
+	unsigned int drained = 0;
+	int rc;
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (drained = 0; drained < drain_max; drained++) {
+		drain[drained] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (drain[drained] == NULL)
+			break;
+	}
+
+	/* Free a few — enough for some but not bulk_n objects. */
+	const unsigned int freed = RTE_MIN(drained, 4u);
+	for (unsigned int i = 0; i < freed; i++)
+		rte_fastmem_free(drain[--drained]);
+
+	rc = rte_fastmem_alloc_bulk(ptrs, bulk_n, LIMIT_OBJ_SIZE, 0, 0);
+	TEST_ASSERT(rc < 0, "bulk alloc should fail");
+
+	for (unsigned int i = 0; i < drained; i++)
+		rte_fastmem_free(drain[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_memory_limit_recovery_after_free(void)
+{
+	const unsigned int max_ptrs = 512;
+	void *ptrs[max_ptrs];
+	unsigned int count = 0;
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	for (count = 0; count < max_ptrs; count++) {
+		ptrs[count] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[count] == NULL)
+			break;
+	}
+	TEST_ASSERT(count > 0 && count < max_ptrs,
+		"expected partial fill, got %u", count);
+
+	const unsigned int half = count / 2;
+	for (unsigned int i = 0; i < half; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	for (unsigned int i = 0; i < half; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		TEST_ASSERT_NOT_NULL(ptrs[i], "recovery alloc[%u] failed", i);
+	}
+
+	for (unsigned int i = 0; i < count; i++)
+		rte_fastmem_free(ptrs[i]);
+
+	return TEST_SUCCESS;
+}
+
+struct limit_worker_args {
+	unsigned int alloc_count;
+	int result;
+};
+
+static int
+limit_worker(void *arg)
+{
+	struct limit_worker_args *args = arg;
+	const unsigned int max_ptrs = 128;
+	void *ptrs[max_ptrs];
+	unsigned int i;
+
+	args->alloc_count = 0;
+
+	for (i = 0; i < max_ptrs; i++) {
+		ptrs[i] = rte_fastmem_alloc(LIMIT_OBJ_SIZE, 0, 0);
+		if (ptrs[i] == NULL)
+			break;
+		memset(ptrs[i], 0xab, LIMIT_OBJ_SIZE);
+		args->alloc_count++;
+	}
+
+	for (unsigned int j = 0; j < args->alloc_count; j++) {
+		uint8_t *bytes = ptrs[j];
+		for (size_t k = 0; k < LIMIT_OBJ_SIZE; k++) {
+			if (bytes[k] != 0xab) {
+				args->result = TEST_FAILED;
+				return -1;
+			}
+		}
+		rte_fastmem_free(ptrs[j]);
+	}
+
+	args->result = TEST_SUCCESS;
+	return 0;
+}
+
+static int
+test_memory_limit_multi_lcore_oom(void)
+{
+	struct limit_worker_args args[RTE_MAX_LCORE];
+	unsigned int lcore_id;
+	unsigned int worker_count = 0;
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		worker_count++;
+
+	if (worker_count < 2) {
+		printf("Not enough workers (%u < 2), skipping\n", worker_count);
+		return TEST_SKIPPED;
+	}
+
+	rte_fastmem_set_limit(SOCKET_ID_ANY, LIMIT_ONE_MZ);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		args[lcore_id].result = TEST_FAILED;
+		rte_eal_remote_launch(limit_worker, &args[lcore_id], lcore_id);
+	}
+
+	rte_eal_mp_wait_lcore();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		TEST_ASSERT_EQUAL(args[lcore_id].result, TEST_SUCCESS,
+			"worker on lcore %u failed", lcore_id);
+	}
+
+	struct rte_fastmem_stats stats;
+	rte_fastmem_stats(&stats);
+	TEST_ASSERT_EQUAL(stats.bytes_in_use, (uint64_t)0,
+		"bytes_in_use not zero: %" PRIu64, stats.bytes_in_use);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_same_class(void)
+{
+	void *ptr = rte_fastmem_alloc(32, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	/* Realloc to a smaller size within the same class (64 B class). */
+	void *ptr2 = rte_fastmem_realloc(ptr, 33, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc failed");
+	TEST_ASSERT_EQUAL(ptr, ptr2,
+		"realloc returned different pointer for same class");
+
+	/* Realloc to exact class boundary — still same class. */
+	void *ptr3 = rte_fastmem_realloc(ptr2, 64, 0);
+	TEST_ASSERT_NOT_NULL(ptr3, "realloc failed");
+	TEST_ASSERT_EQUAL(ptr2, ptr3,
+		"realloc returned different pointer for same class");
+
+	rte_fastmem_free(ptr3);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_grow(void)
+{
+	const uint8_t pattern = 0xab;
+	void *ptr = rte_fastmem_alloc(16, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	memset(ptr, pattern, 16);
+
+	/* Grow beyond current class. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 128, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc grow failed");
+
+	/* Verify contents preserved. */
+	uint8_t *bytes = ptr2;
+	for (unsigned int i = 0; i < 16; i++)
+		TEST_ASSERT_EQUAL(bytes[i], pattern,
+			"content corrupted at byte %u", i);
+
+	rte_fastmem_free(ptr2);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_shrink(void)
+{
+	const uint8_t pattern = 0xcd;
+	void *ptr = rte_fastmem_alloc(256, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	memset(ptr, pattern, 256);
+
+	/* Shrink to a smaller class. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 16, 0);
+	TEST_ASSERT_NOT_NULL(ptr2, "realloc shrink failed");
+
+	/* Verify contents preserved up to new size. */
+	uint8_t *bytes = ptr2;
+	for (unsigned int i = 0; i < 16; i++)
+		TEST_ASSERT_EQUAL(bytes[i], pattern,
+			"content corrupted at byte %u", i);
+
+	rte_fastmem_free(ptr2);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_null_ptr(void)
+{
+	/* NULL ptr should behave like alloc. */
+	void *ptr = rte_fastmem_realloc(NULL, 64, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "realloc(NULL) failed");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_zero_size(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	/* size 0 should free and return NULL. */
+	void *ptr2 = rte_fastmem_realloc(ptr, 0, 0);
+	TEST_ASSERT_NULL(ptr2, "realloc(size=0) should return NULL");
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_too_big(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	void *ptr2 = rte_fastmem_realloc(ptr, rte_fastmem_max_size() + 1, 0);
+	TEST_ASSERT_NULL(ptr2, "realloc should fail for oversized request");
+	TEST_ASSERT_EQUAL(rte_errno, E2BIG, "expected E2BIG");
+
+	/* Original pointer should still be valid. */
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+test_realloc_invalid_align(void)
+{
+	void *ptr = rte_fastmem_alloc(64, 0, 0);
+	TEST_ASSERT_NOT_NULL(ptr, "alloc failed");
+
+	void *ptr2 = rte_fastmem_realloc(ptr, 64, 3);
+	TEST_ASSERT_NULL(ptr2, "realloc should fail for non-power-of-2 align");
+	TEST_ASSERT_EQUAL(rte_errno, EINVAL, "expected EINVAL");
+
+	rte_fastmem_free(ptr);
+	return TEST_SUCCESS;
+}
+
+static int
+fastmem_setup(void)
+{
+	return rte_fastmem_init();
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_deinit();
+}
+
+static struct unit_test_suite fastmem_testsuite = {
+	.suite_name = "fastmem tests",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_init_deinit),
+		TEST_CASE(test_init_is_not_idempotent),
+		TEST_CASE(test_deinit_without_init),
+		TEST_CASE(test_max_size),
+		TEST_CASE(test_reserve_without_init),
+		TEST_CASE(test_cache_flush_without_init),
+		TEST_CASE(test_classes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_multiple_memzones),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_cumulative),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_invalid_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_reserve_any_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_invalid_align),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_small),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_free_various_sizes),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_alignment),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_reuse),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_many_in_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_block_repurposing_no_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_null),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_content_integrity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_align_one),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_socket_numa_placement),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_cross_socket_deinit),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_exceeds_capacity),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_non_eal_thread),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_cache_flush_returns_memory),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_zero_flag),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_exceeds_cache),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_alloc_bulk_socket),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_free_bulk),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_lcore_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_stats_reset),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_mixed_lifetimes_multi_lcore),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_basic),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_exhaustion),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_zero_blocks_growth),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_below_current),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_socket_id_any),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_unlimited),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_alloc_integrity_under_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_bulk_alloc_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_recovery_after_free),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_memory_limit_multi_lcore_oom),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_same_class),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_grow),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_shrink),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_null_ptr),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_zero_size),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_too_big),
+		TEST_CASE_ST(fastmem_setup, fastmem_teardown,
+			test_realloc_invalid_align),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_fastmem(void)
+{
+	return unit_test_suite_runner(&fastmem_testsuite);
+}
+
+REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_SKIP, ASAN_OK, test_fastmem);
diff --git a/app/test/test_fastmem_perf.c b/app/test/test_fastmem_perf.c
new file mode 100644
index 0000000000..73c0a4c6ce
--- /dev/null
+++ b/app/test/test_fastmem_perf.c
@@ -0,0 +1,1040 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_mempool.h>
+#include <rte_stdatomic.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+#define TEST_LOG(...) printf(__VA_ARGS__)
+
+static const size_t SIZES[] = { 8, 64, 256, 1024, 4096 };
+#define N_SIZES RTE_DIM(SIZES)
+
+/* Number of ops for warmup and measurement. */
+#define WARMUP_OPS 20000u
+#define MEASURE_OPS 2000000u
+
+/* Buffer for scenarios that allocate N then free N. */
+#define BATCH_N 256
+
+/*
+ * Allocator vtable: a thin adapter exposing alloc / free /
+ * per-allocator setup/teardown. Each scenario calls these
+ * indirectly so the same timing loop serves all allocators.
+ */
+struct allocator {
+	const char *name;
+	int (*setup)(size_t size, unsigned int n_max);
+	void (*teardown)(void);
+	void *(*alloc)(void);
+	void (*free_obj)(void *ptr);
+	int (*alloc_bulk)(void **ptrs, unsigned int n);
+	void (*free_bulk)(void **ptrs, unsigned int n);
+};
+
+/* Fastmem adapter -------------------------------------------------- */
+
+static size_t fastmem_size;
+
+static int
+fastmem_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	fastmem_size = size;
+	return 0;
+}
+
+static void
+fastmem_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_alloc(void)
+{
+	return rte_fastmem_alloc(fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free(void *ptr)
+{
+	rte_fastmem_free(ptr);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static struct rte_mempool *mempool_pool;
+
+static int
+mempool_setup(size_t size, unsigned int n_max)
+{
+	char name[RTE_MEMPOOL_NAMESIZE];
+	unsigned int cache_size;
+
+	/*
+	 * Pool size must accommodate the full batch burst plus
+	 * per-lcore cache capacity. Use mempool's default cache
+	 * size so we're measuring its standard hot path.
+	 */
+	cache_size = RTE_MEMPOOL_CACHE_MAX_SIZE;
+
+	snprintf(name, sizeof(name), "fmperf_mp_%zu", size);
+	mempool_pool = rte_mempool_create(name, n_max + cache_size * 2,
+			size, cache_size, 0, NULL, NULL, NULL, NULL,
+			SOCKET_ID_ANY, 0);
+	if (mempool_pool == NULL) {
+		TEST_LOG("mempool_create(%zu) failed\n", size);
+		return -1;
+	}
+
+	return 0;
+}
+
+static void
+mempool_teardown(void)
+{
+	rte_mempool_free(mempool_pool);
+	mempool_pool = NULL;
+}
+
+static void * __rte_noinline
+mempool_alloc_one(void)
+{
+	void *obj = NULL;
+
+	if (rte_mempool_get(mempool_pool, &obj) < 0)
+		return NULL;
+	return obj;
+}
+
+static void __rte_noinline
+mempool_free_one(void *ptr)
+{
+	rte_mempool_put(mempool_pool, ptr);
+}
+
+/* rte_malloc adapter ----------------------------------------------- */
+
+static size_t malloc_size;
+
+static int
+malloc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	malloc_size = size;
+	return 0;
+}
+
+static void
+malloc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+malloc_alloc(void)
+{
+	return rte_malloc(NULL, malloc_size, 0);
+}
+
+static void __rte_noinline
+malloc_free(void *ptr)
+{
+	rte_free(ptr);
+}
+
+/* libc (glibc) malloc adapter -------------------------------------- */
+
+static size_t libc_size;
+
+static int
+libc_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	/*
+	 * Round up to cache-line alignment to match the other
+	 * allocators' default alignment guarantees and keep the
+	 * comparison honest. aligned_alloc() requires size to be
+	 * a multiple of the alignment.
+	 */
+	libc_size = RTE_ALIGN_CEIL(size, RTE_CACHE_LINE_SIZE);
+	return 0;
+}
+
+static void
+libc_teardown(void)
+{
+}
+
+static void * __rte_noinline
+libc_alloc(void)
+{
+	return aligned_alloc(RTE_CACHE_LINE_SIZE, libc_size);
+}
+
+static void __rte_noinline
+libc_free(void *ptr)
+{
+	free(ptr);
+}
+
+/* Bulk adapters ---------------------------------------------------- */
+
+static int __rte_noinline
+fastmem_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_alloc_bulk(ptrs, n, fastmem_size, 0, 0);
+}
+
+static void __rte_noinline
+fastmem_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_free_bulk(ptrs, n);
+}
+
+/* Fastmem handle adapter ------------------------------------------- */
+
+static rte_fastmem_handle_t fastmem_handle;
+
+static int
+fastmem_h_setup(size_t size, unsigned int n_max __rte_unused)
+{
+	return rte_fastmem_hlookup(size, 0, rte_socket_id(), &fastmem_handle);
+}
+
+static void
+fastmem_h_teardown(void)
+{
+	rte_fastmem_cache_flush();
+}
+
+static void * __rte_noinline
+fastmem_h_alloc(void)
+{
+	return rte_fastmem_halloc(fastmem_handle, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free(void *ptr)
+{
+	rte_fastmem_hfree(fastmem_handle, ptr);
+}
+
+static int __rte_noinline
+fastmem_h_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_fastmem_halloc_bulk(fastmem_handle, ptrs, n, 0);
+}
+
+static void __rte_noinline
+fastmem_h_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_fastmem_hfree_bulk(fastmem_handle, ptrs, n);
+}
+
+/* Mempool adapter -------------------------------------------------- */
+
+static int __rte_noinline
+mempool_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return rte_mempool_get_bulk(mempool_pool, ptrs, n);
+}
+
+static void __rte_noinline
+mempool_free_bulk(void **ptrs, unsigned int n)
+{
+	rte_mempool_put_bulk(mempool_pool, ptrs, n);
+}
+
+static int __rte_noinline
+generic_alloc_bulk(void **ptrs, unsigned int n, void *(*alloc_fn)(void))
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++) {
+		ptrs[i] = alloc_fn();
+		if (ptrs[i] == NULL)
+			return -1;
+	}
+	return 0;
+}
+
+static int __rte_noinline
+malloc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, malloc_alloc);
+}
+
+static void __rte_noinline
+malloc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		malloc_free(ptrs[i]);
+}
+
+static int __rte_noinline
+libc_alloc_bulk(void **ptrs, unsigned int n)
+{
+	return generic_alloc_bulk(ptrs, n, libc_alloc);
+}
+
+static void __rte_noinline
+libc_free_bulk(void **ptrs, unsigned int n)
+{
+	unsigned int i;
+
+	for (i = 0; i < n; i++)
+		libc_free(ptrs[i]);
+}
+
+/* Adapter table ---------------------------------------------------- */
+
+static const struct allocator allocators[] = {
+	{ "fastmem",    fastmem_setup,   fastmem_teardown,   fastmem_alloc,     fastmem_free,     fastmem_alloc_bulk,   fastmem_free_bulk },
+	{ "fastmem_h",  fastmem_h_setup, fastmem_h_teardown, fastmem_h_alloc,   fastmem_h_free,   fastmem_h_alloc_bulk, fastmem_h_free_bulk },
+	{ "mempool",    mempool_setup,   mempool_teardown,   mempool_alloc_one, mempool_free_one, mempool_alloc_bulk,   mempool_free_bulk },
+	{ "rte_malloc", malloc_setup,    malloc_teardown,    malloc_alloc,      malloc_free,      malloc_alloc_bulk,    malloc_free_bulk },
+	{ "libc",       libc_setup,      libc_teardown,      libc_alloc,        libc_free,        libc_alloc_bulk,      libc_free_bulk },
+};
+#define N_ALLOCATORS RTE_DIM(allocators)
+
+/*
+ * Scenario 1: tight alloc+free loop. A single object is cycled
+ * repeatedly. The LIFO path keeps the same pointer hot, giving
+ * a best-case measurement.
+ */
+static double
+run_tight(const struct allocator *alloc, size_t size)
+{
+	void *p;
+	uint64_t tsc;
+	unsigned int i;
+
+	if (alloc->setup(size, 1) < 0)
+		return -1.0;
+
+	/* Warmup. */
+	for (i = 0; i < WARMUP_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+
+	tsc = rte_rdtsc_precise();
+	for (i = 0; i < MEASURE_OPS; i++) {
+		p = alloc->alloc();
+		if (p == NULL)
+			goto err;
+		alloc->free_obj(p);
+	}
+	tsc = rte_rdtsc_precise() - tsc;
+
+	alloc->teardown();
+
+	return (double)tsc / MEASURE_OPS;
+err:
+	alloc->teardown();
+	return -1.0;
+}
+
+/*
+ * Scenario 2: allocate N, free N (FIFO free order). Exercises
+ * cache refill and drain paths when N exceeds cache capacity.
+ */
+static void
+run_batch(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	/* Pick iteration count so total ops ~= MEASURE_OPS. */
+	iters = MEASURE_OPS / BATCH_N;
+
+	/* Warmup. */
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++)
+			alloc->free_obj(ptrs[i]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 3: allocate N, free N in reverse order.
+ */
+static void
+run_batch_reverse(const struct allocator *alloc, size_t size,
+		double *cycles_alloc, double *cycles_free)
+{
+	void *ptrs[BATCH_N];
+	uint64_t tsc_alloc = 0, tsc_free = 0;
+	unsigned int iter, i;
+	unsigned int iters;
+
+	*cycles_alloc = -1.0;
+	*cycles_free = -1.0;
+
+	if (alloc->setup(size, BATCH_N) < 0)
+		return;
+
+	iters = MEASURE_OPS / BATCH_N;
+
+	for (iter = 0; iter < WARMUP_OPS / BATCH_N; iter++) {
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+	}
+
+	for (iter = 0; iter < iters; iter++) {
+		uint64_t t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = 0; i < BATCH_N; i++) {
+			ptrs[i] = alloc->alloc();
+			if (ptrs[i] == NULL)
+				goto err;
+		}
+		tsc_alloc += rte_rdtsc_precise() - t0;
+
+		t0 = rte_rdtsc_precise();
+		for (i = BATCH_N; i > 0; i--)
+			alloc->free_obj(ptrs[i - 1]);
+		tsc_free += rte_rdtsc_precise() - t0;
+	}
+
+	alloc->teardown();
+
+	*cycles_alloc = (double)tsc_alloc / (iters * BATCH_N);
+	*cycles_free = (double)tsc_free / (iters * BATCH_N);
+	return;
+err:
+	alloc->teardown();
+}
+
+/*
+ * Scenario 4: multi-lcore alloc/work/free with a dummy-work
+ * baseline. Each worker runs a tight alloc → touch → free loop
+ * on its own lcore. A second run with the same dummy work but
+ * no allocator traffic establishes a baseline; the per-op
+ * allocator cost is reported as (alloc_run - baseline_run).
+ *
+ * Fixed size class and a fixed amount of dummy work per op —
+ * this scenario sweeps lcore count rather than size.
+ */
+#define MULTI_SIZE 256u
+#define MULTI_WORK_BYTES 64u
+#define MULTI_WORK_PASSES 8u   /* RMW passes over the work region. */
+#define MULTI_OPS 200000u
+#define MULTI_WARMUP 2000u
+#define MAX_MULTI_LCORES 32u
+
+/*
+ * Per-worker volatile sink. Each worker writes to its own
+ * slot, preventing dead-code elimination of touch_buffer() and
+ * avoiding cross-lcore cache-line sharing on the hot path.
+ * Padded to cache-line stride to prevent false sharing between
+ * neighboring workers' slots.
+ */
+struct worker_sink {
+	volatile uint64_t value;
+} __rte_cache_aligned;
+
+static struct worker_sink worker_sinks[RTE_MAX_LCORE];
+
+/*
+ * Out-of-line dummy workload: run MULTI_WORK_PASSES
+ * read-modify-write passes over the first 'bytes' of the
+ * buffer. Each pass reads what the previous pass wrote, so the
+ * compiler cannot unroll or parallelize across passes — the
+ * work scales linearly with MULTI_WORK_PASSES. Returns an
+ * accumulator so the caller can feed it into a volatile sink;
+ * without that, the compiler could elide the whole function.
+ *
+ * __rte_noinline so it looks identical to the compiler in both
+ * the baseline (pre-allocated scratch buffer) and alloc-path
+ * runs, making the cycle-delta subtraction valid.
+ *
+ * The purpose of this being tunably expensive is to keep
+ * worker-per-iteration cost high relative to the allocator's
+ * critical section, so that even serialized allocators like
+ * rte_malloc spend most of their time outside the lock and the
+ * measured per-op allocator cost reflects its own work rather
+ * than its contention queue.
+ */
+static uint64_t __rte_noinline
+touch_buffer(void *buf, size_t bytes)
+{
+	uint64_t *p = buf;
+	size_t n = bytes / sizeof(uint64_t);
+	uint64_t acc = 0;
+	unsigned int pass;
+	size_t i;
+
+	/* Prime the buffer with a known pattern. */
+	for (i = 0; i < n; i++)
+		p[i] = i * 0x9E3779B97F4A7C15ULL;
+
+	/*
+	 * Dependent RMW passes: each pass reads p[i] written by
+	 * the previous pass, mixes the pass index in, and writes
+	 * back. The XOR into acc keeps the chain live.
+	 */
+	for (pass = 0; pass < MULTI_WORK_PASSES; pass++) {
+		for (i = 0; i < n; i++) {
+			uint64_t v = p[i];
+
+			v = v * 0xC2B2AE3D27D4EB4FULL + pass;
+			v ^= v >> 33;
+			p[i] = v;
+			acc ^= v;
+		}
+	}
+
+	return acc;
+}
+
+struct worker_args {
+	const struct allocator *alloc;
+	void *scratch;            /* baseline only; NULL => alloc path */
+	unsigned int iters;
+	unsigned int warmup;
+	unsigned int bulk_n;      /* 0 = single-object, >0 = bulk */
+	RTE_ATOMIC(bool) start_flag; /* barrier at worker entry */
+	uint64_t cycles;          /* out */
+	unsigned int ops;         /* out */
+	int err;                  /* out */
+};
+
+static int
+worker_run(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	/* Wait for start flag (spin-barrier set by main). */
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				return -1;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+
+	/* Measured loop. */
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		void *p;
+
+		if (wa->scratch != NULL)
+			p = wa->scratch;
+		else {
+			p = wa->alloc->alloc();
+			if (p == NULL) {
+				wa->err = -1;
+				break;
+			}
+		}
+		acc ^= touch_buffer(p, MULTI_WORK_BYTES);
+		if (wa->scratch == NULL)
+			wa->alloc->free_obj(p);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i;
+
+	/* Publish accumulator to defeat dead-code elimination. */
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+static int
+worker_run_bulk(void *arg)
+{
+	struct worker_args *wa = arg;
+	unsigned int lcore = rte_lcore_id();
+	void *ptrs[BATCH_N];
+	uint64_t acc = 0;
+	uint64_t t0;
+	unsigned int i, j;
+	unsigned int bulk_n = wa->bulk_n;
+
+	wa->err = 0;
+	wa->ops = 0;
+	wa->cycles = 0;
+
+	while (!rte_atomic_load_explicit(&wa->start_flag,
+			rte_memory_order_acquire))
+		rte_pause();
+
+	/* Warmup. */
+	for (i = 0; i < wa->warmup; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			return -1;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+
+	t0 = rte_rdtsc_precise();
+	for (i = 0; i < wa->iters; i++) {
+		if (wa->alloc->alloc_bulk(ptrs, bulk_n) < 0) {
+			wa->err = -1;
+			break;
+		}
+		for (j = 0; j < bulk_n; j++)
+			acc ^= touch_buffer(ptrs[j], MULTI_WORK_BYTES);
+		wa->alloc->free_bulk(ptrs, bulk_n);
+	}
+	wa->cycles = rte_rdtsc_precise() - t0;
+	wa->ops = i * bulk_n;
+
+	worker_sinks[lcore].value ^= acc;
+
+	return 0;
+}
+
+/*
+ * Launch workers on the first 'n_workers' worker lcores, run
+ * either the baseline (scratch != NULL) or the alloc path
+ * (scratch == NULL), and return the mean per-op cycle cost
+ * averaged across participating workers.
+ *
+ * On any worker error, returns -1.0.
+ */
+static double
+run_multi_workers(const struct allocator *alloc, unsigned int n_workers,
+		void *const *scratches, unsigned int bulk_n)
+{
+	struct worker_args wargs[RTE_MAX_LCORE];
+	unsigned int worker_lcores[MAX_MULTI_LCORES];
+	unsigned int n = 0;
+	unsigned int lcore_id;
+	unsigned int i;
+	lcore_function_t *fn = bulk_n > 0 ? worker_run_bulk : worker_run;
+
+	/* Collect the first n_workers worker lcores. */
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (n >= n_workers)
+			break;
+		worker_lcores[n++] = lcore_id;
+	}
+	if (n < n_workers)
+		return -1.0;
+
+	/* Prepare per-worker args. */
+	for (i = 0; i < n_workers; i++) {
+		struct worker_args *wa = &wargs[worker_lcores[i]];
+
+		wa->alloc = alloc;
+		wa->scratch = scratches != NULL ? scratches[i] : NULL;
+		wa->iters = MULTI_OPS;
+		wa->warmup = MULTI_WARMUP;
+		wa->bulk_n = bulk_n;
+		rte_atomic_store_explicit(&wa->start_flag, false,
+				rte_memory_order_relaxed);
+	}
+
+	/* Launch workers. They spin on start_flag until released. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_remote_launch(fn, &wargs[worker_lcores[i]],
+				worker_lcores[i]);
+
+	/* Release all workers roughly simultaneously. */
+	for (i = 0; i < n_workers; i++)
+		rte_atomic_store_explicit(
+			&wargs[worker_lcores[i]].start_flag, true,
+			rte_memory_order_release);
+
+	/* Wait for completion. */
+	for (i = 0; i < n_workers; i++)
+		rte_eal_wait_lcore(worker_lcores[i]);
+
+	/* Aggregate: mean cycles per op across workers. */
+	{
+		double sum_cycles_per_op = 0.0;
+		unsigned int n_ok = 0;
+
+		for (i = 0; i < n_workers; i++) {
+			struct worker_args *wa = &wargs[worker_lcores[i]];
+
+			if (wa->err != 0 || wa->ops == 0)
+				return -1.0;
+			sum_cycles_per_op +=
+				(double)wa->cycles / (double)wa->ops;
+			n_ok++;
+		}
+		return sum_cycles_per_op / n_ok;
+	}
+}
+
+/*
+ * One sub-run of Scenario 4: given an allocator and a worker
+ * count, return (baseline, alloc_path) mean cycles per op.
+ */
+static void
+run_multi_lcore(const struct allocator *alloc, unsigned int n_workers,
+		unsigned int bulk_n, double *baseline, double *alloc_path)
+{
+	void *scratches[MAX_MULTI_LCORES] = {0};
+	unsigned int n_alloced = 0;
+	unsigned int i;
+
+	*baseline = -1.0;
+	*alloc_path = -1.0;
+
+	if (alloc->setup(MULTI_SIZE, n_workers * 64) < 0)
+		return;
+
+	/* Baseline: pre-allocate one scratch per worker. */
+	for (i = 0; i < n_workers; i++) {
+		scratches[i] = alloc->alloc();
+		if (scratches[i] == NULL)
+			goto err;
+		n_alloced++;
+	}
+
+	*baseline = run_multi_workers(alloc, n_workers, scratches, 0);
+
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	n_alloced = 0;
+
+	/* Alloc path: workers alloc+free each iter. */
+	*alloc_path = run_multi_workers(alloc, n_workers, NULL, bulk_n);
+
+	alloc->teardown();
+	return;
+err:
+	for (i = 0; i < n_alloced; i++)
+		alloc->free_obj(scratches[i]);
+	alloc->teardown();
+}
+
+/* Reporting -------------------------------------------------------- */
+
+static void
+print_header(const char *title)
+{
+	size_t i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < N_SIZES; i++)
+		TEST_LOG(" %10zu B", SIZES[i]);
+	TEST_LOG("\n");
+}
+
+static void
+print_row(const char *name, const double *values)
+{
+	size_t i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < N_SIZES; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %12s", "--");
+		else
+			TEST_LOG(" %12.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_header(const char *title, const unsigned int *lcore_counts,
+		unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("\n=== %s ===\n", title);
+	TEST_LOG("%-12s", "allocator");
+	for (i = 0; i < n_counts; i++)
+		TEST_LOG(" %8u lcore%c", lcore_counts[i],
+				lcore_counts[i] == 1 ? ' ' : 's');
+	TEST_LOG("\n");
+}
+
+static void
+print_multi_row(const char *name, const double *values, unsigned int n_counts)
+{
+	unsigned int i;
+
+	TEST_LOG("%-12s", name);
+	for (i = 0; i < n_counts; i++) {
+		if (values[i] < 0)
+			TEST_LOG(" %14s", "--");
+		else
+			TEST_LOG(" %14.1f", values[i]);
+	}
+	TEST_LOG("\n");
+}
+
+/* Driver ----------------------------------------------------------- */
+
+static int
+test_fastmem_perf(void)
+{
+	size_t i;
+	size_t a;
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		TEST_LOG("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	TEST_LOG("\nfastmem performance — single-lcore, fixed-size\n");
+	TEST_LOG("All numbers are TSC cycles.\n");
+
+	/* Scenario 1: tight alloc+free. */
+	print_header("Scenario 1: Single-object hot path — cycles per (alloc + free)");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			vals[i] = run_tight(&allocators[a], SIZES[i]);
+		print_row(allocators[a].name, vals);
+	}
+
+	/* Scenario 2: batched, FIFO free. */
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 2: Batch alloc, FIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 3: batched, reverse free. */
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per alloc");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_alloc);
+	}
+	print_header("Scenario 3: Batch alloc, LIFO free — cycles per free");
+	for (a = 0; a < N_ALLOCATORS; a++) {
+		double vals_alloc[N_SIZES], vals_free[N_SIZES];
+
+		for (i = 0; i < N_SIZES; i++)
+			run_batch_reverse(&allocators[a], SIZES[i],
+				&vals_alloc[i], &vals_free[i]);
+		print_row(allocators[a].name, vals_free);
+	}
+
+	/* Scenario 4: multi-lcore alloc/work/free with baseline. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		/* Sweep lcore counts: 1, 2, 4, 8, ... up to max_workers. */
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		/* Ensure max_workers is the final column if not power of two. */
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 4 (Multi-lcore contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 4 parameters: size=%u B\n",
+				MULTI_SIZE);
+
+			for (a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a], lcore_counts[c],
+							0, &base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 4: Multi-lcore contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	/* Scenario 5: multi-lcore bulk alloc/work/free. */
+	{
+		unsigned int max_workers = rte_lcore_count() - 1;
+		unsigned int lcore_counts[8];
+		unsigned int n_counts = 0;
+		unsigned int w;
+		double base_vals[N_ALLOCATORS][8];
+		double alloc_vals[N_ALLOCATORS][8];
+		double delta_vals[N_ALLOCATORS][8];
+		unsigned int bulk_n = 8;
+
+		if (max_workers > MAX_MULTI_LCORES)
+			max_workers = MAX_MULTI_LCORES;
+
+		for (w = 1; w <= max_workers && n_counts < RTE_DIM(lcore_counts); w *= 2)
+			lcore_counts[n_counts++] = w;
+		if (n_counts > 0 && lcore_counts[n_counts - 1] != max_workers &&
+				n_counts < RTE_DIM(lcore_counts) && max_workers >= 1)
+			lcore_counts[n_counts++] = max_workers;
+
+		if (n_counts == 0) {
+			TEST_LOG("\nScenario 5 (Multi-lcore bulk contention) skipped: no worker lcores available.\n");
+		} else {
+			TEST_LOG("\nScenario 5 parameters: size=%u B, "
+				"bulk=%u\n",
+				MULTI_SIZE, bulk_n);
+
+			for (size_t a = 0; a < N_ALLOCATORS; a++) {
+				unsigned int c;
+
+				for (c = 0; c < n_counts; c++)
+					run_multi_lcore(&allocators[a],
+							lcore_counts[c], bulk_n,
+							&base_vals[a][c],
+							&alloc_vals[a][c]);
+				for (c = 0; c < n_counts; c++) {
+					if (base_vals[a][c] < 0 || alloc_vals[a][c] < 0)
+						delta_vals[a][c] = -1.0;
+					else
+						delta_vals[a][c] = alloc_vals[a][c] -
+							base_vals[a][c];
+				}
+			}
+
+			TEST_LOG("Baseline (domain logic only): %.1f cycles/op\n",
+					base_vals[0][0]);
+
+			print_multi_header("Scenario 5: Multi-lcore bulk contention — allocator overhead (cycles/op)",
+					lcore_counts, n_counts);
+			for (size_t a = 0; a < N_ALLOCATORS; a++)
+				print_multi_row(allocators[a].name,
+						delta_vals[a], n_counts);
+		}
+	}
+
+	TEST_LOG("\n");
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_perf_autotest, test_fastmem_perf);
diff --git a/app/test/test_fastmem_profile.c b/app/test/test_fastmem_profile.c
new file mode 100644
index 0000000000..9a5dc94018
--- /dev/null
+++ b/app/test/test_fastmem_profile.c
@@ -0,0 +1,157 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Ericsson AB
+ */
+
+/*
+ * A minimal fastmem workload intended for use with perf record /
+ * perf report. Runs a tight alloc/free loop for a fixed duration
+ * so that sampling profilers can attribute cycles to individual
+ * functions and instructions within the fastmem hot path.
+ *
+ * Usage:
+ *   perf record -g -- dpdk-test --no-huge --no-pci -m 8192 \
+ *       -l 0 <<< fastmem_profile_autotest
+ *   perf report
+ */
+
+#include <inttypes.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_lcore.h>
+#include <rte_memory.h>
+
+#include <rte_fastmem.h>
+
+#include "test.h"
+
+/* Duration of each sub-test in TSC cycles (~3 seconds at 3 GHz). */
+#define PROFILE_DURATION_CYCLES (3ULL * rte_get_tsc_hz())
+
+/* Allocation size for the profiling workload. */
+#define PROFILE_SIZE 256u
+
+/*
+ * Sub-test 1: tight alloc+free, exercises only the per-lcore
+ * cache (no bin interaction after warmup).
+ */
+static int
+profile_cache_hit(void)
+{
+	uint64_t deadline;
+	uint64_t ops = 0;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		void *p = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+
+		if (p == NULL)
+			return -1;
+		rte_fastmem_free(p);
+		ops++;
+	}
+
+	printf("  cache_hit: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+/*
+ * Sub-test 2: alloc N then free N, where N exceeds the cache
+ * capacity. This forces repeated cache refills and drains,
+ * exercising the bin lock and slab free-list traversal.
+ */
+#define PROFILE_BATCH 256u
+
+static int
+profile_cache_miss(void)
+{
+	void *ptrs[PROFILE_BATCH];
+	uint64_t deadline;
+	uint64_t ops = 0;
+	unsigned int i;
+
+	deadline = rte_rdtsc() + PROFILE_DURATION_CYCLES;
+
+	while (rte_rdtsc() < deadline) {
+		for (i = 0; i < PROFILE_BATCH; i++) {
+			ptrs[i] = rte_fastmem_alloc(PROFILE_SIZE, 0, 0);
+			if (ptrs[i] == NULL)
+				return -1;
+		}
+		for (i = 0; i < PROFILE_BATCH; i++)
+			rte_fastmem_free(ptrs[i]);
+		ops += PROFILE_BATCH;
+	}
+
+	printf("  cache_miss: %" PRIu64 " ops\n", ops);
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_hit(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-hit workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_hit() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+static int
+test_fastmem_profile_cache_miss(void)
+{
+	int rc;
+
+	rc = rte_fastmem_init();
+	if (rc < 0) {
+		printf("rte_fastmem_init() failed: %d\n", rc);
+		return -1;
+	}
+
+	rc = rte_fastmem_reserve(128 * 1024 * 1024, SOCKET_ID_ANY);
+	if (rc < 0) {
+		printf("rte_fastmem_reserve() failed: %d\n", rc);
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	printf("fastmem profile: cache-miss workload (size=%u, ~%u s)\n",
+		PROFILE_SIZE, 3);
+
+	if (profile_cache_miss() < 0) {
+		rte_fastmem_deinit();
+		return -1;
+	}
+
+	rte_fastmem_deinit();
+	return 0;
+}
+
+REGISTER_PERF_TEST(fastmem_profile_cache_hit_autotest,
+		test_fastmem_profile_cache_hit);
+REGISTER_PERF_TEST(fastmem_profile_cache_miss_autotest,
+		test_fastmem_profile_cache_miss);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [RFC v3 0/3] lib/fastmem: fast small-object allocator
  2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
                           ` (2 preceding siblings ...)
  2026-05-27 17:30         ` [RFC v3 3/3] app/test: add fastmem test suite Mattias Rönnblom
@ 2026-05-28  9:02         ` Morten Brørup
  3 siblings, 0 replies; 38+ messages in thread
From: Morten Brørup @ 2026-05-28  9:02 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Konstantin Ananyev, Mattias Rönnblom, Yogaraj Baskaravel,
	Stephen Hemminger, Bruce Richardson

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Wednesday, 27 May 2026 19.31
> 
> 
> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.
> 
> Motivation
> ----------
> 
> DPDK applications commonly maintain many mempools — one per object
> type (connections, sessions, timers, work items). Each must be sized
> up front, wastes memory when over-provisioned, and cannot serve
> objects of a different size. Fastmem eliminates this by accepting
> arbitrary sizes at runtime, backed by a slab allocator that
> repurposes memory across size classes as demand shifts.
> 
> Design
> ------
> 
> Three-layer architecture:
> 
> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
>    reserved lazily (or pre-reserved for deterministic latency).
> 
> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
>    The alignment enables O(1) slab lookup from any object pointer
>    via bitmask — no radix tree or index structure. Slabs move
>    freely between 18 power-of-2 size classes (8 B to 1 MiB).
> 
> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
>    path). Cache misses trigger bulk transfers to/from the shared
>    bin under a spinlock.
> 
> Key properties:
> 
> - Zero per-object metadata in the production build.
> - NUMA-aware, with per-socket bins and free-slab pools.
> - DMA-usable memory with O(1) virt-to-IOVA translation.
> - Bulk alloc/free with all-or-nothing semantics.
> - Backing memory never returned during lifetime (slabs recycled).
> - Non-EAL threads supported (bypass cache, take bin lock).
> - Secondary process support (lazy attach, no per-lcore caches).
> 
> API surface
> -----------
> 
>   rte_fastmem_init / deinit
>   rte_fastmem_reserve
>   rte_fastmem_set_limit / get_limit
>   rte_fastmem_alloc / alloc_socket
>   rte_fastmem_realloc
>   rte_fastmem_alloc_bulk / alloc_bulk_socket
>   rte_fastmem_free / free_bulk
>   rte_fastmem_hlookup / halloc / halloc_bulk / hfree / hfree_bulk
>   rte_fastmem_virt2iova
>   rte_fastmem_cache_flush
>   rte_fastmem_max_size / classes
>   rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
>   rte_fastmem_stats_reset
> 
> All APIs are marked __rte_experimental.
> 
> Performance
> -----------
> 
> The single-object hot path is roughly 2–3× the cost of mempool
> and an order of magnitude faster than rte_malloc. Under
> multi-lcore contention, fastmem scales similarly to mempool,
> while rte_malloc collapses.
> 
> Limitations
> -----------
> 
> - Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
> - Power-of-2 classes only; worst-case internal fragmentation ~50%.
> - Backing memory not reclaimable short of deinit.
> 
> Future work
> -----------
> 
> - Lcore-affine allocations (false-sharing-free by construction).
> - Mempool ops driver for transparent drop-in use.

Regarding mempool support.
As you already mentioned, some mempools hold fully or partially initialized objects.
Releasing such an object to the heap would require an ability to reconstruct it on allocation from the heap.

In some cases, object reconstruction might be possible through callbacks or some other means.
And in some cases, object reconstruction might be practically impossible.
Under all circumstances, object reconstruction has a performance cost, which needs to be weighed up against the memory savings by freeing the objects back to the heap. This consideration is specific to each mempool, the kind of objects it holds, and how the mempool is being used.

If we look specifically at the mbuf mempool, an mbuf comprises of metadata (struct rte_mbuf and possibly struct rte_mbuf_ext_shared_info) and the packet buffer itself.
The mbuf structure supports using external buffers for the packet buffer, which does not need reconstruction if dynamically allocated from the heap.
It seems viable to keep the metadata parts of the mbufs in a mempool, and dynamically allocate/free their packet buffers on mbuf allocation/free. A shim mempool ops driver could relatively easily implement this. It might require a few additions to the mbuf and/or mempool libraries too, but that would be acceptable.

<feature creep>
Another thing regarding mbuf packet data:
Some NICs require packet buffers of 2048 bytes, but we also allocate a headroom of default 128 bytes in front of it, so the default packet buffer size (RTE_MBUF_DEFAULT_BUF_SIZE) is not 2^N, but 2048+128=2176 [1].

[1]: https://elixir.bootlin.com/dpdk/v26.03/source/lib/mbuf/rte_mbuf_core.h#L408

Allocating fastmem buffers of 4 KiB and only use 2.1 KiB seems wasteful.
Could the fastmem library support a shortlist of magic object sizes that are not 2^N?

The magic sizes should be explicitly configured at run-time. (The mbuf library must inform the fastmem library of the requested data_room_size before it populates the mbuf mempool.)
The shortlist should have a fixed max length, maybe 4 as default, preferably build-time configurable.
Removing a magic size from the shortlist need not be supported. Only adding magic sizes is required.

The magic sizes will be relatively large (assume 512 bytes or more), so adding a fastlib object metadata structure to each magic-sized object is acceptable, if necessary.

From a fastmem library perspective, WDYT?
</feature creep>

> - Debug mode (cookies, double-free detection, poison-on-free).
> - Telemetry integration.
> - EAL integration, allowing EAL-internal subsystems to use
>   fastmem for their small-object allocations.
> 
> Changes in RFC v3:
> - Add rte_fastmem_realloc() with full test coverage.
> - Add __rte_malloc/__rte_dealloc compiler attributes; remove
>   incorrect __rte_alloc_size/__rte_alloc_align.
> - Extract normalize_align() helper; remove redundant inline
>   directives.
> - Merge lifecycle and functional test suites.
> - Add realloc subsection to programming guide.
> 
> Changes in RFC v2:
> - Fix cross-socket deinit use-after-free.
> - Add secondary process support.
> - Add handle-based allocation API.
> - Fix clang warnings; misc cleanup.
> 
> 
> Mattias Rönnblom (3):
>   doc: add fastmem programming guide
>   lib: add fastmem library
>   app/test: add fastmem test suite
> 
>  app/test/meson.build                  |    3 +
>  app/test/test_fastmem.c               | 1801 +++++++++++++++++++++++++
>  app/test/test_fastmem_perf.c          | 1040 ++++++++++++++
>  app/test/test_fastmem_profile.c       |  157 +++
>  doc/api/doxy-api-index.md             |    1 +
>  doc/api/doxy-api.conf.in              |    1 +
>  doc/guides/prog_guide/fastmem_lib.rst |  328 +++++
>  doc/guides/prog_guide/index.rst       |    1 +
>  lib/fastmem/meson.build               |    6 +
>  lib/fastmem/rte_fastmem.c             | 1748 ++++++++++++++++++++++++
>  lib/fastmem/rte_fastmem.h             |  815 +++++++++++
>  lib/meson.build                       |    1 +
>  12 files changed, 5902 insertions(+)
>  create mode 100644 app/test/test_fastmem.c
>  create mode 100644 app/test/test_fastmem_perf.c
>  create mode 100644 app/test/test_fastmem_profile.c
>  create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
>  create mode 100644 lib/fastmem/meson.build
>  create mode 100644 lib/fastmem/rte_fastmem.c
>  create mode 100644 lib/fastmem/rte_fastmem.h
> 
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
                   ` (2 preceding siblings ...)
  2026-05-25 10:36 ` [RFC 3/3] app/test: add fastmem test suite Mattias Rönnblom
@ 2026-05-25 14:30 ` Stephen Hemminger
  2026-05-25 19:39   ` Mattias Rönnblom
  2026-05-25 18:36 ` Stephen Hemminger
  4 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2026-05-25 14:30 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Mon, 25 May 2026 12:36:39 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.

Makes sense, what a simple wrapper inline to allow full replacement
testing/performance A/B comparison?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 14:30 ` [RFC " Stephen Hemminger
@ 2026-05-25 19:39   ` Mattias Rönnblom
  2026-05-25 22:18     ` Stephen Hemminger
  0 siblings, 1 reply; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 19:39 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/25/26 16:30, Stephen Hemminger wrote:
> On Mon, 25 May 2026 12:36:39 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> This RFC introduces fastmem, a general-purpose small-object allocator
>> for DPDK. It is intended to replace per-type mempools with a single
>> allocator that handles arbitrary sizes, grows on demand, and matches
>> mempool-level performance on the hot path.
> 
> Makes sense, what a simple wrapper inline to allow full replacement
> testing/performance A/B comparison?

Do you mean a mempool or a heap wrapper? Or both?

I haven't looked into what options there are with mempools. A mempool 
driver should be possible, but then I guess one might attempt a 
whole-sale mempool-compatible API as well.

The role(s) fastmem could serve are:
a) An lcore/fast path small-object allocator when you don't know the 
object size and/or count beforehand (i.e., what the cover letter says).
b) Do what mempools do and a.
c) Do what the rte_malloc heap does, but lcore/fast path-friendly. In 
other words, option a but with larger objects too.
e) Something that's both b and c.

I haven't really formed an opinion yet, other than that option a seems 
like a natural first step.

Fastmem is significantly slower than mempools for the moment. Claude 
will tell you to inline, but that doesn't help (at least not in the 
micro benchmarks). Then it will tell you to go remove the statistics, 
which also doesn't help. (Latency is data dependency-driven, so stats 
load/store/compute runs on resources that otherwise would have been idle.)

What does help however is pre-compute socket and bin-related info and 
put into a handle, which the application may optionally use to quickly 
retrieve objects of-a-certain-size/from-a-certain-socket. Still slower 
than mempool though.

> === Scenario 1: Single-object hot path — cycles per (alloc + free) ===
> allocator             8 B         64 B        256 B       1024 B       4096 B
> fastmem              16.9         16.7         17.7         17.6         17.9
> fastmem_h             9.5          9.4          9.5          9.5          9.4
> mempool               6.9          6.9          6.9          7.0          6.6
> rte_malloc           93.7         93.8         94.8        100.1        130.0
> libc                118.8        119.2         20.4         20.4        111.0
> 
> === Scenario 2: Batch alloc, FIFO free — cycles per alloc ===
> allocator             8 B         64 B        256 B       1024 B       4096 B
> fastmem              10.1         10.2         10.8         12.7         14.7
> fastmem_h             6.8          6.7          7.4          9.3         11.4
> mempool               4.2          4.1          4.1          4.1          4.1
> rte_malloc           58.6         58.5         62.1         67.5         68.5
> libc                104.8        104.6         73.7        203.9       1254.0

Intel(R) Xeon(R) Gold 6421N / Ubuntu 24.04 / clang

Best regards,
	Mattias

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 19:39   ` Mattias Rönnblom
@ 2026-05-25 22:18     ` Stephen Hemminger
  2026-05-26  7:01       ` Mattias Rönnblom
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2026-05-25 22:18 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Mon, 25 May 2026 21:39:20 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> On 5/25/26 16:30, Stephen Hemminger wrote:
> > On Mon, 25 May 2026 12:36:39 +0200
> > Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> >   
> >> This RFC introduces fastmem, a general-purpose small-object allocator
> >> for DPDK. It is intended to replace per-type mempools with a single
> >> allocator that handles arbitrary sizes, grows on demand, and matches
> >> mempool-level performance on the hot path.  
> > 
> > Makes sense, what a simple wrapper inline to allow full replacement
> > testing/performance A/B comparison?  
> 
> Do you mean a mempool or a heap wrapper? Or both?
> 
> I haven't looked into what options there are with mempools. A mempool 
> driver should be possible, but then I guess one might attempt a 
> whole-sale mempool-compatible API as well.

My thinking is a yet another allocator in DPDK is just another source
of confusion and bugs. BUT if it can consolidate and fully replace
one or more existing allocators then it would be great improvement.

Mempools are fast, but fixed and space inefficient.
Rte_malloc is slow, but flexible.

Also, need to make whatever is added play well with static
and dynamic checkers.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 22:18     ` Stephen Hemminger
@ 2026-05-26  7:01       ` Mattias Rönnblom
  0 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-26  7:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/26/26 00:18, Stephen Hemminger wrote:
> On Mon, 25 May 2026 21:39:20 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> On 5/25/26 16:30, Stephen Hemminger wrote:
>>> On Mon, 25 May 2026 12:36:39 +0200
>>> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
>>>    
>>>> This RFC introduces fastmem, a general-purpose small-object allocator
>>>> for DPDK. It is intended to replace per-type mempools with a single
>>>> allocator that handles arbitrary sizes, grows on demand, and matches
>>>> mempool-level performance on the hot path.
>>>
>>> Makes sense, what a simple wrapper inline to allow full replacement
>>> testing/performance A/B comparison?
>>
>> Do you mean a mempool or a heap wrapper? Or both?
>>
>> I haven't looked into what options there are with mempools. A mempool
>> driver should be possible, but then I guess one might attempt a
>> whole-sale mempool-compatible API as well.
> 
> My thinking is a yet another allocator in DPDK is just another source
> of confusion and bugs. BUT if it can consolidate and fully replace
> one or more existing allocators then it would be great improvement.
> 
> Mempools are fast, but fixed and space inefficient.
> Rte_malloc is slow, but flexible.
> 
> Also, need to make whatever is added play well with static
> and dynamic checkers.

I'm not sure it's possible to replace mempools with a slab allocator 
like fastmem. They have different semantics, and I suspect that there 
are times when you prefer a mempool.

# Object Identity & Content Preservation

A mempool always returns one of the same pre-populated objects, with its 
contents untouched since last use. This enables pre-initialized fields, 
hardware-registered buffers, and constructors that run only once.

# Safe Use-After-Free

Returned objects remain valid, typed memory even after release. Stale 
references do not segfault or observe unrelated data, enabling RCU-style 
deferred reclamation.

# Bounded, Failure-Free Operation

A mempool operates with a fixed number of objects and performs no 
runtime memory allocation. This guarantees deterministic latency, 
natural backpressure, and eliminates `ENOMEM` failures after initialization.

# Known IOVA at Initialization Time

All object addresses, both virtual and physical, are fixed and 
enumerable from creation time. This enables pre-programming DMA 
descriptors and IOMMU registration.

# Memory Accounting

A mempool provides an exact, attributable memory footprint per pool, 
without sharing backing memory across unrelated users.

# Dense, Enumerable Object Set

Objects share a common base address and fixed stride, enabling efficient 
iteration and pointer compression.

Considering many apps use DPDK for I/O and other hardware abstraction 
only, and carry all other OS/kernel/platform type infrastructure 
themselves, replacing the mempools with something else will likely cause 
a lot of friction.

A fastmem-backed mempool backend (with limitations)? Sure.

Replacing rte_malloc seems easier, but I haven't looked into that in 
detail yet.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
                   ` (3 preceding siblings ...)
  2026-05-25 14:30 ` [RFC " Stephen Hemminger
@ 2026-05-25 18:36 ` Stephen Hemminger
  2026-05-25 19:43   ` Mattias Rönnblom
  4 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2026-05-25 18:36 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On Mon, 25 May 2026 12:36:39 +0200
Mattias Rönnblom <hofors@lysator.liu.se> wrote:

> This RFC introduces fastmem, a general-purpose small-object allocator
> for DPDK. It is intended to replace per-type mempools with a single
> allocator that handles arbitrary sizes, grows on demand, and matches
> mempool-level performance on the hot path.
> 
> Motivation
> ----------
> 
> DPDK applications commonly maintain many mempools — one per object
> type (connections, sessions, timers, work items). Each must be sized
> up front, wastes memory when over-provisioned, and cannot serve
> objects of a different size. Fastmem eliminates this by accepting
> arbitrary sizes at runtime, backed by a slab allocator that
> repurposes memory across size classes as demand shifts.
> 
> Design
> ------
> 
> Three-layer architecture:
> 
> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
>    reserved lazily (or pre-reserved for deterministic latency).
> 
> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
>    The alignment enables O(1) slab lookup from any object pointer
>    via bitmask — no radix tree or index structure. Slabs move
>    freely between 18 power-of-2 size classes (8 B to 1 MiB).
> 
> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
>    path). Cache misses trigger bulk transfers to/from the shared
>    bin under a spinlock.
> 
> Key properties:
> 
> - Zero per-object metadata in the production build.
> - NUMA-aware, with per-socket bins and free-slab pools.
> - DMA-usable memory with O(1) virt-to-IOVA translation.
> - Bulk alloc/free with all-or-nothing semantics.
> - Backing memory never returned during lifetime (slabs recycled).
> - Non-EAL threads supported (bypass cache, take bin lock).
> 
> API surface
> -----------
> 
>   rte_fastmem_init / deinit
>   rte_fastmem_reserve
>   rte_fastmem_set_limit / get_limit
>   rte_fastmem_alloc / alloc_socket
>   rte_fastmem_alloc_bulk / alloc_bulk_socket
>   rte_fastmem_free / free_bulk
>   rte_fastmem_virt2iova
>   rte_fastmem_cache_flush
>   rte_fastmem_max_size / classes
>   rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
>   rte_fastmem_stats_reset
> 
> All APIs are marked __rte_experimental.
> 
> Performance
> -----------
> 
> The single-object hot path is roughly 2-3x the cost of mempool
> and an order of magnitude faster than rte_malloc. Under
> multi-lcore contention, fastmem scales similarly to mempool,
> while rte_malloc collapses.
> 
> Limitations
> -----------
> 
> - Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
> - Power-of-2 classes only; worst-case internal fragmentation ~50%.
> - Backing memory not reclaimable short of deinit.
> 
> Future work
> -----------
> 
> - Lcore-affine allocations (false-sharing-free by construction).
> - Mempool ops driver for transparent drop-in use.
> - Pre-resolved allocator handle binding size class and socket,
>   eliminating per-call class lookup and enabling an inline
>   cache-hit fast path.
> - Debug mode (cookies, double-free detection, poison-on-free).
> - Telemetry integration.
> - EAL integration, allowing EAL-internal subsystems to use
>   fastmem for their small-object allocations.
> 
> Mattias Rönnblom (3):
>   doc: add fastmem programming guide
>   lib: add fastmem library
>   app/test: add fastmem test suite
> 
>  app/test/meson.build                  |    3 +
>  app/test/test_fastmem.c               | 1682 +++++++++++++++++++++++++
>  app/test/test_fastmem_perf.c          |  997 +++++++++++++++
>  app/test/test_fastmem_profile.c       |  157 +++
>  doc/api/doxy-api-index.md             |    1 +
>  doc/api/doxy-api.conf.in              |    1 +
>  doc/guides/prog_guide/fastmem_lib.rst |  301 +++++
>  doc/guides/prog_guide/index.rst       |    1 +
>  lib/fastmem/meson.build               |    6 +
>  lib/fastmem/rte_fastmem.c             | 1486 ++++++++++++++++++++++
>  lib/fastmem/rte_fastmem.h             |  644 ++++++++++
>  lib/meson.build                       |    1 +
>  12 files changed, 5280 insertions(+)
>  create mode 100644 app/test/test_fastmem.c
>  create mode 100644 app/test/test_fastmem_perf.c
>  create mode 100644 app/test/test_fastmem_profile.c
>  create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
>  create mode 100644 lib/fastmem/meson.build
>  create mode 100644 lib/fastmem/rte_fastmem.c
>  create mode 100644 lib/fastmem/rte_fastmem.h
> 

Largish patchset so did AI review with full claude model.

Series review: [RFC 0/3] add fastmem allocator
Reviewed against the v1 RFC posted 2026-05-25.

[RFC 1/3] doc: add fastmem programming guide

Info: doc/guides/prog_guide/fastmem_lib.rst -- "\ No newline at end of file"
   The new RST file does not end with a newline.

[RFC 2/3] lib: add fastmem library

Error: lib/fastmem/rte_fastmem.c -- use-after-free during rte_fastmem_deinit()
   when caches were allocated cross-socket.

   cache_create() places the cache struct on the *calling thread's* socket,
   not on the socket the cache serves:

       unsigned int own_socket = rte_socket_id();
       ...
       alloc_socket = &fastmem->sockets[own_socket];
       cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
       ...
       *slot = cache;          /* slot is in socket K's caches[][] */

   So an lcore on socket S that calls rte_fastmem_alloc_socket(..., K) with
   S != K creates a cache whose memory lives in socket S's memzone but is
   reachable through socket K's caches[lcore][class].

   rte_fastmem_deinit() then walks sockets in index order:

       for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
               release_socket(&fastmem->sockets[i]);

   and release_socket() does, in this order:

       socket_release_caches(socket);            /* (1) */
       for (c...) bin_release(&socket->bins[c], socket);  /* (2) */
       for (i...) rte_memzone_free(socket->memzones[i]);  /* (3) */

   When i = S, step (3) frees socket S's memzones. When i = K (K > S),
   socket_release_caches(K) runs:

       cache_slab = slab_of(cache);             /* in socket S's freed mz */
       bin_free_one(cache_slab->bin, cache);    /* reads cache_slab->bin */

   cache_slab points into a freed memzone, so cache_slab->bin and the
   subsequent push (slab->free_head = obj; slab->free_count++; in
   bin_push_locked()) read and write released memory. slab_release() may
   then re-attach the slab to socket S's free_head, which was zeroed and
   whose backing is gone.

   This is triggered by any application that allocates from a non-local
   socket via SOCKET_ID_ANY fallback or explicit socket_id, which the
   programming guide describes as a normal mode of operation. The
   existing test_alloc_socket and test_alloc_socket_numa_placement use
   rte_socket_id_by_idx(0) (the local socket) so the bug is not
   exercised by the test suite.

   Either order the teardown in three phases (all caches across all
   sockets first, then all bins, then all memzones), or allocate the
   cache struct from the socket it serves rather than the calling
   thread's socket.

Warning: lib/fastmem/rte_fastmem.c -- non-atomic access to shared 64-bit
   statistics counters.

   cache->alloc_cache_hits, alloc_cache_misses, alloc_nomem,
   free_cache_hits, free_cache_misses, and the bin counters
   slab_acquires, slab_releases, slabs_partial, slabs_full are
   incremented as plain C reads/writes by the owning lcore and read
   from another thread via rte_fastmem_stats(), rte_fastmem_stats_class(),
   rte_fastmem_stats_lcore(), and rte_fastmem_stats_lcore_class(). On
   architectures where uint64_t is not naturally atomic (and per the C
   standard generally) this is a data race; even on x86-64 it is
   undefined behavior under -fsanitize=thread.

   Use rte_atomic_fetch_add_explicit() with rte_memory_order_relaxed on
   the producer side and rte_atomic_load_explicit() with relaxed
   ordering on the reader side. Per AGENTS.md / the DPDK convention,
   relaxed ordering is appropriate for these counters.

Warning: lib/fastmem/rte_fastmem.c -- pointer publish in cache_create()
   without release ordering.

       *slot = cache;
       return cache;

   The struct fields (count, capacity, target, the stats counters) are
   written before this store but with no fence or release barrier. A
   concurrent stats reader doing socket->caches[l][c] followed by
   cache->* could observe the pointer but not all initialized fields.
   Even ignoring the stats reader, rte_fastmem_cache_flush() invoked
   from a different lcore on the same cache (not currently possible by
   API contract, but the field is technically reachable) would race.
   Pair with rte_atomic_store_explicit(..., rte_memory_order_release)
   and a matching acquire load on the reader path.

Warning: lib/fastmem/rte_fastmem.c -- spurious ENOMEM window during slab
   release.

   bin_push_locked() removes a fully-drained slab from bin->partial
   before bin_free_one() drops the bin lock; slab_release() then puts
   it on socket->free_head under the socket lock. Between the unlock
   and slab_release(), another lcore allocating in any class on the
   same socket can see free_head == NULL, hit the memory_limit (or
   FASTMEM_MAX_MEMZONES_PER_SOCKET) check in grow_socket(), and return
   ENOMEM even though the slab is about to become available. Not a
   correctness issue but visible to applications that pin tightly to
   their limit.

Info: lib/fastmem/rte_fastmem.c local_socket_id() final fallback:

       return (unsigned int)rte_socket_id_by_idx(0);

   rte_socket_id_by_idx() returns int and is documented to return -1 on
   error. If there are zero configured sockets the cast yields UINT_MAX
   and fastmem->sockets[UINT_MAX] is out of bounds. Realistically there
   is always at least one socket, but a defensive check (return 0, or
   fail allocation explicitly) would avoid the corner case.

Info: lib/fastmem/rte_fastmem.c cache_pop() refills to cache->target
   (half capacity) rather than to capacity. Subsequent single-object
   allocs only get target-1 hits before the next bin trip. Likely
   intentional for fairness with bulk callers, but worth a comment.

Info: lib/meson.build inserts 'fastmem' between 'dispatcher' and
   'gpudev'. The natural alphabetical position is between 'efd' and
   'fib'; fastmem has no dependency on dispatcher.

[RFC 3/3] app/test: add fastmem test suite

Warning: app/test/test_fastmem.c -- REGISTER_FAST_TEST uses NOHUGE_OK
   but the functional tests need real memzone-backed memory.

       REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_OK, ASAN_OK,
                          test_fastmem);

   test_fastmem runs both the lifecycle suite (no allocations) and the
   functional suite, which requests 128 MiB IOVA-contiguous memzones.
   In --no-huge mode IOVA-contiguous reservation of that size is not
   reliable, so NOHUGE_SKIP is more honest. If you want the lifecycle
   tests to remain no-huge-friendly, register them as a separate
   test command.

Warning: app/test/test_fastmem.c -- the suite never exercises
   cross-socket cache allocation.

   test_alloc_socket and test_alloc_socket_numa_placement both use
   rte_socket_id_by_idx(0) (the local socket). Add a test that runs on
   a worker lcore whose rte_socket_id() differs from the target
   socket_id passed to rte_fastmem_alloc_socket(), then calls
   rte_fastmem_deinit(). This would have caught the deinit UAF above.

Info: app/test/test_fastmem.c -- several test functions declare an
   uninitialized `int rc;` that is never read or written (e.g.
   test_alloc_too_big, test_alloc_invalid_align, test_alloc_free_small,
   test_alloc_alignment, test_alloc_socket, test_alloc_block_repurposing
   and others). Drop the declarations.

Info: app/test/test_fastmem.c trailing blank-line clusters (two blank
   lines before "return TEST_SUCCESS;" in test_reserve_multiple_memzones,
   test_reserve_cumulative, test_reserve_invalid_socket,
   test_reserve_any_socket, test_alloc_too_big, ...). Drop the extra
   blank line.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC 0/3] lib/fastmem: fast small-object allocator
  2026-05-25 18:36 ` Stephen Hemminger
@ 2026-05-25 19:43   ` Mattias Rönnblom
  0 siblings, 0 replies; 38+ messages in thread
From: Mattias Rönnblom @ 2026-05-25 19:43 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Morten Brørup, Konstantin Ananyev,
	Mattias Rönnblom, Yogaraj Baskaravel

On 5/25/26 20:36, Stephen Hemminger wrote:
> On Mon, 25 May 2026 12:36:39 +0200
> Mattias Rönnblom <hofors@lysator.liu.se> wrote:
> 
>> This RFC introduces fastmem, a general-purpose small-object allocator
>> for DPDK. It is intended to replace per-type mempools with a single
>> allocator that handles arbitrary sizes, grows on demand, and matches
>> mempool-level performance on the hot path.
>>
>> Motivation
>> ----------
>>
>> DPDK applications commonly maintain many mempools — one per object
>> type (connections, sessions, timers, work items). Each must be sized
>> up front, wastes memory when over-provisioned, and cannot serve
>> objects of a different size. Fastmem eliminates this by accepting
>> arbitrary sizes at runtime, backed by a slab allocator that
>> repurposes memory across size classes as demand shifts.
>>
>> Design
>> ------
>>
>> Three-layer architecture:
>>
>> 1. Backing memory: 128 MiB IOVA-contiguous memzones from EAL,
>>     reserved lazily (or pre-reserved for deterministic latency).
>>
>> 2. Slabs: 2 MiB, 2 MiB-aligned regions carved from memzones.
>>     The alignment enables O(1) slab lookup from any object pointer
>>     via bitmask — no radix tree or index structure. Slabs move
>>     freely between 18 power-of-2 size classes (8 B to 1 MiB).
>>
>> 3. Per-lcore caches: bounded LIFO stacks (no locks on the hot
>>     path). Cache misses trigger bulk transfers to/from the shared
>>     bin under a spinlock.
>>
>> Key properties:
>>
>> - Zero per-object metadata in the production build.
>> - NUMA-aware, with per-socket bins and free-slab pools.
>> - DMA-usable memory with O(1) virt-to-IOVA translation.
>> - Bulk alloc/free with all-or-nothing semantics.
>> - Backing memory never returned during lifetime (slabs recycled).
>> - Non-EAL threads supported (bypass cache, take bin lock).
>>
>> API surface
>> -----------
>>
>>    rte_fastmem_init / deinit
>>    rte_fastmem_reserve
>>    rte_fastmem_set_limit / get_limit
>>    rte_fastmem_alloc / alloc_socket
>>    rte_fastmem_alloc_bulk / alloc_bulk_socket
>>    rte_fastmem_free / free_bulk
>>    rte_fastmem_virt2iova
>>    rte_fastmem_cache_flush
>>    rte_fastmem_max_size / classes
>>    rte_fastmem_stats / stats_class / stats_lcore / stats_lcore_class
>>    rte_fastmem_stats_reset
>>
>> All APIs are marked __rte_experimental.
>>
>> Performance
>> -----------
>>
>> The single-object hot path is roughly 2-3x the cost of mempool
>> and an order of magnitude faster than rte_malloc. Under
>> multi-lcore contention, fastmem scales similarly to mempool,
>> while rte_malloc collapses.
>>
>> Limitations
>> -----------
>>
>> - Maximum allocation: 1 MiB. Larger requests should use rte_malloc.
>> - Power-of-2 classes only; worst-case internal fragmentation ~50%.
>> - Backing memory not reclaimable short of deinit.
>>
>> Future work
>> -----------
>>
>> - Lcore-affine allocations (false-sharing-free by construction).
>> - Mempool ops driver for transparent drop-in use.
>> - Pre-resolved allocator handle binding size class and socket,
>>    eliminating per-call class lookup and enabling an inline
>>    cache-hit fast path.
>> - Debug mode (cookies, double-free detection, poison-on-free).
>> - Telemetry integration.
>> - EAL integration, allowing EAL-internal subsystems to use
>>    fastmem for their small-object allocations.
>>
>> Mattias Rönnblom (3):
>>    doc: add fastmem programming guide
>>    lib: add fastmem library
>>    app/test: add fastmem test suite
>>
>>   app/test/meson.build                  |    3 +
>>   app/test/test_fastmem.c               | 1682 +++++++++++++++++++++++++
>>   app/test/test_fastmem_perf.c          |  997 +++++++++++++++
>>   app/test/test_fastmem_profile.c       |  157 +++
>>   doc/api/doxy-api-index.md             |    1 +
>>   doc/api/doxy-api.conf.in              |    1 +
>>   doc/guides/prog_guide/fastmem_lib.rst |  301 +++++
>>   doc/guides/prog_guide/index.rst       |    1 +
>>   lib/fastmem/meson.build               |    6 +
>>   lib/fastmem/rte_fastmem.c             | 1486 ++++++++++++++++++++++
>>   lib/fastmem/rte_fastmem.h             |  644 ++++++++++
>>   lib/meson.build                       |    1 +
>>   12 files changed, 5280 insertions(+)
>>   create mode 100644 app/test/test_fastmem.c
>>   create mode 100644 app/test/test_fastmem_perf.c
>>   create mode 100644 app/test/test_fastmem_profile.c
>>   create mode 100644 doc/guides/prog_guide/fastmem_lib.rst
>>   create mode 100644 lib/fastmem/meson.build
>>   create mode 100644 lib/fastmem/rte_fastmem.c
>>   create mode 100644 lib/fastmem/rte_fastmem.h
>>
> 
> Largish patchset so did AI review with full claude model.
> 
> Series review: [RFC 0/3] add fastmem allocator
> Reviewed against the v1 RFC posted 2026-05-25.
> 
> 
> [RFC 1/3] doc: add fastmem programming guide
> 
> Info: doc/guides/prog_guide/fastmem_lib.rst -- "\ No newline at end of file"
>     The new RST file does not end with a newline.
> 
> 
> [RFC 2/3] lib: add fastmem library
> 
> Error: lib/fastmem/rte_fastmem.c -- use-after-free during rte_fastmem_deinit()
>     when caches were allocated cross-socket.
> 
>     cache_create() places the cache struct on the *calling thread's* socket,
>     not on the socket the cache serves:
> 
>         unsigned int own_socket = rte_socket_id();
>         ...
>         alloc_socket = &fastmem->sockets[own_socket];
>         cache = bin_alloc_one(&alloc_socket->bins[cache_class]);
>         ...
>         *slot = cache;          /* slot is in socket K's caches[][] */
> 
>     So an lcore on socket S that calls rte_fastmem_alloc_socket(..., K) with
>     S != K creates a cache whose memory lives in socket S's memzone but is
>     reachable through socket K's caches[lcore][class].
> 
>     rte_fastmem_deinit() then walks sockets in index order:
> 
>         for (i = 0; i < RTE_MAX_NUMA_NODES; i++)
>                 release_socket(&fastmem->sockets[i]);
> 
>     and release_socket() does, in this order:
> 
>         socket_release_caches(socket);            /* (1) */
>         for (c...) bin_release(&socket->bins[c], socket);  /* (2) */
>         for (i...) rte_memzone_free(socket->memzones[i]);  /* (3) */
> 
>     When i = S, step (3) frees socket S's memzones. When i = K (K > S),
>     socket_release_caches(K) runs:
> 
>         cache_slab = slab_of(cache);             /* in socket S's freed mz */
>         bin_free_one(cache_slab->bin, cache);    /* reads cache_slab->bin */
> 
>     cache_slab points into a freed memzone, so cache_slab->bin and the
>     subsequent push (slab->free_head = obj; slab->free_count++; in
>     bin_push_locked()) read and write released memory. slab_release() may
>     then re-attach the slab to socket S's free_head, which was zeroed and
>     whose backing is gone.
> 
>     This is triggered by any application that allocates from a non-local
>     socket via SOCKET_ID_ANY fallback or explicit socket_id, which the
>     programming guide describes as a normal mode of operation. The
>     existing test_alloc_socket and test_alloc_socket_numa_placement use
>     rte_socket_id_by_idx(0) (the local socket) so the bug is not
>     exercised by the test suite.
> 
>     Either order the teardown in three phases (all caches across all
>     sockets first, then all bins, then all memzones), or allocate the
>     cache struct from the socket it serves rather than the calling
>     thread's socket.
> 
> Warning: lib/fastmem/rte_fastmem.c -- non-atomic access to shared 64-bit
>     statistics counters.
> 
>     cache->alloc_cache_hits, alloc_cache_misses, alloc_nomem,
>     free_cache_hits, free_cache_misses, and the bin counters
>     slab_acquires, slab_releases, slabs_partial, slabs_full are
>     incremented as plain C reads/writes by the owning lcore and read
>     from another thread via rte_fastmem_stats(), rte_fastmem_stats_class(),
>     rte_fastmem_stats_lcore(), and rte_fastmem_stats_lcore_class(). On
>     architectures where uint64_t is not naturally atomic (and per the C
>     standard generally) this is a data race; even on x86-64 it is
>     undefined behavior under -fsanitize=thread.
> 
>     Use rte_atomic_fetch_add_explicit() with rte_memory_order_relaxed on
>     the producer side and rte_atomic_load_explicit() with relaxed
>     ordering on the reader side. Per AGENTS.md / the DPDK convention,
>     relaxed ordering is appropriate for these counters.
> 
> Warning: lib/fastmem/rte_fastmem.c -- pointer publish in cache_create()
>     without release ordering.
> 
>         *slot = cache;
>         return cache;
> 
>     The struct fields (count, capacity, target, the stats counters) are
>     written before this store but with no fence or release barrier. A
>     concurrent stats reader doing socket->caches[l][c] followed by
>     cache->* could observe the pointer but not all initialized fields.
>     Even ignoring the stats reader, rte_fastmem_cache_flush() invoked
>     from a different lcore on the same cache (not currently possible by
>     API contract, but the field is technically reachable) would race.
>     Pair with rte_atomic_store_explicit(..., rte_memory_order_release)
>     and a matching acquire load on the reader path.
> 
> Warning: lib/fastmem/rte_fastmem.c -- spurious ENOMEM window during slab
>     release.
> 
>     bin_push_locked() removes a fully-drained slab from bin->partial
>     before bin_free_one() drops the bin lock; slab_release() then puts
>     it on socket->free_head under the socket lock. Between the unlock
>     and slab_release(), another lcore allocating in any class on the
>     same socket can see free_head == NULL, hit the memory_limit (or
>     FASTMEM_MAX_MEMZONES_PER_SOCKET) check in grow_socket(), and return
>     ENOMEM even though the slab is about to become available. Not a
>     correctness issue but visible to applications that pin tightly to
>     their limit.
> 
> Info: lib/fastmem/rte_fastmem.c local_socket_id() final fallback:
> 
>         return (unsigned int)rte_socket_id_by_idx(0);
> 
>     rte_socket_id_by_idx() returns int and is documented to return -1 on
>     error. If there are zero configured sockets the cast yields UINT_MAX
>     and fastmem->sockets[UINT_MAX] is out of bounds. Realistically there
>     is always at least one socket, but a defensive check (return 0, or
>     fail allocation explicitly) would avoid the corner case.
> 
> Info: lib/fastmem/rte_fastmem.c cache_pop() refills to cache->target
>     (half capacity) rather than to capacity. Subsequent single-object
>     allocs only get target-1 hits before the next bin trip. Likely
>     intentional for fairness with bulk callers, but worth a comment.
> 
> Info: lib/meson.build inserts 'fastmem' between 'dispatcher' and
>     'gpudev'. The natural alphabetical position is between 'efd' and
>     'fib'; fastmem has no dependency on dispatcher.
> 
> 
> [RFC 3/3] app/test: add fastmem test suite
> 
> Warning: app/test/test_fastmem.c -- REGISTER_FAST_TEST uses NOHUGE_OK
>     but the functional tests need real memzone-backed memory.
> 
>         REGISTER_FAST_TEST(fastmem_autotest, NOHUGE_OK, ASAN_OK,
>                            test_fastmem);
> 
>     test_fastmem runs both the lifecycle suite (no allocations) and the
>     functional suite, which requests 128 MiB IOVA-contiguous memzones.
>     In --no-huge mode IOVA-contiguous reservation of that size is not
>     reliable, so NOHUGE_SKIP is more honest. If you want the lifecycle
>     tests to remain no-huge-friendly, register them as a separate
>     test command.
> 
> Warning: app/test/test_fastmem.c -- the suite never exercises
>     cross-socket cache allocation.
> 
>     test_alloc_socket and test_alloc_socket_numa_placement both use
>     rte_socket_id_by_idx(0) (the local socket). Add a test that runs on
>     a worker lcore whose rte_socket_id() differs from the target
>     socket_id passed to rte_fastmem_alloc_socket(), then calls
>     rte_fastmem_deinit(). This would have caught the deinit UAF above.
> 
> Info: app/test/test_fastmem.c -- several test functions declare an
>     uninitialized `int rc;` that is never read or written (e.g.
>     test_alloc_too_big, test_alloc_invalid_align, test_alloc_free_small,
>     test_alloc_alignment, test_alloc_socket, test_alloc_block_repurposing
>     and others). Drop the declarations.
> 
> Info: app/test/test_fastmem.c trailing blank-line clusters (two blank
>     lines before "return TEST_SUCCESS;" in test_reserve_multiple_memzones,
>     test_reserve_cumulative, test_reserve_invalid_socket,
>     test_reserve_any_socket, test_alloc_too_big, ...). Drop the extra
>     blank line.

Thanks. I've addressed the above issues and the fixes will be available 
as an RFC v2, except:

#2 - Non-atomic stats counters

     Diagnostic counters read cross-thread. On all DPDK-supported
     architectures, aligned uint64_t stores are atomic in practice;
     a torn read (e.g., on 32-bit x86) at worst yields a slightly
     stale counter value. Not worth the ceremony.

#3 - Pointer publish without release ordering

     On weakly-ordered architectures a stats reader could briefly see
     uninitialized counter values for a newly-created cache. Acceptable
     for diagnostic data.

#4 - Spurious ENOMEM window during slab release

     Narrow timing window, not a correctness bug. Closing it would
     require holding the bin lock across slab_release(), reintroducing
     the contention the design avoids.


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2026-06-10 12:35 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 10:36 [RFC 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
2026-05-25 10:36 ` [RFC 1/3] doc: add fastmem programming guide Mattias Rönnblom
2026-05-25 10:36 ` [RFC 2/3] lib: add fastmem library Mattias Rönnblom
2026-05-27 14:22   ` Stephen Hemminger
2026-05-27 17:25     ` Mattias Rönnblom
2026-05-25 10:36 ` [RFC 3/3] app/test: add fastmem test suite Mattias Rönnblom
2026-05-26  8:57   ` [RFC v2 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
2026-05-26  8:57     ` [RFC v2 1/3] doc: add fastmem programming guide Mattias Rönnblom
2026-05-26  8:57     ` [RFC v2 2/3] lib: add fastmem library Mattias Rönnblom
2026-05-26 13:23       ` Stephen Hemminger
2026-05-27 10:12         ` Mattias Rönnblom
2026-05-27 10:18           ` Bruce Richardson
2026-05-27 11:17             ` Mattias Rönnblom
2026-05-27 11:17             ` Morten Brørup
2026-05-27 11:29               ` Mattias Rönnblom
2026-05-27 12:03                 ` Morten Brørup
2026-05-26  8:57     ` [RFC v2 3/3] app/test: add fastmem test suite Mattias Rönnblom
2026-05-27 17:30       ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
2026-05-27 17:30         ` [RFC v3 1/3] doc: add fastmem programming guide Mattias Rönnblom
2026-05-30  9:26           ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Mattias Rönnblom
2026-05-30  9:26             ` [RFC v4 1/3] doc: add fastmem programming guide Mattias Rönnblom
2026-05-30  9:26             ` [RFC v4 2/3] lib: add fastmem library Mattias Rönnblom
2026-05-30  9:26             ` [RFC v4 3/3] app/test: add fastmem test suite Mattias Rönnblom
2026-06-10 12:35             ` [RFC v4 0/3] lib/fastmem: fast small-object allocator Konstantin Ananyev
2026-05-27 17:30         ` [RFC v3 2/3] lib: add fastmem library Mattias Rönnblom
2026-05-28  9:11           ` Morten Brørup
2026-05-28 14:45             ` Varghese, Vipin
2026-05-28 19:56               ` Morten Brørup
2026-05-29 14:29                 ` Varghese, Vipin
2026-05-30 16:22                 ` Mattias Rönnblom
2026-05-27 17:30         ` [RFC v3 3/3] app/test: add fastmem test suite Mattias Rönnblom
2026-05-28  9:02         ` [RFC v3 0/3] lib/fastmem: fast small-object allocator Morten Brørup
2026-05-25 14:30 ` [RFC " Stephen Hemminger
2026-05-25 19:39   ` Mattias Rönnblom
2026-05-25 22:18     ` Stephen Hemminger
2026-05-26  7:01       ` Mattias Rönnblom
2026-05-25 18:36 ` Stephen Hemminger
2026-05-25 19:43   ` Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox