From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com,
eddyz87@gmail.com, memxor@gmail.com,
herbert@gondor.apana.org.au
Cc: Mykyta Yatsenko <yatsenko@meta.com>,
Emil Tsalapatis <emil@etsalapatis.com>
Subject: [PATCH bpf-next v4 00/11] bpf: Introduce resizable hash map
Date: Wed, 13 May 2026 15:36:03 -0700 [thread overview]
Message-ID: <20260513-rhash-v4-0-dd3d541ccb0b@meta.com> (raw)
This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.
The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:
1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
under-provisioning hurts performance)
BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.
The implementation wraps the kernel's rhashtable with BPF map operations:
- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- In-place updates for improved performance.
- max_entries serves as a hard limit, not bucket count
- Uses bit_spin_lock() + local_irq_save() for bucket locking,
similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and
deletes may fail.
- Iterations are best-effort, if resize, insertions, deletions take place
concurrently, iterations may visit same elements multiple times or skip
elements.
- Lock out insertions, when running special fields destructor to guarantee
its completion.
The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.
Summary of the read consistency guarantees:
map type | write mechanism | read consistency
-------------+------------------+--------------------------
htab | alloc, swap ptr | always consistent (RCU)
htab F_LOCK | in-place + lock | consistent if reader locks
-------------+------------------+--------------------------
rhtab | in-place memcpy | torn reads
rhtab F_LOCK | in-place + lock | consistent if reader locks
Benchmarks
----------
1. LOOKUP (single producer, M events/sec)
key | max | nr | htab | rhtab | ratio | delta
----+-----+-------+---------+---------+-------+-------
8 | 1K | 750 | 99.85 | 81.92 | 0.82x | -18 %
8 | 1K | 1K | 100.71 | 80.19 | 0.80x | -20 %
8 | 1M | 750K | 23.37 | 72.09 | 3.08x | +208 %
8 | 1M | 1M | 13.39 | 53.72 | 4.01x | +301 %
32 | 1K | 750 | 51.57 | 42.78 | 0.83x | -17 %
32 | 1K | 1K | 50.81 | 45.83 | 0.90x | -10 %
32 | 1M | 750K | 11.27 | 15.29 | 1.36x | +36 %
32 | 1M | 1M | 7.32 | 8.75 | 1.19x | +19 %
256 | 1K | 750 | 7.58 | 7.88 | 1.04x | +4 %
256 | 1K | 1K | 7.43 | 7.81 | 1.05x | +5 %
256 | 1M | 750K | 3.69 | 4.27 | 1.16x | +16 %
256 | 1M | 1M | 2.60 | 3.12 | 1.20x | +20 %
Pattern:
* Small map (1K): htab wins for 8 / 32 byte keys by 10-20 %
because the preallocated bucket array fits in L1. Equalises
at 256 byte keys.
* Large map (1M): rhtab wins everywhere, up to 4x at high load
factor with 8 byte keys.
* Higher load factor amplifies rhtab's lead: rhtab grows the
bucket array; htab stays at user-declared max.
2. FULL UPDATE (M events/sec per producer, -p 7)
htab per-producer:
20.33 22.02 19.27 23.61 24.18 23.17 21.07
mean 21.94 range 19.27 - 24.18
rhtab per-producer:
133.51 129.47 74.52 129.29 102.26 129.98 107.64
mean 115.24 range 74.52 - 133.51
speedup (mean): 5.25x (+425 %)
In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.
3. MEMORY (overwrite, -p 8, no --preallocated)
value_size | htab ops/s | rhtab ops/s | htab mem | rhtab mem
-----------+-------------+-------------+----------+----------
32 B | 122.87 k/s | 133.04 k/s | 2.47 MiB | 2.49 MiB
4096 B | 64.43 k/s | 65.38 k/s | 6.74 MiB | 6.44 MiB
rhtab/htab : +8 % ops, +0.8 % mem (32 B)
+1 % ops, -4 % mem (4096 B)
SUMMARY
* Small / well-fitting map: htab is faster (cache-friendly
fixed bucket array), but only by ~10-20 %.
* Large / high-load-factor map: rhtab is dramatically faster
(1.2x to 4x) because rhashtable resizes to keep the load
factor sane while htab stays stuck at user-declared max.
* Update-heavy workloads: rhtab is ~5x faster per producer
via in-place memcpy.
* Memory benchmark: effectively on par
---
Changes in v4:
- rhashtable: introduce rhashtable_next_key(), drop walker-based
iteration for BPF (also drops earlier rhashtable_walk_enter_from()
proposal).
- map_extra: presize hint via lower 32 bits (nelem_hint), capped at
U16_MAX.
- Automatic shrinking enabled (was missing despite being advertised).
- Reject key_size > U16_MAX (rhashtable_params.key_len is u16).
- Replace irqs_disabled() guard with bpf_disable_instrumentation around
bucket-lock paths: closes same-CPU NMI tracing recursion without
rejecting legitimate IRQ-context callers.
- lookup_and_delete reordered: unlink before copy to avoid populating
user buffer on concurrent-unlink -ENOENT.
- update_existing reordered: copy then free_fields, matching arraymap.
- Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn
via static-const rhashtable_params; works on both 32-bit and 64-bit.
- check_and_init_map_value() on insert (zero special-field bytes from
recycled bpf_mem_alloc memory; previously bpf_spin_lock could read
garbage and qspinlock would deadlock).
- BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special-
fields commit so each commit is bisect-safe.
- Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com
Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/
Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com
---
Mykyta Yatsenko (11):
rhashtable: Add rhashtable_next_key() API
rhashtable: Add selftest for rhashtable_next_key()
bpf: Implement resizable hashmap basic functions
bpf: Implement iteration ops for resizable hashtab
bpf: Allow special fields in resizable hashtab
bpf: Optimize word-sized keys for resizable hashtable
libbpf: Support resizable hashtable
selftests/bpf: Add basic tests for resizable hash map
selftests/bpf: Add BPF iterator tests for resizable hash map
bpftool: Add rhash map documentation
selftests/bpf: Add resizable hashmap to benchmarks
include/linux/bpf_types.h | 1 +
include/linux/rhashtable.h | 85 +++
include/uapi/linux/bpf.h | 6 +
kernel/bpf/hashtab.c | 805 ++++++++++++++++++++-
kernel/bpf/map_iter.c | 3 +-
kernel/bpf/syscall.c | 5 +
kernel/bpf/verifier.c | 1 +
lib/test_rhashtable.c | 73 ++
tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
tools/bpf/bpftool/map.c | 2 +-
tools/include/uapi/linux/bpf.h | 6 +
tools/lib/bpf/libbpf.c | 1 +
tools/lib/bpf/libbpf_probes.c | 3 +
tools/testing/selftests/bpf/bench.c | 6 +
.../bpf/benchs/bench_bpf_hashmap_full_update.c | 34 +-
.../bpf/benchs/bench_bpf_hashmap_lookup.c | 31 +-
.../testing/selftests/bpf/benchs/bench_htab_mem.c | 35 +-
tools/testing/selftests/bpf/prog_tests/rhash.c | 183 +++++
.../selftests/bpf/progs/bpf_iter_bpf_rhash_map.c | 34 +
tools/testing/selftests/bpf/progs/rhash.c | 248 +++++++
20 files changed, 1546 insertions(+), 18 deletions(-)
---
base-commit: b01b1156ce614469a6b9bd9a23408eb4181e1ae7
change-id: 20251103-rhash-7b70069923d8
Best regards,
--
Mykyta Yatsenko <yatsenko@meta.com>
next reply other threads:[~2026-05-13 22:37 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-13 22:36 Mykyta Yatsenko [this message]
2026-05-13 22:36 ` [PATCH bpf-next v4 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
2026-05-13 23:21 ` bot+bpf-ci
2026-05-13 22:36 ` [PATCH bpf-next v4 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
2026-05-13 23:21 ` bot+bpf-ci
2026-05-13 22:36 ` [PATCH bpf-next v4 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 05/11] bpf: Allow special fields in " Mykyta Yatsenko
2026-05-13 23:21 ` bot+bpf-ci
2026-05-13 22:36 ` [PATCH bpf-next v4 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 07/11] libbpf: Support " Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 09/11] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 10/11] bpftool: Add rhash map documentation Mykyta Yatsenko
2026-05-13 22:36 ` [PATCH bpf-next v4 11/11] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260513-rhash-v4-0-dd3d541ccb0b@meta.com \
--to=mykyta.yatsenko5@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=emil@etsalapatis.com \
--cc=herbert@gondor.apana.org.au \
--cc=kafai@meta.com \
--cc=kernel-team@meta.com \
--cc=memxor@gmail.com \
--cc=yatsenko@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox