public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v3 00/10] bpf: Introduce resizable hash map
@ 2026-04-24 19:50 Mykyta Yatsenko
  2026-04-24 19:50 ` [PATCH bpf-next v3 01/10] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-04-24 19:50 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko, Emil Tsalapatis

This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.

The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:

1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
 under-provisioning hurts performance)

BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.

The implementation wraps the kernel's rhashtable with BPF map operations:

- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- max_entries serves as a hard limit, not bucket count

The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.

Summary of the read consistency guarantees:

  map type     |  write mechanism |  read consistency
  -------------+------------------+--------------------------
  htab         |  alloc, swap ptr |  always consistent (RCU)
  htab  F_LOCK |  in-place + lock |  consistent if reader locks
  -------------+------------------+--------------------------
  rhtab        |  in-place memcpy |  torn reads
  rhtab F_LOCK |  in-place + lock |  consistent if reader locks

Benchmarks
----------

Lookup performance (bpf-rhashmap-lookup vs bpf-hashmap-lookup, events/sec):

  key_size | max_entries | nr_entries |    htab   |   rhtab  |  tbl load
  ---------+-------------+------------+-----------+----------+------------
     4     |    1,000    |    500     |  17.22M   |  16.92M  |   50%
     8     |  1,000,000  |  200,000   |  15.21M   |  13.39M  |   20%
     8     |  1,000,000  |  700,000   |  13.16M   |  13.62M  |   70%
     8     |  1,000,000  | 1,000,000  |   7.93M   |   9.52M  |  100%
    16     |  1,000,000  |  700,000   |  10.77M   |  11.85M  |   70%
    32     |  1,000,000  |  700,000   |  10.24M   |  10.69M  |   70%

rhtab matches or exceeds htab at higher load factors and larger keys.
At full occupancy (1M/1M) rhtab is 20% faster since htab's fixed
bucket count leads to longer chains while rhtab resizes to maintain
a healthy load factor.

In-place update (bpf-*-full-update, events/sec):
  htab :  7.62M
  rhtab: 14.94M  (+96%)

rhtab updates values in place via memcpy; htab allocates a new element
and swaps the pointer for each update.

Memory (htab-mem / rhtab-mem, 8 producers, overwrite workload):

    value_size |  ops/s  |  memory  |  peak memory
    -----------+---------+----------+-------------
    htab    32 | 18.27k  | 0.98 MiB |  1.24 MiB
    rhtab   32 | 27.05k  | 0.99 MiB |  0.99 MiB
    htab  4096 | 14.00k  | 2.23 MiB |  2.23 MiB
    rhtab 4096 | 17.00k  | 1.99 MiB |  1.99 MiB

rhtab is 21-48% faster on concurrent overwrites due to in-place
updates avoiding per-element allocation.
htab's peak memory exceeds steady-state (1.24 vs 0.98 MiB)
because each update allocates a new element before freeing the
old one. rhtab updates in place so peak equals steady-state.

Custom hash functions
---------------------

  Hash function      | key=4     | key=8     | key=32
  -------------------+-----------+-----------+-----------
  jhash (inlined)    | 15-16M    | 16M*      | 15M
  jhash (indirect)   | 11.6M     | 11.9M     | 11.0M
  xxh64 (indirect)   | 11.4M     | 11.9M     | 10.8M
  hsiphash (indirect)| 11.8M     | 11.6M     | 10.7M

Conclusions:
1. The indirect call dispatch is the bottleneck, not the hash
function. All three hash functions perform identically
(~11-12M) because the overhead of going through rhtab->params
(non-constant params -> ht->p.hashfn indirect call.
2. The zeroed-params trick is critical - it lets the compiler
see params.hashfn == NULL at compile time, inlining the
jhash dispatch and avoiding the indirect call entirely. That's
the ~15M vs ~11M difference (~35% improvement).

If we substitute inlined jhash with custom function in rhashtable:

Hash function | key=4   | key=8   | key=32  | key=256
--------------+---------+---------+---------+---------
jhash         | 15.6M   | 15.0M   | 15.0M   |  5.9M
xxh64         | 15.5M   | 14.8M   | 14.8M   | 10.3M

At key=256, xxh64 is 75% faster than jhash (10.3M vs 5.9M).

Asm code for resizable hashmap lookup
-------------------------------------
       : 0                0xffffffff81491860 <rhtab_lookup_elem>:
  0.01 :   ffffffff81491860:        endbr64
  0.08 :   ffffffff81491864:        pushq   %rbp
  0.09 :   ffffffff81491865:        pushq   %r15
  0.02 :   ffffffff81491867:        pushq   %r14
  0.02 :   ffffffff81491869:        pushq   %r13
  0.34 :   ffffffff8149186b:        pushq   %r12
  0.02 :   ffffffff8149186d:        pushq   %rbx
  0.06 :   ffffffff8149186e:        subq    $0x20, %rsp
  0.00 :   ffffffff81491872:        movq    %rsi, %rbx
  0.01 :   ffffffff81491875:        movq    %rdi, %r13
  0.01 :   ffffffff81491878:        movq    0x120(%rdi), %r14
  0.23 :   ffffffff8149187f:        movq    %rdi, (%rsp)
  0.75 :   ffffffff81491883:        movl    0x8(%r14), %edx
  0.66 :   ffffffff81491887:        movzwl  0x132(%r13), %esi
  0.04 :   ffffffff8149188f:        movq    %rbx, %rdi
  0.00 :   ffffffff81491892:        callq   0xffffffff8148e250 <jhash>
  0.05 :   ffffffff81491897:        movl    (%r14), %esi
 21.55 :   ffffffff8149189a:        decl    %esi
  0.47 :   ffffffff8149189c:        andl    %eax, %esi
  0.00 :   ffffffff8149189e:        cmpl    $0x0, 0x4(%r14)
  0.02 :   ffffffff814918a3:        movq    %r14, 0x18(%rsp)
  0.00 :   ffffffff814918a8:        jne     0xffffffff8149193c <rhtab_lookup_elem+0xdc>
  0.03 :   ffffffff814918ae:        movl    %esi, %eax
  0.00 :   ffffffff814918b0:        leaq    (%r14,%rax,8), %rax
  0.25 :   ffffffff814918b4:        addq    $0x40, %rax
  0.67 :   ffffffff814918b8:        movq    %rax, %rcx
  0.00 :   ffffffff814918bb:        orq     $0x1, %rcx
  0.00 :   ffffffff814918bf:        movq    %rcx, 0x8(%rsp)
  0.38 :   ffffffff814918c4:        movq    %rax, 0x10(%rsp)
  0.00 :   ffffffff814918c9:        movq    (%rax), %r12
  8.00 :   ffffffff814918cc:        andq    $-0x2, %r12
  0.00 :   ffffffff814918d0:        je      0xffffffff81491929 <rhtab_lookup_elem+0xc9>
  0.00 :   ffffffff814918d2:        movzwl  0x136(%r13), %r14d
  0.14 :   ffffffff814918da:        negq    %r14
  0.00 :   ffffffff814918dd:        movzwl  0x134(%r13), %r15d
  0.00 :   ffffffff814918e5:        movzwl  0x132(%r13), %r13d
  0.00 :   ffffffff814918ed:        leaq    (%r12,%r14), %rbp
  0.68 :   ffffffff814918f1:        leaq    (%r15,%rbp), %rdi
  2.02 :   ffffffff814918f5:        movq    %rbx, %rsi
  2.63 :   ffffffff814918f8:        movq    %r13, %rdx
  0.00 :   ffffffff814918fb:        callq   0xffffffff81ef69a0 <bcmp>
  6.30 :   ffffffff81491900:        testl   %eax, %eax
 34.73 :   ffffffff81491902:        je      0xffffffff81491920 <rhtab_lookup_elem+0xc0>
  0.00 :   ffffffff81491904:        movq    (%r12), %r12
  1.03 :   ffffffff81491908:        testb   $0x1, %r12b
  0.00 :   ffffffff8149190c:        je      0xffffffff814918ed <rhtab_lookup_elem+0x8d>
  0.00 :   ffffffff8149190e:        cmpq    0x8(%rsp), %r12
  0.00 :   ffffffff81491913:        movq    (%rsp), %r13
  0.00 :   ffffffff81491917:        movq    0x10(%rsp), %rax
  0.00 :   ffffffff8149191c:        jne     0xffffffff814918c9 <rhtab_lookup_elem+0x69>
  0.00 :   ffffffff8149191e:        jmp     0xffffffff81491929 <rhtab_lookup_elem+0xc9>
  0.00 :   ffffffff81491920:        testq   %r12, %r12
  7.31 :   ffffffff81491923:        movq    (%rsp), %r13
  0.43 :   ffffffff81491927:        jne     0xffffffff8149194b <rhtab_lookup_elem+0xeb>
  0.00 :   ffffffff81491929:        movq    0x18(%rsp), %r14
  0.00 :   ffffffff8149192e:        movq    0x30(%r14), %r14
  0.00 :   ffffffff81491932:        testq   %r14, %r14
  0.00 :   ffffffff81491935:        je      0xffffffff81491949 <rhtab_lookup_elem+0xe9>
  0.00 :   ffffffff81491937:        jmp     0xffffffff81491883 <rhtab_lookup_elem+0x23>
  0.00 :   ffffffff8149193c:        movq    %r14, %rdi
  0.00 :   ffffffff8149193f:        callq   0xffffffff818711a0 <rht_bucket_nested>
  0.00 :   ffffffff81491944:        jmp     0xffffffff814918b8 <rhtab_lookup_elem+0x58>
  0.00 :   ffffffff81491949:        xorl    %ebp, %ebp
  0.00 :   ffffffff8149194b:        movq    %rbp, %rax
  0.00 :   ffffffff8149194e:        addq    $0x20, %rsp
  0.23 :   ffffffff81491952:        popq    %rbx
  8.98 :   ffffffff81491953:        popq    %r12
  0.07 :   ffffffff81491955:        popq    %r13
  0.04 :   ffffffff81491957:        popq    %r14
  1.47 :   ffffffff81491959:        popq    %r15
  0.18 :   ffffffff8149195b:        popq    %rbp
  0.00 :   ffffffff8149195c:        jmp     0xffffffff81f182b0 <__pi___x86_return_thunk>

---
Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/

Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com

---
Mykyta Yatsenko (10):
      bpf: Implement resizable hashmap basic functions
      rhashtable: Add rhashtable_walk_enter_from()
      bpf: Implement get_next_key() resizable hashtab
      bpf: Implement batch ops and iterators for resizable hashtab
      bpf: Allow timers, workqueues and task_work in resizable hashtab
      libbpf: Support resizable hashtable
      selftests/bpf: Add basic tests for resizable hash map
      selftests/bpf: Add BPF iterator tests for resizable hash map
      bpftool: Add rhash map documentation
      selftests/bpf: Add resizable hashmap to benchmarks

 include/linux/bpf_types.h                          |   1 +
 include/linux/rhashtable.h                         |  31 +-
 include/uapi/linux/bpf.h                           |   1 +
 kernel/bpf/hashtab.c                               | 730 ++++++++++++++++++++-
 kernel/bpf/map_iter.c                              |   3 +-
 kernel/bpf/syscall.c                               |   4 +
 kernel/bpf/verifier.c                              |   1 +
 lib/rhashtable.c                                   |  41 ++
 lib/test_rhashtable.c                              | 131 ++++
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |   2 +-
 tools/bpf/bpftool/map.c                            |   2 +-
 tools/include/uapi/linux/bpf.h                     |   1 +
 tools/lib/bpf/libbpf.c                             |   1 +
 tools/lib/bpf/libbpf_probes.c                      |   3 +
 tools/testing/selftests/bpf/bench.c                |   6 +
 .../bpf/benchs/bench_bpf_hashmap_full_update.c     |  34 +-
 .../bpf/benchs/bench_bpf_hashmap_lookup.c          |  31 +-
 .../testing/selftests/bpf/benchs/bench_htab_mem.c  |  35 +-
 tools/testing/selftests/bpf/prog_tests/rhash.c     | 127 ++++
 .../selftests/bpf/progs/bpf_iter_bpf_rhash_map.c   |  34 +
 tools/testing/selftests/bpf/progs/rhash.c          | 239 +++++++
 21 files changed, 1435 insertions(+), 23 deletions(-)
---
base-commit: 05103382104b8ffc701f7b2c79379a0361ecad30
change-id: 20251103-rhash-7b70069923d8

Best regards,
--  
Mykyta Yatsenko <yatsenko@meta.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-04-28 13:20 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-24 19:50 [PATCH bpf-next v3 00/10] bpf: Introduce resizable hash map Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 01/10] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
2026-04-24 20:40   ` sashiko-bot
2026-04-25 20:41     ` Mykyta Yatsenko
2026-04-24 20:45   ` bot+bpf-ci
2026-04-25 20:50     ` Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 02/10] rhashtable: Add rhashtable_walk_enter_from() Mykyta Yatsenko
2026-04-24 20:15   ` sashiko-bot
2026-04-24 20:45   ` bot+bpf-ci
2026-04-28 10:35   ` Herbert Xu
2026-04-24 19:50 ` [PATCH bpf-next v3 03/10] bpf: Implement get_next_key() resizable hashtab Mykyta Yatsenko
2026-04-28 10:33   ` Herbert Xu
2026-04-28 13:20     ` Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 04/10] bpf: Implement batch ops and iterators for " Mykyta Yatsenko
2026-04-24 20:28   ` sashiko-bot
2026-04-25 21:24     ` Mykyta Yatsenko
2026-04-27 13:36       ` Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 05/10] bpf: Allow timers, workqueues and task_work in " Mykyta Yatsenko
2026-04-24 21:05   ` sashiko-bot
2026-04-25 21:29     ` Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 06/10] libbpf: Support resizable hashtable Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 07/10] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
2026-04-24 20:02   ` sashiko-bot
2026-04-24 20:32   ` bot+bpf-ci
2026-04-24 19:50 ` [PATCH bpf-next v3 08/10] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 09/10] bpftool: Add rhash map documentation Mykyta Yatsenko
2026-04-24 19:50 ` [PATCH bpf-next v3 10/10] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox