[PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map
@ 2026-05-28 17:47 Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
                   ` (10 more replies)
  0 siblings, 11 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko, Emil Tsalapatis

This patch series introduces BPF_MAP_TYPE_RHASH, a new hash map type that
leverages the kernel's rhashtable to provide resizable hash map for BPF.

The existing BPF_MAP_TYPE_HASH uses a fixed number of buckets determined at
map creation time. While this works well for many use cases, it presents
challenges when:

1. The number of elements is unknown at creation time
2. The element count varies significantly during runtime
3. Memory efficiency is important (over-provisioning wastes memory,
 under-provisioning hurts performance)

BPF_MAP_TYPE_RHASH addresses these issues by using rhashtable, which
automatically grows and shrinks based on load factor.

The implementation wraps the kernel's rhashtable with BPF map operations:

- Uses bpf_mem_alloc for RCU-safe memory management
- Supports all standard map operations (lookup, update, delete, get_next_key)
- Supports batch operations (lookup_batch, lookup_and_delete_batch)
- Supports BPF iterators for traversal
- Supports BPF_F_LOCK for spin locks in values
- Requires BPF_F_NO_PREALLOC flag (elements allocated on demand)
- In-place updates for improved performance.
- max_entries serves as a hard limit, not bucket count
- Uses bit_spin_lock() + local_irq_save() for bucket locking,
similar to existing BPF hashmap's raw_spin_lock_irqsave(), insertions and
deletes may fail.
- Iterations are best-effort, if resize, insertions, deletions take place
concurrently, iterations may visit same elements multiple times or skip
elements.
- Lock out insertions, when running special fields destructor to guarantee
its completion.

The series includes comprehensive tests:
- Basic operations in test_maps (lookup, update, delete, get_next_key)
- BPF program tests for lookup/update/delete semantics
- Seq file tests

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Update implementation
---------------------
Current implementation of the BPF_MAP_TYPE_RHASH does not provide
the same strong guarantees on the values consistency under concurrent
reads/writes as BPF_MAP_TYPE_HASH.
BPF_MAP_TYPE_HASH allocates a new element and atomically swaps the
pointer. BPF_MAP_TYPE_RHASH does memcpy in place with no lock held.
rhash trades consistency for speed, concurrent readers can observe
partially updated data. Two concurrent writers to the same key can
also interleave, producing mixed values. This is similar to arraymap
update implementation, including handling of the special fields.
As a solution, user may use BPF_F_LOCK to guarantee consistent reads
and write serialization.

Summary of the read consistency guarantees:

  map type     |  write mechanism |  read consistency
  -------------+------------------+--------------------------
  htab         |  alloc, swap ptr |  always consistent (RCU)
  htab  F_LOCK |  in-place + lock |  consistent if reader locks
  -------------+------------------+--------------------------
  rhtab        |  in-place memcpy |  torn reads
  rhtab F_LOCK |  in-place + lock |  consistent if reader locks

Benchmarks
----------
1. LOOKUP  (single producer, M events/sec)
  key | max | nr    |    htab |   rhtab | ratio | delta
  ----+-----+-------+---------+---------+-------+-------
    8 |  1K |   750 |   99.85 |   81.92 | 0.82x |  -18 %
    8 |  1K |    1K |  100.71 |   80.19 | 0.80x |  -20 %
    8 |  1M |  750K |   23.37 |   72.09 | 3.08x | +208 %
    8 |  1M |    1M |   13.39 |   53.72 | 4.01x | +301 %
   32 |  1K |   750 |   51.57 |   42.78 | 0.83x |  -17 %
   32 |  1K |    1K |   50.81 |   45.83 | 0.90x |  -10 %
   32 |  1M |  750K |   11.27 |   15.29 | 1.36x |  +36 %
   32 |  1M |    1M |    7.32 |    8.75 | 1.19x |  +19 %
  256 |  1K |   750 |    7.58 |    7.88 | 1.04x |   +4 %
  256 |  1K |    1K |    7.43 |    7.81 | 1.05x |   +5 %
  256 |  1M |  750K |    3.69 |    4.27 | 1.16x |  +16 %
  256 |  1M |    1M |    2.60 |    3.12 | 1.20x |  +20 %

Pattern:
  * Small map (1K): htab wins for 8 / 32 byte keys by 10-20 %
    because the preallocated bucket array fits in L1.  Equalises
    at 256 byte keys.
  * Large map (1M): rhtab wins everywhere, up to 4x at high load
    factor with 8 byte keys.
  * Higher load factor amplifies rhtab's lead: rhtab grows the
    bucket array; htab stays at user-declared max.

2. FULL UPDATE  (M events/sec per producer, -p 7)

  htab  per-producer:
    20.33   22.02   19.27   23.61   24.18   23.17   21.07
    mean  21.94   range  19.27 - 24.18

  rhtab per-producer:
   133.51  129.47   74.52  129.29  102.26  129.98  107.64
    mean 115.24   range  74.52 - 133.51

  speedup (mean): 5.25x   (+425 %)

In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.

3. MEMORY  (overwrite, -p 8, no --preallocated)

  value_size |  htab ops/s | rhtab ops/s | htab mem | rhtab mem
  -----------+-------------+-------------+----------+----------
       32 B  |  122.87 k/s |  133.04 k/s | 2.47 MiB | 2.49 MiB
     4096 B  |   64.43 k/s |   65.38 k/s | 6.74 MiB | 6.44 MiB
  rhtab/htab :  +8 % ops, +0.8 % mem   (32 B)
                +1 % ops,  -4  % mem (4096 B)

SUMMARY

  * Small / well-fitting map: htab is faster (cache-friendly
    fixed bucket array), but only by ~10-20 %.
  * Large / high-load-factor map: rhtab is dramatically faster
    (1.2x to 4x) because rhashtable resizes to keep the load
    factor sane while htab stays stuck at user-declared max.
  * Update-heavy workloads: rhtab is ~5x faster per producer
    via in-place memcpy.
  * Memory benchmark: effectively on par

---
Changes in v5:
- rhashtable_next_key: add kdoc WARNING to highlight lack of rehash
  detection and unbounded iteration (Herbert).
- rhashtable: selftest now checks IS_ERR() before PTR_ERR comparison
  on the missing-key path (bot+bpf-ci).
- hashtab: drop dead stub bodies and unused map_ops registrations
  from the basic commit; iteration commit adds bodies, structs, and
  registrations together. .map_get_next_key keeps a stub registration
  in the basic commit because the syscall dispatcher does not
  NULL-check it; iteration commit replaces the stub body with the
  real implementation (bot+bpf-ci).
- hashtab: fix batch cursor advancement. v4 stashed the lookahead
  element key but then resumed via next_key(cursor), skipping that
  element across batch boundaries and orphaning it on
  lookup_and_delete_batch. v5 stashes the lookahead key and looks
  it up directly on the next batch entry (bot+bpf-ci, sashiko v3).
- hashtab: document torn-read race in rhtab_map_update_existing,
  matching arraymap semantics (bot+bpf-ci).
- Link to v4: https://patch.msgid.link/20260513-rhash-v4-0-dd3d541ccb0b@meta.com

Changes in v4:
- rhashtable: introduce rhashtable_next_key(), drop walker-based
  iteration for BPF (also drops earlier rhashtable_walk_enter_from()
  proposal).
- map_extra: presize hint via lower 32 bits (nelem_hint), capped at
  U16_MAX.
- Automatic shrinking enabled (was missing despite being advertised).
- Reject key_size > U16_MAX (rhashtable_params.key_len is u16).
- Replace irqs_disabled() guard with bpf_disable_instrumentation around
  bucket-lock paths: closes same-CPU NMI tracing recursion without
  rejecting legitimate IRQ-context callers.
- lookup_and_delete reordered: unlink before copy to avoid populating
  user buffer on concurrent-unlink -ENOENT.
- update_existing reordered: copy then free_fields, matching arraymap.
- Word-sized key fast path (sizeof(long) bytes), inlined hashfn/cmpfn
  via static-const rhashtable_params; works on both 32-bit and 64-bit.
- check_and_init_map_value() on insert (zero special-field bytes from
  recycled bpf_mem_alloc memory; previously bpf_spin_lock could read
  garbage and qspinlock would deadlock).
- BPF_SPIN_LOCK / BPF_RES_SPIN_LOCK allowlist moved to the special-
  fields commit so each commit is bisect-safe.
- Link to v3: https://patch.msgid.link/20260424-rhash-v3-0-d0fa0ce4379b@meta.com

Changes in v3:
- Squash all commits implementing basic functions into one (Alexei)
- Remove selftests that were not necessary (Alexei)
- Resize detection for kernel full iterations, error out on resize (Alexei)
- Remove second lookup in get_next_key() (Emil)
- __acquires(RCU)/__releases(RCU) on seq_start/seq_stop (Emil)
- Use bpf_map_check_op_flags() where it makes sense (Leon)
- Benchmarks refresh, experiment with alternative hash functions
- Rely on iterator invalidation during rehash to handle table resizes:
fail on resize where we fully iterate on table inside kernel, dont fail on
resize where iteration goes through userspace. Exception -
rhtab_map_free_internal_structs() should be just safe to iterate fully
in kernel, no risk of infinite loop, because no user holding reference.
- Handle special fields during in-place updates (Emil, sashiko)
- Link to v2: https://lore.kernel.org/all/20260408-rhash-v2-0-3b3675da1f6e@meta.com/

Changes in v2:
- Added benchmarks
- Reworked all functions that walk the rhashtable, use walk API, instead
of directly accessing tbl and future_tbl
- Added rhashtable_walk_enter_from() into rhashtable to support O(1)
iteration continuations
- Link to v1: https://lore.kernel.org/r/20260205-rhash-v1-0-30dd6d63c462@meta.com

---
Mykyta Yatsenko (11):
      rhashtable: Add rhashtable_next_key() API
      rhashtable: Add selftest for rhashtable_next_key()
      bpf: Implement resizable hashmap basic functions
      bpf: Implement iteration ops for resizable hashtab
      bpf: Allow special fields in resizable hashtab
      bpf: Optimize word-sized keys for resizable hashtable
      libbpf: Support resizable hashtable
      selftests/bpf: Add basic tests for resizable hash map
      selftests/bpf: Add BPF iterator tests for resizable hash map
      bpftool: Add rhash map documentation
      selftests/bpf: Add resizable hashmap to benchmarks

 include/linux/bpf_types.h                          |   1 +
 include/linux/rhashtable.h                         |  91 +++
 include/uapi/linux/bpf.h                           |   6 +
 kernel/bpf/hashtab.c                               | 814 ++++++++++++++++++++-
 kernel/bpf/map_iter.c                              |   3 +-
 kernel/bpf/syscall.c                               |   5 +
 kernel/bpf/verifier.c                              |   1 +
 lib/test_rhashtable.c                              |  73 ++
 tools/bpf/bpftool/Documentation/bpftool-map.rst    |   2 +-
 tools/bpf/bpftool/map.c                            |   2 +-
 tools/include/uapi/linux/bpf.h                     |   6 +
 tools/lib/bpf/libbpf.c                             |   1 +
 tools/lib/bpf/libbpf_probes.c                      |   3 +
 tools/testing/selftests/bpf/bench.c                |   6 +
 .../bpf/benchs/bench_bpf_hashmap_full_update.c     |  34 +-
 .../bpf/benchs/bench_bpf_hashmap_lookup.c          |  31 +-
 .../testing/selftests/bpf/benchs/bench_htab_mem.c  |  35 +-
 tools/testing/selftests/bpf/prog_tests/rhash.c     | 183 +++++
 .../selftests/bpf/progs/bpf_iter_bpf_rhash_map.c   |  34 +
 tools/testing/selftests/bpf/progs/rhash.c          | 248 +++++++
 20 files changed, 1561 insertions(+), 18 deletions(-)
---
base-commit: d34771f2f3db4ea0dd532cdd579002fc95dc483a
change-id: 20251103-rhash-7b70069923d8

Best regards,
--  
Mykyta Yatsenko <yatsenko@meta.com>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:16   ` sashiko-bot
  2026-05-28 18:26   ` bot+bpf-ci
  2026-05-28 17:47 ` [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.

  void *rhashtable_next_key(struct rhashtable *ht,
                            const void *prev_key,
                            const struct rhashtable_params params);

Caller holds RCU; passes NULL prev_key for the first element or
the previously returned key to advance. Walks tbl->future_tbl
chain so in-flight rehashes are observed.

Best-effort semantics: an element being migrated may transiently
appear in both source and destination tables and be observed
twice. Concurrent inserts and shrinks may be skipped. Termination
of a full iteration is not guaranteed under continuous adversarial
rehash; callers must bound their loop externally. If prev_key is
not present in any in-flight table, returns ERR_PTR(-ENOENT).

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/rhashtable.h | 91 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 91 insertions(+)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index ef5230cece36..d6eb3ba476ea 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -650,6 +650,97 @@ static __always_inline struct rhash_head *__rhashtable_lookup(
 	return NULL;
 }
 
+/* Internal: scan one element forward from prev_key's position.
+ * Returns first rhash_head whose bucket >= prev_key's bucket and
+ * (if within prev_key's bucket) appears after prev_key in chain.
+ * Returns first element if prev_key is NULL or ERR_PTR(-ENOENT) if prev_key
+ * is not found.
+ */
+static __always_inline struct rhash_head *__rhashtable_next_in_table(
+	struct rhashtable *ht, struct bucket_table *tbl,
+	const void *prev_key, const struct rhashtable_params params)
+	__must_hold_shared(RCU)
+{
+	struct rhashtable_compare_arg arg = { .ht = ht, .key = prev_key };
+	struct rhash_head *he;
+	unsigned int b = 0;
+	bool found = false;
+
+	if (prev_key) {
+		b = rht_key_hashfn(ht, tbl, prev_key, params);
+		rht_for_each_rcu(he, tbl, b) {
+			if (found)
+				return he;
+			if (params.obj_cmpfn ?
+			    !params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+			    !rhashtable_compare(&arg, rht_obj(ht, he)))
+				found = true;
+		}
+		if (!found)
+			return ERR_PTR(-ENOENT);
+		b++;
+	}
+
+	for (; b < tbl->size; b++)
+		rht_for_each_rcu(he, tbl, b)
+			return he;
+	return NULL;
+}
+
+/**
+ * rhashtable_next_key - return next element after a given key
+ * @ht:		hash table
+ * @prev_key:	pointer to previous key, or NULL for the first element
+ * @params:	hash table parameters
+ *
+ * WARNING: this walk is highly unstable. Unlike rhashtable_walk_*(),
+ * it cannot detect a concurrent resize or rehash, so a full iteration
+ * is NOT guaranteed to terminate under adversarial or sustained
+ * rehashing. Callers MUST tolerate skipped and duplicated elements and
+ * SHOULD bound their loop externally.
+ *
+ * Returns the next element in best-effort iteration order, walking the
+ * @tbl chain (including any future_tbl in flight). Caller must hold RCU.
+ *
+ * Pass @prev_key == NULL to obtain the first element. To iterate, set
+ * @prev_key to the key of the previously returned element on each call,
+ * and stop when NULL is returned.
+ *
+ * Best-effort semantics:
+ *   - Across the tbl->future_tbl chain, an element being migrated may
+ *     transiently appear in both tables and be observed twice.
+ *   - Concurrent inserts may or may not be observed.
+ *   - Termination of a full iteration loop is NOT guaranteed under
+ *     adversarial continuous rehash; callers MUST tolerate skips and
+ *     repeats and SHOULD bound their loop externally.
+ *   - If prev_key was concurrently deleted and is not present in any
+ *     in-flight table, returns ERR_PTR(-ENOENT).
+ *
+ * Returns entry of the next element, or NULL when iteration is exhausted
+ * or ERR_PTR(-ENOENT) if prev_key is not found.
+ */
+static inline void *rhashtable_next_key(
+	struct rhashtable *ht, const void *prev_key,
+	const struct rhashtable_params params)
+	__must_hold_shared(RCU)
+{
+	struct bucket_table *tbl;
+	struct rhash_head *he;
+
+	tbl = rht_dereference_rcu(ht->tbl, ht);
+	do {
+		he = __rhashtable_next_in_table(ht, tbl, prev_key, params);
+		if (!IS_ERR_OR_NULL(he))
+			return rht_obj(ht, he);
+		if (!he)
+			prev_key = NULL;
+		/* Ensure we see any new future_tbl attached during a rehash. */
+		smp_rmb();
+		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
+	} while (tbl);
+	return he; /* NULL or -ENOENT */
+}
+
 /**
  * rhashtable_lookup - search hash table
  * @ht:		hash table

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key()
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:26   ` sashiko-bot
  2026-05-28 17:47 ` [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Insert n elements, then verify:
  - NULL prev_key walks from the beginning, visiting all n
  - non-existing prev_key returns ERR_PTR(-ENOENT)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 lib/test_rhashtable.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 0b33559a910b..c7a0817b0be5 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -679,6 +679,76 @@ static int threadfunc(void *data)
 	return err;
 }
 
+static int __init test_rhashtable_next_key(void)
+{
+	struct rhashtable_params params = test_rht_params;
+	struct test_obj_val key_missing = { .id = 99999, .tid = 0 };
+	struct test_obj_val *prev_key = NULL;
+	struct rhashtable ht;
+	struct test_obj *objs, *cur;
+	int i, count = 0, err;
+	int visited_keys[8] = { 0 };
+	const int n = ARRAY_SIZE(visited_keys);
+
+	err = rhashtable_init(&ht, &params);
+	if (err)
+		return err;
+
+	objs = kcalloc(n, sizeof(*objs), GFP_KERNEL);
+	if (!objs) {
+		rhashtable_destroy(&ht);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < n; i++) {
+		objs[i].value.id = i;
+		err = rhashtable_insert_fast(&ht, &objs[i].node, params);
+		if (err)
+			goto out;
+	}
+
+	rcu_read_lock();
+
+	/* NULL prev_key: walk from the beginning, expect all n elements. */
+	while ((cur = rhashtable_next_key(&ht, prev_key, params))) {
+		if (IS_ERR(cur)) {
+			err = -EINVAL;
+			goto unlock;
+		}
+		count++;
+		prev_key = &cur->value;
+		visited_keys[cur->value.id] = 1;
+		if (count > n)
+			break;
+	}
+
+	if (count != n) {
+		err = -EINVAL;
+		goto unlock;
+	}
+
+	for (i = 0; i < n; i++) {
+		if (!visited_keys[i]) {
+			err = -EINVAL;
+			goto unlock;
+		}
+	}
+
+	/* Non-existing prev_key: must return ERR_PTR(-ENOENT). */
+	cur = rhashtable_next_key(&ht, &key_missing, params);
+	if (!IS_ERR(cur) || PTR_ERR(cur) != -ENOENT)
+		err = -EINVAL;
+
+unlock:
+	rcu_read_unlock();
+out:
+	for (i = 0; i < n; i++)
+		rhashtable_remove_fast(&ht, &objs[i].node, params);
+	kfree(objs);
+	rhashtable_destroy(&ht);
+	return err;
+}
+
 static int __init test_rht_init(void)
 {
 	unsigned int entries;
@@ -738,6 +808,9 @@ static int __init test_rht_init(void)
 
 	test_insert_duplicates_run();
 
+	pr_info("Testing rhashtable_next_key: %s\n",
+		test_rhashtable_next_key() == 0 ? "pass" : "FAIL");
+
 	if (!tcount)
 		return 0;
 

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:42   ` bot+bpf-ci
  2026-05-28 17:47 ` [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Use rhashtable_lookup_likely() for lookups, rhashtable_remove_fast()
for deletes, and rhashtable_lookup_get_insert_fast() for inserts.

Updates modify values in place under RCU rather than allocating a
new element and swapping the pointer (as regular htab does). This
trades read consistency for performance: concurrent readers may
see partial updates. Users requiring consistent reads should use
BPF_F_LOCK.

Initialize rhashtable with bpf_mem_alloc element cache. Require
BPF_F_NO_PREALLOC. Limit max_entries to 2^31. Free elements via
rhashtable_free_and_destroy() callback to handle internal structs.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 include/linux/bpf_types.h      |   1 +
 include/uapi/linux/bpf.h       |   6 +
 kernel/bpf/hashtab.c           | 301 +++++++++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   3 +
 kernel/bpf/verifier.c          |   1 +
 tools/include/uapi/linux/bpf.h |   6 +
 6 files changed, 318 insertions(+)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index b13de31e163f..56e4c3f983d3 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -134,6 +134,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_BLOOM_FILTER, bloom_filter_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_USER_RINGBUF, user_ringbuf_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARENA, arena_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_INSN_ARRAY, insn_array_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_RHASH, rhtab_map_ops)
 
 BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
 BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec171ccb6ef..bed9b1b4d5ef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1047,6 +1047,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGRP_STORAGE,
 	BPF_MAP_TYPE_ARENA,
 	BPF_MAP_TYPE_INSN_ARRAY,
+	BPF_MAP_TYPE_RHASH,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1545,6 +1546,11 @@ union bpf_attr {
 		 *
 		 * BPF_MAP_TYPE_ARENA - contains the address where user space
 		 * is going to mmap() the arena. It has to be page aligned.
+		 *
+		 * BPF_MAP_TYPE_RHASH - initial table size hint
+		 * (nelem_hint). 0 = use rhashtable default. Must be
+		 * <= min(max_entries, U16_MAX). Upper 32 bits reserved,
+		 * must be zero.
 		 */
 		__u64	map_extra;
 
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 3dd9b4924ae4..8944c9f070ca 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -9,6 +9,7 @@
 #include <linux/rculist_nulls.h>
 #include <linux/rcupdate_wait.h>
 #include <linux/random.h>
+#include <linux/rhashtable.h>
 #include <uapi/linux/btf.h>
 #include <linux/rcupdate_trace.h>
 #include <linux/btf_ids.h>
@@ -2739,3 +2740,303 @@ const struct bpf_map_ops htab_of_maps_map_ops = {
 	BATCH_OPS(htab),
 	.map_btf_id = &htab_map_btf_ids[0],
 };
+
+struct rhtab_elem {
+	struct rhash_head node;
+	/* key bytes, then value bytes follow */
+	u8 data[] __aligned(8);
+};
+
+struct bpf_rhtab {
+	struct bpf_map map;
+	struct rhashtable ht;
+	struct bpf_mem_alloc ma;
+	u32 elem_size;
+};
+
+static const struct rhashtable_params rhtab_params = {
+	.head_offset = offsetof(struct rhtab_elem, node),
+	.key_offset  = offsetof(struct rhtab_elem, data),
+};
+
+static inline void *rhtab_elem_value(struct rhtab_elem *l, u32 key_size)
+{
+	return l->data + round_up(key_size, 8);
+}
+
+static struct bpf_map *rhtab_map_alloc(union bpf_attr *attr)
+{
+	struct rhashtable_params params;
+	struct bpf_rhtab *rhtab;
+	int err = 0;
+
+	rhtab = bpf_map_area_alloc(sizeof(*rhtab), NUMA_NO_NODE);
+	if (!rhtab)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&rhtab->map, attr);
+
+	if (rhtab->map.max_entries > 1UL << 31) {
+		err = -E2BIG;
+		goto free_rhtab;
+	}
+
+	rhtab->elem_size = sizeof(struct rhtab_elem) + round_up(rhtab->map.key_size, 8) +
+			   round_up(rhtab->map.value_size, 8);
+
+	params = rhtab_params;
+	params.key_len = rhtab->map.key_size;
+	params.nelem_hint = (u32)attr->map_extra;
+	params.automatic_shrinking = true;
+
+	err = rhashtable_init(&rhtab->ht, &params);
+	if (err)
+		goto free_rhtab;
+
+	/* Set max_elems after rhashtable_init() since init zeroes the struct */
+	rhtab->ht.max_elems = rhtab->map.max_entries;
+
+	err = bpf_mem_alloc_init(&rhtab->ma, rhtab->elem_size, false);
+	if (err)
+		goto destroy_rhtab;
+
+	return &rhtab->map;
+
+destroy_rhtab:
+	rhashtable_destroy(&rhtab->ht);
+free_rhtab:
+	bpf_map_area_free(rhtab);
+	return ERR_PTR(err);
+}
+
+static int rhtab_map_alloc_check(union bpf_attr *attr)
+{
+	if (!(attr->map_flags & BPF_F_NO_PREALLOC))
+		return -EINVAL;
+
+	if (attr->map_flags & BPF_F_ZERO_SEED)
+		return -EINVAL;
+
+	if (attr->key_size > U16_MAX)
+		return -E2BIG;
+
+	if (attr->map_extra >> 32)
+		return -EINVAL;
+
+	if ((u32)attr->map_extra > U16_MAX)
+		return -E2BIG;
+
+	if ((u32)attr->map_extra > attr->max_entries)
+		return -EINVAL;
+
+	return htab_map_alloc_check(attr);
+}
+
+static void rhtab_free_elem(void *ptr, void *arg)
+{
+	struct bpf_rhtab *rhtab = arg;
+	struct rhtab_elem *elem = ptr;
+
+	bpf_map_free_internal_structs(&rhtab->map, rhtab_elem_value(elem, rhtab->map.key_size));
+	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
+}
+
+static void rhtab_map_free(struct bpf_map *map)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+
+	rhashtable_free_and_destroy(&rhtab->ht, rhtab_free_elem, rhtab);
+	bpf_mem_alloc_destroy(&rhtab->ma);
+	bpf_map_area_free(rhtab);
+}
+
+static void *rhtab_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+
+	/* Hold RCU lock in case sleepable program calls via gen_lookup */
+	guard(rcu)();
+
+	return rhashtable_lookup_likely(&rhtab->ht, key, rhtab_params);
+}
+
+static void *rhtab_map_lookup_elem(struct bpf_map *map, void *key) __must_hold(RCU)
+{
+	struct rhtab_elem *l;
+
+	l = rhtab_lookup_elem(map, key);
+	return l ? rhtab_elem_value(l, map->key_size) : NULL;
+}
+
+static void rhtab_read_elem_value(struct bpf_map *map, void *dst, struct rhtab_elem *elem,
+				  u64 flags)
+{
+	void *src = rhtab_elem_value(elem, map->key_size);
+
+	if (flags & BPF_F_LOCK)
+		copy_map_value_locked(map, dst, src, true);
+	else
+		copy_map_value(map, dst, src);
+}
+
+static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, void *copy,
+			     u64 flags)
+{
+	int err;
+
+	/*
+	 * disable_instrumentation() mitigates the deadlock for programs running in NMI context.
+	 * rhashtable locks bucket with local_irq_save(). Only NMI programs may reenter
+	 * rhashtable code, bpf_disable_instrumentation() disables programs running in NMI, except
+	 * raw tracepoints, which we don't have in rhashtable.
+	 */
+	bpf_disable_instrumentation();
+	err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
+	bpf_enable_instrumentation();
+
+	if (err)
+		return err;
+
+	if (copy) {
+		rhtab_read_elem_value(&rhtab->map, copy, elem, flags);
+		check_and_init_map_value(&rhtab->map, copy);
+	}
+
+	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
+	return 0;
+}
+
+
+static long rhtab_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	struct rhtab_elem *elem;
+
+	guard(rcu)();
+
+	elem = rhtab_lookup_elem(map, key);
+	if (!elem)
+		return -ENOENT;
+
+	return rhtab_delete_elem(rhtab, elem, NULL, 0);
+}
+
+static int rhtab_map_lookup_and_delete_elem(struct bpf_map *map, void *key, void *value, u64 flags)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	struct rhtab_elem *elem;
+	int err;
+
+	err = bpf_map_check_op_flags(map, flags, BPF_F_LOCK);
+	if (err)
+		return err;
+
+	guard(rcu)();
+
+	elem = rhtab_lookup_elem(map, key);
+	if (!elem)
+		return -ENOENT;
+
+	return rhtab_delete_elem(rhtab, elem, value, flags);
+}
+
+static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
+				      u64 map_flags)
+{
+	void *old_val = rhtab_elem_value(elem, map->key_size);
+
+	if (map_flags & BPF_NOEXIST)
+		return -EEXIST;
+
+	if (map_flags & BPF_F_LOCK)
+		copy_map_value_locked(map, old_val, value, false);
+	else
+		copy_map_value(map, old_val, value);
+	return 0;
+}
+
+static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	struct rhtab_elem *elem, *tmp;
+
+	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
+		return -EINVAL;
+
+	if ((map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))
+		return -EINVAL;
+
+	guard(rcu)();
+	elem = rhtab_lookup_elem(map, key);
+	if (elem)
+		return rhtab_map_update_existing(map, elem, value, map_flags);
+
+	if (map_flags & BPF_EXIST)
+		return -ENOENT;
+
+	/* Check max_entries limit before inserting new element */
+	if (atomic_read(&rhtab->ht.nelems) >= map->max_entries)
+		return -E2BIG;
+
+	elem = bpf_mem_cache_alloc(&rhtab->ma);
+	if (!elem)
+		return -ENOMEM;
+
+	memcpy(elem->data, key, map->key_size);
+	copy_map_value(map, rhtab_elem_value(elem, map->key_size), value);
+
+	/* Prevent deadlock for NMI programs attempting to take bucket lock */
+	bpf_disable_instrumentation();
+	tmp = rhashtable_lookup_get_insert_fast(&rhtab->ht, &elem->node, rhtab_params);
+	bpf_enable_instrumentation();
+
+	if (tmp) {
+		bpf_mem_cache_free(&rhtab->ma, elem);
+		if (IS_ERR(tmp))
+			return PTR_ERR(tmp);
+
+		return rhtab_map_update_existing(map, tmp, value, map_flags);
+	}
+
+	return 0;
+}
+
+static int rhtab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
+{
+	struct bpf_insn *insn = insn_buf;
+	const int ret = BPF_REG_0;
+
+	BUILD_BUG_ON(!__same_type(&rhtab_lookup_elem,
+				  (void *(*)(struct bpf_map *map, void *key)) NULL));
+	*insn++ = BPF_EMIT_CALL(rhtab_lookup_elem);
+	*insn++ = BPF_JMP_IMM(BPF_JEQ, ret, 0, 1);
+	*insn++ = BPF_ALU64_IMM(BPF_ADD, ret,
+				offsetof(struct rhtab_elem, data) + round_up(map->key_size, 8));
+
+	return insn - insn_buf;
+}
+
+static void rhtab_map_free_internal_structs(struct bpf_map *map)
+{
+}
+
+static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	return -EOPNOTSUPP;
+}
+
+BTF_ID_LIST_SINGLE(rhtab_map_btf_ids, struct, bpf_rhtab)
+const struct bpf_map_ops rhtab_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc_check = rhtab_map_alloc_check,
+	.map_alloc = rhtab_map_alloc,
+	.map_free = rhtab_map_free,
+	.map_get_next_key = rhtab_map_get_next_key,
+	.map_release_uref = rhtab_map_free_internal_structs,
+	.map_lookup_elem = rhtab_map_lookup_elem,
+	.map_lookup_and_delete_elem = rhtab_map_lookup_and_delete_elem,
+	.map_update_elem = rhtab_map_update_elem,
+	.map_delete_elem = rhtab_map_delete_elem,
+	.map_gen_lookup = rhtab_map_gen_lookup,
+	.map_btf_id = &rhtab_map_btf_ids[0],
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8d4ea700aac6..12c07310016a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1398,6 +1398,7 @@ static int __map_create(union bpf_attr *attr, bpfptr_t uattr, struct bpf_verifie
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
 	    attr->map_type != BPF_MAP_TYPE_ARENA &&
+	    attr->map_type != BPF_MAP_TYPE_RHASH &&
 	    attr->map_extra != 0) {
 		bpf_log(log, "Invalid map_extra.\n");
 		return -EINVAL;
@@ -1473,6 +1474,7 @@ static int __map_create(union bpf_attr *attr, bpfptr_t uattr, struct bpf_verifie
 	case BPF_MAP_TYPE_CGROUP_ARRAY:
 	case BPF_MAP_TYPE_ARRAY_OF_MAPS:
 	case BPF_MAP_TYPE_HASH:
+	case BPF_MAP_TYPE_RHASH:
 	case BPF_MAP_TYPE_PERCPU_HASH:
 	case BPF_MAP_TYPE_HASH_OF_MAPS:
 	case BPF_MAP_TYPE_RINGBUF:
@@ -2238,6 +2240,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		   map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH ||
+		   map->map_type == BPF_MAP_TYPE_RHASH ||
 		   map->map_type == BPF_MAP_TYPE_STACK_TRACE) {
 		if (!bpf_map_is_offloaded(map)) {
 			bpf_disable_instrumentation();
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c8d980fdd709..41388926b4f7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -17795,6 +17795,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 	if (prog->sleepable)
 		switch (map->map_type) {
 		case BPF_MAP_TYPE_HASH:
+		case BPF_MAP_TYPE_RHASH:
 		case BPF_MAP_TYPE_LRU_HASH:
 		case BPF_MAP_TYPE_ARRAY:
 		case BPF_MAP_TYPE_PERCPU_HASH:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 37142e6d911a..7d0b282ba674 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1047,6 +1047,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_CGRP_STORAGE,
 	BPF_MAP_TYPE_ARENA,
 	BPF_MAP_TYPE_INSN_ARRAY,
+	BPF_MAP_TYPE_RHASH,
 	__MAX_BPF_MAP_TYPE
 };
 
@@ -1545,6 +1546,11 @@ union bpf_attr {
 		 *
 		 * BPF_MAP_TYPE_ARENA - contains the address where user space
 		 * is going to mmap() the arena. It has to be page aligned.
+		 *
+		 * BPF_MAP_TYPE_RHASH - initial table size hint
+		 * (nelem_hint). 0 = use rhashtable default. Must be
+		 * <= min(max_entries, U16_MAX). Upper 32 bits reserved,
+		 * must be zero.
 		 */
 		__u64	map_extra;
 

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (2 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:28   ` sashiko-bot
  2026-05-28 17:47 ` [PATCH bpf-next v5 05/11] bpf: Allow special fields in " Mykyta Yatsenko
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Implement rhtab_map_get_next_key(), batch lookup/lookup-and-delete,
and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH.

All three use rhashtable_next_key() with the previous key as cursor.
This avoids the rhashtable walker state machine and its spinlock
acquisition per enter/exit, and lets callers stay stateless across
syscalls. ERR_PTR(-ENOENT) from rhashtable_next_key() (cursor was
concurrently deleted) is handled per callsite: get_next_key falls
back to the first key (matches BPF UAPI semantics of htab); the
seq_file iterator and batch terminate iteration.

Also implements rhtab_map_mem_usage() to report memory consumption.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/hashtab.c  | 361 +++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/bpf/map_iter.c |   3 +-
 2 files changed, 362 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8944c9f070ca..a03fa5fa6545 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -3021,10 +3021,363 @@ static void rhtab_map_free_internal_structs(struct bpf_map *map)
 }
 
 static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+	__must_hold_shared(RCU)
 {
-	return -EOPNOTSUPP;
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	struct rhtab_elem *elem;
+
+	elem = rhashtable_next_key(&rhtab->ht, key, rhtab_params);
+
+	/* if not found, return the first key */
+	if (PTR_ERR(elem) == -ENOENT)
+		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
+
+	if (!elem)
+		return -ENOENT;
+
+	memcpy(next_key, elem->data, map->key_size);
+	return 0;
+}
+
+static void rhtab_map_seq_show_elem(struct bpf_map *map, void *key, struct seq_file *m)
+{
+	void *value;
+
+	/* Guarantee that hashtab value is not freed */
+	guard(rcu)();
+
+	value = rhtab_map_lookup_elem(map, key);
+	if (!value)
+		return;
+
+	btf_type_seq_show(map->btf, map->btf_key_type_id, key, m);
+	seq_puts(m, ": ");
+	btf_type_seq_show(map->btf, map->btf_value_type_id, value, m);
+	seq_putc(m, '\n');
+}
+
+static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
+				void *callback_ctx, u64 flags)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	void *prev_key = NULL;
+	struct rhtab_elem *elem;
+	int num_elems = 0;
+	void *key_buf = NULL;
+	u64 ret = 0;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	/*
+	 * Buffer for stashing the cursor across rcu_read_unlock() in the
+	 * cond_resched path.
+	 */
+	key_buf = kmalloc_nolock(map->key_size, 0, NUMA_NO_NODE);
+	if (!key_buf)
+		return -ENOMEM;
+
+	rcu_read_lock();
+	/*
+	 * Best-effort iteration: if rhashtable is concurrently resized or
+	 * elements are deleted/inserted, there may be missed or duplicate
+	 * elements visited.
+	 */
+	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params))) {
+		if (IS_ERR(elem))
+			break;
+		num_elems++;
+		ret = callback_fn((u64)(long)map,
+				  (u64)(long)elem->data,
+				  (u64)(long)rhtab_elem_value(elem, map->key_size),
+				  (u64)(long)callback_ctx, 0);
+		if (ret)
+			break;
+
+		prev_key = elem->data;	/* valid while RCU held */
+
+		if (need_resched()) {
+			memcpy(key_buf, elem->data, map->key_size);
+			rcu_read_unlock();
+			cond_resched();
+			rcu_read_lock();
+			prev_key = key_buf;
+		}
+	}
+	rcu_read_unlock();
+
+	kfree_nolock(key_buf);
+	return num_elems;
+}
+
+static u64 rhtab_map_mem_usage(const struct bpf_map *map)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	u64 num_entries;
+
+	/* Excludes rhashtable bucket overhead (~ nelems * sizeof(void *) at 75% load). */
+	num_entries = atomic_read(&rhtab->ht.nelems);
+	return sizeof(struct bpf_rhtab) + rhtab->elem_size * num_entries;
+}
+
+static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
+					       const union bpf_attr *attr,
+					       union bpf_attr __user *uattr,
+					       bool do_delete)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	void __user *uvalues = u64_to_user_ptr(attr->batch.values);
+	void __user *ukeys = u64_to_user_ptr(attr->batch.keys);
+	void __user *ubatch = u64_to_user_ptr(attr->batch.in_batch);
+	void *cursor = NULL, *keys = NULL, *values = NULL, *dst_key, *dst_val;
+	struct rhtab_elem **del_elems = NULL;
+	u32 max_count, total, key_size, value_size, i;
+	bool has_next_cursor = false;
+	struct rhtab_elem *elem;
+	u64 elem_map_flags, map_flags;
+	int ret = 0;
+
+	elem_map_flags = attr->batch.elem_flags;
+	ret = bpf_map_check_op_flags(map, elem_map_flags, BPF_F_LOCK);
+	if (ret)
+		return ret;
+
+	map_flags = attr->batch.flags;
+	if (map_flags)
+		return -EINVAL;
+
+	max_count = attr->batch.count;
+	if (!max_count)
+		return 0;
+
+	if (put_user(0, &uattr->batch.count))
+		return -EFAULT;
+
+	key_size = map->key_size;
+	value_size = map->value_size;
+
+	keys = kvmalloc_array(max_count, key_size, GFP_USER | __GFP_NOWARN);
+	values = kvmalloc_array(max_count, value_size, GFP_USER | __GFP_NOWARN);
+	if (do_delete)
+		del_elems = kvmalloc_array(max_count, sizeof(void *),
+					   GFP_USER | __GFP_NOWARN);
+	cursor = kmalloc(key_size, GFP_USER | __GFP_NOWARN);
+
+	if (!keys || !values || !cursor || (do_delete && !del_elems)) {
+		ret = -ENOMEM;
+		goto free;
+	}
+
+	if (ubatch && copy_from_user(cursor, ubatch, key_size)) {
+		ret = -EFAULT;
+		goto free;
+	}
+
+	dst_key = keys;
+	dst_val = values;
+	total = 0;
+
+	rcu_read_lock();
+
+	/*
+	 * Cursor stores the key of the next-to-process element (stashed by
+	 * the previous batch). Look it up directly so the element is included
+	 * here rather than skipped by next_key(). If the cursor was deleted
+	 * concurrently (or by the previous do_delete batch), lookup returns
+	 * NULL and userspace must restart from a NULL cursor.
+	 */
+	if (ubatch)
+		elem = rhtab_lookup_elem(map, cursor);
+	else
+		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
+
+	while (elem && !IS_ERR(elem) && total < max_count) {
+		memcpy(dst_key, elem->data, key_size);
+		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
+		check_and_init_map_value(map, dst_val);
+
+		if (do_delete)
+			del_elems[total] = elem;
+
+		elem = rhashtable_next_key(&rhtab->ht, dst_key, rhtab_params);
+		dst_key += key_size;
+		dst_val += value_size;
+		total++;
+
+		/* Bail to userspace to avoid stalls. */
+		if (need_resched())
+			break;
+	}
+
+	if (elem && !IS_ERR(elem)) {
+		/* Stash next-to-process key as cursor for the next batch. */
+		memcpy(cursor, elem->data, key_size);
+		has_next_cursor = true;
+	}
+
+	if (do_delete) {
+		for (i = 0; i < total; i++)
+			rhtab_delete_elem(rhtab, del_elems[i], NULL, 0);
+	}
+
+	rcu_read_unlock();
+
+	if (total == 0) {
+		ret = -ENOENT;
+		goto free;
+	}
+
+	/* No more elements after this batch. */
+	if (!has_next_cursor)
+		ret = -ENOENT;
+
+	if (copy_to_user(ukeys, keys, total * key_size) ||
+	    copy_to_user(uvalues, values, total * value_size) ||
+	    put_user(total, &uattr->batch.count) ||
+	    (has_next_cursor &&
+	     copy_to_user(u64_to_user_ptr(attr->batch.out_batch),
+			  cursor, key_size))) {
+		ret = -EFAULT;
+		goto free;
+	}
+
+free:
+	kfree(cursor);
+	kvfree(keys);
+	kvfree(values);
+	kvfree(del_elems);
+	return ret;
+}
+
+static int rhtab_map_lookup_batch(struct bpf_map *map, const union bpf_attr *attr,
+				  union bpf_attr __user *uattr)
+{
+	return __rhtab_map_lookup_and_delete_batch(map, attr, uattr, false);
+}
+
+static int rhtab_map_lookup_and_delete_batch(struct bpf_map *map, const union bpf_attr *attr,
+					     union bpf_attr __user *uattr)
+{
+	return __rhtab_map_lookup_and_delete_batch(map, attr, uattr, true);
 }
 
+struct bpf_iter_seq_rhash_map_info {
+	struct bpf_map *map;
+	struct bpf_rhtab *rhtab;
+	void *last_key;
+	bool last_key_valid;
+};
+
+static void *bpf_rhash_map_seq_start(struct seq_file *seq, loff_t *pos)
+	__acquires_shared(RCU)
+{
+	struct bpf_iter_seq_rhash_map_info *info = seq->private;
+	void *key = *pos > 0 && info->last_key_valid ? info->last_key : NULL;
+	struct rhtab_elem *elem;
+
+	rcu_read_lock();
+	elem = rhashtable_next_key(&info->rhtab->ht, key,
+				   rhtab_params);
+	if (IS_ERR_OR_NULL(elem))
+		return NULL;
+	if (*pos == 0)
+		++*pos;
+	return elem;
+}
+
+static void *bpf_rhash_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
+{
+	struct bpf_iter_seq_rhash_map_info *info = seq->private;
+	struct rhtab_elem *elem = v;
+	void *next;
+
+	memcpy(info->last_key, elem->data, info->map->key_size);
+	info->last_key_valid = true;
+	++*pos;
+
+	next = rhashtable_next_key(&info->rhtab->ht, info->last_key,
+				   rhtab_params);
+	if (IS_ERR(next))
+		return NULL;
+	return next;
+}
+
+static int __bpf_rhash_map_seq_show(struct seq_file *seq,
+				    struct rhtab_elem *elem)
+{
+	struct bpf_iter_seq_rhash_map_info *info = seq->private;
+	struct bpf_iter__bpf_map_elem ctx = {};
+	struct bpf_iter_meta meta;
+	struct bpf_prog *prog;
+	int ret = 0;
+
+	meta.seq = seq;
+	prog = bpf_iter_get_info(&meta, elem == NULL);
+	if (prog) {
+		ctx.meta = &meta;
+		ctx.map = info->map;
+		if (elem) {
+			ctx.key = elem->data;
+			ctx.value = rhtab_elem_value(elem, info->map->key_size);
+		}
+		ret = bpf_iter_run_prog(prog, &ctx);
+	}
+
+	return ret;
+}
+
+static int bpf_rhash_map_seq_show(struct seq_file *seq, void *v)
+{
+	return __bpf_rhash_map_seq_show(seq, v);
+}
+
+static void bpf_rhash_map_seq_stop(struct seq_file *seq, void *v)
+	__releases_shared(RCU)
+{
+	if (!v)
+		(void)__bpf_rhash_map_seq_show(seq, NULL);
+
+	rcu_read_unlock();
+}
+
+static int bpf_iter_init_rhash_map(void *priv_data, struct bpf_iter_aux_info *aux)
+{
+	struct bpf_iter_seq_rhash_map_info *info = priv_data;
+	struct bpf_map *map = aux->map;
+
+	info->last_key_valid = false;
+	info->last_key = kmalloc(map->key_size, GFP_USER);
+	if (!info->last_key)
+		return -ENOMEM;
+
+	bpf_map_inc_with_uref(map);
+	info->map = map;
+	info->rhtab = container_of(map, struct bpf_rhtab, map);
+	return 0;
+}
+
+static void bpf_iter_fini_rhash_map(void *priv_data)
+{
+	struct bpf_iter_seq_rhash_map_info *info = priv_data;
+
+	kfree(info->last_key);
+	bpf_map_put_with_uref(info->map);
+}
+
+static const struct seq_operations bpf_rhash_map_seq_ops = {
+	.start = bpf_rhash_map_seq_start,
+	.next = bpf_rhash_map_seq_next,
+	.stop = bpf_rhash_map_seq_stop,
+	.show = bpf_rhash_map_seq_show,
+};
+
+static const struct bpf_iter_seq_info rhash_iter_seq_info = {
+	.seq_ops = &bpf_rhash_map_seq_ops,
+	.init_seq_private = bpf_iter_init_rhash_map,
+	.fini_seq_private = bpf_iter_fini_rhash_map,
+	.seq_priv_size = sizeof(struct bpf_iter_seq_rhash_map_info),
+};
+
 BTF_ID_LIST_SINGLE(rhtab_map_btf_ids, struct, bpf_rhtab)
 const struct bpf_map_ops rhtab_map_ops = {
 	.map_meta_equal = bpf_map_meta_equal,
@@ -3038,5 +3391,11 @@ const struct bpf_map_ops rhtab_map_ops = {
 	.map_update_elem = rhtab_map_update_elem,
 	.map_delete_elem = rhtab_map_delete_elem,
 	.map_gen_lookup = rhtab_map_gen_lookup,
+	.map_seq_show_elem = rhtab_map_seq_show_elem,
+	.map_set_for_each_callback_args = map_set_for_each_callback_args,
+	.map_for_each_callback = bpf_each_rhash_elem,
+	.map_mem_usage = rhtab_map_mem_usage,
+	BATCH_OPS(rhtab),
 	.map_btf_id = &rhtab_map_btf_ids[0],
+	.iter_seq_info = &rhash_iter_seq_info,
 };
diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c
index 261a03ea73d3..4a2aafbe28b4 100644
--- a/kernel/bpf/map_iter.c
+++ b/kernel/bpf/map_iter.c
@@ -119,7 +119,8 @@ static int bpf_iter_attach_map(struct bpf_prog *prog,
 		is_percpu = true;
 	else if (map->map_type != BPF_MAP_TYPE_HASH &&
 		 map->map_type != BPF_MAP_TYPE_LRU_HASH &&
-		 map->map_type != BPF_MAP_TYPE_ARRAY)
+		 map->map_type != BPF_MAP_TYPE_ARRAY &&
+		 map->map_type != BPF_MAP_TYPE_RHASH)
 		goto put_map;
 
 	key_acc_size = prog->aux->max_rdonly_access;

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 05/11] bpf: Allow special fields in resizable hashtab
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (3 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:26   ` bot+bpf-ci
  2026-05-28 17:47 ` [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Add support for timers, workqueues, task work and spin locks.
Without this, users needing deferred callbacks or BPF_F_LOCK in a
dynamically-sized map have no option - fixed-size htab is the only
map supporting these field types. Resizable hashtab should offer the
same capability.

Properly clean up BTF record fields on element delete and map
teardown by wiring up bpf_obj_free_fields through a memory allocator
destructor, matching the pattern used by htab for non-prealloc maps.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/hashtab.c | 110 ++++++++++++++++++++++++++++++++++++++++++++++-----
 kernel/bpf/syscall.c |   2 +
 2 files changed, 102 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index a03fa5fa6545..50e9d57b2790 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -497,28 +497,26 @@ static void htab_dtor_ctx_free(void *ctx)
 	kfree(ctx);
 }
 
-static int htab_set_dtor(struct bpf_htab *htab, void (*dtor)(void *, void *))
+static int bpf_ma_set_dtor(struct bpf_map *map, struct bpf_mem_alloc *ma,
+			   void (*dtor)(void *, void *))
 {
-	u32 key_size = htab->map.key_size;
-	struct bpf_mem_alloc *ma;
 	struct htab_btf_record *hrec;
 	int err;
 
 	/* No need for dtors. */
-	if (IS_ERR_OR_NULL(htab->map.record))
+	if (IS_ERR_OR_NULL(map->record))
 		return 0;
 
 	hrec = kzalloc(sizeof(*hrec), GFP_KERNEL);
 	if (!hrec)
 		return -ENOMEM;
-	hrec->key_size = key_size;
-	hrec->record = btf_record_dup(htab->map.record);
+	hrec->key_size = map->key_size;
+	hrec->record = btf_record_dup(map->record);
 	if (IS_ERR(hrec->record)) {
 		err = PTR_ERR(hrec->record);
 		kfree(hrec);
 		return err;
 	}
-	ma = htab_is_percpu(htab) ? &htab->pcpu_ma : &htab->ma;
 	bpf_mem_alloc_set_dtor(ma, dtor, htab_dtor_ctx_free, hrec);
 	return 0;
 }
@@ -535,9 +533,9 @@ static int htab_map_check_btf(struct bpf_map *map, const struct btf *btf,
 	 * populated in htab_map_alloc(), so it will always appear as NULL.
 	 */
 	if (htab_is_percpu(htab))
-		return htab_set_dtor(htab, htab_pcpu_mem_dtor);
+		return bpf_ma_set_dtor(map, &htab->pcpu_ma, htab_pcpu_mem_dtor);
 	else
-		return htab_set_dtor(htab, htab_mem_dtor);
+		return bpf_ma_set_dtor(map, &htab->ma, htab_mem_dtor);
 }
 
 static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
@@ -2752,6 +2750,7 @@ struct bpf_rhtab {
 	struct rhashtable ht;
 	struct bpf_mem_alloc ma;
 	u32 elem_size;
+	bool freeing_internal;
 };
 
 static const struct rhashtable_params rhtab_params = {
@@ -2832,6 +2831,28 @@ static int rhtab_map_alloc_check(union bpf_attr *attr)
 	return htab_map_alloc_check(attr);
 }
 
+static void rhtab_check_and_free_fields(struct bpf_rhtab *rhtab,
+					struct rhtab_elem *elem)
+{
+	if (IS_ERR_OR_NULL(rhtab->map.record))
+		return;
+
+	bpf_obj_free_fields(rhtab->map.record,
+			    rhtab_elem_value(elem, rhtab->map.key_size));
+}
+
+static void rhtab_mem_dtor(void *obj, void *ctx)
+{
+	struct htab_btf_record *hrec = ctx;
+	struct rhtab_elem *elem = obj;
+
+	if (IS_ERR_OR_NULL(hrec->record))
+		return;
+
+	bpf_obj_free_fields(hrec->record,
+			    rhtab_elem_value(elem, hrec->key_size));
+}
+
 static void rhtab_free_elem(void *ptr, void *arg)
 {
 	struct bpf_rhtab *rhtab = arg;
@@ -2901,7 +2922,8 @@ static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, v
 		rhtab_read_elem_value(&rhtab->map, copy, elem, flags);
 		check_and_init_map_value(&rhtab->map, copy);
 	}
-
+	/* Release internal structs: kptr, bpf_timer, task_work, wq */
+	rhtab_check_and_free_fields(rhtab, elem);
 	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
 	return 0;
 }
@@ -2943,6 +2965,7 @@ static int rhtab_map_lookup_and_delete_elem(struct bpf_map *map, void *key, void
 static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
 				      u64 map_flags)
 {
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
 	void *old_val = rhtab_elem_value(elem, map->key_size);
 
 	if (map_flags & BPF_NOEXIST)
@@ -2952,6 +2975,17 @@ static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *el
 		copy_map_value_locked(map, old_val, value, false);
 	else
 		copy_map_value(map, old_val, value);
+
+	/*
+	 * Torn reads: a concurrent reader without BPF_F_LOCK may observe
+	 * the value mid-copy. Callers requiring consistent reads must use
+	 * BPF_F_LOCK, matching arraymap semantics.
+	 *
+	 * copy_map_value() skips special-field offsets, so old timers/
+	 * kptrs/etc. still sit in the slot. Cancel them after the copy
+	 * to match arraymap's update semantics.
+	 */
+	rhtab_check_and_free_fields(rhtab, elem);
 	return 0;
 }
 
@@ -2974,6 +3008,14 @@ static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u
 	if (map_flags & BPF_EXIST)
 		return -ENOENT;
 
+	/*
+	 * Reject new insertions while map_release_uref cleanup walks the
+	 * table. Without this, new elements could keep triggering rehash
+	 * and prevent the walk from terminating.
+	 */
+	if (READ_ONCE(rhtab->freeing_internal))
+		return -EBUSY;
+
 	/* Check max_entries limit before inserting new element */
 	if (atomic_read(&rhtab->ht.nelems) >= map->max_entries)
 		return -E2BIG;
@@ -2984,6 +3026,7 @@ static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u
 
 	memcpy(elem->data, key, map->key_size);
 	copy_map_value(map, rhtab_elem_value(elem, map->key_size), value);
+	check_and_init_map_value(map, rhtab_elem_value(elem, map->key_size));
 
 	/* Prevent deadlock for NMI programs attempting to take bucket lock */
 	bpf_disable_instrumentation();
@@ -3016,8 +3059,54 @@ static int rhtab_map_gen_lookup(struct bpf_map *map, struct bpf_insn *insn_buf)
 	return insn - insn_buf;
 }
 
+static int rhtab_map_check_btf(struct bpf_map *map, const struct btf *btf,
+			       const struct btf_type *key_type,
+			       const struct btf_type *value_type)
+{
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+
+	return bpf_ma_set_dtor(map, &rhtab->ma, rhtab_mem_dtor);
+}
+
 static void rhtab_map_free_internal_structs(struct bpf_map *map)
 {
+	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
+	struct rhashtable_iter iter;
+	struct rhtab_elem *elem;
+
+	if (!bpf_map_has_internal_structs(map))
+		return;
+
+	/*
+	 * Block new insertions. Once observed, no new growth is triggered,
+	 * so any in-flight rehash will drain and the walker is guaranteed
+	 * to stop returning -EAGAIN. Treat -EAGAIN as "rehash in progress,
+	 * retry"; do not wait for the worker.
+	 */
+	WRITE_ONCE(rhtab->freeing_internal, true);
+
+	rhashtable_walk_enter(&rhtab->ht, &iter);
+	rhashtable_walk_start(&iter);
+
+	while ((elem = rhashtable_walk_next(&iter))) {
+		if (IS_ERR(elem)) {
+			if (PTR_ERR(elem) == -EAGAIN)
+				continue;
+			break;
+		}
+
+		bpf_map_free_internal_structs(map, rhtab_elem_value(elem, map->key_size));
+
+		if (need_resched()) { /* Avoid stalls on large maps */
+			rhashtable_walk_stop(&iter);
+			cond_resched();
+			rhashtable_walk_start(&iter);
+		}
+	}
+
+	rhashtable_walk_stop(&iter);
+	rhashtable_walk_exit(&iter);
+	WRITE_ONCE(rhtab->freeing_internal, false);
 }
 
 static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
@@ -3386,6 +3475,7 @@ const struct bpf_map_ops rhtab_map_ops = {
 	.map_free = rhtab_map_free,
 	.map_get_next_key = rhtab_map_get_next_key,
 	.map_release_uref = rhtab_map_free_internal_structs,
+	.map_check_btf = rhtab_map_check_btf,
 	.map_lookup_elem = rhtab_map_lookup_elem,
 	.map_lookup_and_delete_elem = rhtab_map_lookup_and_delete_elem,
 	.map_update_elem = rhtab_map_update_elem,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 12c07310016a..4d97bbce3578 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1280,6 +1280,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 			case BPF_SPIN_LOCK:
 			case BPF_RES_SPIN_LOCK:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
+				    map->map_type != BPF_MAP_TYPE_RHASH &&
 				    map->map_type != BPF_MAP_TYPE_ARRAY &&
 				    map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE &&
 				    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
@@ -1294,6 +1295,7 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 			case BPF_WORKQUEUE:
 			case BPF_TASK_WORK:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
+				    map->map_type != BPF_MAP_TYPE_RHASH &&
 				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_ARRAY) {
 					ret = -EOPNOTSUPP;

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (4 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 05/11] bpf: Allow special fields in " Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:25   ` sashiko-bot
  2026-05-28 17:47 ` [PATCH bpf-next v5 07/11] libbpf: Support " Mykyta Yatsenko
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Specialize the lookup/update/delete/get_next_key/batch/iter paths for
keys whose size matches sizeof(long) (4 bytes on 32-bit, 8 bytes on
64-bit). A static-const rhashtable_params lets the compiler inline a
custom XOR-fold hashfn and a single-word equality cmpfn, eliminating
the indirect jhash dispatch. The same hashfn and cmpfn are installed
into rhashtable's stored params so the rehash worker and slow-path
inserts agree with the inlined fast paths.

Dispatch lives in a single rhtab_next_key() helper and inline branches
on map->key_size at the other callsites.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/hashtab.c | 74 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 60 insertions(+), 14 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 50e9d57b2790..f141b69e3bb0 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -2763,6 +2763,31 @@ static inline void *rhtab_elem_value(struct rhtab_elem *l, u32 key_size)
 	return l->data + round_up(key_size, 8);
 }
 
+/* Specialize hash function and objcmp for long sized key */
+static __always_inline int rhtab_key_cmp_long(struct rhashtable_compare_arg *arg,
+					      const void *ptr)
+{
+	const unsigned long key1 = *(const unsigned long *)arg->key;
+	const struct rhtab_elem *key2 = ptr;
+
+	return key1 != *(const unsigned long *)key2->data;
+}
+
+static __always_inline u32 rhtab_hashfn_long(const void *data, u32 len, u32 seed)
+{
+	u64 k = *(const unsigned long *)data;
+
+	return (u32)(k ^ (k >> 32)) ^ seed;
+}
+
+static const struct rhashtable_params rhtab_params_long = {
+	.head_offset = offsetof(struct rhtab_elem, node),
+	.key_offset  = offsetof(struct rhtab_elem, data),
+	.key_len     = sizeof(long),
+	.hashfn      = rhtab_hashfn_long,
+	.obj_cmpfn   = rhtab_key_cmp_long,
+};
+
 static struct bpf_map *rhtab_map_alloc(union bpf_attr *attr)
 {
 	struct rhashtable_params params;
@@ -2788,6 +2813,11 @@ static struct bpf_map *rhtab_map_alloc(union bpf_attr *attr)
 	params.nelem_hint = (u32)attr->map_extra;
 	params.automatic_shrinking = true;
 
+	if (rhtab->map.key_size == sizeof(long)) {
+		params.hashfn = rhtab_hashfn_long;
+		params.obj_cmpfn = rhtab_key_cmp_long;
+	}
+
 	err = rhashtable_init(&rhtab->ht, &params);
 	if (err)
 		goto free_rhtab;
@@ -2871,6 +2901,14 @@ static void rhtab_map_free(struct bpf_map *map)
 	bpf_map_area_free(rhtab);
 }
 
+static __always_inline struct rhtab_elem *
+rhtab_next_key(struct bpf_rhtab *rhtab, const void *prev_key)
+{
+	if (rhtab->map.key_size == sizeof(long))
+		return rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params_long);
+	return rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params);
+}
+
 static void *rhtab_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
@@ -2878,6 +2916,9 @@ static void *rhtab_lookup_elem(struct bpf_map *map, void *key)
 	/* Hold RCU lock in case sleepable program calls via gen_lookup */
 	guard(rcu)();
 
+	if (map->key_size == sizeof(long))
+		return rhashtable_lookup_likely(&rhtab->ht, key, rhtab_params_long);
+
 	return rhashtable_lookup_likely(&rhtab->ht, key, rhtab_params);
 }
 
@@ -2912,7 +2953,12 @@ static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, v
 	 * raw tracepoints, which we don't have in rhashtable.
 	 */
 	bpf_disable_instrumentation();
-	err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
+
+	if (rhtab->map.key_size == sizeof(long))
+		err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params_long);
+	else
+		err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
+
 	bpf_enable_instrumentation();
 
 	if (err)
@@ -3030,7 +3076,12 @@ static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u
 
 	/* Prevent deadlock for NMI programs attempting to take bucket lock */
 	bpf_disable_instrumentation();
-	tmp = rhashtable_lookup_get_insert_fast(&rhtab->ht, &elem->node, rhtab_params);
+
+	if (map->key_size == sizeof(long))
+		tmp = rhashtable_lookup_get_insert_fast(&rhtab->ht, &elem->node, rhtab_params_long);
+	else
+		tmp = rhashtable_lookup_get_insert_fast(&rhtab->ht, &elem->node, rhtab_params);
+
 	bpf_enable_instrumentation();
 
 	if (tmp) {
@@ -3115,11 +3166,9 @@ static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key
 	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
 	struct rhtab_elem *elem;
 
-	elem = rhashtable_next_key(&rhtab->ht, key, rhtab_params);
-
-	/* if not found, return the first key */
+	elem = rhtab_next_key(rhtab, key);
 	if (PTR_ERR(elem) == -ENOENT)
-		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
+		elem = rhtab_next_key(rhtab, NULL);
 
 	if (!elem)
 		return -ENOENT;
@@ -3172,7 +3221,7 @@ static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
 	 * elements are deleted/inserted, there may be missed or duplicate
 	 * elements visited.
 	 */
-	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params))) {
+	while ((elem = rhtab_next_key(rhtab, prev_key))) {
 		if (IS_ERR(elem))
 			break;
 		num_elems++;
@@ -3278,8 +3327,7 @@ static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
 	if (ubatch)
 		elem = rhtab_lookup_elem(map, cursor);
 	else
-		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
-
+		elem = rhtab_next_key(rhtab, NULL);
 	while (elem && !IS_ERR(elem) && total < max_count) {
 		memcpy(dst_key, elem->data, key_size);
 		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
@@ -3288,7 +3336,7 @@ static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
 		if (do_delete)
 			del_elems[total] = elem;
 
-		elem = rhashtable_next_key(&rhtab->ht, dst_key, rhtab_params);
+		elem = rhtab_next_key(rhtab, dst_key);
 		dst_key += key_size;
 		dst_val += value_size;
 		total++;
@@ -3365,8 +3413,7 @@ static void *bpf_rhash_map_seq_start(struct seq_file *seq, loff_t *pos)
 	struct rhtab_elem *elem;
 
 	rcu_read_lock();
-	elem = rhashtable_next_key(&info->rhtab->ht, key,
-				   rhtab_params);
+	elem = rhtab_next_key(info->rhtab, key);
 	if (IS_ERR_OR_NULL(elem))
 		return NULL;
 	if (*pos == 0)
@@ -3384,8 +3431,7 @@ static void *bpf_rhash_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
 	info->last_key_valid = true;
 	++*pos;
 
-	next = rhashtable_next_key(&info->rhtab->ht, info->last_key,
-				   rhtab_params);
+	next = rhtab_next_key(info->rhtab, info->last_key);
 	if (IS_ERR(next))
 		return NULL;
 	return next;

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 07/11] libbpf: Support resizable hashtable
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (5 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko, Emil Tsalapatis

From: Mykyta Yatsenko <yatsenko@meta.com>

Add BPF_MAP_TYPE_RHASH to libbpf's map type name table and feature
probing so that libbpf-based tools can create and identify resizable
hash maps.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
---
 tools/lib/bpf/libbpf.c        | 1 +
 tools/lib/bpf/libbpf_probes.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ab2071fdd3e8..1354bcbc8b30 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -192,6 +192,7 @@ static const char * const map_type_name[] = {
 	[BPF_MAP_TYPE_CGRP_STORAGE]		= "cgrp_storage",
 	[BPF_MAP_TYPE_ARENA]			= "arena",
 	[BPF_MAP_TYPE_INSN_ARRAY]		= "insn_array",
+	[BPF_MAP_TYPE_RHASH]			= "rhash",
 };
 
 static const char * const prog_type_name[] = {
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index b70d9637ecf5..e40819465ddc 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -309,6 +309,9 @@ static int probe_map_create(enum bpf_map_type map_type)
 		value_size	= sizeof(__u64);
 		opts.map_flags	= BPF_F_NO_PREALLOC;
 		break;
+	case BPF_MAP_TYPE_RHASH:
+		opts.map_flags	= BPF_F_NO_PREALLOC;
+		break;
 	case BPF_MAP_TYPE_CGROUP_STORAGE:
 	case BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE:
 		key_size	= sizeof(struct bpf_cgroup_storage_key);

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (6 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 07/11] libbpf: Support " Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 18:41   ` sashiko-bot
  2026-05-28 17:47 ` [PATCH bpf-next v5 09/11] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/rhash.c | 120 ++++++++++++
 tools/testing/selftests/bpf/progs/rhash.c      | 248 +++++++++++++++++++++++++
 2 files changed, 368 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/rhash.c b/tools/testing/selftests/bpf/prog_tests/rhash.c
new file mode 100644
index 000000000000..69686bf69ba5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/rhash.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <string.h>
+#include <stdio.h>
+#include "rhash.skel.h"
+#include <linux/bpf.h>
+#include <linux/perf_event.h>
+#include <sys/syscall.h>
+
+static void rhash_run(const char *prog_name)
+{
+	struct rhash *skel;
+	struct bpf_program *prog;
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	int err;
+
+	skel = rhash__open();
+	if (!ASSERT_OK_PTR(skel, "rhash__open"))
+		return;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+		goto cleanup;
+	bpf_program__set_autoload(prog, true);
+
+	err = rhash__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	if (!ASSERT_OK(err, "prog run"))
+		goto cleanup;
+
+	if (!ASSERT_OK(opts.retval, "prog retval"))
+		goto cleanup;
+
+	if (!ASSERT_OK(skel->bss->err, "bss->err"))
+		goto cleanup;
+
+cleanup:
+	rhash__destroy(skel);
+}
+
+static int rhash_map_create(__u32 max_entries, __u64 map_extra)
+{
+	LIBBPF_OPTS(bpf_map_create_opts, opts,
+		    .map_flags = BPF_F_NO_PREALLOC,
+		    .map_extra = map_extra);
+
+	return bpf_map_create(BPF_MAP_TYPE_RHASH, "rhash_extra",
+			      sizeof(__u32), sizeof(__u64), max_entries, &opts);
+}
+
+static void rhash_map_extra_presize(void)
+{
+	const __u32 max_entries = 1024;
+	const __u32 nelem_hint = 256;
+	struct bpf_map_info info = {};
+	__u32 info_len = sizeof(info);
+	__u64 val = 0;
+	__u32 key;
+	int fd, i;
+
+	fd = rhash_map_create(max_entries, nelem_hint);
+	if (!ASSERT_GE(fd, 0, "rhash_map_create presize"))
+		return;
+
+	if (!ASSERT_OK(bpf_map_get_info_by_fd(fd, &info, &info_len), "info"))
+		goto close;
+	ASSERT_EQ(info.map_extra, nelem_hint, "info.map_extra");
+
+	for (i = 0; i < (int)nelem_hint; i++) {
+		key = i;
+		if (!ASSERT_OK(bpf_map_update_elem(fd, &key, &val, BPF_NOEXIST),
+			       "update"))
+			goto close;
+	}
+close:
+	close(fd);
+}
+
+static void rhash_map_extra_too_big(void)
+{
+	int fd;
+
+	fd = rhash_map_create(1U << 20, 0x10000);
+	if (!ASSERT_LT(fd, 0, "rhash_map_create hint > U16_MAX"))
+		close(fd);
+}
+
+void test_rhash(void)
+{
+	if (test__start_subtest("test_rhash_lookup_update"))
+		rhash_run("test_rhash_lookup_update");
+
+	if (test__start_subtest("test_rhash_update_delete"))
+		rhash_run("test_rhash_update_delete");
+
+	if (test__start_subtest("test_rhash_update_elements"))
+		rhash_run("test_rhash_update_elements");
+
+	if (test__start_subtest("test_rhash_update_exist"))
+		rhash_run("test_rhash_update_exist");
+
+	if (test__start_subtest("test_rhash_update_any"))
+		rhash_run("test_rhash_update_any");
+
+	if (test__start_subtest("test_rhash_noexist_duplicate"))
+		rhash_run("test_rhash_noexist_duplicate");
+
+	if (test__start_subtest("test_rhash_delete_nonexistent"))
+		rhash_run("test_rhash_delete_nonexistent");
+
+	if (test__start_subtest("test_rhash_map_extra_presize"))
+		rhash_map_extra_presize();
+
+	if (test__start_subtest("test_rhash_map_extra_too_big"))
+		rhash_map_extra_too_big();
+}
diff --git a/tools/testing/selftests/bpf/progs/rhash.c b/tools/testing/selftests/bpf/progs/rhash.c
new file mode 100644
index 000000000000..fc2dac3a719e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rhash.c
@@ -0,0 +1,248 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <stdbool.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+
+#define ENOENT 2
+#define EEXIST 17
+
+char _license[] SEC("license") = "GPL";
+
+int err;
+
+struct elem {
+	char arr[128];
+	int val;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RHASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 128);
+	__type(key, int);
+	__type(value, struct elem);
+} rhmap SEC(".maps");
+
+SEC("syscall")
+int test_rhash_lookup_update(void *ctx)
+{
+	int key = 5;
+	struct elem empty = {.val = 3, .arr = {0}};
+	struct elem *e;
+
+	err = 1;
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (e)
+		return 1;
+
+	err = bpf_map_update_elem(&rhmap, &key, &empty, BPF_NOEXIST);
+	if (err)
+		return 1;
+
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (!e || e->val != empty.val) {
+		err = 2;
+		return 2;
+	}
+
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_update_delete(void *ctx)
+{
+	int key = 6;
+	struct elem empty = {.val = 4, .arr = {0}};
+	struct elem *e;
+
+	err = 1;
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (e)
+		return 1;
+
+	err = bpf_map_update_elem(&rhmap, &key, &empty, BPF_NOEXIST);
+	if (err)
+		return 2;
+
+	err = bpf_map_delete_elem(&rhmap, &key);
+	if (err)
+		return 3;
+
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (e) {
+		err = 4;
+		return 4;
+	}
+
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_update_elements(void *ctx)
+{
+	int key = 0;
+	struct elem empty = {.val = 4, .arr = {0}};
+	struct elem *e;
+	int i;
+
+	err = 1;
+
+	for (i = 0; i < 128; ++i) {
+		key = i;
+		e = bpf_map_lookup_elem(&rhmap, &key);
+		if (e)
+			return 1;
+
+		empty.val = key;
+		err = bpf_map_update_elem(&rhmap, &key, &empty, BPF_NOEXIST);
+		if (err)
+			return 2;
+
+		e = bpf_map_lookup_elem(&rhmap, &key);
+		if (!e || e->val != key) {
+			err = 4;
+			return 4;
+		}
+	}
+
+	for (i = 0; i < 128; ++i) {
+		key = i;
+		err = bpf_map_delete_elem(&rhmap, &key);
+		if (err)
+			return 3;
+
+		e = bpf_map_lookup_elem(&rhmap, &key);
+		if (e) {
+			err = 5;
+			return 5;
+		}
+	}
+
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_update_exist(void *ctx)
+{
+	int key = 10;
+	struct elem val1 = {.val = 100, .arr = {0}};
+	struct elem val2 = {.val = 200, .arr = {0}};
+	struct elem *e;
+	int ret;
+
+	err = 1;
+
+	/* BPF_EXIST on non-existent key should fail with -ENOENT */
+	ret = bpf_map_update_elem(&rhmap, &key, &val1, BPF_EXIST);
+	if (ret != -ENOENT)
+		return 1;
+
+	/* Insert element first */
+	ret = bpf_map_update_elem(&rhmap, &key, &val1, BPF_NOEXIST);
+	if (ret)
+		return 2;
+
+	/* Verify initial value */
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (!e || e->val != 100)
+		return 3;
+
+	/* BPF_EXIST on existing key should succeed and update value */
+	ret = bpf_map_update_elem(&rhmap, &key, &val2, BPF_EXIST);
+	if (ret)
+		return 4;
+
+	/* Verify value was updated */
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (!e || e->val != 200)
+		return 5;
+
+	/* Cleanup */
+	bpf_map_delete_elem(&rhmap, &key);
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_update_any(void *ctx)
+{
+	int key = 11;
+	struct elem val1 = {.val = 111, .arr = {0}};
+	struct elem val2 = {.val = 222, .arr = {0}};
+	struct elem *e;
+	int ret;
+
+	err = 1;
+
+	/* BPF_ANY on non-existent key should insert */
+	ret = bpf_map_update_elem(&rhmap, &key, &val1, BPF_ANY);
+	if (ret)
+		return 1;
+
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (!e || e->val != 111)
+		return 2;
+
+	/* BPF_ANY on existing key should update */
+	ret = bpf_map_update_elem(&rhmap, &key, &val2, BPF_ANY);
+	if (ret)
+		return 3;
+
+	e = bpf_map_lookup_elem(&rhmap, &key);
+	if (!e || e->val != 222)
+		return 4;
+
+	/* Cleanup */
+	bpf_map_delete_elem(&rhmap, &key);
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_noexist_duplicate(void *ctx)
+{
+	int key = 12;
+	struct elem val = {.val = 600, .arr = {0}};
+	int ret;
+
+	err = 1;
+
+	/* Insert element */
+	ret = bpf_map_update_elem(&rhmap, &key, &val, BPF_NOEXIST);
+	if (ret)
+		return 1;
+
+	/* Try to insert again with BPF_NOEXIST - should fail with -EEXIST */
+	ret = bpf_map_update_elem(&rhmap, &key, &val, BPF_NOEXIST);
+	if (ret != -EEXIST)
+		return 2;
+
+	/* Cleanup */
+	bpf_map_delete_elem(&rhmap, &key);
+	err = 0;
+	return 0;
+}
+
+SEC("syscall")
+int test_rhash_delete_nonexistent(void *ctx)
+{
+	int key = 99999;
+	int ret;
+
+	err = 1;
+
+	/* Delete non-existent key should return -ENOENT */
+	ret = bpf_map_delete_elem(&rhmap, &key);
+	if (ret != -ENOENT)
+		return 1;
+
+	err = 0;
+	return 0;
+}

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 09/11] selftests/bpf: Add BPF iterator tests for resizable hash map
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (7 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 10/11] bpftool: Add rhash map documentation Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 11/11] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko
  10 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Test basic BPF iterator functionality for BPF_MAP_TYPE_RHASH,
verifying all elements are visited.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/rhash.c     | 63 ++++++++++++++++++++++
 .../selftests/bpf/progs/bpf_iter_bpf_rhash_map.c   | 34 ++++++++++++
 2 files changed, 97 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/rhash.c b/tools/testing/selftests/bpf/prog_tests/rhash.c
index 69686bf69ba5..98bb66907b7f 100644
--- a/tools/testing/selftests/bpf/prog_tests/rhash.c
+++ b/tools/testing/selftests/bpf/prog_tests/rhash.c
@@ -4,6 +4,7 @@
 #include <string.h>
 #include <stdio.h>
 #include "rhash.skel.h"
+#include "bpf_iter_bpf_rhash_map.skel.h"
 #include <linux/bpf.h>
 #include <linux/perf_event.h>
 #include <sys/syscall.h>
@@ -89,6 +90,65 @@ static void rhash_map_extra_too_big(void)
 		close(fd);
 }
 
+static void rhash_iter_test(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	struct bpf_iter_bpf_rhash_map *skel;
+	int err, i, len, map_fd, iter_fd;
+	union bpf_iter_link_info linfo;
+	u32 expected_key_sum = 0, key;
+	struct bpf_link *link;
+	u64 val = 0;
+	char buf[64];
+
+	skel = bpf_iter_bpf_rhash_map__open();
+	if (!ASSERT_OK_PTR(skel, "bpf_iter_bpf_rhash_map__open"))
+		return;
+
+	err = bpf_iter_bpf_rhash_map__load(skel);
+	if (!ASSERT_OK(err, "bpf_iter_bpf_rhash_map__load"))
+		goto out;
+
+	map_fd = bpf_map__fd(skel->maps.rhashmap);
+
+	/* Populate map with test data */
+	for (i = 0; i < 64; i++) {
+		key = i + 1;
+		expected_key_sum += key;
+
+		err = bpf_map_update_elem(map_fd, &key, &val, BPF_NOEXIST);
+		if (!ASSERT_OK(err, "map_update"))
+			goto out;
+	}
+
+	memset(&linfo, 0, sizeof(linfo));
+	linfo.map.map_fd = map_fd;
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.dump_bpf_rhash_map, &opts);
+	if (!ASSERT_OK_PTR(link, "attach_iter"))
+		goto out;
+
+	iter_fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_GE(iter_fd, 0, "create_iter"))
+		goto free_link;
+
+	do {
+		len = read(iter_fd, buf, sizeof(buf));
+	} while (len > 0);
+
+	ASSERT_EQ(skel->bss->key_sum, expected_key_sum, "key_sum");
+	ASSERT_EQ(skel->bss->elem_count, 64, "elem_count");
+
+	close(iter_fd);
+
+free_link:
+	bpf_link__destroy(link);
+out:
+	bpf_iter_bpf_rhash_map__destroy(skel);
+}
+
 void test_rhash(void)
 {
 	if (test__start_subtest("test_rhash_lookup_update"))
@@ -117,4 +177,7 @@ void test_rhash(void)
 
 	if (test__start_subtest("test_rhash_map_extra_too_big"))
 		rhash_map_extra_too_big();
+
+	if (test__start_subtest("test_rhash_iter"))
+		rhash_iter_test();
 }
diff --git a/tools/testing/selftests/bpf/progs/bpf_iter_bpf_rhash_map.c b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_rhash_map.c
new file mode 100644
index 000000000000..86f6c0d5eadb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_iter_bpf_rhash_map.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RHASH);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__uint(max_entries, 64);
+	__type(key, __u32);
+	__type(value, __u64);
+} rhashmap SEC(".maps");
+
+__u32 key_sum = 0;
+__u64 val_sum = 0;
+__u32 elem_count = 0;
+__u32 err = 0;
+
+SEC("iter/bpf_map_elem")
+int dump_bpf_rhash_map(struct bpf_iter__bpf_map_elem *ctx)
+{
+	__u32 *key = ctx->key;
+	__u64 *val = ctx->value;
+
+	if (!key || !val)
+		return 0;
+
+	key_sum += *key;
+	val_sum += *val;
+	elem_count++;
+	return 0;
+}

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 10/11] bpftool: Add rhash map documentation
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (8 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 09/11] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  2026-05-28 17:47 ` [PATCH bpf-next v5 11/11] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko
  10 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko, Emil Tsalapatis

From: Mykyta Yatsenko <yatsenko@meta.com>

Make bpftool documentation aware of the resizable hash map.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 2 +-
 tools/bpf/bpftool/map.c                         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 1af3305ea2b2..5daf3de5c744 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -56,7 +56,7 @@ MAP COMMANDS
 |     | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage**
 |     | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage**
 |     | **task_storage** | **bloom_filter** | **user_ringbuf** | **cgrp_storage** | **arena**
-|     | **insn_array** }
+|     | **insn_array** | **rhash** }
 
 DESCRIPTION
 ===========
diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 7ebf7dbcfba4..71a45d96617e 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -1478,7 +1478,7 @@ static int do_help(int argc, char **argv)
 		"                 cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n"
 		"                 queue | stack | sk_storage | struct_ops | ringbuf | inode_storage |\n"
 		"                 task_storage | bloom_filter | user_ringbuf | cgrp_storage | arena |\n"
-		"                 insn_array }\n"
+		"                 insn_array | rhash }\n"
 		"       " HELP_SPEC_OPTIONS " |\n"
 		"                    {-f|--bpffs} | {-n|--nomount} }\n"
 		"",

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH bpf-next v5 11/11] selftests/bpf: Add resizable hashmap to benchmarks
  2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
                   ` (9 preceding siblings ...)
  2026-05-28 17:47 ` [PATCH bpf-next v5 10/11] bpftool: Add rhash map documentation Mykyta Yatsenko
@ 2026-05-28 17:47 ` Mykyta Yatsenko
  10 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-28 17:47 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
	herbert
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Support resizable hashmap in BPF map benchmarks.

1. LOOKUP  (single producer, M events/sec)

  key | max | nr    |    htab |   rhtab | ratio | delta
  ----+-----+-------+---------+---------+-------+-------
    8 |  1K |   750 |   99.85 |   81.92 | 0.82x |  -18 %
    8 |  1K |    1K |  100.71 |   80.19 | 0.80x |  -20 %
    8 |  1M |  750K |   23.37 |   72.09 | 3.08x | +208 %
    8 |  1M |    1M |   13.39 |   53.72 | 4.01x | +301 %
   32 |  1K |   750 |   51.57 |   42.78 | 0.83x |  -17 %
   32 |  1K |    1K |   50.81 |   45.83 | 0.90x |  -10 %
   32 |  1M |  750K |   11.27 |   15.29 | 1.36x |  +36 %
   32 |  1M |    1M |    7.32 |    8.75 | 1.19x |  +19 %
  256 |  1K |   750 |    7.58 |    7.88 | 1.04x |   +4 %
  256 |  1K |    1K |    7.43 |    7.81 | 1.05x |   +5 %
  256 |  1M |  750K |    3.69 |    4.27 | 1.16x |  +16 %
  256 |  1M |    1M |    2.60 |    3.12 | 1.20x |  +20 %

Pattern:
  * Small map (1K): htab wins for 8 / 32 byte keys by 10-20%
  * Large map (1M): rhtab wins everywhere, up to 4x at high load
    factor with 8 byte keys.
  * Higher load factor amplifies rhtab's lead: rhtab grows the
    bucket array; htab stays at user-declared max.

2. FULL UPDATE  (M events/sec per producer)

  htab  per-producer:
    20.33   22.02   19.27   23.61   24.18   23.17   21.07
    mean  21.94   range  19.27 - 24.18

  rhtab per-producer:
   133.51  129.47   74.52  129.29  102.26  129.98  107.64
    mean 115.24   range  74.52 - 133.51

  speedup (mean): 5.25x   (+425 %)

In-place memcpy avoids the per-update alloc + RCU pointer swap
that htab pays.

3. MEMORY

  value_size |  htab ops/s | rhtab ops/s | htab mem | rhtab mem
  -----------+-------------+-------------+----------+----------
       32 B  |  122.87 k/s |  133.04 k/s | 2.47 MiB | 2.49 MiB
     4096 B  |   64.43 k/s |   65.38 k/s | 6.74 MiB | 6.44 MiB
  rhtab/htab :  +8 % ops, +0.8 % mem   (32 B)
                +1 % ops,  -4  % mem (4096 B)

Throughput effectively tied

SUMMARY

  * Small / well-fitting map: htab is faster (cache-friendly
    fixed bucket array), but only by ~10-20 %.
  * Large / high-load-factor map: rhtab is dramatically faster
    (1.2x to 4x) because rhashtable resizes to keep the load
    factor sane while htab stays stuck at user-declared max.
  * Update-heavy workloads: rhtab is ~5x faster per producer
    via in-place memcpy.
  * Memory benchmark: effectively on par.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/bench.c                |  6 ++++
 .../bpf/benchs/bench_bpf_hashmap_full_update.c     | 34 +++++++++++++++++++--
 .../bpf/benchs/bench_bpf_hashmap_lookup.c          | 31 +++++++++++++++++--
 .../testing/selftests/bpf/benchs/bench_htab_mem.c  | 35 ++++++++++++++++++++--
 4 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 6155ce455c27..3d9d2cd7764b 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -560,13 +560,16 @@ extern const struct bench bench_bpf_loop;
 extern const struct bench bench_strncmp_no_helper;
 extern const struct bench bench_strncmp_helper;
 extern const struct bench bench_bpf_hashmap_full_update;
+extern const struct bench bench_bpf_rhashmap_full_update;
 extern const struct bench bench_local_storage_cache_seq_get;
 extern const struct bench bench_local_storage_cache_interleaved_get;
 extern const struct bench bench_local_storage_cache_hashmap_control;
 extern const struct bench bench_local_storage_tasks_trace;
 extern const struct bench bench_bpf_hashmap_lookup;
+extern const struct bench bench_bpf_rhashmap_lookup;
 extern const struct bench bench_local_storage_create;
 extern const struct bench bench_htab_mem;
+extern const struct bench bench_rhtab_mem;
 extern const struct bench bench_crypto_encrypt;
 extern const struct bench bench_crypto_decrypt;
 extern const struct bench bench_sockmap;
@@ -640,13 +643,16 @@ static const struct bench *benchs[] = {
 	&bench_strncmp_no_helper,
 	&bench_strncmp_helper,
 	&bench_bpf_hashmap_full_update,
+	&bench_bpf_rhashmap_full_update,
 	&bench_local_storage_cache_seq_get,
 	&bench_local_storage_cache_interleaved_get,
 	&bench_local_storage_cache_hashmap_control,
 	&bench_local_storage_tasks_trace,
 	&bench_bpf_hashmap_lookup,
+	&bench_bpf_rhashmap_lookup,
 	&bench_local_storage_create,
 	&bench_htab_mem,
+	&bench_rhtab_mem,
 	&bench_crypto_encrypt,
 	&bench_crypto_decrypt,
 	&bench_sockmap,
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
index ee1dc12c5e5e..7278fa860397 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_full_update.c
@@ -34,19 +34,29 @@ static void measure(struct bench_res *res)
 {
 }
 
-static void setup(void)
+static void hashmap_full_update_setup(enum bpf_map_type map_type)
 {
 	struct bpf_link *link;
 	int map_fd, i, max_entries;
 
 	setup_libbpf();
 
-	ctx.skel = bpf_hashmap_full_update_bench__open_and_load();
+	ctx.skel = bpf_hashmap_full_update_bench__open();
 	if (!ctx.skel) {
 		fprintf(stderr, "failed to open skeleton\n");
 		exit(1);
 	}
 
+	bpf_map__set_type(ctx.skel->maps.hash_map_bench, map_type);
+	if (map_type == BPF_MAP_TYPE_RHASH)
+		bpf_map__set_map_flags(ctx.skel->maps.hash_map_bench,
+				       BPF_F_NO_PREALLOC);
+
+	if (bpf_hashmap_full_update_bench__load(ctx.skel)) {
+		fprintf(stderr, "failed to load skeleton\n");
+		exit(1);
+	}
+
 	ctx.skel->bss->nr_loops = MAX_LOOP_NUM;
 
 	link = bpf_program__attach(ctx.skel->progs.benchmark);
@@ -62,6 +72,16 @@ static void setup(void)
 		bpf_map_update_elem(map_fd, &i, &i, BPF_ANY);
 }
 
+static void setup(void)
+{
+	hashmap_full_update_setup(BPF_MAP_TYPE_HASH);
+}
+
+static void rhash_setup(void)
+{
+	hashmap_full_update_setup(BPF_MAP_TYPE_RHASH);
+}
+
 static void hashmap_report_final(struct bench_res res[], int res_cnt)
 {
 	unsigned int nr_cpus = bpf_num_possible_cpus();
@@ -87,3 +107,13 @@ const struct bench bench_bpf_hashmap_full_update = {
 	.report_progress = NULL,
 	.report_final = hashmap_report_final,
 };
+
+const struct bench bench_bpf_rhashmap_full_update = {
+	.name = "bpf-rhashmap-full-update",
+	.validate = validate,
+	.setup = rhash_setup,
+	.producer_thread = producer,
+	.measure = measure,
+	.report_progress = NULL,
+	.report_final = hashmap_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
index 279ff1b8b5b2..5264b7b20e39 100644
--- a/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
+++ b/tools/testing/selftests/bpf/benchs/bench_bpf_hashmap_lookup.c
@@ -148,9 +148,10 @@ static inline void patch_key(u32 i, u32 *key)
 	/* the rest of key is random */
 }
 
-static void setup(void)
+static void hashmap_lookup_setup(enum bpf_map_type map_type)
 {
 	struct bpf_link *link;
+	__u32 map_flags;
 	int map_fd;
 	int ret;
 	int i;
@@ -163,10 +164,15 @@ static void setup(void)
 		exit(1);
 	}
 
+	map_flags = args.map_flags;
+	if (map_type == BPF_MAP_TYPE_RHASH)
+		map_flags |= BPF_F_NO_PREALLOC;
+
+	bpf_map__set_type(ctx.skel->maps.hash_map_bench, map_type);
 	bpf_map__set_max_entries(ctx.skel->maps.hash_map_bench, args.max_entries);
 	bpf_map__set_key_size(ctx.skel->maps.hash_map_bench, args.key_size);
 	bpf_map__set_value_size(ctx.skel->maps.hash_map_bench, 8);
-	bpf_map__set_map_flags(ctx.skel->maps.hash_map_bench, args.map_flags);
+	bpf_map__set_map_flags(ctx.skel->maps.hash_map_bench, map_flags);
 
 	ctx.skel->bss->nr_entries = args.nr_entries;
 	ctx.skel->bss->nr_loops = args.nr_loops / args.nr_entries;
@@ -197,6 +203,16 @@ static void setup(void)
 	}
 }
 
+static void setup(void)
+{
+	hashmap_lookup_setup(BPF_MAP_TYPE_HASH);
+}
+
+static void rhash_setup(void)
+{
+	hashmap_lookup_setup(BPF_MAP_TYPE_RHASH);
+}
+
 static inline double events_from_time(u64 time)
 {
 	if (time)
@@ -275,3 +291,14 @@ const struct bench bench_bpf_hashmap_lookup = {
 	.report_progress = NULL,
 	.report_final = hashmap_report_final,
 };
+
+const struct bench bench_bpf_rhashmap_lookup = {
+	.name = "bpf-rhashmap-lookup",
+	.argp = &bench_hashmap_lookup_argp,
+	.validate = validate,
+	.setup = rhash_setup,
+	.producer_thread = producer,
+	.measure = measure,
+	.report_progress = NULL,
+	.report_final = hashmap_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
index 297e32390cd1..1ee217d97434 100644
--- a/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
+++ b/tools/testing/selftests/bpf/benchs/bench_htab_mem.c
@@ -152,7 +152,7 @@ static const struct htab_mem_use_case *htab_mem_find_use_case_or_exit(const char
 	exit(1);
 }
 
-static void htab_mem_setup(void)
+static void htab_mem_setup_impl(enum bpf_map_type map_type)
 {
 	struct bpf_map *map;
 	const char **names;
@@ -178,10 +178,11 @@ static void htab_mem_setup(void)
 	}
 
 	map = ctx.skel->maps.htab;
+	bpf_map__set_type(map, map_type);
 	bpf_map__set_value_size(map, args.value_size);
 	/* Ensure that different CPUs can operate on different subset */
 	bpf_map__set_max_entries(map, MAX(8192, 64 * env.nr_cpus));
-	if (args.preallocated)
+	if (map_type != BPF_MAP_TYPE_RHASH && args.preallocated)
 		bpf_map__set_map_flags(map, bpf_map__map_flags(map) & ~BPF_F_NO_PREALLOC);
 
 	names = ctx.uc->progs;
@@ -220,6 +221,16 @@ static void htab_mem_setup(void)
 	exit(1);
 }
 
+static void htab_mem_setup(void)
+{
+	htab_mem_setup_impl(BPF_MAP_TYPE_HASH);
+}
+
+static void rhtab_mem_setup(void)
+{
+	htab_mem_setup_impl(BPF_MAP_TYPE_RHASH);
+}
+
 static void htab_mem_add_fn(pthread_barrier_t *notify)
 {
 	while (true) {
@@ -338,6 +349,15 @@ static void htab_mem_report_final(struct bench_res res[], int res_cnt)
 	cleanup_cgroup_environment();
 }
 
+static void rhtab_mem_validate(void)
+{
+	if (args.preallocated) {
+		fprintf(stderr, "rhash map does not support preallocation\n");
+		exit(1);
+	}
+	htab_mem_validate();
+}
+
 const struct bench bench_htab_mem = {
 	.name = "htab-mem",
 	.argp = &bench_htab_mem_argp,
@@ -348,3 +368,14 @@ const struct bench bench_htab_mem = {
 	.report_progress = htab_mem_report_progress,
 	.report_final = htab_mem_report_final,
 };
+
+const struct bench bench_rhtab_mem = {
+	.name = "rhtab-mem",
+	.argp = &bench_htab_mem_argp,
+	.validate = rhtab_mem_validate,
+	.setup = rhtab_mem_setup,
+	.producer_thread = htab_mem_producer,
+	.measure = htab_mem_measure,
+	.report_progress = htab_mem_report_progress,
+	.report_final = htab_mem_report_final,
+};

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API
  2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
@ 2026-05-28 18:16   ` sashiko-bot
  2026-05-29 16:43     ` Mykyta Yatsenko
  2026-05-28 18:26   ` bot+bpf-ci
  1 sibling, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-28 18:16 UTC (permalink / raw)
  To: Mykyta Yatsenko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [Critical] rhashtable_next_key completely skips unmigrated elements if prev_key has already been migrated during a concurrent rehash.
- [High] rhashtable_next_key enters an infinite loop if the hash table contains duplicate keys.
- [High] Concurrent deletion of the prev_key element aborts the iteration prematurely and returns a dangerous ERR_PTR(-ENOENT) pointer.
- [High] rhashtable_next_key silently ignores all duplicate elements when used on rhlist-enabled tables.
--

commit 2aba1f7c85d79ee4c37a0c5013226a79bf52f189
Author: Mykyta Yatsenko <yatsenko@meta.com>
Subject: rhashtable: Add rhashtable_next_key() API

Introduce a simpler iteration mechanism for rhashtable that lets
the caller continue from an arbitrary position by supplying the
previous key, without the per-iterator state of the
rhashtable_walk_* API.

> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index ef5230cece364..d6eb3ba476ea2 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h
> @@ -650,6 +650,97 @@ static __always_inline struct rhash_head *__rhashtable_lookup(
>  	return NULL;
>  }
>  
> +/* Internal: scan one element forward from prev_key's position.
> + * Returns first rhash_head whose bucket >= prev_key's bucket and
> + * (if within prev_key's bucket) appears after prev_key in chain.
> + * Returns first element if prev_key is NULL or ERR_PTR(-ENOENT) if prev_key
> + * is not found.
> + */
> +static __always_inline struct rhash_head *__rhashtable_next_in_table(
> +	struct rhashtable *ht, struct bucket_table *tbl,
> +	const void *prev_key, const struct rhashtable_params params)
> +	__must_hold_shared(RCU)
> +{
> +	struct rhashtable_compare_arg arg = { .ht = ht, .key = prev_key };
> +	struct rhash_head *he;
> +	unsigned int b = 0;
> +	bool found = false;
> +
> +	if (prev_key) {
> +		b = rht_key_hashfn(ht, tbl, prev_key, params);
> +		rht_for_each_rcu(he, tbl, b) {

[Severity: High]
Does this silently skip duplicate elements when the table is configured with
rhlist = true?

For rhlist-enabled tables (like rhltable), the primary hash chain only connects
the first element of each unique key group, while duplicates are stored in
secondary linked lists via the rhlist_head structure.

Since this iteration relies exclusively on rht_for_each_rcu() and doesn't
traverse those secondary lists, it seems it will only yield the most recently
inserted element for each unique key.

> +			if (found)
> +				return he;
> +			if (params.obj_cmpfn ?
> +			    !params.obj_cmpfn(&arg, rht_obj(ht, he)) :
> +			    !rhashtable_compare(&arg, rht_obj(ht, he)))
> +				found = true;

[Severity: High]
Can this lead to an infinite loop and RCU stall if the hash table contains
duplicate keys?

Since standard rhashtables permit duplicate keys, rhashtable_compare() will
match the first occurrence of prev_key in the chain and then yield the
immediate next element.

If the next element is also a duplicate of prev_key, when the caller resumes
iteration with it, the search restarts from the bucket head, matches the
first occurrence again, and repeatedly yields the same second duplicate
without making forward progress.

> +		}
> +		if (!found)
> +			return ERR_PTR(-ENOENT);

[Severity: High]
Is it safe to return ERR_PTR(-ENOENT) if prev_key is concurrently deleted?

Unlike standard rhashtable iterators, this abruptly aborts the iteration if the
element is removed, missing the remainder of the map.

Additionally, generic callers utilizing standard loop termination checks
(e.g., while (obj)) could inadvertently dereference this error pointer and
cause a kernel panic.

> +		b++;
> +	}
> +
> +	for (; b < tbl->size; b++)
> +		rht_for_each_rcu(he, tbl, b)
> +			return he;
> +	return NULL;
> +}

[ ... ]

> +static inline void *rhashtable_next_key(
> +	struct rhashtable *ht, const void *prev_key,
> +	const struct rhashtable_params params)
> +	__must_hold_shared(RCU)
> +{
> +	struct bucket_table *tbl;
> +	struct rhash_head *he;
> +
> +	tbl = rht_dereference_rcu(ht->tbl, ht);
> +	do {
> +		he = __rhashtable_next_in_table(ht, tbl, prev_key, params);
> +		if (!IS_ERR_OR_NULL(he))
> +			return rht_obj(ht, he);
> +		if (!he)
> +			prev_key = NULL;
> +		/* Ensure we see any new future_tbl attached during a rehash. */
> +		smp_rmb();
> +		tbl = rht_dereference_rcu(tbl->future_tbl, ht);

[Severity: Critical]
Does this logic completely skip unmigrated elements if prev_key has already been
migrated during a concurrent rehash?

If prev_key was migrated, __rhashtable_next_in_table() will fail to find it in
the old table and return ERR_PTR(-ENOENT).

The code then falls back to searching future_tbl where it finds the key and
continues. However, by jumping to future_tbl, the iterator permanently abandons
the old table (ht->tbl) and will never visit its remaining unmigrated buckets,
causing significant data omission.

> +	} while (tbl);
> +	return he; /* NULL or -ENOENT */
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528-rhash-v5-0-7205191b6c57@meta.com?part=1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable
  2026-05-28 17:47 ` [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
@ 2026-05-28 18:25   ` sashiko-bot
  2026-06-01 16:38     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-28 18:25 UTC (permalink / raw)
  To: Mykyta Yatsenko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Trivial hash collisions in `rhtab_hashfn_long` defeat rhashtable's DoS mitigation, leading to infinite rehashing and 100% CPU livelock.
- [High] Direct dereferencing of potentially unaligned map key pointers in the optimized hash and compare functions will cause kernel panics on strict-alignment architectures.
--

commit 40f8c726140c6268e5f797c092818cebd3cfc282
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Optimize word-sized keys for resizable hashtable

This commit specializes the lookup, update, and delete paths for word-sized
keys in resizable hashtables. It introduces a custom XOR-fold hash function
and single-word equality comparison to bypass the indirect jhash dispatch.

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> +/* Specialize hash function and objcmp for long sized key */
> +static __always_inline int rhtab_key_cmp_long(struct rhashtable_compare_arg *arg,
> +					      const void *ptr)
> +{
> +	const unsigned long key1 = *(const unsigned long *)arg->key;

[Severity: High]
Could dereferencing this pointer directly cause alignment faults on
architectures that require strict alignment? Since BPF programs can pass
unaligned pointers (such as from packet data in skb->data) directly to map
helpers, using get_unaligned() might be necessary here to prevent kernel
panics.

> +	const struct rhtab_elem *key2 = ptr;
> +
> +	return key1 != *(const unsigned long *)key2->data;
> +}
> +
> +static __always_inline u32 rhtab_hashfn_long(const void *data, u32 len, u32 seed)
> +{
> +	u64 k = *(const unsigned long *)data;
> +
> +	return (u32)(k ^ (k >> 32)) ^ seed;

[Severity: High]
Does applying the seed after folding the key make hash collisions entirely
independent of the seed?

If a BPF program or an attacker inserts keys constructed as:
k = (Y << 32) | (X ^ Y)

These keys will all fold to exactly X before the XOR with the seed,
meaning f(key1) ^ seed == f(key2) ^ seed for any random seed.

If rhashtable detects a long bucket chain and triggers a background rehash
with a new seed, won't all these colliding keys move together to the same
new bucket? This might trap the rhashtable worker in an infinite loop of
rehashing, leading to a 100% CPU livelock.

> +}
> +

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528-rhash-v5-0-7205191b6c57@meta.com?part=6

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key()
  2026-05-28 17:47 ` [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
@ 2026-05-28 18:26   ` sashiko-bot
  2026-05-29 16:53     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-28 18:26 UTC (permalink / raw)
  To: Mykyta Yatsenko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The selftest strictly enforces that exactly 'n' elements are visited, which violates the rhashtable_next_key() API contract and leads to spurious test failures if concurrent rehashing occurs.
--

commit 764349a8abc231043942a78a0c94a8a3422ff178
Author: Mykyta Yatsenko <yatsenko@meta.com>

rhashtable: Add selftest for rhashtable_next_key()

This commit adds a selftest for rhashtable_next_key() to verify that walking
with a NULL prev_key visits all elements, and that providing a non-existing
prev_key returns ERR_PTR(-ENOENT).

> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
> index 0b33559a910b0..c7a0817b0be59 100644
> --- a/lib/test_rhashtable.c
> +++ b/lib/test_rhashtable.c
[ ... ]
> @@ -679,6 +679,76 @@ static int threadfunc(void *data)
>  	return err;
>  }
>  
> +static int __init test_rhashtable_next_key(void)
> +{
> +	struct rhashtable_params params = test_rht_params;
> +	struct test_obj_val key_missing = { .id = 99999, .tid = 0 };
> +	struct test_obj_val *prev_key = NULL;
> +	struct rhashtable ht;
> +	struct test_obj *objs, *cur;
> +	int i, count = 0, err;
> +	int visited_keys[8] = { 0 };
> +	const int n = ARRAY_SIZE(visited_keys);
> +
> +	err = rhashtable_init(&ht, &params);
> +	if (err)
> +		return err;
> +
> +	objs = kcalloc(n, sizeof(*objs), GFP_KERNEL);
> +	if (!objs) {
> +		rhashtable_destroy(&ht);
> +		return -ENOMEM;
> +	}
> +
> +	for (i = 0; i < n; i++) {
> +		objs[i].value.id = i;
> +		err = rhashtable_insert_fast(&ht, &objs[i].node, params);
> +		if (err)
> +			goto out;
> +	}
> +
> +	rcu_read_lock();
> +
> +	/* NULL prev_key: walk from the beginning, expect all n elements. */
> +	while ((cur = rhashtable_next_key(&ht, prev_key, params))) {
> +		if (IS_ERR(cur)) {
> +			err = -EINVAL;
> +			goto unlock;
> +		}
> +		count++;
> +		prev_key = &cur->value;
> +		visited_keys[cur->value.id] = 1;
> +		if (count > n)
> +			break;
> +	}
> +
> +	if (count != n) {
> +		err = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	for (i = 0; i < n; i++) {
> +		if (!visited_keys[i]) {
> +			err = -EINVAL;
> +			goto unlock;
> +		}
> +	}

[Severity: Medium]
Will this strict check cause spurious test failures during concurrent rehashing?

The rhashtable_next_key() API documentation explicitly warns that its walk is
highly unstable during concurrent rehashing, stating that callers must
tolerate skipped and duplicated elements.

This test uses test_rht_params, which inherits its nelem_hint from the
user-provided size module parameter. If the test is executed with a small
initial size (e.g., size=4), inserting 8 elements will cross the 75%
threshold and trigger a background rehash by scheduling ht->run_work.

Because the test iterates over the keys immediately without waiting for this
background work to finish (e.g., via flush_work(&ht.run_work) or by locally
overriding nelem_hint to avoid resizing), the traversal can run concurrently
with the rehash. This could cause rhashtable_next_key() to legitimately
return duplicates or skip elements, causing the test to spuriously fail.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528-rhash-v5-0-7205191b6c57@meta.com?part=2

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API
  2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
  2026-05-28 18:16   ` sashiko-bot
@ 2026-05-28 18:26   ` bot+bpf-ci
  1 sibling, 0 replies; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-28 18:26 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor, herbert
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3566 bytes --]

> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
> index ef5230cece36..d6eb3ba476ea 100644
> --- a/include/linux/rhashtable.h
> +++ b/include/linux/rhashtable.h

[ ... ]

> @@ -650,6 +650,97 @@ static __always_inline struct rhash_head *__rhashtable_lookup(
>  	return NULL;
>  }
>
> +/**
> + * rhashtable_next_key - return next element after a given key
> + * @ht:		hash table
> + * @prev_key:	pointer to previous key, or NULL for the first element
> + * @params:	hash table parameters
> + *
> + * WARNING: this walk is highly unstable. Unlike rhashtable_walk_*(),
> + * it cannot detect a concurrent resize or rehash, so a full iteration
> + * is NOT guaranteed to terminate under adversarial or sustained
> + * rehashing. Callers MUST tolerate skipped and duplicated elements and
> + * SHOULD bound their loop externally.
> + *
> + * Returns the next element in best-effort iteration order, walking the
> + * @tbl chain (including any future_tbl in flight). Caller must hold RCU.
> + *
> + * Pass @prev_key == NULL to obtain the first element. To iterate, set
> + * @prev_key to the key of the previously returned element on each call,
> + * and stop when NULL is returned.
> + *
> + * Best-effort semantics:
> + *   - Across the tbl->future_tbl chain, an element being migrated may
> + *     transiently appear in both tables and be observed twice.
> + *   - Concurrent inserts may or may not be observed.
> + *   - Termination of a full iteration loop is NOT guaranteed under
> + *     adversarial continuous rehash; callers MUST tolerate skips and
> + *     repeats and SHOULD bound their loop externally.
> + *   - If prev_key was concurrently deleted and is not present in any
> + *     in-flight table, returns ERR_PTR(-ENOENT).
> + *
> + * Returns entry of the next element, or NULL when iteration is exhausted
> + * or ERR_PTR(-ENOENT) if prev_key is not found.
> + */
> +static inline void *rhashtable_next_key(
> +	struct rhashtable *ht, const void *prev_key,
> +	const struct rhashtable_params params)
> +	__must_hold_shared(RCU)
> +{

Herbert Xu (rhashtable maintainer) raised a concern in v3 about rehashing
causing infinite iteration:

https://lore.kernel.org/bpf/afCNDOhbszgMlGB7@gondor.apana.org.au/

> The same issue from v2 still exists here.  If a rehash occurs in
> between calls to rht_map_get_next_key you could be iterating over
> the table forever.

In v4, a new rhashtable_next_key() API was introduced to address this,
with documentation noting the issue. Alexei Starovoitov asked Herbert
"Is it an acceptable api? I suspect you don't like it" but Herbert has
not responded.

v5 added a WARNING in kdoc attributed to "(Herbert)", but this appears
to acknowledge his concern rather than indicate approval.

Has the rhashtable maintainer explicitly reviewed or approved this new
API design?

> +	struct bucket_table *tbl;
> +	struct rhash_head *he;
> +
> +	tbl = rht_dereference_rcu(ht->tbl, ht);
> +	do {
> +		he = __rhashtable_next_in_table(ht, tbl, prev_key, params);
> +		if (!IS_ERR_OR_NULL(he))
> +			return rht_obj(ht, he);
> +		if (!he)
> +			prev_key = NULL;
> +		/* Ensure we see any new future_tbl attached during a rehash. */
> +		smp_rmb();
> +		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
> +	} while (tbl);
> +	return he; /* NULL or -ENOENT */
> +}

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 05/11] bpf: Allow special fields in resizable hashtab
  2026-05-28 17:47 ` [PATCH bpf-next v5 05/11] bpf: Allow special fields in " Mykyta Yatsenko
@ 2026-05-28 18:26   ` bot+bpf-ci
  2026-06-01 13:05     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-28 18:26 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor, herbert
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2528 bytes --]

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index a03fa5fa6545..50e9d57b2790 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> @@ -2943,6 +2965,7 @@ static int rhtab_map_lookup_and_delete_elem(struct bpf_map *map, void *key, void
>  static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
>  				      u64 map_flags)
>  {
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>  	void *old_val = rhtab_elem_value(elem, map->key_size);
>
>  	if (map_flags & BPF_NOEXIST)
> @@ -2952,6 +2975,17 @@ static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *el
>  		copy_map_value_locked(map, old_val, value, false);
>  	else
>  		copy_map_value(map, old_val, value);
> +
> +	/*
> +	 * Torn reads: a concurrent reader without BPF_F_LOCK may observe
> +	 * the value mid-copy. Callers requiring consistent reads must use
> +	 * BPF_F_LOCK, matching arraymap semantics.
> +	 *
> +	 * copy_map_value() skips special-field offsets, so old timers/
> +	 * kptrs/etc. still sit in the slot. Cancel them after the copy
> +	 * to match arraymap's update semantics.
> +	 */
> +	rhtab_check_and_free_fields(rhtab, elem);
>  	return 0;
>  }

In the v4 discussion on lore, a concern was raised about the ordering of
copy_map_value() and rhtab_check_and_free_fields() operations. When
copy_map_value() is called, it skips special field offsets, leaving the
old timer/kptr/etc. active in the slot during the copy.

Could a concurrent BPF program access those special fields in the window
between the copy and the free operations? Specifically, in the
BPF_F_LOCK case where copy_map_value_locked() is used, if a concurrent
program acquires the spinlock and accesses the value between the copy
completion and rhtab_check_and_free_fields() call, would it still see
the old timer/kptr that gets canceled immediately after?

The comment mentions matching arraymap's update semantics and acknowledges
torn reads for the non-locked case, but it's not clear whether the locked
case fully prevents concurrent access to special fields that are about to
be freed.

Reference: https://lore.kernel.org/bpf/252dc423a3bc75b3f4d936e7706085da71939a88a7ef148926c317d463d20a12@mail.kernel.org/

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab
  2026-05-28 17:47 ` [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
@ 2026-05-28 18:28   ` sashiko-bot
  2026-05-29 17:53     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-28 18:28 UTC (permalink / raw)
  To: Mykyta Yatsenko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Critical] Missing error pointer check in `rhtab_map_get_next_key` leads to unhandled error pointer dereferences and kernel panics.
- [High] `bpf_each_rhash_elem` illegally calls `cond_resched()` and drops `rcu_read_lock()` from within a BPF map iterator callback, which will trigger kernel panics in atomic contexts.
- [High] Batch lookup and `seq_file` iterators prematurely terminate when encountering concurrent deletions, breaking BPF map iteration guarantees and causing silent data omission.
--

commit afc331365f30658242015d82e7bdd35a014f432d
Author: Mykyta Yatsenko <yatsenko@meta.com>

bpf: Implement iteration ops for resizable hashtab

Implement rhtab_map_get_next_key(), batch lookup/lookup-and-delete,
and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH.

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 8944c9f070ca8..a03fa5fa6545b 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -3021,10 +3021,363 @@ static void rhtab_map_free_internal_structs(struct bpf_map *map)
>  }
>  
>  static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +	__must_hold_shared(RCU)
>  {
> -	return -EOPNOTSUPP;
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
> +	struct rhtab_elem *elem;
> +
> +	elem = rhashtable_next_key(&rhtab->ht, key, rhtab_params);
> +
> +	/* if not found, return the first key */
> +	if (PTR_ERR(elem) == -ENOENT)
> +		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
> +
> +	if (!elem)
> +		return -ENOENT;
> +
> +	memcpy(next_key, elem->data, map->key_size);

[Severity: Critical]
Could this dereference an error pointer if rhashtable_next_key() returns
ERR_PTR(-EAGAIN)? 

The first check only handles -ENOENT. If we get -EAGAIN (which can happen
during concurrent resizing), it bypasses the PTR_ERR check, passes the
!elem check since it isn't NULL, and then passes the error pointer directly
into memcpy().

> +	return 0;
> +}

[ ... ]

> +static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
> +				void *callback_ctx, u64 flags)
> +{
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
> +	void *prev_key = NULL;

[ ... ]

> +	rcu_read_lock();
> +	/*
> +	 * Best-effort iteration: if rhashtable is concurrently resized or
> +	 * elements are deleted/inserted, there may be missed or duplicate
> +	 * elements visited.
> +	 */
> +	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params))) {
> +		if (IS_ERR(elem))
> +			break;
> +		num_elems++;
> +		ret = callback_fn((u64)(long)map,
> +				  (u64)(long)elem->data,
> +				  (u64)(long)rhtab_elem_value(elem, map->key_size),
> +				  (u64)(long)callback_ctx, 0);
> +		if (ret)
> +			break;
> +
> +		prev_key = elem->data;	/* valid while RCU held */
> +
> +		if (need_resched()) {
> +			memcpy(key_buf, elem->data, map->key_size);
> +			rcu_read_unlock();
> +			cond_resched();

[Severity: High]
Can this cond_resched() trigger "BUG: scheduling while atomic" panics?

BPF programs using the bpf_for_each_map_elem helper often run in
non-sleepable contexts (like XDP, TC, or tracing) under an outer RCU read
lock or with preemption disabled. 

While this drops the inner RCU lock, it doesn't exit the outer atomic
context, so sleeping here might violate those constraints and cause system
instability.

> +			rcu_read_lock();
> +			prev_key = key_buf;
> +		}
> +	}
> +	rcu_read_unlock();

[ ... ]

> +static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
> +					       const union bpf_attr *attr,
> +					       union bpf_attr __user *uattr,
> +					       bool do_delete)
> +{

[ ... ]

> +	rcu_read_lock();
> +
> +	/*
> +	 * Cursor stores the key of the next-to-process element (stashed by
> +	 * the previous batch). Look it up directly so the element is included
> +	 * here rather than skipped by next_key(). If the cursor was deleted
> +	 * concurrently (or by the previous do_delete batch), lookup returns
> +	 * NULL and userspace must restart from a NULL cursor.
> +	 */
> +	if (ubatch)
> +		elem = rhtab_lookup_elem(map, cursor);
> +	else
> +		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
> +
> +	while (elem && !IS_ERR(elem) && total < max_count) {

[Severity: High]
Will this prematurely terminate the batch iteration if a concurrent deletion
occurs? 

If the cursor element was concurrently deleted, rhashtable_next_key() returns
ERR_PTR(-ENOENT). This loop checks !IS_ERR(elem) and will break out early.

This results in the function returning -ENOENT to userspace. Under BPF UAPI
semantics, userspace interprets -ENOENT as the end of the map, which might
cause silent data omission instead of safely recovering or continuing.

> +		memcpy(dst_key, elem->data, key_size);
> +		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
> +		check_and_init_map_value(map, dst_val);

[ ... ]

> +static void *bpf_rhash_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> +{
> +	struct bpf_iter_seq_rhash_map_info *info = seq->private;
> +	struct rhtab_elem *elem = v;
> +	void *next;
> +
> +	memcpy(info->last_key, elem->data, info->map->key_size);
> +	info->last_key_valid = true;
> +	++*pos;
> +
> +	next = rhashtable_next_key(&info->rhtab->ht, info->last_key,
> +				   rhtab_params);
> +	if (IS_ERR(next))
> +		return NULL;

[Severity: High]
Does this also prematurely terminate the seq_file iteration on concurrent
deletions? 

Similar to the batch logic above, returning NULL when next is
ERR_PTR(-ENOENT) signals the end of the sequence to the iterator, which could
silently truncate the iteration results.

> +	return next;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528-rhash-v5-0-7205191b6c57@meta.com?part=4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map
  2026-05-28 17:47 ` [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
@ 2026-05-28 18:41   ` sashiko-bot
  2026-06-01 16:41     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: sashiko-bot @ 2026-05-28 18:41 UTC (permalink / raw)
  To: Mykyta Yatsenko; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Missing file descriptor validation for `bpf_program__fd(prog)` after retrieving the program via `bpf_object__find_program_by_name()`.
--

commit 85c8ece43292fc1ae20a51c42265be508a83ec61
Author: Mykyta Yatsenko <yatsenko@meta.com>

selftests/bpf: Add basic tests for resizable hash map
    
Test basic map operations (lookup, update, delete) for
BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
key insertion and deletion of nonexistent keys.

> diff --git a/tools/testing/selftests/bpf/prog_tests/rhash.c b/tools/testing/selftests/bpf/prog_tests/rhash.c
> new file mode 100644
> index 0000000000000..69686bf69ba57
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/rhash.c
[ ... ]
> +static void rhash_run(const char *prog_name)
> +{
> +	struct rhash *skel;
> +	struct bpf_program *prog;
> +	LIBBPF_OPTS(bpf_test_run_opts, opts);
> +	int err;
> +
> +	skel = rhash__open();
> +	if (!ASSERT_OK_PTR(skel, "rhash__open"))
> +		return;
> +
> +	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
> +	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
> +		goto cleanup;
> +	bpf_program__set_autoload(prog, true);
> +
> +	err = rhash__load(skel);
> +	if (!ASSERT_OK(err, "skel_load"))
> +		goto cleanup;
> +
> +	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);

[Severity: Low]
Does the file descriptor returned by bpf_program__fd(prog) need to be
validated here? 

Since prog was obtained via the manual lookup API
bpf_object__find_program_by_name(), should we verify the fd is valid (>= 0)
using ASSERT_GE before passing it to bpf_prog_test_run_opts()?

> +	if (!ASSERT_OK(err, "prog run"))
> +		goto cleanup;
> +
> +	if (!ASSERT_OK(opts.retval, "prog retval"))
> +		goto cleanup;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260528-rhash-v5-0-7205191b6c57@meta.com?part=8

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions
  2026-05-28 17:47 ` [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
@ 2026-05-28 18:42   ` bot+bpf-ci
  2026-05-29 17:08     ` Mykyta Yatsenko
  0 siblings, 1 reply; 27+ messages in thread
From: bot+bpf-ci @ 2026-05-28 18:42 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	eddyz87, memxor, herbert
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 10910 bytes --]

> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 3dd9b4924ae4..8944c9f070ca 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c

[ ... ]

> +static void rhtab_free_elem(void *ptr, void *arg)
> +{
> +	struct bpf_rhtab *rhtab = arg;
> +	struct rhtab_elem *elem = ptr;
> +
> +	bpf_map_free_internal_structs(&rhtab->map, rhtab_elem_value(elem, rhtab->map.key_size));
> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
> +}

bpf_map_free_internal_structs only handles timers, workqueues, and
task_work (not kptrs, list_heads, rb_roots, or refcounted objects). In
contrast, htab's htab_elem_free calls check_and_free_fields which
eventually invokes bpf_obj_free_fields for the full field list. As a
result, when this restriction is later lifted, rhtab_free_elem would
leak kptrs and skip list/rbtree drains during map teardown.

The issue is latent at this commit because syscall.c map_check_btf
rejects all those field types for RHASH maps (map->record is NULL), but
the follow-up commit that enables the field types must rewrite this
entire function and add a memalloc destructor with bpf_obj_free_fields.

[ ... ]

> +static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, void *copy,
> +			     u64 flags)
> +{
> +	int err;
> +
> +	/*
> +	 * disable_instrumentation() mitigates the deadlock for programs running in NMI context.
> +	 * rhashtable locks bucket with local_irq_save(). Only NMI programs may reenter
> +	 * rhashtable code, bpf_disable_instrumentation() disables programs running in NMI, except
> +	 * raw tracepoints, which we don't have in rhashtable.
> +	 */
> +	bpf_disable_instrumentation();
> +	err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
> +	bpf_enable_instrumentation();

Does bpf_disable_instrumentation prevent the softirq deadlock scenario?

The v4 bot review noted that if a syscall holds the bucket lock and
is interrupted by hardware IRQ that schedules softirq (e.g., XDP or TC
program), the softirq would deadlock trying to acquire the same bucket
lock. bpf_disable_instrumentation only prevents NMI-based tracing
programs, not softirq programs triggered by hardware IRQs.

Alexei Starovoitov suggested in v1 that update/delete operations should
check 'if (irqs_disabled()) return -EBUSY' to prevent this class of
deadlock. That check is not present in this version.

Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/

> +
> +	if (err)
> +		return err;
> +
> +	if (copy) {
> +		rhtab_read_elem_value(&rhtab->map, copy, elem, flags);
> +		check_and_init_map_value(&rhtab->map, copy);
> +	}
> +
> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
> +	return 0;
> +}

The element is freed with bpf_mem_cache_free_rcu without calling
bpf_obj_free_fields or check_and_free_fields on the value. Compare with
htab's free_htab_elem which calls check_and_free_fields before freeing.

Once special fields are enabled for RHASH maps (by a later commit),
deleting an element with kptrs, list_heads, or rb_roots would leak those
resources. The later commit that enables special fields adds
rhtab_check_and_free_fields here.

[ ... ]

> +static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
> +				      u64 map_flags)
> +{
> +	void *old_val = rhtab_elem_value(elem, map->key_size);
> +
> +	if (map_flags & BPF_NOEXIST)
> +		return -EEXIST;
> +
> +	if (map_flags & BPF_F_LOCK)
> +		copy_map_value_locked(map, old_val, value, false);
> +	else
> +		copy_map_value(map, old_val, value);
> +	return 0;
> +}

copy_map_value deliberately skips special-field offsets so it doesn't
clobber kptr/timer/list_head pointers. After this in-place update
returns, the element's value contains the new scalar bytes but still
holds the OLD kptrs, timers, and work-queue handles at the special
offsets.

Arraymap and htab both invoke bpf_obj_free_fields or cancel timers on
the old value after the scalar copy to release the per-element
resources. The follow-up commit that enables special fields adds
rhtab_check_and_free_fields after the copy to match this pattern.

[ ... ]

> +static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags)
> +{
> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
> +	struct rhtab_elem *elem, *tmp;
> +
> +	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
> +		return -EINVAL;
> +
> +	if ((map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))
> +		return -EINVAL;

Similar deadlock question as rhtab_delete_elem: does
bpf_disable_instrumentation (used later in this function) prevent the
softirq deadlock where a syscall holds the bucket lock and is interrupted
by hardware IRQ that schedules softirq?

Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/

[ ... ]

> +	elem = bpf_mem_cache_alloc(&rhtab->ma);
> +	if (!elem)
> +		return -ENOMEM;
> +
> +	memcpy(elem->data, key, map->key_size);
> +	copy_map_value(map, rhtab_elem_value(elem, map->key_size), value);

bpf_mem_cache_alloc returns memory that is not zeroed at special-field
offsets (the cache may recycle freed objects). copy_map_value
deliberately skips those offsets. The freshly inserted element's value
therefore has uninitialized bytes at every special-field offset (timer
pointer, kptr, list_head, rb_root, etc.).

The canonical pattern is to call check_and_init_map_value after
copy_map_value to zero-initialize or call the field-specific init for
every special-field offset. htab does this in alloc_htab_elem and
pcpu_init_value. The follow-up commit adds check_and_init_map_value
here.

[ ... ]

> +
> +BTF_ID_LIST_SINGLE(rhtab_map_btf_ids, struct, bpf_rhtab)
> +const struct bpf_map_ops rhtab_map_ops = {
> +	.map_meta_equal = bpf_map_meta_equal,
> +	.map_alloc_check = rhtab_map_alloc_check,
> +	.map_alloc = rhtab_map_alloc,
> +	.map_free = rhtab_map_free,
> +	.map_get_next_key = rhtab_map_get_next_key,
> +	.map_release_uref = rhtab_map_free_internal_structs,
> +	.map_lookup_elem = rhtab_map_lookup_elem,
> +	.map_lookup_and_delete_elem = rhtab_map_lookup_and_delete_elem,
> +	.map_update_elem = rhtab_map_update_elem,
> +	.map_delete_elem = rhtab_map_delete_elem,
> +	.map_gen_lookup = rhtab_map_gen_lookup,
> +	.map_btf_id = &rhtab_map_btf_ids[0],
> +};

rhtab_map_ops does not define a .map_mem_usage callback. At this commit,
kernel/bpf/syscall.c __map_create enforces:

  if (!ops->map_mem_usage)
    return -EINVAL;

This check sits before map_alloc. As a result, every attempt to create a
BPF_MAP_TYPE_RHASH map fails with -EINVAL at the bpf(BPF_MAP_CREATE)
syscall. No code path inside rhtab_map_alloc / rhtab_map_update_elem /
etc. is reachable from user space.

The commit's stated runtime semantics (use rhashtable_lookup_likely for
lookups, in-place updates) are untestable because no map of this type
can exist. The bpf_map_show_fdinfo path also relies on
map->ops->map_mem_usage(map) without a NULL check, so even if the create
check were bypassed, /proc/[pid]/fdinfo/[map_fd] would NULL-deref.

The follow-up commit 'bpf: Implement iteration ops for resizable
hashtab' adds rhtab_map_mem_usage and wires .map_mem_usage =
rhtab_map_mem_usage in rhtab_map_ops. Each patch must compile and
introduce a working feature when applied to a kernel that bisects to it.
This commit is the first to expose BPF_MAP_TYPE_RHASH to user space (it
touches uapi bpf.h and the syscall.c switch), so this missing callback
breaks bisection.

[ ... ]

> +static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
> +{
> +	return -EOPNOTSUPP;
> +}

rhtab_map_get_next_key always returns -EOPNOTSUPP, meaning userspace
cannot iterate keys of a BPF_MAP_TYPE_RHASH map via the standard syscall
API, and tools like bpftool dump cannot list rhtab map contents.

The series previously added rhashtable_next_key API which provides
best-effort iteration, but it was never wired up.

Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/

[ ... ]

Multiple map_ops are registered as functional callbacks but are stubs
that were removed in v5. However, in v5 many callbacks that should be
present are missing entirely (map_seq_show_elem, map_for_each_callback,
map_mem_usage, BATCH_OPS, iter_seq_info).

While returning -EOPNOTSUPP is graceful (no crash), registering ops
pointers that fail rather than relying on the dispatcher's NULL check
defeats the verifier's compile-time validation: the verifier will accept
BPF programs that call bpf_for_each_map_elem on an rhtab map, only to
fail at runtime. Should these callbacks be NULL instead?

Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/

[ ... ]

> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 8d4ea700aac6..12c07310016a 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c

[ ... syscall.c changes for map_extra and map_lookup_and_delete_elem allowlist ... ]

The commit message states: 'Updates modify values in place under RCU
rather than allocating a new element... This trades read consistency for
performance: concurrent readers may see partial updates. Users requiring
consistent reads should use BPF_F_LOCK.'

The rhtab implementation in kernel/bpf/hashtab.c is wired for BPF_F_LOCK:
rhtab_map_update_elem checks btf_record_has_field(map->record,
BPF_SPIN_LOCK), and rhtab_read_elem_value / rhtab_map_update_existing
call copy_map_value_locked when BPF_F_LOCK is set.

However, this patch does not add BPF_MAP_TYPE_RHASH to the BPF_SPIN_LOCK
allow-list in map_check_btf at kernel/bpf/syscall.c (the case
BPF_SPIN_LOCK block that lists HASH, ARRAY, CGROUP_STORAGE, SK_STORAGE,
INODE_STORAGE, TASK_STORAGE, CGRP_STORAGE).

When a user attempts to create an RHASH map whose value type contains a
struct bpf_spin_lock (the only way to enable BPF_F_LOCK on a map),
__map_create calls map_check_btf, the BPF_SPIN_LOCK case checks the
allow-list, does not find BPF_MAP_TYPE_RHASH, and returns -EOPNOTSUPP.
The map cannot be created.

Therefore the BPF_F_LOCK code paths in rhtab_map_* are unreachable in
this commit, contradicting the commit message. A user bisecting a tree
at this commit and trying to use the documented BPF_F_LOCK feature on an
RHASH map will get -EOPNOTSUPP at map creation time.

The follow-up commit 'bpf: Allow special fields in resizable hashtab'
adds BPF_MAP_TYPE_RHASH to both the BPF_SPIN_LOCK and
BPF_TIMER/BPF_WORKQUEUE/BPF_TASK_WORK switch cases in map_check_btf.

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API
  2026-05-28 18:16   ` sashiko-bot
@ 2026-05-29 16:43     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-29 16:43 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf

On 5/28/26 7:16 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [Critical] rhashtable_next_key completely skips unmigrated elements if prev_key has already been migrated during a concurrent rehash.
> - [High] rhashtable_next_key enters an infinite loop if the hash table contains duplicate keys.
> - [High] Concurrent deletion of the prev_key element aborts the iteration prematurely and returns a dangerous ERR_PTR(-ENOENT) pointer.
> - [High] rhashtable_next_key silently ignores all duplicate elements when used on rhlist-enabled tables.
> --
> 
> commit 2aba1f7c85d79ee4c37a0c5013226a79bf52f189
> Author: Mykyta Yatsenko <yatsenko@meta.com>
> Subject: rhashtable: Add rhashtable_next_key() API
> 
> Introduce a simpler iteration mechanism for rhashtable that lets
> the caller continue from an arbitrary position by supplying the
> previous key, without the per-iterator state of the
> rhashtable_walk_* API.
> 
>> diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
>> index ef5230cece364..d6eb3ba476ea2 100644
>> --- a/include/linux/rhashtable.h
>> +++ b/include/linux/rhashtable.h
>> @@ -650,6 +650,97 @@ static __always_inline struct rhash_head *__rhashtable_lookup(
>>  	return NULL;
>>  }
>>  
>> +/* Internal: scan one element forward from prev_key's position.
>> + * Returns first rhash_head whose bucket >= prev_key's bucket and
>> + * (if within prev_key's bucket) appears after prev_key in chain.
>> + * Returns first element if prev_key is NULL or ERR_PTR(-ENOENT) if prev_key
>> + * is not found.
>> + */
>> +static __always_inline struct rhash_head *__rhashtable_next_in_table(
>> +	struct rhashtable *ht, struct bucket_table *tbl,
>> +	const void *prev_key, const struct rhashtable_params params)
>> +	__must_hold_shared(RCU)
>> +{
>> +	struct rhashtable_compare_arg arg = { .ht = ht, .key = prev_key };
>> +	struct rhash_head *he;
>> +	unsigned int b = 0;
>> +	bool found = false;
>> +
>> +	if (prev_key) {
>> +		b = rht_key_hashfn(ht, tbl, prev_key, params);
>> +		rht_for_each_rcu(he, tbl, b) {
> 
> [Severity: High]
> Does this silently skip duplicate elements when the table is configured with
> rhlist = true?
> 
> For rhlist-enabled tables (like rhltable), the primary hash chain only connects
> the first element of each unique key group, while duplicates are stored in
> secondary linked lists via the rhlist_head structure.
> 
> Since this iteration relies exclusively on rht_for_each_rcu() and doesn't
> traverse those secondary lists, it seems it will only yield the most recently
> inserted element for each unique key.
> 

rhltable is not supported, I can return -EOPNOTSUPP for it:
if (unlikely(ht->rhlist))
    return ERR_PTR(-EOPNOTSUPP);

>> +			if (found)
>> +				return he;
>> +			if (params.obj_cmpfn ?
>> +			    !params.obj_cmpfn(&arg, rht_obj(ht, he)) :
>> +			    !rhashtable_compare(&arg, rht_obj(ht, he)))
>> +				found = true;
> 
> [Severity: High]
> Can this lead to an infinite loop and RCU stall if the hash table contains
> duplicate keys?
> 
> Since standard rhashtables permit duplicate keys, rhashtable_compare() will
> match the first occurrence of prev_key in the chain and then yield the
> immediate next element.
> 
> If the next element is also a duplicate of prev_key, when the caller resumes
> iteration with it, the search restarts from the bucket head, matches the
> first occurrence again, and repeatedly yields the same second duplicate
> without making forward progress.
> 

good point, This code can be modified to skip duplicate keys. For BPF
codepaths duplicate keys are not possible, though.

>> +		}
>> +		if (!found)
>> +			return ERR_PTR(-ENOENT);
> 
> [Severity: High]
> Is it safe to return ERR_PTR(-ENOENT) if prev_key is concurrently deleted?
> 
> Unlike standard rhashtable iterators, this abruptly aborts the iteration if the
> element is removed, missing the remainder of the map.
> 
> Additionally, generic callers utilizing standard loop termination checks
> (e.g., while (obj)) could inadvertently dereference this error pointer and
> cause a kernel panic.

users should use IS_ERR_OR_NULL() instead of just checking for NULL.

> 
>> +		b++;
>> +	}
>> +
>> +	for (; b < tbl->size; b++)
>> +		rht_for_each_rcu(he, tbl, b)
>> +			return he;
>> +	return NULL;
>> +}
> 
> [ ... ]
> 
>> +static inline void *rhashtable_next_key(
>> +	struct rhashtable *ht, const void *prev_key,
>> +	const struct rhashtable_params params)
>> +	__must_hold_shared(RCU)
>> +{
>> +	struct bucket_table *tbl;
>> +	struct rhash_head *he;
>> +
>> +	tbl = rht_dereference_rcu(ht->tbl, ht);
>> +	do {
>> +		he = __rhashtable_next_in_table(ht, tbl, prev_key, params);
>> +		if (!IS_ERR_OR_NULL(he))
>> +			return rht_obj(ht, he);
>> +		if (!he)
>> +			prev_key = NULL;
>> +		/* Ensure we see any new future_tbl attached during a rehash. */
>> +		smp_rmb();
>> +		tbl = rht_dereference_rcu(tbl->future_tbl, ht);
> 
> [Severity: Critical]
> Does this logic completely skip unmigrated elements if prev_key has already been
> migrated during a concurrent rehash?
> 
> If prev_key was migrated, __rhashtable_next_in_table() will fail to find it in
> the old table and return ERR_PTR(-ENOENT).
> 
> The code then falls back to searching future_tbl where it finds the key and
> continues. However, by jumping to future_tbl, the iterator permanently abandons
> the old table (ht->tbl) and will never visit its remaining unmigrated buckets,
> causing significant data omission.
> 

Yes, this is a limitation we have to deal with when using rhashtable.

>> +	} while (tbl);
>> +	return he; /* NULL or -ENOENT */
>> +}
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key()
  2026-05-28 18:26   ` sashiko-bot
@ 2026-05-29 16:53     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-29 16:53 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf



On 5/28/26 7:26 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Medium] The selftest strictly enforces that exactly 'n' elements are visited, which violates the rhashtable_next_key() API contract and leads to spurious test failures if concurrent rehashing occurs.
> --
> 
> commit 764349a8abc231043942a78a0c94a8a3422ff178
> Author: Mykyta Yatsenko <yatsenko@meta.com>
> 
> rhashtable: Add selftest for rhashtable_next_key()
> 
> This commit adds a selftest for rhashtable_next_key() to verify that walking
> with a NULL prev_key visits all elements, and that providing a non-existing
> prev_key returns ERR_PTR(-ENOENT).
> 
>> diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
>> index 0b33559a910b0..c7a0817b0be59 100644
>> --- a/lib/test_rhashtable.c
>> +++ b/lib/test_rhashtable.c
> [ ... ]
>> @@ -679,6 +679,76 @@ static int threadfunc(void *data)
>>  	return err;
>>  }
>>  
>> +static int __init test_rhashtable_next_key(void)
>> +{
>> +	struct rhashtable_params params = test_rht_params;
>> +	struct test_obj_val key_missing = { .id = 99999, .tid = 0 };
>> +	struct test_obj_val *prev_key = NULL;
>> +	struct rhashtable ht;
>> +	struct test_obj *objs, *cur;
>> +	int i, count = 0, err;
>> +	int visited_keys[8] = { 0 };
>> +	const int n = ARRAY_SIZE(visited_keys);
>> +
>> +	err = rhashtable_init(&ht, &params);
>> +	if (err)
>> +		return err;
>> +
>> +	objs = kcalloc(n, sizeof(*objs), GFP_KERNEL);
>> +	if (!objs) {
>> +		rhashtable_destroy(&ht);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	for (i = 0; i < n; i++) {
>> +		objs[i].value.id = i;
>> +		err = rhashtable_insert_fast(&ht, &objs[i].node, params);
>> +		if (err)
>> +			goto out;
>> +	}
>> +
>> +	rcu_read_lock();
>> +
>> +	/* NULL prev_key: walk from the beginning, expect all n elements. */
>> +	while ((cur = rhashtable_next_key(&ht, prev_key, params))) {
>> +		if (IS_ERR(cur)) {
>> +			err = -EINVAL;
>> +			goto unlock;
>> +		}
>> +		count++;
>> +		prev_key = &cur->value;
>> +		visited_keys[cur->value.id] = 1;
>> +		if (count > n)
>> +			break;
>> +	}
>> +
>> +	if (count != n) {
>> +		err = -EINVAL;
>> +		goto unlock;
>> +	}
>> +
>> +	for (i = 0; i < n; i++) {
>> +		if (!visited_keys[i]) {
>> +			err = -EINVAL;
>> +			goto unlock;
>> +		}
>> +	}
> 
> [Severity: Medium]
> Will this strict check cause spurious test failures during concurrent rehashing?
> 
> The rhashtable_next_key() API documentation explicitly warns that its walk is
> highly unstable during concurrent rehashing, stating that callers must
> tolerate skipped and duplicated elements.
> 
> This test uses test_rht_params, which inherits its nelem_hint from the
> user-provided size module parameter. If the test is executed with a small
> initial size (e.g., size=4), inserting 8 elements will cross the 75%
> threshold and trigger a background rehash by scheduling ht->run_work.
> 
> Because the test iterates over the keys immediately without waiting for this
> background work to finish (e.g., via flush_work(&ht.run_work) or by locally
> overriding nelem_hint to avoid resizing), the traversal can run concurrently
> with the rehash. This could cause rhashtable_next_key() to legitimately
> return duplicates or skip elements, causing the test to spuriously fail.
> 

the simplest solution should be to provide nelem_hint for the rhashtable.
Also the number of the elements is small, so rehash is not really expected.

> [ ... ]
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions
  2026-05-28 18:42   ` bot+bpf-ci
@ 2026-05-29 17:08     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-29 17:08 UTC (permalink / raw)
  To: bot+bpf-ci, bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	memxor, herbert
  Cc: yatsenko, martin.lau, yonghong.song, clm, ihor.solodrai



On 5/28/26 7:42 PM, bot+bpf-ci@kernel.org wrote:
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 3dd9b4924ae4..8944c9f070ca 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
> 
> [ ... ]
> 
>> +static void rhtab_free_elem(void *ptr, void *arg)
>> +{
>> +	struct bpf_rhtab *rhtab = arg;
>> +	struct rhtab_elem *elem = ptr;
>> +
>> +	bpf_map_free_internal_structs(&rhtab->map, rhtab_elem_value(elem, rhtab->map.key_size));
>> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
>> +}
> 
> bpf_map_free_internal_structs only handles timers, workqueues, and
> task_work (not kptrs, list_heads, rb_roots, or refcounted objects). In
> contrast, htab's htab_elem_free calls check_and_free_fields which
> eventually invokes bpf_obj_free_fields for the full field list. As a
> result, when this restriction is later lifted, rhtab_free_elem would
> leak kptrs and skip list/rbtree drains during map teardown.
> 
> The issue is latent at this commit because syscall.c map_check_btf
> rejects all those field types for RHASH maps (map->record is NULL), but
> the follow-up commit that enables the field types must rewrite this
> entire function and add a memalloc destructor with bpf_obj_free_fields.
> 

I'll remove bpf_map_free_internal_structs() from this commit.

> [ ... ]
> 
>> +static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, void *copy,
>> +			     u64 flags)
>> +{
>> +	int err;
>> +
>> +	/*
>> +	 * disable_instrumentation() mitigates the deadlock for programs running in NMI context.
>> +	 * rhashtable locks bucket with local_irq_save(). Only NMI programs may reenter
>> +	 * rhashtable code, bpf_disable_instrumentation() disables programs running in NMI, except
>> +	 * raw tracepoints, which we don't have in rhashtable.
>> +	 */
>> +	bpf_disable_instrumentation();
>> +	err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
>> +	bpf_enable_instrumentation();
> 
> Does bpf_disable_instrumentation prevent the softirq deadlock scenario?
> 
> The v4 bot review noted that if a syscall holds the bucket lock and
> is interrupted by hardware IRQ that schedules softirq (e.g., XDP or TC
> program), the softirq would deadlock trying to acquire the same bucket
> lock. bpf_disable_instrumentation only prevents NMI-based tracing
> programs, not softirq programs triggered by hardware IRQs.

bucket lock does local_irq_save(), so XDP or TC programs 
cannot run during the critical section, only NMI can, which is 
handled by the bpf_disable_instrumentation().

> 
> Alexei Starovoitov suggested in v1 that update/delete operations should
> check 'if (irqs_disabled()) return -EBUSY' to prevent this class of
> deadlock. That check is not present in this version.
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
>> +
>> +	if (err)
>> +		return err;
>> +
>> +	if (copy) {
>> +		rhtab_read_elem_value(&rhtab->map, copy, elem, flags);
>> +		check_and_init_map_value(&rhtab->map, copy);
>> +	}
>> +
>> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
>> +	return 0;
>> +}
> 
> The element is freed with bpf_mem_cache_free_rcu without calling
> bpf_obj_free_fields or check_and_free_fields on the value. Compare with
> htab's free_htab_elem which calls check_and_free_fields before freeing.
> 
> Once special fields are enabled for RHASH maps (by a later commit),
> deleting an element with kptrs, list_heads, or rb_roots would leak those
> resources. The later commit that enables special fields adds
> rhtab_check_and_free_fields here.

When special fields are enabled, they are handled by this code as well.

> 
> [ ... ]
> 
>> +static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
>> +				      u64 map_flags)
>> +{
>> +	void *old_val = rhtab_elem_value(elem, map->key_size);
>> +
>> +	if (map_flags & BPF_NOEXIST)
>> +		return -EEXIST;
>> +
>> +	if (map_flags & BPF_F_LOCK)
>> +		copy_map_value_locked(map, old_val, value, false);
>> +	else
>> +		copy_map_value(map, old_val, value);
>> +	return 0;
>> +}
> 
> copy_map_value deliberately skips special-field offsets so it doesn't
> clobber kptr/timer/list_head pointers. After this in-place update
> returns, the element's value contains the new scalar bytes but still
> holds the OLD kptrs, timers, and work-queue handles at the special
> offsets.
> 
> Arraymap and htab both invoke bpf_obj_free_fields or cancel timers on
> the old value after the scalar copy to release the per-element
> resources. The follow-up commit that enables special fields adds
> rhtab_check_and_free_fields after the copy to match this pattern.
> 

In this commit special fields are not yet supported. There is a separate
commit implementing it.

> [ ... ]
> 
>> +static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags)
>> +{
>> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>> +	struct rhtab_elem *elem, *tmp;
>> +
>> +	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
>> +		return -EINVAL;
>> +
>> +	if ((map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))
>> +		return -EINVAL;
> 
> Similar deadlock question as rhtab_delete_elem: does
> bpf_disable_instrumentation (used later in this function) prevent the
> softirq deadlock where a syscall holds the bucket lock and is interrupted
> by hardware IRQ that schedules softirq?
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
>> +	elem = bpf_mem_cache_alloc(&rhtab->ma);
>> +	if (!elem)
>> +		return -ENOMEM;
>> +
>> +	memcpy(elem->data, key, map->key_size);
>> +	copy_map_value(map, rhtab_elem_value(elem, map->key_size), value);
> 
> bpf_mem_cache_alloc returns memory that is not zeroed at special-field
> offsets (the cache may recycle freed objects). copy_map_value
> deliberately skips those offsets. The freshly inserted element's value
> therefore has uninitialized bytes at every special-field offset (timer
> pointer, kptr, list_head, rb_root, etc.).
> 
> The canonical pattern is to call check_and_init_map_value after
> copy_map_value to zero-initialize or call the field-specific init for
> every special-field offset. htab does this in alloc_htab_elem and
> pcpu_init_value. The follow-up commit adds check_and_init_map_value
> here.

Right, the missing pieces for special fields support are added later.

> 
> [ ... ]
> 
>> +
>> +BTF_ID_LIST_SINGLE(rhtab_map_btf_ids, struct, bpf_rhtab)
>> +const struct bpf_map_ops rhtab_map_ops = {
>> +	.map_meta_equal = bpf_map_meta_equal,
>> +	.map_alloc_check = rhtab_map_alloc_check,
>> +	.map_alloc = rhtab_map_alloc,
>> +	.map_free = rhtab_map_free,
>> +	.map_get_next_key = rhtab_map_get_next_key,
>> +	.map_release_uref = rhtab_map_free_internal_structs,
>> +	.map_lookup_elem = rhtab_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = rhtab_map_lookup_and_delete_elem,
>> +	.map_update_elem = rhtab_map_update_elem,
>> +	.map_delete_elem = rhtab_map_delete_elem,
>> +	.map_gen_lookup = rhtab_map_gen_lookup,
>> +	.map_btf_id = &rhtab_map_btf_ids[0],
>> +};
> 
> rhtab_map_ops does not define a .map_mem_usage callback. At this commit,
> kernel/bpf/syscall.c __map_create enforces:
> 
>   if (!ops->map_mem_usage)
>     return -EINVAL;
> 
> This check sits before map_alloc. As a result, every attempt to create a
> BPF_MAP_TYPE_RHASH map fails with -EINVAL at the bpf(BPF_MAP_CREATE)
> syscall. No code path inside rhtab_map_alloc / rhtab_map_update_elem /
> etc. is reachable from user space.
> 
> The commit's stated runtime semantics (use rhashtable_lookup_likely for
> lookups, in-place updates) are untestable because no map of this type
> can exist. The bpf_map_show_fdinfo path also relies on
> map->ops->map_mem_usage(map) without a NULL check, so even if the create
> check were bypassed, /proc/[pid]/fdinfo/[map_fd] would NULL-deref.
> 
> The follow-up commit 'bpf: Implement iteration ops for resizable
> hashtab' adds rhtab_map_mem_usage and wires .map_mem_usage =
> rhtab_map_mem_usage in rhtab_map_ops. Each patch must compile and
> introduce a working feature when applied to a kernel that bisects to it.
> This commit is the first to expose BPF_MAP_TYPE_RHASH to user space (it
> touches uapi bpf.h and the syscall.c switch), so this missing callback
> breaks bisection.

I'll move map_mem_usage to this commit. 

> 
> [ ... ]
> 
>> +static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
> 
> rhtab_map_get_next_key always returns -EOPNOTSUPP, meaning userspace
> cannot iterate keys of a BPF_MAP_TYPE_RHASH map via the standard syscall
> API, and tools like bpftool dump cannot list rhtab map contents.
> 
> The series previously added rhashtable_next_key API which provides
> best-effort iteration, but it was never wired up.

Implemented in the next commit.

> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
> Multiple map_ops are registered as functional callbacks but are stubs
> that were removed in v5. However, in v5 many callbacks that should be
> present are missing entirely (map_seq_show_elem, map_for_each_callback,
> map_mem_usage, BATCH_OPS, iter_seq_info).
> 
> While returning -EOPNOTSUPP is graceful (no crash), registering ops
> pointers that fail rather than relying on the dispatcher's NULL check
> defeats the verifier's compile-time validation: the verifier will accept
> BPF programs that call bpf_for_each_map_elem on an rhtab map, only to
> fail at runtime. Should these callbacks be NULL instead?
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 8d4ea700aac6..12c07310016a 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
> 
> [ ... syscall.c changes for map_extra and map_lookup_and_delete_elem allowlist ... ]
> 
> The commit message states: 'Updates modify values in place under RCU
> rather than allocating a new element... This trades read consistency for
> performance: concurrent readers may see partial updates. Users requiring
> consistent reads should use BPF_F_LOCK.'
> 
> The rhtab implementation in kernel/bpf/hashtab.c is wired for BPF_F_LOCK:
> rhtab_map_update_elem checks btf_record_has_field(map->record,
> BPF_SPIN_LOCK), and rhtab_read_elem_value / rhtab_map_update_existing
> call copy_map_value_locked when BPF_F_LOCK is set.
> 
> However, this patch does not add BPF_MAP_TYPE_RHASH to the BPF_SPIN_LOCK
> allow-list in map_check_btf at kernel/bpf/syscall.c (the case
> BPF_SPIN_LOCK block that lists HASH, ARRAY, CGROUP_STORAGE, SK_STORAGE,
> INODE_STORAGE, TASK_STORAGE, CGRP_STORAGE).
> 
> When a user attempts to create an RHASH map whose value type contains a
> struct bpf_spin_lock (the only way to enable BPF_F_LOCK on a map),
> __map_create calls map_check_btf, the BPF_SPIN_LOCK case checks the
> allow-list, does not find BPF_MAP_TYPE_RHASH, and returns -EOPNOTSUPP.
> The map cannot be created.
> 
> Therefore the BPF_F_LOCK code paths in rhtab_map_* are unreachable in
> this commit, contradicting the commit message. A user bisecting a tree
> at this commit and trying to use the documented BPF_F_LOCK feature on an
> RHASH map will get -EOPNOTSUPP at map creation time.
> 
> The follow-up commit 'bpf: Allow special fields in resizable hashtab'
> adds BPF_MAP_TYPE_RHASH to both the BPF_SPIN_LOCK and
> BPF_TIMER/BPF_WORKQUEUE/BPF_TASK_WORK switch cases in map_check_btf.
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab
  2026-05-28 18:28   ` sashiko-bot
@ 2026-05-29 17:53     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-05-29 17:53 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf



On 5/28/26 7:28 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
> - [Critical] Missing error pointer check in `rhtab_map_get_next_key` leads to unhandled error pointer dereferences and kernel panics.
> - [High] `bpf_each_rhash_elem` illegally calls `cond_resched()` and drops `rcu_read_lock()` from within a BPF map iterator callback, which will trigger kernel panics in atomic contexts.
> - [High] Batch lookup and `seq_file` iterators prematurely terminate when encountering concurrent deletions, breaking BPF map iteration guarantees and causing silent data omission.
> --
> 
> commit afc331365f30658242015d82e7bdd35a014f432d
> Author: Mykyta Yatsenko <yatsenko@meta.com>
> 
> bpf: Implement iteration ops for resizable hashtab
> 
> Implement rhtab_map_get_next_key(), batch lookup/lookup-and-delete,
> and the seq_file BPF iterator for BPF_MAP_TYPE_RHASH.
> 
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 8944c9f070ca8..a03fa5fa6545b 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -3021,10 +3021,363 @@ static void rhtab_map_free_internal_structs(struct bpf_map *map)
>>  }
>>  
>>  static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>> +	__must_hold_shared(RCU)
>>  {
>> -	return -EOPNOTSUPP;
>> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>> +	struct rhtab_elem *elem;
>> +
>> +	elem = rhashtable_next_key(&rhtab->ht, key, rhtab_params);
>> +
>> +	/* if not found, return the first key */
>> +	if (PTR_ERR(elem) == -ENOENT)
>> +		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
>> +
>> +	if (!elem)
>> +		return -ENOENT;
>> +
>> +	memcpy(next_key, elem->data, map->key_size);
> 
> [Severity: Critical]
> Could this dereference an error pointer if rhashtable_next_key() returns
> ERR_PTR(-EAGAIN)? 
> 
> The first check only handles -ENOENT. If we get -EAGAIN (which can happen
> during concurrent resizing), it bypasses the PTR_ERR check, passes the
> !elem check since it isn't NULL, and then passes the error pointer directly
> into memcpy().
> 

-EAGAIN is never returned from rhashtable_next_key(), but the error check
can be made more reliable. I'll address this for the v6

>> +	return 0;
>> +}
> 
> [ ... ]
> 
>> +static long bpf_each_rhash_elem(struct bpf_map *map, bpf_callback_t callback_fn,
>> +				void *callback_ctx, u64 flags)
>> +{
>> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>> +	void *prev_key = NULL;
> 
> [ ... ]
> 
>> +	rcu_read_lock();
>> +	/*
>> +	 * Best-effort iteration: if rhashtable is concurrently resized or
>> +	 * elements are deleted/inserted, there may be missed or duplicate
>> +	 * elements visited.
>> +	 */
>> +	while ((elem = rhashtable_next_key(&rhtab->ht, prev_key, rhtab_params))) {
>> +		if (IS_ERR(elem))
>> +			break;
>> +		num_elems++;
>> +		ret = callback_fn((u64)(long)map,
>> +				  (u64)(long)elem->data,
>> +				  (u64)(long)rhtab_elem_value(elem, map->key_size),
>> +				  (u64)(long)callback_ctx, 0);
>> +		if (ret)
>> +			break;
>> +
>> +		prev_key = elem->data;	/* valid while RCU held */
>> +
>> +		if (need_resched()) {
>> +			memcpy(key_buf, elem->data, map->key_size);
>> +			rcu_read_unlock();
>> +			cond_resched();
> 
> [Severity: High]
> Can this cond_resched() trigger "BUG: scheduling while atomic" panics?
> 
> BPF programs using the bpf_for_each_map_elem helper often run in
> non-sleepable contexts (like XDP, TC, or tracing) under an outer RCU read
> lock or with preemption disabled. 
> 
> While this drops the inner RCU lock, it doesn't exit the outer atomic
> context, so sleeping here might violate those constraints and cause system
> instability.
> 

yes, I'll remove cond_resched() from here. Dropping and re-acquiring rcu lock,
though may still be needed.

>> +			rcu_read_lock();
>> +			prev_key = key_buf;
>> +		}
>> +	}
>> +	rcu_read_unlock();
> 
> [ ... ]
> 
>> +static int __rhtab_map_lookup_and_delete_batch(struct bpf_map *map,
>> +					       const union bpf_attr *attr,
>> +					       union bpf_attr __user *uattr,
>> +					       bool do_delete)
>> +{
> 
> [ ... ]
> 
>> +	rcu_read_lock();
>> +
>> +	/*
>> +	 * Cursor stores the key of the next-to-process element (stashed by
>> +	 * the previous batch). Look it up directly so the element is included
>> +	 * here rather than skipped by next_key(). If the cursor was deleted
>> +	 * concurrently (or by the previous do_delete batch), lookup returns
>> +	 * NULL and userspace must restart from a NULL cursor.
>> +	 */
>> +	if (ubatch)
>> +		elem = rhtab_lookup_elem(map, cursor);
>> +	else
>> +		elem = rhashtable_next_key(&rhtab->ht, NULL, rhtab_params);
>> +
>> +	while (elem && !IS_ERR(elem) && total < max_count) {
> 
> [Severity: High]
> Will this prematurely terminate the batch iteration if a concurrent deletion
> occurs? 
> 
> If the cursor element was concurrently deleted, rhashtable_next_key() returns
> ERR_PTR(-ENOENT). This loop checks !IS_ERR(elem) and will break out early.
> 
> This results in the function returning -ENOENT to userspace. Under BPF UAPI
> semantics, userspace interprets -ENOENT as the end of the map, which might
> cause silent data omission instead of safely recovering or continuing.

Maybe return -EAGAIN in case of cursor deletion, so that userspace will retry.

> 
>> +		memcpy(dst_key, elem->data, key_size);
>> +		rhtab_read_elem_value(map, dst_val, elem, elem_map_flags);
>> +		check_and_init_map_value(map, dst_val);
> 
> [ ... ]
> 
>> +static void *bpf_rhash_map_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> +{
>> +	struct bpf_iter_seq_rhash_map_info *info = seq->private;
>> +	struct rhtab_elem *elem = v;
>> +	void *next;
>> +
>> +	memcpy(info->last_key, elem->data, info->map->key_size);
>> +	info->last_key_valid = true;
>> +	++*pos;
>> +
>> +	next = rhashtable_next_key(&info->rhtab->ht, info->last_key,
>> +				   rhtab_params);
>> +	if (IS_ERR(next))
>> +		return NULL;
> 
> [Severity: High]
> Does this also prematurely terminate the seq_file iteration on concurrent
> deletions? 
> 
> Similar to the batch logic above, returning NULL when next is
> ERR_PTR(-ENOENT) signals the end of the sequence to the iterator, which could
> silently truncate the iteration results.

Yes, in case cursor is deleted, the iteration finishes early.

> 
>> +	return next;
>> +}
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 05/11] bpf: Allow special fields in resizable hashtab
  2026-05-28 18:26   ` bot+bpf-ci
@ 2026-06-01 13:05     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-06-01 13:05 UTC (permalink / raw)
  To: bot+bpf-ci, bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	memxor, herbert
  Cc: yatsenko, martin.lau, yonghong.song, clm, ihor.solodrai



On 5/28/26 7:26 PM, bot+bpf-ci@kernel.org wrote:
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index a03fa5fa6545..50e9d57b2790 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
> 
> [ ... ]
> 
>> @@ -2943,6 +2965,7 @@ static int rhtab_map_lookup_and_delete_elem(struct bpf_map *map, void *key, void
>>  static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
>>  				      u64 map_flags)
>>  {
>> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>>  	void *old_val = rhtab_elem_value(elem, map->key_size);
>>
>>  	if (map_flags & BPF_NOEXIST)
>> @@ -2952,6 +2975,17 @@ static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *el
>>  		copy_map_value_locked(map, old_val, value, false);
>>  	else
>>  		copy_map_value(map, old_val, value);
>> +
>> +	/*
>> +	 * Torn reads: a concurrent reader without BPF_F_LOCK may observe
>> +	 * the value mid-copy. Callers requiring consistent reads must use
>> +	 * BPF_F_LOCK, matching arraymap semantics.
>> +	 *
>> +	 * copy_map_value() skips special-field offsets, so old timers/
>> +	 * kptrs/etc. still sit in the slot. Cancel them after the copy
>> +	 * to match arraymap's update semantics.
>> +	 */
>> +	rhtab_check_and_free_fields(rhtab, elem);
>>  	return 0;
>>  }
> 
> In the v4 discussion on lore, a concern was raised about the ordering of
> copy_map_value() and rhtab_check_and_free_fields() operations. When
> copy_map_value() is called, it skips special field offsets, leaving the
> old timer/kptr/etc. active in the slot during the copy.
> 
> Could a concurrent BPF program access those special fields in the window
> between the copy and the free operations? Specifically, in the
> BPF_F_LOCK case where copy_map_value_locked() is used, if a concurrent
> program acquires the spinlock and accesses the value between the copy
> completion and rhtab_check_and_free_fields() call, would it still see
> the old timer/kptr that gets canceled immediately after?

Yes the concurrent BPF program may access timer, wq, task_work, etc that
is immediately canceled, this is by design, what matters is that the operations
are safe.

> 
> The comment mentions matching arraymap's update semantics and acknowledges
> torn reads for the non-locked case, but it's not clear whether the locked
> case fully prevents concurrent access to special fields that are about to
> be freed.
> 
> Reference: https://lore.kernel.org/bpf/252dc423a3bc75b3f4d936e7706085da71939a88a7ef148926c317d463d20a12@mail.kernel.org/
> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable
  2026-05-28 18:25   ` sashiko-bot
@ 2026-06-01 16:38     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-06-01 16:38 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf



On 5/28/26 7:25 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] Trivial hash collisions in `rhtab_hashfn_long` defeat rhashtable's DoS mitigation, leading to infinite rehashing and 100% CPU livelock.
> - [High] Direct dereferencing of potentially unaligned map key pointers in the optimized hash and compare functions will cause kernel panics on strict-alignment architectures.
> --
> 
> commit 40f8c726140c6268e5f797c092818cebd3cfc282
> Author: Mykyta Yatsenko <yatsenko@meta.com>
> 
> bpf: Optimize word-sized keys for resizable hashtable
> 
> This commit specializes the lookup, update, and delete paths for word-sized
> keys in resizable hashtables. It introduces a custom XOR-fold hash function
> and single-word equality comparison to bypass the indirect jhash dispatch.
> 
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
> 
> [ ... ]
> 
>> +/* Specialize hash function and objcmp for long sized key */
>> +static __always_inline int rhtab_key_cmp_long(struct rhashtable_compare_arg *arg,
>> +					      const void *ptr)
>> +{
>> +	const unsigned long key1 = *(const unsigned long *)arg->key;
> 
> [Severity: High]
> Could dereferencing this pointer directly cause alignment faults on
> architectures that require strict alignment? Since BPF programs can pass
> unaligned pointers (such as from packet data in skb->data) directly to map
> helpers, using get_unaligned() might be necessary here to prevent kernel
> panics.
> 

The same problem exists for the htab jhash2. it looks like for the arches that
BPF cares about, this is not an issue.

>> +	const struct rhtab_elem *key2 = ptr;
>> +
>> +	return key1 != *(const unsigned long *)key2->data;
>> +}
>> +
>> +static __always_inline u32 rhtab_hashfn_long(const void *data, u32 len, u32 seed)
>> +{
>> +	u64 k = *(const unsigned long *)data;
>> +
>> +	return (u32)(k ^ (k >> 32)) ^ seed;
> 
> [Severity: High]
> Does applying the seed after folding the key make hash collisions entirely
> independent of the seed?
> 
> If a BPF program or an attacker inserts keys constructed as:
> k = (Y << 32) | (X ^ Y)
> 
> These keys will all fold to exactly X before the XOR with the seed,
> meaning f(key1) ^ seed == f(key2) ^ seed for any random seed.
> 
> If rhashtable detects a long bucket chain and triggers a background rehash
> with a new seed, won't all these colliding keys move together to the same
> new bucket? This might trap the rhashtable worker in an infinite loop of
> rehashing, leading to a 100% CPU livelock.
> 

It looks like the attack is possible in theory: attacker should be in control of
the 8-bytes that rhtab uses as a key. Not sure if we want to fix it or leave
for performance.

>> +}
>> +
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map
  2026-05-28 18:41   ` sashiko-bot
@ 2026-06-01 16:41     ` Mykyta Yatsenko
  0 siblings, 0 replies; 27+ messages in thread
From: Mykyta Yatsenko @ 2026-06-01 16:41 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf



On 5/28/26 7:41 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [Low] Missing file descriptor validation for `bpf_program__fd(prog)` after retrieving the program via `bpf_object__find_program_by_name()`.
> --
> 
> commit 85c8ece43292fc1ae20a51c42265be508a83ec61
> Author: Mykyta Yatsenko <yatsenko@meta.com>
> 
> selftests/bpf: Add basic tests for resizable hash map
>     
> Test basic map operations (lookup, update, delete) for
> BPF_MAP_TYPE_RHASH including boundary conditions like duplicate
> key insertion and deletion of nonexistent keys.
> 
>> diff --git a/tools/testing/selftests/bpf/prog_tests/rhash.c b/tools/testing/selftests/bpf/prog_tests/rhash.c
>> new file mode 100644
>> index 0000000000000..69686bf69ba57
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/rhash.c
> [ ... ]
>> +static void rhash_run(const char *prog_name)
>> +{
>> +	struct rhash *skel;
>> +	struct bpf_program *prog;
>> +	LIBBPF_OPTS(bpf_test_run_opts, opts);
>> +	int err;
>> +
>> +	skel = rhash__open();
>> +	if (!ASSERT_OK_PTR(skel, "rhash__open"))
>> +		return;
>> +
>> +	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
>> +	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
>> +		goto cleanup;
>> +	bpf_program__set_autoload(prog, true);
>> +
>> +	err = rhash__load(skel);
>> +	if (!ASSERT_OK(err, "skel_load"))
>> +		goto cleanup;
>> +
>> +	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
> 
> [Severity: Low]
> Does the file descriptor returned by bpf_program__fd(prog) need to be
> validated here? 
> 

fd should be valid after rhash__load()

> Since prog was obtained via the manual lookup API
> bpf_object__find_program_by_name(), should we verify the fd is valid (>= 0)
> using ASSERT_GE before passing it to bpf_prog_test_run_opts()?
> 
>> +	if (!ASSERT_OK(err, "prog run"))
>> +		goto cleanup;
>> +
>> +	if (!ASSERT_OK(opts.retval, "prog retval"))
>> +		goto cleanup;
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2026-06-01 16:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
2026-05-28 18:16   ` sashiko-bot
2026-05-29 16:43     ` Mykyta Yatsenko
2026-05-28 18:26   ` bot+bpf-ci
2026-05-28 17:47 ` [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
2026-05-28 18:26   ` sashiko-bot
2026-05-29 16:53     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
2026-05-28 18:42   ` bot+bpf-ci
2026-05-29 17:08     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
2026-05-28 18:28   ` sashiko-bot
2026-05-29 17:53     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 05/11] bpf: Allow special fields in " Mykyta Yatsenko
2026-05-28 18:26   ` bot+bpf-ci
2026-06-01 13:05     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
2026-05-28 18:25   ` sashiko-bot
2026-06-01 16:38     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 07/11] libbpf: Support " Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
2026-05-28 18:41   ` sashiko-bot
2026-06-01 16:41     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 09/11] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 10/11] bpftool: Add rhash map documentation Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 11/11] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.