Re: [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: bot+bpf-ci@kernel.org, bpf@vger.kernel.org, ast@kernel.org,
	andrii@kernel.org, daniel@iogearbox.net, kafai@meta.com,
	kernel-team@meta.com, eddyz87@gmail.com, memxor@gmail.com,
	herbert@gondor.apana.org.au
Cc: yatsenko@meta.com, martin.lau@kernel.org,
	yonghong.song@linux.dev, clm@meta.com, ihor.solodrai@linux.dev
Subject: Re: [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions
Date: Fri, 29 May 2026 18:08:49 +0100	[thread overview]
Message-ID: <05c8bdae-83dc-4456-89dd-8e10360b5f92@gmail.com> (raw)
In-Reply-To: <fd400a6c3c1452ca221fe95e15d160f014605ff826271d4954fc5cb31c08ece9@mail.kernel.org>



On 5/28/26 7:42 PM, bot+bpf-ci@kernel.org wrote:
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 3dd9b4924ae4..8944c9f070ca 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
> 
> [ ... ]
> 
>> +static void rhtab_free_elem(void *ptr, void *arg)
>> +{
>> +	struct bpf_rhtab *rhtab = arg;
>> +	struct rhtab_elem *elem = ptr;
>> +
>> +	bpf_map_free_internal_structs(&rhtab->map, rhtab_elem_value(elem, rhtab->map.key_size));
>> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
>> +}
> 
> bpf_map_free_internal_structs only handles timers, workqueues, and
> task_work (not kptrs, list_heads, rb_roots, or refcounted objects). In
> contrast, htab's htab_elem_free calls check_and_free_fields which
> eventually invokes bpf_obj_free_fields for the full field list. As a
> result, when this restriction is later lifted, rhtab_free_elem would
> leak kptrs and skip list/rbtree drains during map teardown.
> 
> The issue is latent at this commit because syscall.c map_check_btf
> rejects all those field types for RHASH maps (map->record is NULL), but
> the follow-up commit that enables the field types must rewrite this
> entire function and add a memalloc destructor with bpf_obj_free_fields.
> 

I'll remove bpf_map_free_internal_structs() from this commit.

> [ ... ]
> 
>> +static int rhtab_delete_elem(struct bpf_rhtab *rhtab, struct rhtab_elem *elem, void *copy,
>> +			     u64 flags)
>> +{
>> +	int err;
>> +
>> +	/*
>> +	 * disable_instrumentation() mitigates the deadlock for programs running in NMI context.
>> +	 * rhashtable locks bucket with local_irq_save(). Only NMI programs may reenter
>> +	 * rhashtable code, bpf_disable_instrumentation() disables programs running in NMI, except
>> +	 * raw tracepoints, which we don't have in rhashtable.
>> +	 */
>> +	bpf_disable_instrumentation();
>> +	err = rhashtable_remove_fast(&rhtab->ht, &elem->node, rhtab_params);
>> +	bpf_enable_instrumentation();
> 
> Does bpf_disable_instrumentation prevent the softirq deadlock scenario?
> 
> The v4 bot review noted that if a syscall holds the bucket lock and
> is interrupted by hardware IRQ that schedules softirq (e.g., XDP or TC
> program), the softirq would deadlock trying to acquire the same bucket
> lock. bpf_disable_instrumentation only prevents NMI-based tracing
> programs, not softirq programs triggered by hardware IRQs.

bucket lock does local_irq_save(), so XDP or TC programs 
cannot run during the critical section, only NMI can, which is 
handled by the bpf_disable_instrumentation().

> 
> Alexei Starovoitov suggested in v1 that update/delete operations should
> check 'if (irqs_disabled()) return -EBUSY' to prevent this class of
> deadlock. That check is not present in this version.
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
>> +
>> +	if (err)
>> +		return err;
>> +
>> +	if (copy) {
>> +		rhtab_read_elem_value(&rhtab->map, copy, elem, flags);
>> +		check_and_init_map_value(&rhtab->map, copy);
>> +	}
>> +
>> +	bpf_mem_cache_free_rcu(&rhtab->ma, elem);
>> +	return 0;
>> +}
> 
> The element is freed with bpf_mem_cache_free_rcu without calling
> bpf_obj_free_fields or check_and_free_fields on the value. Compare with
> htab's free_htab_elem which calls check_and_free_fields before freeing.
> 
> Once special fields are enabled for RHASH maps (by a later commit),
> deleting an element with kptrs, list_heads, or rb_roots would leak those
> resources. The later commit that enables special fields adds
> rhtab_check_and_free_fields here.

When special fields are enabled, they are handled by this code as well.

> 
> [ ... ]
> 
>> +static long rhtab_map_update_existing(struct bpf_map *map, struct rhtab_elem *elem, void *value,
>> +				      u64 map_flags)
>> +{
>> +	void *old_val = rhtab_elem_value(elem, map->key_size);
>> +
>> +	if (map_flags & BPF_NOEXIST)
>> +		return -EEXIST;
>> +
>> +	if (map_flags & BPF_F_LOCK)
>> +		copy_map_value_locked(map, old_val, value, false);
>> +	else
>> +		copy_map_value(map, old_val, value);
>> +	return 0;
>> +}
> 
> copy_map_value deliberately skips special-field offsets so it doesn't
> clobber kptr/timer/list_head pointers. After this in-place update
> returns, the element's value contains the new scalar bytes but still
> holds the OLD kptrs, timers, and work-queue handles at the special
> offsets.
> 
> Arraymap and htab both invoke bpf_obj_free_fields or cancel timers on
> the old value after the scalar copy to release the per-element
> resources. The follow-up commit that enables special fields adds
> rhtab_check_and_free_fields after the copy to match this pattern.
> 

In this commit special fields are not yet supported. There is a separate
commit implementing it.

> [ ... ]
> 
>> +static long rhtab_map_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags)
>> +{
>> +	struct bpf_rhtab *rhtab = container_of(map, struct bpf_rhtab, map);
>> +	struct rhtab_elem *elem, *tmp;
>> +
>> +	if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST))
>> +		return -EINVAL;
>> +
>> +	if ((map_flags & BPF_F_LOCK) && !btf_record_has_field(map->record, BPF_SPIN_LOCK))
>> +		return -EINVAL;
> 
> Similar deadlock question as rhtab_delete_elem: does
> bpf_disable_instrumentation (used later in this function) prevent the
> softirq deadlock where a syscall holds the bucket lock and is interrupted
> by hardware IRQ that schedules softirq?
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
>> +	elem = bpf_mem_cache_alloc(&rhtab->ma);
>> +	if (!elem)
>> +		return -ENOMEM;
>> +
>> +	memcpy(elem->data, key, map->key_size);
>> +	copy_map_value(map, rhtab_elem_value(elem, map->key_size), value);
> 
> bpf_mem_cache_alloc returns memory that is not zeroed at special-field
> offsets (the cache may recycle freed objects). copy_map_value
> deliberately skips those offsets. The freshly inserted element's value
> therefore has uninitialized bytes at every special-field offset (timer
> pointer, kptr, list_head, rb_root, etc.).
> 
> The canonical pattern is to call check_and_init_map_value after
> copy_map_value to zero-initialize or call the field-specific init for
> every special-field offset. htab does this in alloc_htab_elem and
> pcpu_init_value. The follow-up commit adds check_and_init_map_value
> here.

Right, the missing pieces for special fields support are added later.

> 
> [ ... ]
> 
>> +
>> +BTF_ID_LIST_SINGLE(rhtab_map_btf_ids, struct, bpf_rhtab)
>> +const struct bpf_map_ops rhtab_map_ops = {
>> +	.map_meta_equal = bpf_map_meta_equal,
>> +	.map_alloc_check = rhtab_map_alloc_check,
>> +	.map_alloc = rhtab_map_alloc,
>> +	.map_free = rhtab_map_free,
>> +	.map_get_next_key = rhtab_map_get_next_key,
>> +	.map_release_uref = rhtab_map_free_internal_structs,
>> +	.map_lookup_elem = rhtab_map_lookup_elem,
>> +	.map_lookup_and_delete_elem = rhtab_map_lookup_and_delete_elem,
>> +	.map_update_elem = rhtab_map_update_elem,
>> +	.map_delete_elem = rhtab_map_delete_elem,
>> +	.map_gen_lookup = rhtab_map_gen_lookup,
>> +	.map_btf_id = &rhtab_map_btf_ids[0],
>> +};
> 
> rhtab_map_ops does not define a .map_mem_usage callback. At this commit,
> kernel/bpf/syscall.c __map_create enforces:
> 
>   if (!ops->map_mem_usage)
>     return -EINVAL;
> 
> This check sits before map_alloc. As a result, every attempt to create a
> BPF_MAP_TYPE_RHASH map fails with -EINVAL at the bpf(BPF_MAP_CREATE)
> syscall. No code path inside rhtab_map_alloc / rhtab_map_update_elem /
> etc. is reachable from user space.
> 
> The commit's stated runtime semantics (use rhashtable_lookup_likely for
> lookups, in-place updates) are untestable because no map of this type
> can exist. The bpf_map_show_fdinfo path also relies on
> map->ops->map_mem_usage(map) without a NULL check, so even if the create
> check were bypassed, /proc/[pid]/fdinfo/[map_fd] would NULL-deref.
> 
> The follow-up commit 'bpf: Implement iteration ops for resizable
> hashtab' adds rhtab_map_mem_usage and wires .map_mem_usage =
> rhtab_map_mem_usage in rhtab_map_ops. Each patch must compile and
> introduce a working feature when applied to a kernel that bisects to it.
> This commit is the first to expose BPF_MAP_TYPE_RHASH to user space (it
> touches uapi bpf.h and the syscall.c switch), so this missing callback
> breaks bisection.

I'll move map_mem_usage to this commit. 

> 
> [ ... ]
> 
>> +static int rhtab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>> +{
>> +	return -EOPNOTSUPP;
>> +}
> 
> rhtab_map_get_next_key always returns -EOPNOTSUPP, meaning userspace
> cannot iterate keys of a BPF_MAP_TYPE_RHASH map via the standard syscall
> API, and tools like bpftool dump cannot list rhtab map contents.
> 
> The series previously added rhashtable_next_key API which provides
> best-effort iteration, but it was never wired up.

Implemented in the next commit.

> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
> Multiple map_ops are registered as functional callbacks but are stubs
> that were removed in v5. However, in v5 many callbacks that should be
> present are missing entirely (map_seq_show_elem, map_for_each_callback,
> map_mem_usage, BATCH_OPS, iter_seq_info).
> 
> While returning -EOPNOTSUPP is graceful (no crash), registering ops
> pointers that fail rather than relying on the dispatcher's NULL check
> defeats the verifier's compile-time validation: the verifier will accept
> BPF programs that call bpf_for_each_map_elem on an rhtab map, only to
> fail at runtime. Should these callbacks be NULL instead?
> 
> Reference: https://lore.kernel.org/bpf/faddde9d9654bea549cb0acf434a4dcd06217d795a0717af028a61e453ce8ea6@mail.kernel.org/
> 
> [ ... ]
> 
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 8d4ea700aac6..12c07310016a 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
> 
> [ ... syscall.c changes for map_extra and map_lookup_and_delete_elem allowlist ... ]
> 
> The commit message states: 'Updates modify values in place under RCU
> rather than allocating a new element... This trades read consistency for
> performance: concurrent readers may see partial updates. Users requiring
> consistent reads should use BPF_F_LOCK.'
> 
> The rhtab implementation in kernel/bpf/hashtab.c is wired for BPF_F_LOCK:
> rhtab_map_update_elem checks btf_record_has_field(map->record,
> BPF_SPIN_LOCK), and rhtab_read_elem_value / rhtab_map_update_existing
> call copy_map_value_locked when BPF_F_LOCK is set.
> 
> However, this patch does not add BPF_MAP_TYPE_RHASH to the BPF_SPIN_LOCK
> allow-list in map_check_btf at kernel/bpf/syscall.c (the case
> BPF_SPIN_LOCK block that lists HASH, ARRAY, CGROUP_STORAGE, SK_STORAGE,
> INODE_STORAGE, TASK_STORAGE, CGRP_STORAGE).
> 
> When a user attempts to create an RHASH map whose value type contains a
> struct bpf_spin_lock (the only way to enable BPF_F_LOCK on a map),
> __map_create calls map_check_btf, the BPF_SPIN_LOCK case checks the
> allow-list, does not find BPF_MAP_TYPE_RHASH, and returns -EOPNOTSUPP.
> The map cannot be created.
> 
> Therefore the BPF_F_LOCK code paths in rhtab_map_* are unreachable in
> this commit, contradicting the commit message. A user bisecting a tree
> at this commit and trying to use the documented BPF_F_LOCK feature on an
> RHASH map will get -EOPNOTSUPP at map creation time.
> 
> The follow-up commit 'bpf: Allow special fields in resizable hashtab'
> adds BPF_MAP_TYPE_RHASH to both the BPF_SPIN_LOCK and
> BPF_TIMER/BPF_WORKQUEUE/BPF_TASK_WORK switch cases in map_check_btf.
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26592533594

next prev parent reply	other threads:[~2026-05-29 17:08 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28 17:47 [PATCH bpf-next v5 00/11] bpf: Introduce resizable hash map Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 01/11] rhashtable: Add rhashtable_next_key() API Mykyta Yatsenko
2026-05-28 18:16   ` sashiko-bot
2026-05-29 16:43     ` Mykyta Yatsenko
2026-05-28 18:26   ` bot+bpf-ci
2026-05-28 17:47 ` [PATCH bpf-next v5 02/11] rhashtable: Add selftest for rhashtable_next_key() Mykyta Yatsenko
2026-05-28 18:26   ` sashiko-bot
2026-05-29 16:53     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 03/11] bpf: Implement resizable hashmap basic functions Mykyta Yatsenko
2026-05-28 18:42   ` bot+bpf-ci
2026-05-29 17:08     ` Mykyta Yatsenko [this message]
2026-05-28 17:47 ` [PATCH bpf-next v5 04/11] bpf: Implement iteration ops for resizable hashtab Mykyta Yatsenko
2026-05-28 18:28   ` sashiko-bot
2026-05-29 17:53     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 05/11] bpf: Allow special fields in " Mykyta Yatsenko
2026-05-28 18:26   ` bot+bpf-ci
2026-06-01 13:05     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 06/11] bpf: Optimize word-sized keys for resizable hashtable Mykyta Yatsenko
2026-05-28 18:25   ` sashiko-bot
2026-06-01 16:38     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 07/11] libbpf: Support " Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 08/11] selftests/bpf: Add basic tests for resizable hash map Mykyta Yatsenko
2026-05-28 18:41   ` sashiko-bot
2026-06-01 16:41     ` Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 09/11] selftests/bpf: Add BPF iterator " Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 10/11] bpftool: Add rhash map documentation Mykyta Yatsenko
2026-05-28 17:47 ` [PATCH bpf-next v5 11/11] selftests/bpf: Add resizable hashmap to benchmarks Mykyta Yatsenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05c8bdae-83dc-4456-89dd-8e10360b5f92@gmail.com \
    --to=mykyta.yatsenko5@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bot+bpf-ci@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=clm@meta.com \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=ihor.solodrai@linux.dev \
    --cc=kafai@meta.com \
    --cc=kernel-team@meta.com \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=yatsenko@meta.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.