Linux Kernel Selftest development
 help / color / mirror / Atom feed
* [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps
@ 2026-06-22  7:16 Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 1/3] bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI Gyutae Bae
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Gyutae Bae @ 2026-06-22  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
  Cc: John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, linux-kselftest, linux-kernel,
	Minsu Jeon, Siwan Kim, Jonghyeon Kim, Gyutae Bae

From: Gyutae Bae <gyutae.bae@navercorp.com>

This series adds an atomic compare-and-delete primitive to BPF hash
maps, motivated by a TOCTOU race in Cilium's conntrack GC [1]: the
batched GC snapshots CT entries, decides which expired, then deletes
them by key in a later syscall; between snapshot and delete the
datapath can refresh the same entry, so a live entry is deleted. A
userspace re-check before delete can't close it (lookup and delete are
separate, individually bucket-locked calls).

BPF_F_COMPARE lets userspace delete a key only if a chosen value region
is unchanged, with the compare and the delete done atomically under the
hash bucket lock:

    attr.flags |= BPF_F_COMPARE;
    attr.compare = <expected>;
    attr.compare_offset = <off>;
    attr.compare_size = <len>;

mismatch -> -EBUSY, absent -> -ENOENT, unsupported map -> -EOPNOTSUPP.
The compare* fields without the flag are rejected (-EINVAL) so a dropped
flag can't silently become an unconditional delete; maps whose value
carries BTF-managed fields (spin_lock/timer/kptr/...) are rejected
(-EOPNOTSUPP) since those bytes are sanitised on lookup.

Atomicity boundary (please scrutinise): the compare is atomic vs every
bucket-lock holder, but NOT vs a BPF program writing the value in place
via the pointer from bpf_map_lookup_elem() (no bucket lock). It
collapses the race window from the whole GC batch to one bucket-locked
critical section; full closure wants the compared region treated as a
synchronization variable (e.g. a monotonic revision). The selftest
models this.

Scope of this RFC: per-element compare-and-delete on BPF_MAP_TYPE_HASH
only. Deferred (will follow once the approach is agreed): batch delete +
its attr fields, a libbpf wrapper, LRU-hash and other map types, a
compare-and-swap *update*.

Open questions:
  - flag name: BPF_F_COMPARE vs something else?
  - mismatch errno: -EBUSY vs -EAGAIN?
  - new ->map_delete_elem_cmp() op vs extending ->map_delete_elem?

[1] https://github.com/cilium/cilium/issues/46298

Gyutae Bae (3):
  bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI
  bpf: implement compare-and-delete (BPF_F_COMPARE) for
    BPF_MAP_TYPE_HASH
  selftests/bpf: test BPF_F_COMPARE compare-and-delete

 include/linux/bpf.h                           |   2 +
 include/uapi/linux/bpf.h                      |   6 +-
 kernel/bpf/hashtab.c                          |  39 +++++++
 kernel/bpf/syscall.c                          |  54 ++++++++-
 tools/include/uapi/linux/bpf.h                |   6 +-
 .../selftests/bpf/prog_tests/map_cmp_delete.c | 106 ++++++++++++++++++
 6 files changed, 208 insertions(+), 5 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c


base-commit: a975094bf98ca97be9146f9d3b5681a6f9cf5ce3
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC bpf-next 1/3] bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI
  2026-06-22  7:16 [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Gyutae Bae
@ 2026-06-22  7:16 ` Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 2/3] bpf: implement compare-and-delete (BPF_F_COMPARE) for BPF_MAP_TYPE_HASH Gyutae Bae
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Gyutae Bae @ 2026-06-22  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
  Cc: John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, linux-kselftest, linux-kernel,
	Minsu Jeon, Siwan Kim, Jonghyeon Kim, Gyutae Bae

From: Gyutae Bae <gyutae.bae@navercorp.com>

Introduce the UAPI surface for a compare-and-delete primitive: a new map
flag BPF_F_COMPARE plus compare/compare_offset/compare_size fields let
a delete request name an expected value region.  Behaviour is wired up in
the following patch; this adds the flag and fields (kernel + tools/ copy)
only.

Co-developed-by: Minsu Jeon <minsu.jeon@navercorp.com>
Signed-off-by: Minsu Jeon <minsu.jeon@navercorp.com>
Co-developed-by: Siwan Kim <siwan.kim@navercorp.com>
Signed-off-by: Siwan Kim <siwan.kim@navercorp.com>
Co-developed-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Gyutae Bae <gyutae.bae@navercorp.com>
---
 include/uapi/linux/bpf.h       | 6 +++++-
 tools/include/uapi/linux/bpf.h | 6 +++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 89b36de5fdbb..4705b02fb6a4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1397,7 +1397,7 @@ enum bpf_addr_space_cast {
 	BPF_ADDR_SPACE_CAST = 1,
 };
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM / BPF_MAP_DELETE_ELEM commands */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
 	BPF_NOEXIST	= 1, /* create new element if it didn't exist */
@@ -1405,6 +1405,7 @@ enum {
 	BPF_F_LOCK	= 4, /* spin_lock-ed map_lookup/map_update */
 	BPF_F_CPU	= 8, /* cpu flag for percpu maps, upper 32-bit of flags is a cpu number */
 	BPF_F_ALL_CPUS	= 16, /* update value across all CPUs for percpu maps */
+	BPF_F_COMPARE	= 32, /* compare-and-delete: delete elem only if value region == compare */
 };
 
 /* flags for BPF_MAP_CREATE command */
@@ -1586,6 +1587,9 @@ union bpf_attr {
 			__aligned_u64 next_key;
 		};
 		__u64		flags;
+		__aligned_u64	compare;	/* user ptr to expected bytes (BPF_F_COMPARE) */
+		__u32		compare_offset;	/* start offset within value */
+		__u32		compare_size;	/* bytes to compare; 0 == whole value */
 	};
 
 	struct { /* struct used by BPF_MAP_*_BATCH commands */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 89b36de5fdbb..4705b02fb6a4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1397,7 +1397,7 @@ enum bpf_addr_space_cast {
 	BPF_ADDR_SPACE_CAST = 1,
 };
 
-/* flags for BPF_MAP_UPDATE_ELEM command */
+/* flags for BPF_MAP_UPDATE_ELEM / BPF_MAP_DELETE_ELEM commands */
 enum {
 	BPF_ANY		= 0, /* create new element or update existing */
 	BPF_NOEXIST	= 1, /* create new element if it didn't exist */
@@ -1405,6 +1405,7 @@ enum {
 	BPF_F_LOCK	= 4, /* spin_lock-ed map_lookup/map_update */
 	BPF_F_CPU	= 8, /* cpu flag for percpu maps, upper 32-bit of flags is a cpu number */
 	BPF_F_ALL_CPUS	= 16, /* update value across all CPUs for percpu maps */
+	BPF_F_COMPARE	= 32, /* compare-and-delete: delete elem only if value region == compare */
 };
 
 /* flags for BPF_MAP_CREATE command */
@@ -1586,6 +1587,9 @@ union bpf_attr {
 			__aligned_u64 next_key;
 		};
 		__u64		flags;
+		__aligned_u64	compare;	/* user ptr to expected bytes (BPF_F_COMPARE) */
+		__u32		compare_offset;	/* start offset within value */
+		__u32		compare_size;	/* bytes to compare; 0 == whole value */
 	};
 
 	struct { /* struct used by BPF_MAP_*_BATCH commands */
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC bpf-next 2/3] bpf: implement compare-and-delete (BPF_F_COMPARE) for BPF_MAP_TYPE_HASH
  2026-06-22  7:16 [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 1/3] bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI Gyutae Bae
@ 2026-06-22  7:16 ` Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 3/3] selftests/bpf: test BPF_F_COMPARE compare-and-delete Gyutae Bae
  2026-06-22 22:32 ` [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Alexei Starovoitov
  3 siblings, 0 replies; 6+ messages in thread
From: Gyutae Bae @ 2026-06-22  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
  Cc: John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, linux-kselftest, linux-kernel,
	Minsu Jeon, Siwan Kim, Jonghyeon Kim, Gyutae Bae

From: Gyutae Bae <gyutae.bae@navercorp.com>

map_delete_elem() now honours BPF_F_COMPARE: it copies the expected
bytes, validates the compare window against value_size, and dispatches to a
new ->map_delete_elem_cmp().  The hash-map implementation, under the bucket
lock, memcmps the [off, off+size) window of the live value and unlinks only
on a match (mismatch -EBUSY, absent -ENOENT), making the compare and the
delete atomic w.r.t. every bucket-lock holder.  Only BPF_MAP_TYPE_HASH
wires up the op; every other map type lacks it and returns -EOPNOTSUPP.

Atomicity boundary: the compare is read under the bucket lock, so it is
atomic against every other bucket-lock holder (concurrent deletes,
element-replacing updates).  It is NOT mutual exclusion against a BPF
program that writes the value in place via the pointer from
bpf_map_lookup_elem(), which takes no bucket lock; against such a writer
the compare may observe a torn value within the locked section.  For a
meaningful result the compared region should be a synchronization variable
-- written with a single aligned store such as a monotonic revision, or
replaced via an update -- not a multi-field blob mutated in place.

It compares the raw stored value, so maps whose value carries BTF-managed
fields (bpf_spin_lock, bpf_timer, kptr, ...) are also rejected with
-EOPNOTSUPP: those bytes are sanitised on lookup, so a raw compare would
never match the caller's snapshot and could expose kernel-internal bytes.

BPF_MAP_DELETE_ELEM_LAST_FIELD now reaches compare_size, so the compare*
fields pass CHECK_ATTR even without BPF_F_COMPARE.  Reject that combination
with -EINVAL: a dropped flag must not silently turn a compare-and-delete
into an unconditional delete.

Co-developed-by: Minsu Jeon <minsu.jeon@navercorp.com>
Signed-off-by: Minsu Jeon <minsu.jeon@navercorp.com>
Co-developed-by: Siwan Kim <siwan.kim@navercorp.com>
Signed-off-by: Siwan Kim <siwan.kim@navercorp.com>
Co-developed-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Gyutae Bae <gyutae.bae@navercorp.com>
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/hashtab.c | 39 ++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c | 54 +++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 7719f6528445..2e858272f7a7 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -110,6 +110,8 @@ struct bpf_map_ops {
 	void *(*map_lookup_elem)(struct bpf_map *map, void *key);
 	long (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 flags);
 	long (*map_delete_elem)(struct bpf_map *map, void *key);
+	long (*map_delete_elem_cmp)(struct bpf_map *map, void *key,
+				    const void *compare, u32 off, u32 size);
 	long (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
 	long (*map_pop_elem)(struct bpf_map *map, void *value);
 	long (*map_peek_elem)(struct bpf_map *map, void *value);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 9f394e1aa2e8..8fc3ab938b9e 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1534,6 +1534,44 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
 	return ret;
 }
 
+static long htab_map_delete_elem_cmp(struct bpf_map *map, void *key,
+				     const void *compare, u32 off, u32 size)
+{
+	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	struct hlist_nulls_head *head;
+	struct bucket *b;
+	struct htab_elem *l;
+	unsigned long flags;
+	u32 hash, key_size;
+	int ret;
+
+	WARN_ON_ONCE(!bpf_rcu_lock_held());
+
+	key_size = map->key_size;
+
+	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	b = __select_bucket(htab, hash);
+	head = &b->head;
+
+	ret = htab_lock_bucket(b, &flags);
+	if (ret)
+		return ret;
+
+	l = lookup_elem_raw(head, hash, key, key_size);
+	if (!l)
+		ret = -ENOENT;
+	else if (memcmp(htab_elem_value(l, key_size) + off, compare, size) != 0)
+		ret = -EBUSY;
+	else
+		hlist_nulls_del_rcu(&l->hash_node);
+
+	htab_unlock_bucket(b, flags);
+
+	if (!ret)
+		free_htab_elem(htab, l);	/* ret==0 implies we unlinked l */
+	return ret;
+}
+
 static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
@@ -2366,6 +2404,7 @@ const struct bpf_map_ops htab_map_ops = {
 	.map_lookup_and_delete_elem = htab_map_lookup_and_delete_elem,
 	.map_update_elem = htab_map_update_elem,
 	.map_delete_elem = htab_map_delete_elem,
+	.map_delete_elem_cmp = htab_map_delete_elem_cmp,
 	.map_gen_lookup = htab_map_gen_lookup,
 	.map_seq_show_elem = htab_map_seq_show_elem,
 	.map_set_for_each_callback_args = map_set_for_each_callback_args,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b44106c8ea75..42fc432f82e1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1892,18 +1892,30 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 	return err;
 }
 
-#define BPF_MAP_DELETE_ELEM_LAST_FIELD key
+#define BPF_MAP_DELETE_ELEM_LAST_FIELD compare_size
 
 static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 {
 	bpfptr_t ukey = make_bpfptr(attr->key, uattr.is_kernel);
 	struct bpf_map *map;
-	void *key;
+	void *key, *compare = NULL;
+	u32 off = 0, size = 0;
 	int err;
 
 	if (CHECK_ATTR(BPF_MAP_DELETE_ELEM))
 		return -EINVAL;
 
+	if (attr->flags & ~BPF_F_COMPARE)
+		return -EINVAL;
+
+	/* The compare* fields are meaningful only with BPF_F_COMPARE.  Reject them
+	 * when the flag is absent so a dropped BPF_F_COMPARE cannot silently turn a
+	 * compare-and-delete into an unconditional delete.
+	 */
+	if (!(attr->flags & BPF_F_COMPARE) &&
+	    (attr->compare || attr->compare_offset || attr->compare_size))
+		return -EINVAL;
+
 	CLASS(fd, f)(attr->map_fd);
 	map = __bpf_map_get(f);
 	if (IS_ERR(map))
@@ -1920,6 +1932,38 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
+	if (attr->flags & BPF_F_COMPARE) {
+		bpfptr_t ucmp = make_bpfptr(attr->compare, uattr.is_kernel);
+
+		off = attr->compare_offset;
+		size = attr->compare_size ?: map->value_size;
+		/* off + size must fit in value_size, overflow-safe */
+		if (size > map->value_size || off > map->value_size - size) {
+			err = -EINVAL;
+			goto out;
+		}
+		if (!map->ops->map_delete_elem_cmp) {
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		/* Compare-and-delete reads the raw stored value.  Maps whose value carries
+		 * BTF-managed fields (bpf_spin_lock, bpf_timer, kptr, ...) have
+		 * those bytes sanitised on lookup, so a raw compare would never
+		 * match the caller's snapshot and could expose kernel-internal
+		 * bytes.  Reject such maps for now.
+		 */
+		if (!IS_ERR_OR_NULL(map->record)) {
+			err = -EOPNOTSUPP;
+			goto out;
+		}
+		compare = kvmemdup_bpfptr(ucmp, size);
+		if (IS_ERR(compare)) {
+			err = PTR_ERR(compare);
+			compare = NULL;
+			goto out;
+		}
+	}
+
 	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
@@ -1932,12 +1976,16 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 
 	bpf_disable_instrumentation();
 	rcu_read_lock();
-	err = map->ops->map_delete_elem(map, key);
+	if (compare)
+		err = map->ops->map_delete_elem_cmp(map, key, compare, off, size);
+	else
+		err = map->ops->map_delete_elem(map, key);
 	rcu_read_unlock();
 	bpf_enable_instrumentation();
 	if (!err)
 		maybe_wait_bpf_programs(map);
 out:
+	kvfree(compare);
 	kvfree(key);
 err_put:
 	bpf_map_write_active_dec(map);
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC bpf-next 3/3] selftests/bpf: test BPF_F_COMPARE compare-and-delete
  2026-06-22  7:16 [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 1/3] bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI Gyutae Bae
  2026-06-22  7:16 ` [RFC bpf-next 2/3] bpf: implement compare-and-delete (BPF_F_COMPARE) for BPF_MAP_TYPE_HASH Gyutae Bae
@ 2026-06-22  7:16 ` Gyutae Bae
  2026-06-22 22:32 ` [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Alexei Starovoitov
  3 siblings, 0 replies; 6+ messages in thread
From: Gyutae Bae @ 2026-06-22  7:16 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, bpf
  Cc: John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, linux-kselftest, linux-kernel,
	Minsu Jeon, Siwan Kim, Jonghyeon Kim, Gyutae Bae

From: Gyutae Bae <gyutae.bae@navercorp.com>

Add a test for compare-and-delete on a hash map. There is no libbpf
wrapper yet, so it issues the request through the raw bpf() syscall and
compares against a revision field stored in the value.

A matching expected value deletes the element, a stale one leaves it in
place and returns -EBUSY, and deleting an already-removed key returns
-ENOENT. A compare window that reaches past value_size is rejected with
-EINVAL, as are compare fields passed without BPF_F_COMPARE, so a dropped
flag cannot silently turn into an unconditional delete. Running the same
call on an array map, which has no compare-and-delete handler, returns
-EOPNOTSUPP.

Co-developed-by: Minsu Jeon <minsu.jeon@navercorp.com>
Signed-off-by: Minsu Jeon <minsu.jeon@navercorp.com>
Co-developed-by: Siwan Kim <siwan.kim@navercorp.com>
Signed-off-by: Siwan Kim <siwan.kim@navercorp.com>
Co-developed-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Jonghyeon Kim <jong-hyeon.kim@navercorp.com>
Signed-off-by: Gyutae Bae <gyutae.bae@navercorp.com>
---
 .../selftests/bpf/prog_tests/map_cmp_delete.c | 106 ++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c b/tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c
new file mode 100644
index 000000000000..6d6df13adcc4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/map_cmp_delete.c
@@ -0,0 +1,106 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2026 NAVER Corp.
+ *
+ * Author: Gyutae Bae <gyutae.bae@navercorp.com>
+ */
+
+#include <test_progs.h>
+#include <sys/syscall.h>
+
+struct val {
+	__u64 revision;	/* synchronization field we compare on */
+	__u64 payload;
+};
+
+/* libbpf has no wrapper that passes the compare-and-delete fields, so
+ * issue bpf() directly. 'flags' is a parameter so tests can also
+ * exercise the no-BPF_F_COMPARE case.
+ */
+static int cmp_delete_flags(int fd, const void *key, const void *compare,
+			    __u32 off, __u32 size, __u64 flags)
+{
+	union bpf_attr attr;
+	int ret;
+
+	memset(&attr, 0, sizeof(attr));
+	attr.map_fd = fd;
+	attr.key = (__u64)(unsigned long)key;
+	attr.flags = flags;
+	attr.compare = (__u64)(unsigned long)compare;
+	attr.compare_offset = off;
+	attr.compare_size = size;
+	ret = syscall(__NR_bpf, BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
+	return ret < 0 ? -errno : ret;
+}
+
+static int cmp_delete(int fd, const void *key, const void *compare,
+		      __u32 off, __u32 size)
+{
+	return cmp_delete_flags(fd, key, compare, off, size, BPF_F_COMPARE);
+}
+
+static int hash_map(void)
+{
+	return bpf_map_create(BPF_MAP_TYPE_HASH, "cmp_hash",
+			      sizeof(__u32), sizeof(struct val), 64, NULL);
+}
+
+static void test_contract(void)
+{
+	const __u32 off = offsetof(struct val, revision);
+	const __u32 sz = sizeof(((struct val *)0)->revision);
+	struct val v = { .revision = 10, .payload = 0xabc };
+	struct val snap = v;
+	__u32 k = 1;
+	int fd = hash_map();
+
+	if (!ASSERT_GE(fd, 0, "map_create"))
+		return;
+	ASSERT_OK(bpf_map_update_elem(fd, &k, &v, BPF_ANY), "insert");
+
+	/* mismatch -> -EBUSY, entry preserved */
+	snap.revision = 9;
+	ASSERT_EQ(cmp_delete(fd, &k, &snap, off, sz), -EBUSY, "mismatch_ebusy");
+	ASSERT_OK(bpf_map_lookup_elem(fd, &k, &v), "still_present");
+
+	/* compare fields set but BPF_F_COMPARE omitted -> -EINVAL, entry preserved
+	 * (a dropped flag must not silently become an unconditional delete)
+	 */
+	ASSERT_EQ(cmp_delete_flags(fd, &k, &snap, off, sz, 0), -EINVAL,
+		  "cmp_fields_no_flag_einval");
+	ASSERT_OK(bpf_map_lookup_elem(fd, &k, &v), "preserved_no_flag");
+
+	/* match -> 0, entry gone */
+	snap.revision = 10;
+	ASSERT_OK(cmp_delete(fd, &k, &snap, off, sz), "match_deletes");
+	ASSERT_EQ(bpf_map_lookup_elem(fd, &k, &v), -ENOENT, "gone");
+
+	/* absent -> -ENOENT */
+	ASSERT_EQ(cmp_delete(fd, &k, &snap, off, sz), -ENOENT, "absent_enoent");
+
+	/* bad window -> -EINVAL */
+	ASSERT_EQ(cmp_delete(fd, &k, &snap, 0, sizeof(struct val) + 8), -EINVAL,
+		  "oversize_einval");
+
+	/* unsupported map type (no map_delete_elem_cmp op) -> -EOPNOTSUPP */
+	{
+		int afd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "cmp_arr",
+					 sizeof(__u32), sizeof(struct val), 4, NULL);
+		__u32 ak = 0;
+
+		if (ASSERT_GE(afd, 0, "array_create")) {
+			ASSERT_EQ(cmp_delete(afd, &ak, &snap, off, sz),
+				  -EOPNOTSUPP, "array_eopnotsupp");
+			close(afd);
+		}
+	}
+	close(fd);
+}
+
+void test_map_cmp_delete(void)
+{
+	if (test__start_subtest("contract"))
+		test_contract();
+}
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps
  2026-06-22  7:16 [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Gyutae Bae
                   ` (2 preceding siblings ...)
  2026-06-22  7:16 ` [RFC bpf-next 3/3] selftests/bpf: test BPF_F_COMPARE compare-and-delete Gyutae Bae
@ 2026-06-22 22:32 ` Alexei Starovoitov
  2026-06-23  2:58   ` Gyutae Bae
  3 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2026-06-22 22:32 UTC (permalink / raw)
  To: Gyutae Bae, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	bpf
  Cc: John Fastabend, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
	Emil Tsalapatis, Shuah Khan, linux-kselftest, linux-kernel,
	Minsu Jeon, Siwan Kim, Jonghyeon Kim, Gyutae Bae

On Mon Jun 22, 2026 at 12:16 AM PDT, Gyutae Bae wrote:
> From: Gyutae Bae <gyutae.bae@navercorp.com>
>
> This series adds an atomic compare-and-delete primitive to BPF hash
> maps, motivated by a TOCTOU race in Cilium's conntrack GC [1]: the
> batched GC snapshots CT entries, decides which expired, then deletes
> them by key in a later syscall; between snapshot and delete the
> datapath can refresh the same entry, so a live entry is deleted. A
> userspace re-check before delete can't close it (lookup and delete are
> separate, individually bucket-locked calls).
>
> BPF_F_COMPARE lets userspace delete a key only if a chosen value region
> is unchanged, with the compare and the delete done atomically under the
> hash bucket lock:
>
>     attr.flags |= BPF_F_COMPARE;
>     attr.compare = <expected>;
>     attr.compare_offset = <off>;
>     attr.compare_size = <len>;
>
> mismatch -> -EBUSY, absent -> -ENOENT, unsupported map -> -EOPNOTSUPP.
> The compare* fields without the flag are rejected (-EINVAL) so a dropped
> flag can't silently become an unconditional delete; maps whose value
> carries BTF-managed fields (spin_lock/timer/kptr/...) are rejected
> (-EOPNOTSUPP) since those bytes are sanitised on lookup.
>
> Atomicity boundary (please scrutinise): the compare is atomic vs every
> bucket-lock holder, but NOT vs a BPF program writing the value in place
> via the pointer from bpf_map_lookup_elem() (no bucket lock). It
> collapses the race window from the whole GC batch to one bucket-locked
> critical section; full closure wants the compared region treated as a
> synchronization variable (e.g. a monotonic revision). The selftest
> models this.
>
> Scope of this RFC: per-element compare-and-delete on BPF_MAP_TYPE_HASH
> only. Deferred (will follow once the approach is agreed): batch delete +
> its attr fields, a libbpf wrapper, LRU-hash and other map types, a
> compare-and-swap *update*.
>
> Open questions:
>   - flag name: BPF_F_COMPARE vs something else?
>   - mismatch errno: -EBUSY vs -EAGAIN?
>   - new ->map_delete_elem_cmp() op vs extending ->map_delete_elem?

Sorry, this is no go.
There is bpf_spin_lock that use can use to synchronize access
between bpf progs and user space.
lookup_and_delete with BPF_F_LOCK uses the same lock.
Or add another syscall program that is triggered from user space
that operates on the same map.
Or convert everything to arena and use whatever algorithm you prefer.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps
  2026-06-22 22:32 ` [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Alexei Starovoitov
@ 2026-06-23  2:58   ` Gyutae Bae
  0 siblings, 0 replies; 6+ messages in thread
From: Gyutae Bae @ 2026-06-23  2:58 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, bpf
  Cc: Gyutae Bae, John Fastabend, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan,
	linux-kselftest, linux-kernel, Minsu Jeon, Siwan Kim,
	Jonghyeon Kim, Gyutae Bae

On Mon Jun 22, 2026 at 3:32 PM PDT, Alexei Starovoitov wrote:
> Sorry, this is no go.
> There is bpf_spin_lock that use can use to synchronize access
> between bpf progs and user space.
> lookup_and_delete with BPF_F_LOCK uses the same lock.
> Or add another syscall program that is triggered from user space
> that operates on the same map.
> Or convert everything to arena and use whatever algorithm you prefer.

Thanks for taking the time to look at this, and for the pointers.

That makes sense. Using bpf_spin_lock to synchronize the datapath and
user space, and driving the conditional delete from a SYSCALL program
(or moving the state into arena), stays within the existing
programmability model instead of adding new UAPI. At the same time,
it still lets us close the conntrack GC race we were after.

I'll explore those directions rather than a new map primitive.
Appreciate the pointers.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-23  2:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-22  7:16 [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Gyutae Bae
2026-06-22  7:16 ` [RFC bpf-next 1/3] bpf: add BPF_F_COMPARE flag and compare fields to map elem UAPI Gyutae Bae
2026-06-22  7:16 ` [RFC bpf-next 2/3] bpf: implement compare-and-delete (BPF_F_COMPARE) for BPF_MAP_TYPE_HASH Gyutae Bae
2026-06-22  7:16 ` [RFC bpf-next 3/3] selftests/bpf: test BPF_F_COMPARE compare-and-delete Gyutae Bae
2026-06-22 22:32 ` [RFC bpf-next 0/3] bpf: compare-and-delete (BPF_F_COMPARE) for hash maps Alexei Starovoitov
2026-06-23  2:58   ` Gyutae Bae

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox