[PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq
@ 2026-01-07 17:49 Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

This series reworks implementation of BPF timer and workqueue APIs.
The goal is to make both timers and wq non-blocking, enabling their use
in NMI context.
Today this code relies on a bpf_spin_lock embedded in the map element to
serialize:
 * init of the async object,
 * setting/changing the callback and bpf_prog
 * starting/cancelling the timer/work
 * tearing down when the map element is deleted or the map’s user ref is
 dropped

The basic design approach in this series:
 * Use irq_work to offload all blocking work from NMI
 * Introduce refcount to guarantee lifetime of the bpf_async_cb structs
 deferred to potentially multiple irq_work callbacks
 * Keep objects under RCU protection to make sure they are not freed
 while kfuncs/helpers access them (We can't use refcnt for this, as
 refcnt itself is part of the bpf_async_cb struct)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Changes in v3:
- Major rework
- Introduce mpmc_cell, allowing concurrent writes and reads
- Implement irq_work deferring
- Adding selftests
- Introduces bpf_timer_cancel_async kfunc
- Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com

Changes in v2:
- Move refcnt initialization and put (from cancel_and_free())
from patch 5 into the patch 4, so that patch 4 has more clear and full
implementation and use of refcnt
- Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com

---
Mykyta Yatsenko (10):
      bpf: Refactor __bpf_async_set_callback()
      bpf: Factor out timer deletion helper
      bpf: Simplify bpf_timer_cancel()
      bpf: Add lock-free cell for NMI-safe async operations
      bpf: Enable bpf timer and workqueue use in NMI
      bpf: Add verifier support for bpf_timer argument in kfuncs
      bpf: Introduce bpf_timer_cancel_async() kfunc
      selftests/bpf: Refactor timer selftests
      selftests/bpf: Add stress test for timer async cancel
      selftests/bpf: Verify bpf_timer_cancel_async works

 kernel/bpf/Makefile                            |   2 +-
 kernel/bpf/helpers.c                           | 409 +++++++++++++++----------
 kernel/bpf/mpmc_cell.c                         |  62 ++++
 kernel/bpf/mpmc_cell.h                         | 112 +++++++
 kernel/bpf/verifier.c                          |  59 +++-
 tools/testing/selftests/bpf/prog_tests/timer.c |  92 +++++-
 tools/testing/selftests/bpf/progs/timer.c      |  37 ++-
 7 files changed, 588 insertions(+), 185 deletions(-)
---
base-commit: ea180ffbd27ce5abf2a06329fe1fc8d20dc9becf
change-id: 20251028-timer_nolock-457f5b9daace

Best regards,
-- 
Mykyta Yatsenko <yatsenko@meta.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
                     ` (2 more replies)
  2026-01-07 17:49 ` [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
                   ` (8 subsequent siblings)
  9 siblings, 3 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor __bpf_async_set_callback() getting rid of locks. The idea of the
algorithm is to store both callback_fn and prog in struct bpf_async_cb
and verify that both pointers are stored, if any pointer does not
match (because of the concurrent update), retry until complete match.
On each iteration, increment refcnt of the prog that is going to
be set and decrement the one that is evicted, ensuring that get/put are
balanced, as each iteration has both inc/dec.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 61 ++++++++++++++++++----------------------------------
 1 file changed, 21 insertions(+), 40 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9eaa4185e0a79b903c6fc2ccb310f521a4b14a1d..954bd61310a6ad3a0d540c1b1ebe8c35a9c0119c 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
 };
 
 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
-				    struct bpf_prog_aux *aux, unsigned int flags,
-				    enum bpf_async_type type)
+				    struct bpf_prog *prog)
 {
-	struct bpf_prog *prev, *prog = aux->prog;
-	struct bpf_async_cb *cb;
-	int ret = 0;
+	struct bpf_prog *prev;
+	struct bpf_async_cb *cb = async->cb;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
-	__bpf_spin_lock_irqsave(&async->lock);
-	cb = async->cb;
-	if (!cb) {
-		ret = -EINVAL;
-		goto out;
-	}
-	if (!atomic64_read(&cb->map->usercnt)) {
-		/* maps with timers must be either held by user space
-		 * or pinned in bpffs. Otherwise timer might still be
-		 * running even when bpf prog is detached and user space
-		 * is gone, since map_release_uref won't ever be called.
-		 */
-		ret = -EPERM;
-		goto out;
-	}
-	prev = cb->prog;
-	if (prev != prog) {
-		/* Bump prog refcnt once. Every bpf_timer_set_callback()
-		 * can pick different callback_fn-s within the same prog.
-		 */
-		prog = bpf_prog_inc_not_zero(prog);
-		if (IS_ERR(prog)) {
-			ret = PTR_ERR(prog);
-			goto out;
+	if (!cb)
+		return -EPERM;
+
+	do {
+		if (prog) {
+			prog = bpf_prog_inc_not_zero(prog);
+			if (IS_ERR(prog))
+				return PTR_ERR(prog);
 		}
+
+		prev = xchg(&cb->prog, prog);
+		rcu_assign_pointer(cb->callback_fn, callback_fn);
+
 		if (prev)
-			/* Drop prev prog refcnt when swapping with new prog */
 			bpf_prog_put(prev);
-		cb->prog = prog;
-	}
-	rcu_assign_pointer(cb->callback_fn, callback_fn);
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return ret;
+
+	} while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
+
+	return 0;
 }
 
 BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
 	   struct bpf_prog_aux *, aux)
 {
-	return __bpf_async_set_callback(timer, callback_fn, aux, 0, BPF_ASYNC_TYPE_TIMER);
+	return __bpf_async_set_callback(timer, callback_fn, aux->prog);
 }
 
 static const struct bpf_func_proto bpf_timer_set_callback_proto = {
@@ -3131,7 +3112,7 @@ __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
 	if (flags)
 		return -EINVAL;
 
-	return __bpf_async_set_callback(async, callback_fn, aux, flags, BPF_ASYNC_TYPE_WQ);
+	return __bpf_async_set_callback(async, callback_fn, aux->prog);
 }
 
 __bpf_kfunc void bpf_preempt_disable(void)

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-12  6:51   ` Kumar Kartikeya Dwivedi
  2026-01-07 17:49 ` [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 954bd61310a6ad3a0d540c1b1ebe8c35a9c0119c..ff3c1e1160db748991f2a71e6a44727fc29424d5 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1539,18 +1539,10 @@ static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *a
 	return cb;
 }
 
-/* This function is called by map_delete/update_elem for individual element and
- * by ops->map_release_uref when the user space reference to a map reaches zero.
- */
-void bpf_timer_cancel_and_free(void *val)
+static void bpf_timer_delete(struct bpf_hrtimer *t)
 {
-	struct bpf_hrtimer *t;
-
-	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
-
-	if (!t)
-		return;
-	/* We check that bpf_map_delete/update_elem() was called from timer
+	/*
+	 * We check that bpf_map_delete/update_elem() was called from timer
 	 * callback_fn. In such case we don't call hrtimer_cancel() (since it
 	 * will deadlock) and don't call hrtimer_try_to_cancel() (since it will
 	 * just return -1). Though callback_fn is still running on this cpu it's
@@ -1599,6 +1591,21 @@ void bpf_timer_cancel_and_free(void *val)
 	}
 }
 
+/*
+ * This function is called by map_delete/update_elem for individual element and
+ * by ops->map_release_uref when the user space reference to a map reaches zero.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_hrtimer *t;
+
+	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
+	if (!t)
+		return;
+
+	bpf_timer_delete(t);
+}
+
 /* This function is called by map_delete/update_elem for individual element and
  * by ops->map_release_uref when the user space reference to a map reaches zero.
  */

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel()
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
  2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.
Use guard(rcu)() instead of pair of explicit lock()/unlock().

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index ff3c1e1160db748991f2a71e6a44727fc29424d5..dc8ed948321e6c535d2cc2e8f9fbdd0636cdcabf 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1446,7 +1446,7 @@ static void drop_prog_refcnt(struct bpf_async_cb *async)
 	}
 }
 
-BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
+BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
 {
 	struct bpf_hrtimer *t, *cur_t;
 	bool inc = false;
@@ -1454,13 +1454,12 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 
 	if (in_nmi())
 		return -EOPNOTSUPP;
-	rcu_read_lock();
-	__bpf_spin_lock_irqsave(&timer->lock);
-	t = timer->timer;
-	if (!t) {
-		ret = -EINVAL;
-		goto out;
-	}
+
+	guard(rcu)();
+
+	t = async->timer;
+	if (!t)
+		return -EINVAL;
 
 	cur_t = this_cpu_read(hrtimer_running);
 	if (cur_t == t) {
@@ -1468,8 +1467,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 		 * its own timer the hrtimer_cancel() will deadlock
 		 * since it waits for callback_fn to finish.
 		 */
-		ret = -EDEADLK;
-		goto out;
+		return -EDEADLK;
 	}
 
 	/* Only account in-flight cancellations when invoked from a timer
@@ -1492,20 +1490,19 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 		 * cancelling and waiting for it synchronously, since it might
 		 * do the same. Bail!
 		 */
-		ret = -EDEADLK;
-		goto out;
+		atomic_dec(&t->cancelling);
+		return -EDEADLK;
 	}
+
 drop:
-	drop_prog_refcnt(&t->cb);
-out:
-	__bpf_spin_unlock_irqrestore(&timer->lock);
+	__bpf_async_set_callback(async, NULL, NULL);
 	/* Cancel the timer and wait for associated callback to finish
 	 * if it was running.
 	 */
-	ret = ret ?: hrtimer_cancel(&t->timer);
+	ret = hrtimer_cancel(&t->timer);
+
 	if (inc)
 		atomic_dec(&t->cancelling);
-	rcu_read_unlock();
 	return ret;
 }
 

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (2 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 18:08   ` bot+bpf-ci
                     ` (2 more replies)
  2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
                   ` (5 subsequent siblings)
  9 siblings, 3 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Introduce mpmc_cell, a lock-free cell primitive designed to support
concurrent writes to struct in NMI context (only one writer advances),
allowing readers to consume consistent snapshot.

Implementation details:
 Double buffering allows writers run concurrently with readers (read
 from one cell, write to another)

 The implementation uses a sequence-number-based protocol to enable
 exclusive writes.
  * Bit 0 of seq indicates an active writer
  * Bits 1+ form a generation counter
  * (seq & 2) >> 1 selects the read cell, write cell is opposite
  * Writers atomically set bit 0, write to the inactive cell, then
    increment seq to publish
  * Readers snapshot seq, read from the active cell, then validate
    that seq hasn't changed

mpmc_cell expects users to pre-allocate double buffers.

Key properties:
 * Writers never block (fail if lost the race to another writer)
 * Readers never block writers (double buffering), but may require
 retries if write updates the snapshot concurrently.

This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
operations (like hrtimer_start()) to irq_work effectively allowing BPF
programs to initiate timers and workqueues from NMI context.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/Makefile    |   2 +-
 kernel/bpf/mpmc_cell.c |  62 +++++++++++++++++++++++++++
 kernel/bpf/mpmc_cell.h | 112 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 175 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 79cf22860a99ba31a9daf08a29de0f3a162ba89f..753fa63e0c24dc0a332d86c2c424894300f2d611 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
 endif
 CFLAGS_core.o += -Wno-override-init $(cflags-nogcse-yy)
 
-obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o liveness.o
+obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o liveness.o mpmc_cell.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
 obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o bpf_insn_array.o
diff --git a/kernel/bpf/mpmc_cell.c b/kernel/bpf/mpmc_cell.c
new file mode 100644
index 0000000000000000000000000000000000000000..ca91b4308c8b552bc81cfefa2d975290a64b596d
--- /dev/null
+++ b/kernel/bpf/mpmc_cell.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include "mpmc_cell.h"
+
+static u32 read_cell_idx(struct bpf_mpmc_cell_ctl *ctl, u32 seq)
+{
+	return (seq & 2) >> 1;
+}
+
+void bpf_mpmc_cell_init(struct bpf_mpmc_cell_ctl *ctl, void *cell1, void *cell2)
+{
+	atomic_set(&ctl->seq, 0);
+	ctl->cell[0] = cell1;
+	ctl->cell[1] = cell2;
+}
+
+void *bpf_mpmc_cell_read_begin(struct bpf_mpmc_cell_ctl *ctl, u32 *seq)
+{
+	*seq = atomic_read_acquire(&ctl->seq);
+	/* Mask out acive writer bit */
+	*seq &= ~1;
+
+	return ctl->cell[read_cell_idx(ctl, *seq)];
+}
+
+int bpf_mpmc_cell_read_end(struct bpf_mpmc_cell_ctl *ctl, u32 seq)
+{
+	u32 new_seq;
+
+	/* Ensure cell reads complete before checking seq */
+	smp_rmb();
+
+	new_seq = atomic_read_acquire(&ctl->seq);
+	new_seq &= ~1; /* Ignore active write bit */
+	/* Check if seq changed between begin and end, if it did, new snapshot is available */
+	if (new_seq != seq)
+		return -EAGAIN;
+
+	return 0;
+}
+
+void *bpf_mpmc_cell_write_begin(struct bpf_mpmc_cell_ctl *ctl)
+{
+	u32 seq;
+
+	/*
+	 * Try to set the lowest bit, on success, writer owns cell exclusively,
+	 * other writers fail
+	 */
+	seq = atomic_fetch_or_acquire(1, &ctl->seq);
+	if (seq & 1) /* Check if another writer is active */
+		return NULL;
+
+	/* Write to opposite to read buffer */
+	return ctl->cell[read_cell_idx(ctl, seq) ^ 1];
+}
+
+void bpf_mpmc_cell_write_commit(struct bpf_mpmc_cell_ctl *ctl)
+{
+	atomic_fetch_add_release(1, &ctl->seq);
+}
diff --git a/kernel/bpf/mpmc_cell.h b/kernel/bpf/mpmc_cell.h
new file mode 100644
index 0000000000000000000000000000000000000000..8b57226927a6c51460fae3113b94d8631173da63
--- /dev/null
+++ b/kernel/bpf/mpmc_cell.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#ifndef __BPF_MPMC_CELL_H__
+#define __BPF_MPMC_CELL_H__
+#include <linux/smp.h>
+
+/**
+ * DOC: BPF MPMC Cell
+ *
+ * Multi-producer, multi-consumer lock-free double buffer.
+ * Designed for writers producing data in NMI context where locking is not possible.
+ *
+ * Writers never block or wait, but may fail (return NULL) if another writer is active
+ * (assume these writers are overridden)
+ * Readers never block writers. Readers may need to retry if a write
+ * completes during the read window (return -EAGAIN)
+ *
+ * User should provide two allocated cells.
+ *
+ * Typical usage:
+ *
+ *   // Writer (from NMI or any context):
+ *   cell = bpf_mpmc_cell_write_begin(ctl);
+ *   if (!IS_ERR(cell)) {
+ *       memcpy(cell, data, size);
+ *       bpf_mpmc_cell_write_commit(ctl);
+ *   }
+ *
+ *   // Reader (from irq_work or similar):
+ *   cell = bpf_mpmc_cell_read_begin(ctl, &seq);
+ *   memcpy(local, cell, size);
+ *   ret = bpf_mpmc_cell_read_end(ctl, seq);
+ *   if (ret == 0)
+ *       process(local);  // success, we own this snapshot
+ *   else if (ret == -EAGAIN)
+ *       retry;           // snapshot changed or lost race
+ */
+
+/**
+ * struct bpf_mpmc_cell_ctl - control structure for mpmc cell
+ * @seq: sequence number (odd = write active, seq/2 = generation)
+ * @cell: pointers to two allocated cells to support double buffering
+ *
+ */
+struct bpf_mpmc_cell_ctl {
+	atomic_t seq;
+	void *cell[2];
+};
+
+/**
+ * bpf_mpmc_cell_init() - initialize mpmc cell control structure
+ * @ctl: pointer to control structure to initialize
+ * @cell1: pointer to an allocated cell
+ * @cell2: pointer to another same sized cell
+ *
+ * Must be called before any read/write operations.
+ * Caller must allocate two same sized cells (buffers, structs) and pass
+ * them to this function, those two cells are used for double-buffering,
+ * supporting concurrent reads/writes: readers use one cell, writers another.
+ *
+ * Context: Any context.
+ * Return: void.
+ */
+void bpf_mpmc_cell_init(struct bpf_mpmc_cell_ctl *ctl, void *cell1, void *cell2);
+
+/**
+ * bpf_mpmc_cell_read_begin() - begin a read operation
+ * @ctl: pointer to control structure
+ * @seq: output parameter, sequence number for this read
+ *
+ * Returns: pointer to the current read cell. Caller must copy data
+ * out and then call bpf_mpmc_cell_read_end() to validate.
+ */
+void *bpf_mpmc_cell_read_begin(struct bpf_mpmc_cell_ctl *ctl, u32 *seq);
+
+/**
+ * bpf_mpmc_cell_read_end() - validate read operation.
+ * @ctl: pointer to control structure
+ * @seq: sequence number from matching bpf_mpmc_cell_read_begin()
+ *
+ * Validates that the snapshot read between bpf_mpmc_cell_read_begin()
+ * and bpf_mpmc_cell_read_end() is consistent.
+ *
+ * Return:
+ *   0        - success, snapshot is consistent
+ *   -EAGAIN  - snapshot invalidated (another writer completed)
+ */
+int bpf_mpmc_cell_read_end(struct bpf_mpmc_cell_ctl *ctl, u32 seq);
+
+/**
+ * bpf_mpmc_cell_write_begin() - begin a write operation
+ * @ctl: pointer to control structure
+ *
+ * Attempts to acquire exclusive writer access. Only one writer can be
+ * active at a time. On success, caller must write data and call
+ * bpf_mpmc_cell_write_commit(). There is no write abort mechanism.
+ *
+ * Return: Pointer to the write cell, or NULL if another writer is
+ * active.
+ */
+void *bpf_mpmc_cell_write_begin(struct bpf_mpmc_cell_ctl *ctl);
+
+/**
+ * bpf_mpmc_cell_write_commit() - complete a write operation
+ * @ctl: pointer to control structure
+ *
+ * Publishes the written data, making it visible to readers.
+ * Must be called after successful bpf_mpmc_cell_write_begin().
+ */
+void bpf_mpmc_cell_write_commit(struct bpf_mpmc_cell_ctl *ctl);
+
+#endif

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (3 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
                     ` (2 more replies)
  2026-01-07 17:49 ` [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
                   ` (4 subsequent siblings)
  9 siblings, 3 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor bpf timer and workqueue helpers to allow calling them from NMI
context by making all operations lock-free and deferring NMI-unsafe
work to irq_work.

Previously, bpf_timer_start(), and bpf_wq_start()
could not be called from NMI context because they acquired
bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
patch removes these limitations.

Key changes:
 * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
   operations with atomic cmpxchg() for initialization and xchg() for
   cancel and free.
 * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
   hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
 * Use the lock-free mpmc_cell (added in the previous commit) to pass
   operation commands (start/cancel/free) along with their parameters
   (nsec, mode) from NMI-safe callers to the irq_work handler.
 * Add reference counting to bpf_async_cb to ensure the object stays
   alive until all scheduled irq_work completes and the timer/work
   callback finishes.
 * Move bpf_prog_put() to RCU callback to handle races between
   set_callback() and cancel_and_free().

This enables BPF programs attached to NMI-context hooks (perf
events) to use timers and workqueues for deferred processing.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 288 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 191 insertions(+), 97 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index dc8ed948321e6c535d2cc2e8f9fbdd0636cdcabf..b90b005a17e1de9c0c62056a665d124b883c6320 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -29,6 +29,7 @@
 #include <linux/task_work.h>
 #include <linux/irq_work.h>
 #include <linux/buildid.h>
+#include <mpmc_cell.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -1095,6 +1096,23 @@ static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
 	return (void *)value - round_up(map->key_size, 8);
 }
 
+enum bpf_async_type {
+	BPF_ASYNC_TYPE_TIMER = 0,
+	BPF_ASYNC_TYPE_WQ,
+};
+
+enum bpf_async_op {
+	BPF_ASYNC_START,
+	BPF_ASYNC_CANCEL,
+	BPF_ASYNC_CANCEL_AND_FREE,
+};
+
+struct bpf_async_cmd {
+	u64 nsec;
+	u32 mode;
+	u32 op;
+};
+
 struct bpf_async_cb {
 	struct bpf_map *map;
 	struct bpf_prog *prog;
@@ -1105,6 +1123,12 @@ struct bpf_async_cb {
 		struct work_struct delete_work;
 	};
 	u64 flags;
+	struct irq_work worker;
+	struct bpf_mpmc_cell_ctl ctl;
+	struct bpf_async_cmd cmd[2];
+	atomic_t last_seq;
+	refcount_t refcnt;
+	enum bpf_async_type type;
 };
 
 /* BPF map elements can contain 'struct bpf_timer'.
@@ -1142,18 +1166,8 @@ struct bpf_async_kern {
 		struct bpf_hrtimer *timer;
 		struct bpf_work *work;
 	};
-	/* bpf_spin_lock is used here instead of spinlock_t to make
-	 * sure that it always fits into space reserved by struct bpf_timer
-	 * regardless of LOCKDEP and spinlock debug flags.
-	 */
-	struct bpf_spin_lock lock;
 } __attribute__((aligned(8)));
 
-enum bpf_async_type {
-	BPF_ASYNC_TYPE_TIMER = 0,
-	BPF_ASYNC_TYPE_WQ,
-};
-
 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
@@ -1219,6 +1233,13 @@ static void bpf_async_cb_rcu_free(struct rcu_head *rcu)
 {
 	struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu);
 
+	/*
+	 * Drop the last reference to prog only after RCU GP, as set_callback()
+	 * may race with cancel_and_free()
+	 */
+	if (cb->prog)
+		bpf_prog_put(cb->prog);
+
 	kfree_nolock(cb);
 }
 
@@ -1246,18 +1267,17 @@ static void bpf_timer_delete_work(struct work_struct *work)
 	call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
 }
 
+static void __bpf_async_cancel_and_free(struct bpf_async_kern *async);
+static void bpf_async_irq_worker(struct irq_work *work);
+
 static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
 			    enum bpf_async_type type)
 {
-	struct bpf_async_cb *cb;
+	struct bpf_async_cb *cb, *old_cb;
 	struct bpf_hrtimer *t;
 	struct bpf_work *w;
 	clockid_t clockid;
 	size_t size;
-	int ret = 0;
-
-	if (in_nmi())
-		return -EOPNOTSUPP;
 
 	switch (type) {
 	case BPF_ASYNC_TYPE_TIMER:
@@ -1270,18 +1290,13 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		return -EINVAL;
 	}
 
-	__bpf_spin_lock_irqsave(&async->lock);
 	t = async->timer;
-	if (t) {
-		ret = -EBUSY;
-		goto out;
-	}
+	if (t)
+		return -EBUSY;
 
 	cb = bpf_map_kmalloc_nolock(map, size, 0, map->numa_node);
-	if (!cb) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!cb)
+		return -ENOMEM;
 
 	switch (type) {
 	case BPF_ASYNC_TYPE_TIMER:
@@ -1304,9 +1319,19 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 	cb->map = map;
 	cb->prog = NULL;
 	cb->flags = flags;
+	cb->worker = IRQ_WORK_INIT(bpf_async_irq_worker);
+	bpf_mpmc_cell_init(&cb->ctl, &cb->cmd[0], &cb->cmd[1]);
+	refcount_set(&cb->refcnt, 1); /* map's reference */
+	atomic_set(&cb->last_seq, 0);
+	cb->type = type;
 	rcu_assign_pointer(cb->callback_fn, NULL);
 
-	WRITE_ONCE(async->cb, cb);
+	old_cb = cmpxchg(&async->cb, NULL, cb);
+	if (old_cb) {
+		/* Lost the race to initialize this bpf_async_kern, drop the allocated object */
+		kfree_nolock(cb);
+		return -EBUSY;
+	}
 	/* Guarantee the order between async->cb and map->usercnt. So
 	 * when there are concurrent uref release and bpf timer init, either
 	 * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
@@ -1317,13 +1342,11 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		/* maps with timers must be either held by user space
 		 * or pinned in bpffs.
 		 */
-		WRITE_ONCE(async->cb, NULL);
-		kfree_nolock(cb);
-		ret = -EPERM;
+		__bpf_async_cancel_and_free(async);
+		return -EPERM;
 	}
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return ret;
+
+	return 0;
 }
 
 BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
@@ -1354,6 +1377,61 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static int bpf_async_schedule_op(struct bpf_async_cb *cb, u32 op, u64 nsec, u32 timer_mode)
+{
+	struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
+	struct bpf_async_cmd *cmd;
+
+	cmd = bpf_mpmc_cell_write_begin(ctl);
+	if (!cmd)
+		return -EBUSY;
+
+	cmd->nsec = nsec;
+	cmd->mode = timer_mode;
+	cmd->op = op;
+
+	bpf_mpmc_cell_write_commit(ctl);
+
+	if (!refcount_inc_not_zero(&cb->refcnt))
+		return -EBUSY;
+
+	irq_work_queue(&cb->worker);
+
+	return 0;
+}
+
+static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
+			     u64 *nsec, u32 *flags)
+{
+	struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
+	struct bpf_async_cmd *cmd;
+	u32 seq, last_seq;
+
+	do {
+		last_seq = atomic_read_acquire(&cb->last_seq);
+		cmd = bpf_mpmc_cell_read_begin(ctl, &seq);
+
+		/* Return -EBUSY if current seq is consumed by another reader */
+		if (seq == last_seq)
+			return -EBUSY;
+
+		*nsec = cmd->nsec;
+		*flags = cmd->mode;
+		*op = cmd->op;
+
+	/*
+	 * Retry read on one of the two conditions:
+	 *  1. Some writer produced new snapshot while we were reading. Our snapshot may have been
+	 *     modified, and not consistent.
+	 *  2. Another reader consumed some snapshot. We need to validate that this snapshot is not
+	 *     consumed. This prevents duplicate op processing.
+	 */
+	} while (bpf_mpmc_cell_read_end(ctl, seq) == -EAGAIN ||
+		 atomic_cmpxchg_release(&cb->last_seq, last_seq, seq) != last_seq);
+
+	return 0;
+}
+
 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
 				    struct bpf_prog *prog)
 {
@@ -1395,22 +1473,19 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
 	.arg2_type	= ARG_PTR_TO_FUNC,
 };
 
-BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
+BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
 {
 	struct bpf_hrtimer *t;
-	int ret = 0;
-	enum hrtimer_mode mode;
+	u32 mode;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
 	if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
 		return -EINVAL;
-	__bpf_spin_lock_irqsave(&timer->lock);
-	t = timer->timer;
-	if (!t || !t->cb.prog) {
-		ret = -EINVAL;
-		goto out;
-	}
+
+	guard(rcu)();
+
+	t = async->timer;
+	if (!t || !t->cb.prog)
+		return -EINVAL;
 
 	if (flags & BPF_F_TIMER_ABS)
 		mode = HRTIMER_MODE_ABS_SOFT;
@@ -1420,10 +1495,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla
 	if (flags & BPF_F_TIMER_CPU_PIN)
 		mode |= HRTIMER_MODE_PINNED;
 
-	hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
-out:
-	__bpf_spin_unlock_irqrestore(&timer->lock);
-	return ret;
+	return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
 }
 
 static const struct bpf_func_proto bpf_timer_start_proto = {
@@ -1435,17 +1507,6 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-static void drop_prog_refcnt(struct bpf_async_cb *async)
-{
-	struct bpf_prog *prog = async->prog;
-
-	if (prog) {
-		bpf_prog_put(prog);
-		async->prog = NULL;
-		rcu_assign_pointer(async->callback_fn, NULL);
-	}
-}
-
 BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
 {
 	struct bpf_hrtimer *t, *cur_t;
@@ -1513,27 +1574,16 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
 	.arg1_type	= ARG_PTR_TO_TIMER,
 };
 
-static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
+static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
 {
 	struct bpf_async_cb *cb;
 
-	/* Performance optimization: read async->cb without lock first. */
-	if (!READ_ONCE(async->cb))
-		return NULL;
-
-	__bpf_spin_lock_irqsave(&async->lock);
-	/* re-read it under lock */
-	cb = async->cb;
+	cb = xchg(&async->cb, NULL);
 	if (!cb)
-		goto out;
-	drop_prog_refcnt(cb);
-	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
-	 * this timer, since it won't be initialized.
-	 */
-	WRITE_ONCE(async->cb, NULL);
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return cb;
+		return;
+
+	/* Consume map's refcnt */
+	irq_work_queue(&cb->worker);
 }
 
 static void bpf_timer_delete(struct bpf_hrtimer *t)
@@ -1588,19 +1638,76 @@ static void bpf_timer_delete(struct bpf_hrtimer *t)
 	}
 }
 
+static void bpf_async_process_op(struct bpf_async_cb *cb, u32 op,
+				 u64 timer_nsec, u32 timer_mode)
+{
+	switch (cb->type) {
+	case BPF_ASYNC_TYPE_TIMER: {
+		struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb);
+
+		switch (op) {
+		case BPF_ASYNC_START:
+			hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
+			break;
+		case BPF_ASYNC_CANCEL:
+			hrtimer_try_to_cancel(&t->timer);
+			break;
+		case BPF_ASYNC_CANCEL_AND_FREE:
+			bpf_timer_delete(t);
+			break;
+		default:
+			break;
+		}
+		break;
+	}
+	case BPF_ASYNC_TYPE_WQ: {
+		struct bpf_work *w = container_of(cb, struct bpf_work, cb);
+
+		switch (op) {
+		case BPF_ASYNC_START:
+			schedule_work(&w->work);
+			break;
+		case BPF_ASYNC_CANCEL_AND_FREE:
+			/*
+			 * Trigger cancel of the sleepable work, but *do not* wait for
+			 * it to finish.
+			 * kfree will be called once the work has finished.
+			 */
+			schedule_work(&w->delete_work);
+			break;
+		default:
+			break;
+		}
+		break;
+	}
+	}
+}
+
+static void bpf_async_irq_worker(struct irq_work *work)
+{
+	struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
+	u32 op, timer_mode;
+	u64 nsec;
+	int err;
+
+	err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
+	if (err)
+		goto out;
+
+	bpf_async_process_op(cb, op, nsec, timer_mode);
+
+out:
+	if (refcount_dec_and_test(&cb->refcnt))
+		bpf_async_process_op(cb, BPF_ASYNC_CANCEL_AND_FREE, 0, 0);
+}
+
 /*
  * This function is called by map_delete/update_elem for individual element and
  * by ops->map_release_uref when the user space reference to a map reaches zero.
  */
 void bpf_timer_cancel_and_free(void *val)
 {
-	struct bpf_hrtimer *t;
-
-	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
-	if (!t)
-		return;
-
-	bpf_timer_delete(t);
+	__bpf_async_cancel_and_free(val);
 }
 
 /* This function is called by map_delete/update_elem for individual element and
@@ -1608,19 +1715,7 @@ void bpf_timer_cancel_and_free(void *val)
  */
 void bpf_wq_cancel_and_free(void *val)
 {
-	struct bpf_work *work;
-
-	BTF_TYPE_EMIT(struct bpf_wq);
-
-	work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
-	if (!work)
-		return;
-	/* Trigger cancel of the sleepable work, but *do not* wait for
-	 * it to finish if it was running as we might not be in a
-	 * sleepable context.
-	 * kfree will be called once the work has finished.
-	 */
-	schedule_work(&work->delete_work);
+	__bpf_async_cancel_and_free(val);
 }
 
 BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
@@ -3093,15 +3188,14 @@ __bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
 	struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
 	struct bpf_work *w;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
 	if (flags)
 		return -EINVAL;
 	w = READ_ONCE(async->work);
 	if (!w || !READ_ONCE(w->cb.prog))
 		return -EINVAL;
 
-	schedule_work(&w->work);
+	bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0);
+
 	return 0;
 }
 

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (4 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-09 22:19   ` Andrii Nakryiko
  2026-01-07 17:49 ` [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Extend the verifier to recognize struct bpf_timer as a valid kfunc
argument type. Previously, bpf_timer was only supported in BPF helpers.

This prepares for adding timer-related kfuncs in subsequent patches.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/verifier.c | 59 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9394b0de2ef0085690b0a0052f82cd48d8722e89..f3acd16ccabc81a64cf565ea092419dda6ae3e71 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8569,17 +8569,15 @@ static int check_map_field_pointer(struct bpf_verifier_env *env, u32 regno,
 }
 
 static int process_timer_func(struct bpf_verifier_env *env, int regno,
-			      struct bpf_call_arg_meta *meta)
+			      struct bpf_map *map)
 {
-	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
-	struct bpf_map *map = reg->map_ptr;
 	int err;
 
 	err = check_map_field_pointer(env, regno, BPF_TIMER);
 	if (err)
 		return err;
 
-	if (meta->map_ptr) {
+	if (map) {
 		verifier_bug(env, "Two map pointers in a timer helper");
 		return -EFAULT;
 	}
@@ -8587,8 +8585,36 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 		verbose(env, "bpf_timer cannot be used for PREEMPT_RT.\n");
 		return -EOPNOTSUPP;
 	}
+	return 0;
+}
+
+static int process_timer_helper(struct bpf_verifier_env *env, int regno,
+				struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	int err;
+
+	err = process_timer_func(env, regno, meta->map_ptr);
+	if (err)
+		return err;
+
 	meta->map_uid = reg->map_uid;
-	meta->map_ptr = map;
+	meta->map_ptr = reg->map_ptr;
+	return 0;
+}
+
+static int process_timer_kfunc(struct bpf_verifier_env *env, int regno,
+			       struct bpf_kfunc_call_arg_meta *meta)
+{
+	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
+	int err;
+
+	err = process_timer_func(env, regno, meta->map.ptr);
+	if (err)
+		return err;
+
+	meta->map.uid = reg->map_uid;
+	meta->map.ptr = reg->map_ptr;
 	return 0;
 }
 
@@ -9911,7 +9937,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		}
 		break;
 	case ARG_PTR_TO_TIMER:
-		err = process_timer_func(env, regno, meta);
+		err = process_timer_helper(env, regno, meta);
 		if (err)
 			return err;
 		break;
@@ -12164,6 +12190,7 @@ enum {
 	KF_ARG_WORKQUEUE_ID,
 	KF_ARG_RES_SPIN_LOCK_ID,
 	KF_ARG_TASK_WORK_ID,
+	KF_ARG_TIMER_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
@@ -12175,6 +12202,7 @@ BTF_ID(struct, bpf_rb_node)
 BTF_ID(struct, bpf_wq)
 BTF_ID(struct, bpf_res_spin_lock)
 BTF_ID(struct, bpf_task_work)
+BTF_ID(struct, bpf_timer)
 
 static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
 				    const struct btf_param *arg, int type)
@@ -12218,6 +12246,11 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
 }
 
+static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
+}
+
 static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
 {
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
@@ -12312,6 +12345,7 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_NULL,
 	KF_ARG_PTR_TO_CONST_STR,
 	KF_ARG_PTR_TO_MAP,
+	KF_ARG_PTR_TO_TIMER,
 	KF_ARG_PTR_TO_WORKQUEUE,
 	KF_ARG_PTR_TO_IRQ_FLAG,
 	KF_ARG_PTR_TO_RES_SPIN_LOCK,
@@ -12555,6 +12589,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_wq(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_WORKQUEUE;
 
+	if (is_kfunc_arg_timer(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_TIMER;
+
 	if (is_kfunc_arg_task_work(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_TASK_WORK;
 
@@ -13334,6 +13371,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
 		case KF_ARG_PTR_TO_CONST_STR:
 		case KF_ARG_PTR_TO_WORKQUEUE:
+		case KF_ARG_PTR_TO_TIMER:
 		case KF_ARG_PTR_TO_TASK_WORK:
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 		case KF_ARG_PTR_TO_RES_SPIN_LOCK:
@@ -13633,6 +13671,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_TIMER:
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				verbose(env, "arg#%d doesn't point to a map value\n", i);
+				return -EINVAL;
+			}
+			ret = process_timer_kfunc(env, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_TASK_WORK:
 			if (reg->type != PTR_TO_MAP_VALUE) {
 				verbose(env, "arg#%d doesn't point to a map value\n", i);

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (5 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-09 22:19   ` Andrii Nakryiko
  2026-01-07 17:49 ` [PATCH RFC v3 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

introducing bpf timer cancel kfunc that attempts canceling timer
asynchronously, hence, supports working in NMI context.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index b90b005a17e1de9c0c62056a665d124b883c6320..1f593df04f326c509398f501907265ec6dae60e9 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -4439,6 +4439,19 @@ __bpf_kfunc int bpf_dynptr_file_discard(struct bpf_dynptr *dynptr)
 	return 0;
 }
 
+__bpf_kfunc int bpf_timer_cancel_async(struct bpf_timer *timer)
+{
+	struct bpf_async_cb *cb;
+	struct bpf_async_kern *async = (void *)timer;
+
+	guard(rcu)();
+	cb = async->cb;
+	if (!cb)
+		return -EINVAL;
+
+	return bpf_async_schedule_op(cb, BPF_ASYNC_CANCEL, 0, 0);
+}
+
 __bpf_kfunc_end_defs();
 
 static void bpf_task_work_cancel_scheduled(struct irq_work *irq_work)
@@ -4620,6 +4633,7 @@ BTF_ID_FLAGS(func, bpf_task_work_schedule_signal_impl)
 BTF_ID_FLAGS(func, bpf_task_work_schedule_resume_impl)
 BTF_ID_FLAGS(func, bpf_dynptr_from_file)
 BTF_ID_FLAGS(func, bpf_dynptr_file_discard)
+BTF_ID_FLAGS(func, bpf_timer_cancel_async)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 08/10] selftests/bpf: Refactor timer selftests
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (6 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
  9 siblings, 0 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor timer selftests, extracting stress test into a separate test.
This makes it easier to debug test failures and allows to extend.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 55 +++++++++++++++++---------
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index 34f9ccce260293755980bcd6fcece491964f7929..4d853d1bd2a71b3d0f1ba0daa7a699945b4457fe 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -22,13 +22,35 @@ static void *spin_lock_thread(void *arg)
 	pthread_exit(arg);
 }
 
-static int timer(struct timer *timer_skel)
+
+static int timer_stress(struct timer *timer_skel)
 {
-	int i, err, prog_fd;
+	int i, err = 1, prog_fd;
 	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	pthread_t thread_id[NUM_THR];
 	void *ret;
 
+	prog_fd = bpf_program__fd(timer_skel->progs.race);
+	for (i = 0; i < NUM_THR; i++) {
+		err = pthread_create(&thread_id[i], NULL,
+				     &spin_lock_thread, &prog_fd);
+		if (!ASSERT_OK(err, "pthread_create"))
+			break;
+	}
+
+	while (i) {
+		err = pthread_join(thread_id[--i], &ret);
+		if (ASSERT_OK(err, "pthread_join"))
+			ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join");
+	}
+	return err;
+}
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
 	err = timer__attach(timer_skel);
 	if (!ASSERT_OK(err, "timer_attach"))
 		return err;
@@ -63,25 +85,10 @@ static int timer(struct timer *timer_skel)
 	/* check that code paths completed */
 	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
 
-	prog_fd = bpf_program__fd(timer_skel->progs.race);
-	for (i = 0; i < NUM_THR; i++) {
-		err = pthread_create(&thread_id[i], NULL,
-				     &spin_lock_thread, &prog_fd);
-		if (!ASSERT_OK(err, "pthread_create"))
-			break;
-	}
-
-	while (i) {
-		err = pthread_join(thread_id[--i], &ret);
-		if (ASSERT_OK(err, "pthread_join"))
-			ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join");
-	}
-
 	return 0;
 }
 
-/* TODO: use pid filtering */
-void serial_test_timer(void)
+static void test_timer(int (*timer_test_fn)(struct timer *timer_skel))
 {
 	struct timer *timer_skel = NULL;
 	int err;
@@ -94,13 +101,23 @@ void serial_test_timer(void)
 	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
 		return;
 
-	err = timer(timer_skel);
+	err = timer_test_fn(timer_skel);
 	ASSERT_OK(err, "timer");
 	timer__destroy(timer_skel);
+}
+
+void serial_test_timer(void)
+{
+	test_timer(timer);
 
 	RUN_TESTS(timer_failure);
 }
 
+void serial_test_timer_stress(void)
+{
+	test_timer(timer_stress);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 09/10] selftests/bpf: Add stress test for timer async cancel
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (7 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-07 17:49 ` [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
  9 siblings, 0 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Extend BPF timer selftest to run stress test for async cancel.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 18 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/timer.c      | 14 +++++++++++---
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index 4d853d1bd2a71b3d0f1ba0daa7a699945b4457fe..a157a2a699e638c9f21712b1e7194fc4b6382e71 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -23,13 +23,14 @@ static void *spin_lock_thread(void *arg)
 }
 
 
-static int timer_stress(struct timer *timer_skel)
+static int timer_stress_runner(struct timer *timer_skel, bool async_cancel)
 {
 	int i, err = 1, prog_fd;
 	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	pthread_t thread_id[NUM_THR];
 	void *ret;
 
+	timer_skel->bss->async_cancel = async_cancel;
 	prog_fd = bpf_program__fd(timer_skel->progs.race);
 	for (i = 0; i < NUM_THR; i++) {
 		err = pthread_create(&thread_id[i], NULL,
@@ -46,6 +47,16 @@ static int timer_stress(struct timer *timer_skel)
 	return err;
 }
 
+static int timer_stress(struct timer *timer_skel)
+{
+	return timer_stress_runner(timer_skel, false);
+}
+
+static int timer_stress_async_cancel(struct timer *timer_skel)
+{
+	return timer_stress_runner(timer_skel, true);
+}
+
 static int timer(struct timer *timer_skel)
 {
 	int err, prog_fd;
@@ -118,6 +129,11 @@ void serial_test_timer_stress(void)
 	test_timer(timer_stress);
 }
 
+void serial_test_timer_stress_async_cancel(void)
+{
+	test_timer(timer_stress_async_cancel);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index 4c677c001258a4c05cd570ec52363d49d8eea169..a81413514e4b07ef745f27eade71454234e731e8 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -1,13 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
-#include <linux/bpf.h>
-#include <time.h>
+
+#include <vmlinux.h>
 #include <stdbool.h>
 #include <errno.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 
+#define CLOCK_MONOTONIC 1
+#define CLOCK_BOOTTIME 7
+
 char _license[] SEC("license") = "GPL";
+
 struct hmap_elem {
 	int counter;
 	struct bpf_timer timer;
@@ -63,6 +67,7 @@ __u64 callback_check = 52;
 __u64 callback2_check = 52;
 __u64 pinned_callback_check;
 __s32 pinned_cpu;
+bool async_cancel = 0;
 
 #define ARRAY 1
 #define HTAB 2
@@ -419,7 +424,10 @@ int race(void *ctx)
 
 	bpf_timer_set_callback(timer, race_timer_callback);
 	bpf_timer_start(timer, 0, 0);
-	bpf_timer_cancel(timer);
+	if (async_cancel)
+		bpf_timer_cancel_async(timer);
+	else
+		bpf_timer_cancel(timer);
 
 	return 0;
 }

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works
  2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (8 preceding siblings ...)
  2026-01-07 17:49 ` [PATCH RFC v3 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
@ 2026-01-07 17:49 ` Mykyta Yatsenko
  2026-01-09 22:19   ` Andrii Nakryiko
  9 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 17:49 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Add test that verifies that bpf_timer_cancel_async works: can cancel
callback successfully.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 25 +++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/timer.c      | 23 +++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index a157a2a699e638c9f21712b1e7194fc4b6382e71..2b932d4dfd436fd322bd07169f492e20e4ec7624 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -99,6 +99,26 @@ static int timer(struct timer *timer_skel)
 	return 0;
 }
 
+static int timer_cancel_async(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test_async_cancel_succeed);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(topts.retval, 0, "test_run");
+
+	usleep(500);
+	/* check that there were no errors in timer execution */
+	ASSERT_EQ(timer_skel->bss->err, 0, "err");
+
+	/* check that code paths completed */
+	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
+
+	return 0;
+}
+
 static void test_timer(int (*timer_test_fn)(struct timer *timer_skel))
 {
 	struct timer *timer_skel = NULL;
@@ -134,6 +154,11 @@ void serial_test_timer_stress_async_cancel(void)
 	test_timer(timer_stress_async_cancel);
 }
 
+void serial_test_timer_async_cancel(void)
+{
+	test_timer(timer_cancel_async);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index a81413514e4b07ef745f27eade71454234e731e8..4b4ca781e7cdcf78015359cbd8f8d8ff591d6036 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -169,6 +169,29 @@ int BPF_PROG2(test1, int, a)
 	return 0;
 }
 
+static int timer_error(void *map, int *key, struct bpf_timer *timer)
+{
+	err = 42;
+	return 0;
+}
+
+SEC("syscall")
+int test_async_cancel_succeed(void *ctx)
+{
+	struct bpf_timer *arr_timer;
+	int array_key = ARRAY;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, &array, CLOCK_MONOTONIC);
+	bpf_timer_set_callback(arr_timer, timer_error);
+	bpf_timer_start(arr_timer, 100000 /* 100us */, 0);
+	bpf_timer_cancel_async(arr_timer);
+	ok = 7;
+	return 0;
+}
+
 /* callback for prealloc and non-prealloca hashtab timers */
 static int timer_cb2(void *map, int *key, struct hmap_elem *val)
 {

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
@ 2026-01-07 18:08   ` bot+bpf-ci
  2026-01-07 18:30   ` Kumar Kartikeya Dwivedi
  2026-01-09 22:19   ` Andrii Nakryiko
  2 siblings, 0 replies; 40+ messages in thread
From: bot+bpf-ci @ 2026-01-07 18:08 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1163 bytes --]

> diff --git a/kernel/bpf/mpmc_cell.h b/kernel/bpf/mpmc_cell.h
> new file mode 100644
> index 000000000000..8b57226927a6
> --- /dev/null
> +++ b/kernel/bpf/mpmc_cell.h
> @@ -0,0 +1,112 @@

[ ... ]

> + * Typical usage:
> + *
> + *   // Writer (from NMI or any context):
> + *   cell = bpf_mpmc_cell_write_begin(ctl);
> + *   if (!IS_ERR(cell)) {
> + *       memcpy(cell, data, size);
> + *       bpf_mpmc_cell_write_commit(ctl);
> + *   }

Should this check be "if (cell)" instead of "if (!IS_ERR(cell))"?

bpf_mpmc_cell_write_begin() returns NULL when another writer is active,
not an ERR_PTR. Since IS_ERR(NULL) returns false, "!IS_ERR(cell)" will
evaluate to true when cell is NULL, causing the memcpy to execute with
a NULL pointer.

The function documentation correctly states "Return: Pointer to the
write cell, or NULL if another writer is active", but the example code
here uses the wrong check pattern.

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
@ 2026-01-07 18:22   ` bot+bpf-ci
  2026-01-09 22:19   ` Andrii Nakryiko
  2026-01-12  8:10   ` Kumar Kartikeya Dwivedi
  2 siblings, 0 replies; 40+ messages in thread
From: bot+bpf-ci @ 2026-01-07 18:22 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index dc8ed948321e..b90b005a17e1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, u32 op, u64 nsec, u32 timer_mode)
> +{
> +	struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
> +	struct bpf_async_cmd *cmd;
> +
> +	cmd = bpf_mpmc_cell_write_begin(ctl);
> +	if (!cmd)
> +		return -EBUSY;
> +
> +	cmd->nsec = nsec;
> +	cmd->mode = timer_mode;
> +	cmd->op = op;
> +
> +	bpf_mpmc_cell_write_commit(ctl);
> +
> +	if (!refcount_inc_not_zero(&cb->refcnt))
> +		return -EBUSY;
> +
> +	irq_work_queue(&cb->worker);
> +
> +	return 0;
> +}

[ ... ]

> -BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
> +BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
>  {
>  	struct bpf_hrtimer *t;
> -	int ret = 0;
> -	enum hrtimer_mode mode;
> +	u32 mode;
>
> -	if (in_nmi())
> -		return -EOPNOTSUPP;
>  	if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
>  		return -EINVAL;
>
> +	guard(rcu)();
> +
> +	t = async->timer;
> +	if (!t || !t->cb.prog)
> +		return -EINVAL;

[ ... ]

> -	hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> -out:
> -	__bpf_spin_unlock_irqrestore(&timer->lock);
> -	return ret;
> +	return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
>  }

[ ... ]

> @@ -3093,15 +3188,14 @@ __bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
>  	struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
>  	struct bpf_work *w;
>
> -	if (in_nmi())
> -		return -EOPNOTSUPP;
>  	if (flags)
>  		return -EINVAL;
>  	w = READ_ONCE(async->work);
>  	if (!w || !READ_ONCE(w->cb.prog))
>  		return -EINVAL;
>
> -	schedule_work(&w->work);
> +	bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0);
> +
>  	return 0;
>  }

Should this return the result of bpf_async_schedule_op() instead of
unconditionally returning 0? The new bpf_async_schedule_op() can fail
with -EBUSY (when the mpmc cell write fails or when refcount_inc_not_zero
fails), but this error is silently dropped.

For comparison, bpf_timer_start() does return the result of
bpf_async_schedule_op(). This inconsistency could cause bpf_wq_start()
callers to believe work was successfully scheduled when it was not.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel()
  2026-01-07 17:49 ` [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
@ 2026-01-07 18:22   ` bot+bpf-ci
  2026-01-09 22:19     ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: bot+bpf-ci @ 2026-01-07 18:22 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2433 bytes --]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index ff3c1e1160db..dc8ed948321e 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> @@ -1454,13 +1454,12 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
>
>  	if (in_nmi())
>  		return -EOPNOTSUPP;
> -	rcu_read_lock();
> -	__bpf_spin_lock_irqsave(&timer->lock);
> -	t = timer->timer;
> -	if (!t) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> +
> +	guard(rcu)();
> +
> +	t = async->timer;
> +	if (!t)
> +		return -EINVAL;

[ ... ]

> @@ -1492,20 +1490,19 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
>  		 * cancelling and waiting for it synchronously, since it might
>  		 * do the same. Bail!
>  		 */
> -		ret = -EDEADLK;
> -		goto out;
> +		atomic_dec(&t->cancelling);
> +		return -EDEADLK;
>  	}
> +
>  drop:
> -	drop_prog_refcnt(&t->cb);
> -out:
> -	__bpf_spin_unlock_irqrestore(&timer->lock);
> +	__bpf_async_set_callback(async, NULL, NULL);

With the lock removed here, can bpf_timer_cancel() race with
__bpf_async_cancel_and_free()? The concern is that both paths can
attempt to drop the prog reference:

  bpf_timer_cancel():
    __bpf_async_set_callback(async, NULL, NULL)
      -> prev = xchg(&cb->prog, NULL)
      -> bpf_prog_put(prev)

  __bpf_async_cancel_and_free():
    -> drop_prog_refcnt(cb)
      -> prog = cb->prog  (plain read)
      -> bpf_prog_put(prog)

The xchg in __bpf_async_set_callback() is atomic, but drop_prog_refcnt()
uses a plain read of cb->prog. If drop_prog_refcnt() reads cb->prog
before the xchg sets it to NULL, both paths would call bpf_prog_put() on
the same program, leading to a double put of the reference count.

Note: This appears to be addressed later in the series by commit
0035985a0a23 ("bpf: Enable bpf timer and workqueue use in NMI") which
moves bpf_prog_put() to the RCU callback, but the race window exists in
this intermediate commit.

>  	/* Cancel the timer and wait for associated callback to finish
>  	 * if it was running.
>  	 */
> -	ret = ret ?: hrtimer_cancel(&t->timer);
> +	ret = hrtimer_cancel(&t->timer);
> +
>  	if (inc)
>  		atomic_dec(&t->cancelling);
> -	rcu_read_unlock();
>  	return ret;
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
@ 2026-01-07 18:22   ` bot+bpf-ci
  2026-01-09 22:18     ` Andrii Nakryiko
  2026-01-09 22:18   ` Andrii Nakryiko
  2026-01-12  6:47   ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 40+ messages in thread
From: bot+bpf-ci @ 2026-01-07 18:22 UTC (permalink / raw)
  To: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87
  Cc: yatsenko, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song,
	clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3832 bytes --]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 9eaa4185e0a7..954bd61310a6 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>  };
>
>  static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
> -				    struct bpf_prog_aux *aux, unsigned int flags,
> -				    enum bpf_async_type type)
> +				    struct bpf_prog *prog)
>  {
> -	struct bpf_prog *prev, *prog = aux->prog;
> -	struct bpf_async_cb *cb;
> -	int ret = 0;
> +	struct bpf_prog *prev;
> +	struct bpf_async_cb *cb = async->cb;
>
> -	if (in_nmi())
> -		return -EOPNOTSUPP;
> -	__bpf_spin_lock_irqsave(&async->lock);
> -	cb = async->cb;
> -	if (!cb) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> +	if (!cb)
> +		return -EPERM;

A small concern: the error code changed from -EINVAL to -EPERM. The verifier
comment at kernel/bpf/verifier.c mentions that non-initialized timers
"will return -EINVAL". Is the error code change intentional, and should
the verifier comment be updated?

> -	if (!atomic64_read(&cb->map->usercnt)) {
> -		/* maps with timers must be either held by user space
> -		 * or pinned in bpffs. Otherwise timer might still be
> -		 * running even when bpf prog is detached and user space
> -		 * is gone, since map_release_uref won't ever be called.
> -		 */
> -		ret = -EPERM;
> -		goto out;
> -	}

The usercnt check is removed here. The check still exists in
__bpf_async_init(), so a timer cannot be initialized when usercnt is zero.
But can there be issues if set_callback is called after usercnt drops to
zero on an already-initialized timer? The old code would fail with -EPERM,
while the new code proceeds.

> -	prev = cb->prog;
> -	if (prev != prog) {
> -		/* Bump prog refcnt once. Every bpf_timer_set_callback()
> -		 * can pick different callback_fn-s within the same prog.
> -		 */
> -		prog = bpf_prog_inc_not_zero(prog);
> -		if (IS_ERR(prog)) {
> -			ret = PTR_ERR(prog);
> -			goto out;
> +	do {
> +		if (prog) {
> +			prog = bpf_prog_inc_not_zero(prog);
> +			if (IS_ERR(prog))
> +				return PTR_ERR(prog);
>  		}
> +
> +		prev = xchg(&cb->prog, prog);
> +		rcu_assign_pointer(cb->callback_fn, callback_fn);

Is there a potential race between these two operations? Consider two
concurrent callers (with programs P and Q) racing:

1. Thread A: xchg stores P into cb->prog
2. Thread B: xchg stores Q into cb->prog, gets P back
3. Thread B: rcu_assign_pointer stores FB (Q's callback)
4. Thread B: bpf_prog_put(P), B's check passes, B exits
5. Thread A: rcu_assign_pointer stores FA (P's callback), overwrites FB
6. Thread A: check fails (cb->prog is Q, not P), retries

After step 5, cb->prog holds Q (with Q's reference) but cb->callback_fn
points to FA (from P). If P's refcount drops to zero elsewhere and P is
freed, would the timer callback FA become a dangling pointer?

The bpf_timer_cb() function reads cb->callback_fn and calls it directly
without verifying it corresponds to cb->prog. The cb->prog field holds
the reference keeping the program alive, but if the callback pointer is
from a different program that may have been freed, this could be a problem.

> +
>  		if (prev)
> -			/* Drop prev prog refcnt when swapping with new prog */
>  			bpf_prog_put(prev);
> -		cb->prog = prog;
> -	}
> -	rcu_assign_pointer(cb->callback_fn, callback_fn);
> -out:
> -	__bpf_spin_unlock_irqrestore(&async->lock);
> -	return ret;
> +
> +	} while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
> +
> +	return 0;
>  }

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
  2026-01-07 18:08   ` bot+bpf-ci
@ 2026-01-07 18:30   ` Kumar Kartikeya Dwivedi
  2026-01-07 19:05     ` Mykyta Yatsenko
  2026-01-09 22:19   ` Andrii Nakryiko
  2 siblings, 1 reply; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-07 18:30 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Introduce mpmc_cell, a lock-free cell primitive designed to support
> concurrent writes to struct in NMI context (only one writer advances),
> allowing readers to consume consistent snapshot.
>
> Implementation details:
>  Double buffering allows writers run concurrently with readers (read
>  from one cell, write to another)
>
>  The implementation uses a sequence-number-based protocol to enable
>  exclusive writes.
>   * Bit 0 of seq indicates an active writer
>   * Bits 1+ form a generation counter
>   * (seq & 2) >> 1 selects the read cell, write cell is opposite
>   * Writers atomically set bit 0, write to the inactive cell, then
>     increment seq to publish
>   * Readers snapshot seq, read from the active cell, then validate
>     that seq hasn't changed
>
> mpmc_cell expects users to pre-allocate double buffers.
>
> Key properties:
>  * Writers never block (fail if lost the race to another writer)
>  * Readers never block writers (double buffering), but may require
>  retries if write updates the snapshot concurrently.
>
> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> programs to initiate timers and workqueues from NMI context.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

We already have a dual-versioned concurrency control primitive in the
kernel (seqcount_latch_t). I would just use that instead of
reinventing it here. I don't see much of a difference except writer
serialization, which can be done on top of it. If it was already
considered and discarded for some reason, please add that reason to
the commit message.

>  [...]
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 18:30   ` Kumar Kartikeya Dwivedi
@ 2026-01-07 19:05     ` Mykyta Yatsenko
  2026-01-09  1:18       ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-07 19:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Introduce mpmc_cell, a lock-free cell primitive designed to support
>> concurrent writes to struct in NMI context (only one writer advances),
>> allowing readers to consume consistent snapshot.
>>
>> Implementation details:
>>   Double buffering allows writers run concurrently with readers (read
>>   from one cell, write to another)
>>
>>   The implementation uses a sequence-number-based protocol to enable
>>   exclusive writes.
>>    * Bit 0 of seq indicates an active writer
>>    * Bits 1+ form a generation counter
>>    * (seq & 2) >> 1 selects the read cell, write cell is opposite
>>    * Writers atomically set bit 0, write to the inactive cell, then
>>      increment seq to publish
>>    * Readers snapshot seq, read from the active cell, then validate
>>      that seq hasn't changed
>>
>> mpmc_cell expects users to pre-allocate double buffers.
>>
>> Key properties:
>>   * Writers never block (fail if lost the race to another writer)
>>   * Readers never block writers (double buffering), but may require
>>   retries if write updates the snapshot concurrently.
>>
>> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
>> operations (like hrtimer_start()) to irq_work effectively allowing BPF
>> programs to initiate timers and workqueues from NMI context.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
> We already have a dual-versioned concurrency control primitive in the
> kernel (seqcount_latch_t). I would just use that instead of
> reinventing it here. I don't see much of a difference except writer
> serialization, which can be done on top of it. If it was already
> considered and discarded for some reason, please add that reason to
> the commit message.
yes, multiple concurrent  writers support would is the main difference
between seqcount_latch_t and my implementation. I'll switch to
seqcount_latch_t and external synchronization for writers.
>>   [...]
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 19:05     ` Mykyta Yatsenko
@ 2026-01-09  1:18       ` Andrii Nakryiko
  2026-01-09 18:22         ` Mykyta Yatsenko
  0 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09  1:18 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: Kumar Kartikeya Dwivedi, bpf, ast, andrii, daniel, kafai,
	kernel-team, eddyz87, Mykyta Yatsenko

On Wed, Jan 7, 2026 at 11:05 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> >> From: Mykyta Yatsenko <yatsenko@meta.com>
> >>
> >> Introduce mpmc_cell, a lock-free cell primitive designed to support
> >> concurrent writes to struct in NMI context (only one writer advances),
> >> allowing readers to consume consistent snapshot.
> >>
> >> Implementation details:
> >>   Double buffering allows writers run concurrently with readers (read
> >>   from one cell, write to another)
> >>
> >>   The implementation uses a sequence-number-based protocol to enable
> >>   exclusive writes.
> >>    * Bit 0 of seq indicates an active writer
> >>    * Bits 1+ form a generation counter
> >>    * (seq & 2) >> 1 selects the read cell, write cell is opposite
> >>    * Writers atomically set bit 0, write to the inactive cell, then
> >>      increment seq to publish
> >>    * Readers snapshot seq, read from the active cell, then validate
> >>      that seq hasn't changed
> >>
> >> mpmc_cell expects users to pre-allocate double buffers.
> >>
> >> Key properties:
> >>   * Writers never block (fail if lost the race to another writer)
> >>   * Readers never block writers (double buffering), but may require
> >>   retries if write updates the snapshot concurrently.
> >>
> >> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> >> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> >> programs to initiate timers and workqueues from NMI context.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> > We already have a dual-versioned concurrency control primitive in the
> > kernel (seqcount_latch_t). I would just use that instead of
> > reinventing it here. I don't see much of a difference except writer
> > serialization, which can be done on top of it. If it was already
> > considered and discarded for some reason, please add that reason to
> > the commit message.
> yes, multiple concurrent  writers support would is the main difference
> between seqcount_latch_t and my implementation. I'll switch to
> seqcount_latch_t and external synchronization for writers.

One advantage of custom primitive is that we don't need a second
atomic counter for writers. Here we combine the reader latch counter
(it's just scaled 2x for custom implementation) and "writer is active"
bit (even/odd counter).

With potentially millions of timer activations per second for some
extreme cases, would performance be enough reason to have custom
"seqlock latch"? (I'm not sure myself, trying to get opinions)

> >>   [...]
> >>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-09  1:18       ` Andrii Nakryiko
@ 2026-01-09 18:22         ` Mykyta Yatsenko
  2026-01-09 18:47           ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-09 18:22 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, bpf, ast, andrii, daniel, kafai,
	kernel-team, eddyz87, Mykyta Yatsenko

On 1/9/26 01:18, Andrii Nakryiko wrote:
> On Wed, Jan 7, 2026 at 11:05 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>>>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>>>
>>>> Introduce mpmc_cell, a lock-free cell primitive designed to support
>>>> concurrent writes to struct in NMI context (only one writer advances),
>>>> allowing readers to consume consistent snapshot.
>>>>
>>>> Implementation details:
>>>>    Double buffering allows writers run concurrently with readers (read
>>>>    from one cell, write to another)
>>>>
>>>>    The implementation uses a sequence-number-based protocol to enable
>>>>    exclusive writes.
>>>>     * Bit 0 of seq indicates an active writer
>>>>     * Bits 1+ form a generation counter
>>>>     * (seq & 2) >> 1 selects the read cell, write cell is opposite
>>>>     * Writers atomically set bit 0, write to the inactive cell, then
>>>>       increment seq to publish
>>>>     * Readers snapshot seq, read from the active cell, then validate
>>>>       that seq hasn't changed
>>>>
>>>> mpmc_cell expects users to pre-allocate double buffers.
>>>>
>>>> Key properties:
>>>>    * Writers never block (fail if lost the race to another writer)
>>>>    * Readers never block writers (double buffering), but may require
>>>>    retries if write updates the snapshot concurrently.
>>>>
>>>> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
>>>> operations (like hrtimer_start()) to irq_work effectively allowing BPF
>>>> programs to initiate timers and workqueues from NMI context.
>>>>
>>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>>>> ---
>>> We already have a dual-versioned concurrency control primitive in the
>>> kernel (seqcount_latch_t). I would just use that instead of
>>> reinventing it here. I don't see much of a difference except writer
>>> serialization, which can be done on top of it. If it was already
>>> considered and discarded for some reason, please add that reason to
>>> the commit message.
>> yes, multiple concurrent  writers support would is the main difference
>> between seqcount_latch_t and my implementation. I'll switch to
>> seqcount_latch_t and external synchronization for writers.
> One advantage of custom primitive is that we don't need a second
> atomic counter for writers. Here we combine the reader latch counter
> (it's just scaled 2x for custom implementation) and "writer is active"
> bit (even/odd counter).
>
> With potentially millions of timer activations per second for some
> extreme cases, would performance be enough reason to have custom
> "seqlock latch"? (I'm not sure myself, trying to get opinions)
>
Actually seqcount_latch_t variant may be faster (correct me if I'm wrong),
because mpmc_cell requires 2 lock prefixed instructions to enter the write
critical section and seqcount_latch_t just 1.

mpmc_cell:

if (1&atomic_fetch_or_acquire(1, &ctl->seq)) // first lock prefixed insn
return;
...
       atomic_fetch_add_release(1, &ctl->seq);     // second lock 
prefixed insn

seqcount_latch_t based:

     if (atomic_cmpxchg(&ctl->lock, 0, 1))        // first lock prefixed 
insn
         return;
write_seqcount_latch_begin(&ctl->seq);       // inc with barriers
     ...
     write_seqcount_latch(&ctl->seq);            // inc with barriers
     atomic_set(&ctl->lock, 0);            // plain mov on x86_64

Does it look right?
>>>>    [...]
>>>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-09 18:22         ` Mykyta Yatsenko
@ 2026-01-09 18:47           ` Andrii Nakryiko
  2026-01-09 23:51             ` Alexei Starovoitov
  0 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 18:47 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: Kumar Kartikeya Dwivedi, bpf, ast, andrii, daniel, kafai,
	kernel-team, eddyz87, Mykyta Yatsenko

On Fri, Jan 9, 2026 at 10:22 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 1/9/26 01:18, Andrii Nakryiko wrote:
> > On Wed, Jan 7, 2026 at 11:05 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> >> On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
> >>> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> >>>> From: Mykyta Yatsenko <yatsenko@meta.com>
> >>>>
> >>>> Introduce mpmc_cell, a lock-free cell primitive designed to support
> >>>> concurrent writes to struct in NMI context (only one writer advances),
> >>>> allowing readers to consume consistent snapshot.
> >>>>
> >>>> Implementation details:
> >>>>    Double buffering allows writers run concurrently with readers (read
> >>>>    from one cell, write to another)
> >>>>
> >>>>    The implementation uses a sequence-number-based protocol to enable
> >>>>    exclusive writes.
> >>>>     * Bit 0 of seq indicates an active writer
> >>>>     * Bits 1+ form a generation counter
> >>>>     * (seq & 2) >> 1 selects the read cell, write cell is opposite
> >>>>     * Writers atomically set bit 0, write to the inactive cell, then
> >>>>       increment seq to publish
> >>>>     * Readers snapshot seq, read from the active cell, then validate
> >>>>       that seq hasn't changed
> >>>>
> >>>> mpmc_cell expects users to pre-allocate double buffers.
> >>>>
> >>>> Key properties:
> >>>>    * Writers never block (fail if lost the race to another writer)
> >>>>    * Readers never block writers (double buffering), but may require
> >>>>    retries if write updates the snapshot concurrently.
> >>>>
> >>>> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> >>>> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> >>>> programs to initiate timers and workqueues from NMI context.
> >>>>
> >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >>>> ---
> >>> We already have a dual-versioned concurrency control primitive in the
> >>> kernel (seqcount_latch_t). I would just use that instead of
> >>> reinventing it here. I don't see much of a difference except writer
> >>> serialization, which can be done on top of it. If it was already
> >>> considered and discarded for some reason, please add that reason to
> >>> the commit message.
> >> yes, multiple concurrent  writers support would is the main difference
> >> between seqcount_latch_t and my implementation. I'll switch to
> >> seqcount_latch_t and external synchronization for writers.
> > One advantage of custom primitive is that we don't need a second
> > atomic counter for writers. Here we combine the reader latch counter
> > (it's just scaled 2x for custom implementation) and "writer is active"
> > bit (even/odd counter).
> >
> > With potentially millions of timer activations per second for some
> > extreme cases, would performance be enough reason to have custom
> > "seqlock latch"? (I'm not sure myself, trying to get opinions)
> >
> Actually seqcount_latch_t variant may be faster (correct me if I'm wrong),
> because mpmc_cell requires 2 lock prefixed instructions to enter the write
> critical section and seqcount_latch_t just 1.
>
> mpmc_cell:
>
> if (1&atomic_fetch_or_acquire(1, &ctl->seq)) // first lock prefixed insn
> return;
> ...
>        atomic_fetch_add_release(1, &ctl->seq);     // second lock
> prefixed insn
>
> seqcount_latch_t based:
>
>      if (atomic_cmpxchg(&ctl->lock, 0, 1))        // first lock prefixed
> insn
>          return;
> write_seqcount_latch_begin(&ctl->seq);       // inc with barriers
>      ...
>      write_seqcount_latch(&ctl->seq);            // inc with barriers
>      atomic_set(&ctl->lock, 0);            // plain mov on x86_64
>
> Does it look right?

So I think you are overpivoting on atomic vs non-atomic differences
here: when uncontended, atomic is almost as fast as non-atomic
(difference is irrelevant). But under contention, those four writes
due to separate latch is four cache line bounces (potentially) between
competing CPUs, while with custom sequence logic it's just two.

I'm not dead set on one approach or the other, but I don't think that
latch adds that match value. But if Kumar or others insist on latch, I
don't mind, in the end both will work.

> >>>>    [...]
> >>>>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-07 18:22   ` bot+bpf-ci
@ 2026-01-09 22:18     ` Andrii Nakryiko
  2026-01-13 13:58       ` Mykyta Yatsenko
  2026-01-13 16:14       ` Mykyta Yatsenko
  0 siblings, 2 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:18 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87, yatsenko, martin.lau, yonghong.song, clm,
	ihor.solodrai

On Wed, Jan 7, 2026 at 10:22 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 9eaa4185e0a7..954bd61310a6 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
> >  };
> >
> >  static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
> > -                                 struct bpf_prog_aux *aux, unsigned int flags,
> > -                                 enum bpf_async_type type)
> > +                                 struct bpf_prog *prog)
> >  {
> > -     struct bpf_prog *prev, *prog = aux->prog;
> > -     struct bpf_async_cb *cb;
> > -     int ret = 0;
> > +     struct bpf_prog *prev;
> > +     struct bpf_async_cb *cb = async->cb;
> >
> > -     if (in_nmi())
> > -             return -EOPNOTSUPP;
> > -     __bpf_spin_lock_irqsave(&async->lock);
> > -     cb = async->cb;
> > -     if (!cb) {
> > -             ret = -EINVAL;
> > -             goto out;
> > -     }
> > +     if (!cb)
> > +             return -EPERM;
>
> A small concern: the error code changed from -EINVAL to -EPERM. The verifier
> comment at kernel/bpf/verifier.c mentions that non-initialized timers
> "will return -EINVAL". Is the error code change intentional, and should
> the verifier comment be updated?
>

good AI, I asked the same

> > -     if (!atomic64_read(&cb->map->usercnt)) {
> > -             /* maps with timers must be either held by user space
> > -              * or pinned in bpffs. Otherwise timer might still be
> > -              * running even when bpf prog is detached and user space
> > -              * is gone, since map_release_uref won't ever be called.
> > -              */
> > -             ret = -EPERM;
> > -             goto out;
> > -     }
>
> The usercnt check is removed here. The check still exists in
> __bpf_async_init(), so a timer cannot be initialized when usercnt is zero.
> But can there be issues if set_callback is called after usercnt drops to
> zero on an already-initialized timer? The old code would fail with -EPERM,
> while the new code proceeds.
>

Wasn't this discussed earlier, Mykyta? Do you remember the conclusion?

> > -     prev = cb->prog;
> > -     if (prev != prog) {
> > -             /* Bump prog refcnt once. Every bpf_timer_set_callback()
> > -              * can pick different callback_fn-s within the same prog.
> > -              */
> > -             prog = bpf_prog_inc_not_zero(prog);
> > -             if (IS_ERR(prog)) {
> > -                     ret = PTR_ERR(prog);
> > -                     goto out;
> > +     do {
> > +             if (prog) {
> > +                     prog = bpf_prog_inc_not_zero(prog);
> > +                     if (IS_ERR(prog))
> > +                             return PTR_ERR(prog);
> >               }
> > +
> > +             prev = xchg(&cb->prog, prog);
> > +             rcu_assign_pointer(cb->callback_fn, callback_fn);
>
> Is there a potential race between these two operations? Consider two
> concurrent callers (with programs P and Q) racing:
>
> 1. Thread A: xchg stores P into cb->prog
> 2. Thread B: xchg stores Q into cb->prog, gets P back
> 3. Thread B: rcu_assign_pointer stores FB (Q's callback)
> 4. Thread B: bpf_prog_put(P), B's check passes, B exits
> 5. Thread A: rcu_assign_pointer stores FA (P's callback), overwrites FB
> 6. Thread A: check fails (cb->prog is Q, not P), retries
>
> After step 5, cb->prog holds Q (with Q's reference) but cb->callback_fn
> points to FA (from P). If P's refcount drops to zero elsewhere and P is
> freed, would the timer callback FA become a dangling pointer?

AI is not completely wrong here, IMO. No, there is no use-after-free
just yet because program is RCU protected, so we don't have dangling
pointer just yet.

But. That bpf_prog_inc_not_zero() on retry will fail if P's program
refcount dropped to zero already. And then once we exit, we'll have
Q+FA combo, which is not good.

So I think we need to do one bpf_prog_inc_not_zero() outside of the
loop once and keep that extra ref until we exit. And then instead of
bpf_prog_inc_not_zero(prog) inside the loop, we do unconditional
bpf_prog_inc(), which we *know* has to succeed. Less error handling
inside the loop as well.

Thoughts?


>
> The bpf_timer_cb() function reads cb->callback_fn and calls it directly
> without verifying it corresponds to cb->prog. The cb->prog field holds
> the reference keeping the program alive, but if the callback pointer is
> from a different program that may have been freed, this could be a problem.
>
> > +
> >               if (prev)
> > -                     /* Drop prev prog refcnt when swapping with new prog */
> >                       bpf_prog_put(prev);
> > -             cb->prog = prog;
> > -     }
> > -     rcu_assign_pointer(cb->callback_fn, callback_fn);
> > -out:
> > -     __bpf_spin_unlock_irqrestore(&async->lock);
> > -     return ret;
> > +
> > +     } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
> > +
> > +     return 0;
> >  }
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
@ 2026-01-09 22:18   ` Andrii Nakryiko
  2026-01-12  6:47   ` Kumar Kartikeya Dwivedi
  2 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:18 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor __bpf_async_set_callback() getting rid of locks. The idea of the
> algorithm is to store both callback_fn and prog in struct bpf_async_cb
> and verify that both pointers are stored, if any pointer does not
> match (because of the concurrent update), retry until complete match.
> On each iteration, increment refcnt of the prog that is going to
> be set and decrement the one that is evicted, ensuring that get/put are
> balanced, as each iteration has both inc/dec.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 61 ++++++++++++++++++----------------------------------
>  1 file changed, 21 insertions(+), 40 deletions(-)
>

LGTM

Acked-by: Andrii Nakryiko <andrii@kernel.org>

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel()
  2026-01-07 18:22   ` bot+bpf-ci
@ 2026-01-09 22:19     ` Andrii Nakryiko
  2026-01-12  7:29       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai, kernel-team,
	memxor, eddyz87, yatsenko, martin.lau, yonghong.song, clm,
	ihor.solodrai

On Wed, Jan 7, 2026 at 10:22 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index ff3c1e1160db..dc8ed948321e 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
>
> [ ... ]
>
> > @@ -1454,13 +1454,12 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
> >
> >       if (in_nmi())
> >               return -EOPNOTSUPP;
> > -     rcu_read_lock();
> > -     __bpf_spin_lock_irqsave(&timer->lock);
> > -     t = timer->timer;
> > -     if (!t) {
> > -             ret = -EINVAL;
> > -             goto out;
> > -     }
> > +
> > +     guard(rcu)();
> > +
> > +     t = async->timer;

this should be READ_ONCE() now, no?

> > +     if (!t)
> > +             return -EINVAL;
>
> [ ... ]
>
> > @@ -1492,20 +1490,19 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
> >                * cancelling and waiting for it synchronously, since it might
> >                * do the same. Bail!
> >                */
> > -             ret = -EDEADLK;
> > -             goto out;
> > +             atomic_dec(&t->cancelling);
> > +             return -EDEADLK;
> >       }
> > +
> >  drop:
> > -     drop_prog_refcnt(&t->cb);
> > -out:
> > -     __bpf_spin_unlock_irqrestore(&timer->lock);
> > +     __bpf_async_set_callback(async, NULL, NULL);
>
> With the lock removed here, can bpf_timer_cancel() race with
> __bpf_async_cancel_and_free()? The concern is that both paths can
> attempt to drop the prog reference:
>
>   bpf_timer_cancel():
>     __bpf_async_set_callback(async, NULL, NULL)
>       -> prev = xchg(&cb->prog, NULL)
>       -> bpf_prog_put(prev)
>
>   __bpf_async_cancel_and_free():
>     -> drop_prog_refcnt(cb)
>       -> prog = cb->prog  (plain read)
>       -> bpf_prog_put(prog)
>
> The xchg in __bpf_async_set_callback() is atomic, but drop_prog_refcnt()
> uses a plain read of cb->prog. If drop_prog_refcnt() reads cb->prog
> before the xchg sets it to NULL, both paths would call bpf_prog_put() on
> the same program, leading to a double put of the reference count.
>
> Note: This appears to be addressed later in the series by commit
> 0035985a0a23 ("bpf: Enable bpf timer and workqueue use in NMI") which
> moves bpf_prog_put() to the RCU callback, but the race window exists in
> this intermediate commit.

fair point, do we need to do this early on?


>
> >       /* Cancel the timer and wait for associated callback to finish
> >        * if it was running.
> >        */
> > -     ret = ret ?: hrtimer_cancel(&t->timer);
> > +     ret = hrtimer_cancel(&t->timer);
> > +
> >       if (inc)
> >               atomic_dec(&t->cancelling);
> > -     rcu_read_unlock();
> >       return ret;
> >  }
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
  2026-01-07 18:08   ` bot+bpf-ci
  2026-01-07 18:30   ` Kumar Kartikeya Dwivedi
@ 2026-01-09 22:19   ` Andrii Nakryiko
  2 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Introduce mpmc_cell, a lock-free cell primitive designed to support
> concurrent writes to struct in NMI context (only one writer advances),
> allowing readers to consume consistent snapshot.
>
> Implementation details:
>  Double buffering allows writers run concurrently with readers (read
>  from one cell, write to another)
>
>  The implementation uses a sequence-number-based protocol to enable
>  exclusive writes.
>   * Bit 0 of seq indicates an active writer
>   * Bits 1+ form a generation counter
>   * (seq & 2) >> 1 selects the read cell, write cell is opposite
>   * Writers atomically set bit 0, write to the inactive cell, then
>     increment seq to publish
>   * Readers snapshot seq, read from the active cell, then validate
>     that seq hasn't changed
>
> mpmc_cell expects users to pre-allocate double buffers.
>
> Key properties:
>  * Writers never block (fail if lost the race to another writer)
>  * Readers never block writers (double buffering), but may require
>  retries if write updates the snapshot concurrently.
>
> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> programs to initiate timers and workqueues from NMI context.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/Makefile    |   2 +-
>  kernel/bpf/mpmc_cell.c |  62 +++++++++++++++++++++++++++
>  kernel/bpf/mpmc_cell.h | 112 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 175 insertions(+), 1 deletion(-)
>

LGTM overall (though I didn't think much about all the acquire/release
semantics)

> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 79cf22860a99ba31a9daf08a29de0f3a162ba89f..753fa63e0c24dc0a332d86c2c424894300f2d611 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse
>  endif
>  CFLAGS_core.o += -Wno-override-init $(cflags-nogcse-yy)
>
> -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o liveness.o
> +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o liveness.o mpmc_cell.o
>  obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
>  obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o bpf_insn_array.o
> diff --git a/kernel/bpf/mpmc_cell.c b/kernel/bpf/mpmc_cell.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..ca91b4308c8b552bc81cfefa2d975290a64b596d
> --- /dev/null
> +++ b/kernel/bpf/mpmc_cell.c
> @@ -0,0 +1,62 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +
> +#include "mpmc_cell.h"
> +
> +static u32 read_cell_idx(struct bpf_mpmc_cell_ctl *ctl, u32 seq)
> +{
> +       return (seq & 2) >> 1;

unconventional, why not `(seq >> 1) & 1`?

> +}
> +
> +void bpf_mpmc_cell_init(struct bpf_mpmc_cell_ctl *ctl, void *cell1, void *cell2)
> +{
> +       atomic_set(&ctl->seq, 0);
> +       ctl->cell[0] = cell1;
> +       ctl->cell[1] = cell2;
> +}
> +
> +void *bpf_mpmc_cell_read_begin(struct bpf_mpmc_cell_ctl *ctl, u32 *seq)
> +{
> +       *seq = atomic_read_acquire(&ctl->seq);
> +       /* Mask out acive writer bit */

typo: active

> +       *seq &= ~1;

instead of this bit manipulation, why not return logical epoch to
user, i.e., (seq >> 1)

> +
> +       return ctl->cell[read_cell_idx(ctl, *seq)];
> +}
> +
> +int bpf_mpmc_cell_read_end(struct bpf_mpmc_cell_ctl *ctl, u32 seq)
> +{
> +       u32 new_seq;
> +
> +       /* Ensure cell reads complete before checking seq */
> +       smp_rmb();
> +
> +       new_seq = atomic_read_acquire(&ctl->seq);
> +       new_seq &= ~1; /* Ignore active write bit */

as above, new_seq = seq >> 1 (this is "epoch", lower bit is for
internal writer coordination, that's implementation detail, not part
of external contract)

> +       /* Check if seq changed between begin and end, if it did, new snapshot is available */
> +       if (new_seq != seq)
> +               return -EAGAIN;
> +
> +       return 0;
> +}
> +
> +void *bpf_mpmc_cell_write_begin(struct bpf_mpmc_cell_ctl *ctl)
> +{
> +       u32 seq;
> +
> +       /*
> +        * Try to set the lowest bit, on success, writer owns cell exclusively,
> +        * other writers fail
> +        */
> +       seq = atomic_fetch_or_acquire(1, &ctl->seq);
> +       if (seq & 1) /* Check if another writer is active */
> +               return NULL;
> +
> +       /* Write to opposite to read buffer */
> +       return ctl->cell[read_cell_idx(ctl, seq) ^ 1];
> +}
> +
> +void bpf_mpmc_cell_write_commit(struct bpf_mpmc_cell_ctl *ctl)
> +{
> +       atomic_fetch_add_release(1, &ctl->seq);
> +}

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
@ 2026-01-09 22:19   ` Andrii Nakryiko
  2026-01-14 14:53     ` Mykyta Yatsenko
  2026-01-12  8:10   ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor bpf timer and workqueue helpers to allow calling them from NMI
> context by making all operations lock-free and deferring NMI-unsafe
> work to irq_work.
>
> Previously, bpf_timer_start(), and bpf_wq_start()
> could not be called from NMI context because they acquired
> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
> patch removes these limitations.
>
> Key changes:
>  * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
>    operations with atomic cmpxchg() for initialization and xchg() for
>    cancel and free.
>  * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
>    hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
>  * Use the lock-free mpmc_cell (added in the previous commit) to pass
>    operation commands (start/cancel/free) along with their parameters
>    (nsec, mode) from NMI-safe callers to the irq_work handler.
>  * Add reference counting to bpf_async_cb to ensure the object stays
>    alive until all scheduled irq_work completes and the timer/work
>    callback finishes.
>  * Move bpf_prog_put() to RCU callback to handle races between
>    set_callback() and cancel_and_free().
>
> This enables BPF programs attached to NMI-context hooks (perf
> events) to use timers and workqueues for deferred processing.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 288 ++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 191 insertions(+), 97 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index dc8ed948321e6c535d2cc2e8f9fbdd0636cdcabf..b90b005a17e1de9c0c62056a665d124b883c6320 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -29,6 +29,7 @@
>  #include <linux/task_work.h>
>  #include <linux/irq_work.h>
>  #include <linux/buildid.h>
> +#include <mpmc_cell.h>
>
>  #include "../../lib/kstrtox.h"
>
> @@ -1095,6 +1096,23 @@ static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
>         return (void *)value - round_up(map->key_size, 8);
>  }
>
> +enum bpf_async_type {
> +       BPF_ASYNC_TYPE_TIMER = 0,
> +       BPF_ASYNC_TYPE_WQ,
> +};
> +
> +enum bpf_async_op {
> +       BPF_ASYNC_START,
> +       BPF_ASYNC_CANCEL,
> +       BPF_ASYNC_CANCEL_AND_FREE,
> +};
> +
> +struct bpf_async_cmd {
> +       u64 nsec;
> +       u32 mode;
> +       u32 op;
> +};
> +
>  struct bpf_async_cb {
>         struct bpf_map *map;
>         struct bpf_prog *prog;
> @@ -1105,6 +1123,12 @@ struct bpf_async_cb {
>                 struct work_struct delete_work;
>         };
>         u64 flags;
> +       struct irq_work worker;
> +       struct bpf_mpmc_cell_ctl ctl;

nit: I'd call it more meaningful "cmd_cell"

> +       struct bpf_async_cmd cmd[2];
> +       atomic_t last_seq;
> +       refcount_t refcnt;
> +       enum bpf_async_type type;
>  };
>
>  /* BPF map elements can contain 'struct bpf_timer'.
> @@ -1142,18 +1166,8 @@ struct bpf_async_kern {
>                 struct bpf_hrtimer *timer;
>                 struct bpf_work *work;
>         };
> -       /* bpf_spin_lock is used here instead of spinlock_t to make
> -        * sure that it always fits into space reserved by struct bpf_timer
> -        * regardless of LOCKDEP and spinlock debug flags.
> -        */
> -       struct bpf_spin_lock lock;

we have to leave dummy placeholder instead of preserve bpf_async_kern
size for ABI compatibility

>  } __attribute__((aligned(8)));
>
> -enum bpf_async_type {
> -       BPF_ASYNC_TYPE_TIMER = 0,
> -       BPF_ASYNC_TYPE_WQ,
> -};
> -
>  static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
>
>  static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> @@ -1219,6 +1233,13 @@ static void bpf_async_cb_rcu_free(struct rcu_head *rcu)
>  {
>         struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu);
>
> +       /*
> +        * Drop the last reference to prog only after RCU GP, as set_callback()
> +        * may race with cancel_and_free()
> +        */
> +       if (cb->prog)
> +               bpf_prog_put(cb->prog);
> +
>         kfree_nolock(cb);
>  }
>
> @@ -1246,18 +1267,17 @@ static void bpf_timer_delete_work(struct work_struct *work)
>         call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
>  }
>
> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async);
> +static void bpf_async_irq_worker(struct irq_work *work);
> +
>  static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
>                             enum bpf_async_type type)
>  {
> -       struct bpf_async_cb *cb;
> +       struct bpf_async_cb *cb, *old_cb;
>         struct bpf_hrtimer *t;
>         struct bpf_work *w;
>         clockid_t clockid;
>         size_t size;
> -       int ret = 0;
> -
> -       if (in_nmi())
> -               return -EOPNOTSUPP;
>
>         switch (type) {
>         case BPF_ASYNC_TYPE_TIMER:
> @@ -1270,18 +1290,13 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>                 return -EINVAL;
>         }
>
> -       __bpf_spin_lock_irqsave(&async->lock);
>         t = async->timer;

READ_ONCE()?

> -       if (t) {
> -               ret = -EBUSY;
> -               goto out;
> -       }
> +       if (t)
> +               return -EBUSY;
>
>         cb = bpf_map_kmalloc_nolock(map, size, 0, map->numa_node);
> -       if (!cb) {
> -               ret = -ENOMEM;
> -               goto out;
> -       }
> +       if (!cb)
> +               return -ENOMEM;
>
>         switch (type) {
>         case BPF_ASYNC_TYPE_TIMER:
> @@ -1304,9 +1319,19 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>         cb->map = map;
>         cb->prog = NULL;
>         cb->flags = flags;
> +       cb->worker = IRQ_WORK_INIT(bpf_async_irq_worker);
> +       bpf_mpmc_cell_init(&cb->ctl, &cb->cmd[0], &cb->cmd[1]);
> +       refcount_set(&cb->refcnt, 1); /* map's reference */
> +       atomic_set(&cb->last_seq, 0);
> +       cb->type = type;
>         rcu_assign_pointer(cb->callback_fn, NULL);
>
> -       WRITE_ONCE(async->cb, cb);
> +       old_cb = cmpxchg(&async->cb, NULL, cb);
> +       if (old_cb) {
> +               /* Lost the race to initialize this bpf_async_kern, drop the allocated object */
> +               kfree_nolock(cb);
> +               return -EBUSY;
> +       }
>         /* Guarantee the order between async->cb and map->usercnt. So
>          * when there are concurrent uref release and bpf timer init, either
>          * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
> @@ -1317,13 +1342,11 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>                 /* maps with timers must be either held by user space
>                  * or pinned in bpffs.
>                  */
> -               WRITE_ONCE(async->cb, NULL);
> -               kfree_nolock(cb);
> -               ret = -EPERM;
> +               __bpf_async_cancel_and_free(async);
> +               return -EPERM;
>         }
> -out:
> -       __bpf_spin_unlock_irqrestore(&async->lock);
> -       return ret;
> +
> +       return 0;
>  }
>
>  BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
> @@ -1354,6 +1377,61 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>         .arg3_type      = ARG_ANYTHING,
>  };
>
> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, u32 op, u64 nsec, u32 timer_mode)
> +{
> +       struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
> +       struct bpf_async_cmd *cmd;
> +
> +       cmd = bpf_mpmc_cell_write_begin(ctl);
> +       if (!cmd)
> +               return -EBUSY;
> +
> +       cmd->nsec = nsec;
> +       cmd->mode = timer_mode;
> +       cmd->op = op;
> +
> +       bpf_mpmc_cell_write_commit(ctl);
> +
> +       if (!refcount_inc_not_zero(&cb->refcnt))
> +               return -EBUSY;
> +
> +       irq_work_queue(&cb->worker);

if not in NMI and not irq-disabled mode, we should be able to call
bpf_async_irq_worker() directly here and execute action synchronously
without irq_work execution. Add TODO so we don't forget to implement
that before patch set lands?

> +
> +       return 0;
> +}
> +
> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
> +                            u64 *nsec, u32 *flags)
> +{
> +       struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
> +       struct bpf_async_cmd *cmd;
> +       u32 seq, last_seq;
> +
> +       do {
> +               last_seq = atomic_read_acquire(&cb->last_seq);
> +               cmd = bpf_mpmc_cell_read_begin(ctl, &seq);
> +
> +               /* Return -EBUSY if current seq is consumed by another reader */
> +               if (seq == last_seq)
> +                       return -EBUSY;
> +
> +               *nsec = cmd->nsec;
> +               *flags = cmd->mode;
> +               *op = cmd->op;
> +
> +       /*
> +        * Retry read on one of the two conditions:
> +        *  1. Some writer produced new snapshot while we were reading. Our snapshot may have been
> +        *     modified, and not consistent.
> +        *  2. Another reader consumed some snapshot. We need to validate that this snapshot is not
> +        *     consumed. This prevents duplicate op processing.
> +        */
> +       } while (bpf_mpmc_cell_read_end(ctl, seq) == -EAGAIN ||

can read_end return any other error? If yes, how do we handle? If not,
why hard-code -EAGAIN here, maybe just `< 0` check?

> +                atomic_cmpxchg_release(&cb->last_seq, last_seq, seq) != last_seq);

nit: repeat condition is complicated enough (and requires multi-line
weirdly indented comment), so I'd do:

while (true) {
   ....

    if (bpf_mpmc_cell_read_end(ctl, seq) < 0)
        continue;
    if (atomic_cmpxchg() == last_seq) /* success*/
        break;


(or invert cmpxchg + continue, and then stand-alone break)

> +
> +       return 0;
> +}
> +
>  static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
>                                     struct bpf_prog *prog)
>  {
> @@ -1395,22 +1473,19 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
>         .arg2_type      = ARG_PTR_TO_FUNC,
>  };
>
> -BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
> +BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
>  {
>         struct bpf_hrtimer *t;
> -       int ret = 0;
> -       enum hrtimer_mode mode;
> +       u32 mode;
>
> -       if (in_nmi())
> -               return -EOPNOTSUPP;
>         if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
>                 return -EINVAL;
> -       __bpf_spin_lock_irqsave(&timer->lock);
> -       t = timer->timer;
> -       if (!t || !t->cb.prog) {
> -               ret = -EINVAL;
> -               goto out;
> -       }
> +
> +       guard(rcu)();
> +
> +       t = async->timer;

READ_ONCE()

> +       if (!t || !t->cb.prog)
> +               return -EINVAL;
>
>         if (flags & BPF_F_TIMER_ABS)
>                 mode = HRTIMER_MODE_ABS_SOFT;
> @@ -1420,10 +1495,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla
>         if (flags & BPF_F_TIMER_CPU_PIN)
>                 mode |= HRTIMER_MODE_PINNED;
>
> -       hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> -out:
> -       __bpf_spin_unlock_irqrestore(&timer->lock);
> -       return ret;
> +       return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
>  }
>
>  static const struct bpf_func_proto bpf_timer_start_proto = {
> @@ -1435,17 +1507,6 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
>         .arg3_type      = ARG_ANYTHING,
>  };
>
> -static void drop_prog_refcnt(struct bpf_async_cb *async)
> -{
> -       struct bpf_prog *prog = async->prog;
> -
> -       if (prog) {
> -               bpf_prog_put(prog);
> -               async->prog = NULL;
> -               rcu_assign_pointer(async->callback_fn, NULL);
> -       }
> -}
> -
>  BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
>  {
>         struct bpf_hrtimer *t, *cur_t;
> @@ -1513,27 +1574,16 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>         .arg1_type      = ARG_PTR_TO_TIMER,
>  };
>
> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>  {
>         struct bpf_async_cb *cb;
>
> -       /* Performance optimization: read async->cb without lock first. */
> -       if (!READ_ONCE(async->cb))
> -               return NULL;
> -
> -       __bpf_spin_lock_irqsave(&async->lock);
> -       /* re-read it under lock */
> -       cb = async->cb;
> +       cb = xchg(&async->cb, NULL);
>         if (!cb)
> -               goto out;
> -       drop_prog_refcnt(cb);
> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> -        * this timer, since it won't be initialized.
> -        */
> -       WRITE_ONCE(async->cb, NULL);
> -out:
> -       __bpf_spin_unlock_irqrestore(&async->lock);
> -       return cb;
> +               return;
> +
> +       /* Consume map's refcnt */
> +       irq_work_queue(&cb->worker);

hm... this is subtle (and maybe broken?) irq_work_queue() can be
ignored here, if there is already another one scheduled, so I think
your clever idea with CANCEL_AND_FREE being done based on this
refcount drop is flawed...

CANCEL_AND_FREE has to succeed, so it's out-of-bounds signal that
shouldn't be going through that command cell, yes. But can't we just
have a simple one-way bool that will be set to true here (+ memory
barriers, maybe), and then irq_work_queue() scheduled. If there is irq
work is scheduled, it will inevitable will see this flag (even if it's
not our callback), and if not, then irq_work_queue() will successfully
schedule callback which will also clean up.


also, same about TODO for irq_work_queue() avoidance


>  }
>
>  static void bpf_timer_delete(struct bpf_hrtimer *t)
> @@ -1588,19 +1638,76 @@ static void bpf_timer_delete(struct bpf_hrtimer *t)
>         }
>  }
>
> +static void bpf_async_process_op(struct bpf_async_cb *cb, u32 op,
> +                                u64 timer_nsec, u32 timer_mode)
> +{
> +       switch (cb->type) {
> +       case BPF_ASYNC_TYPE_TIMER: {
> +               struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb);
> +
> +               switch (op) {
> +               case BPF_ASYNC_START:
> +                       hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
> +                       break;
> +               case BPF_ASYNC_CANCEL:
> +                       hrtimer_try_to_cancel(&t->timer);
> +                       break;
> +               case BPF_ASYNC_CANCEL_AND_FREE:
> +                       bpf_timer_delete(t);
> +                       break;
> +               default:
> +                       break;
> +               }
> +               break;
> +       }
> +       case BPF_ASYNC_TYPE_WQ: {
> +               struct bpf_work *w = container_of(cb, struct bpf_work, cb);
> +
> +               switch (op) {
> +               case BPF_ASYNC_START:
> +                       schedule_work(&w->work);
> +                       break;
> +               case BPF_ASYNC_CANCEL_AND_FREE:
> +                       /*
> +                        * Trigger cancel of the sleepable work, but *do not* wait for
> +                        * it to finish.
> +                        * kfree will be called once the work has finished.
> +                        */
> +                       schedule_work(&w->delete_work);
> +                       break;
> +               default:
> +                       break;
> +               }
> +               break;
> +       }
> +       }
> +}
> +
> +static void bpf_async_irq_worker(struct irq_work *work)
> +{
> +       struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
> +       u32 op, timer_mode;
> +       u64 nsec;
> +       int err;
> +
> +       err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
> +       if (err)
> +               goto out;
> +
> +       bpf_async_process_op(cb, op, nsec, timer_mode);
> +
> +out:
> +       if (refcount_dec_and_test(&cb->refcnt))
> +               bpf_async_process_op(cb, BPF_ASYNC_CANCEL_AND_FREE, 0, 0);
> +}
> +
>  /*
>   * This function is called by map_delete/update_elem for individual element and
>   * by ops->map_release_uref when the user space reference to a map reaches zero.
>   */
>  void bpf_timer_cancel_and_free(void *val)
>  {
> -       struct bpf_hrtimer *t;
> -
> -       t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
> -       if (!t)
> -               return;
> -
> -       bpf_timer_delete(t);
> +       __bpf_async_cancel_and_free(val);
>  }
>
>  /* This function is called by map_delete/update_elem for individual element and
> @@ -1608,19 +1715,7 @@ void bpf_timer_cancel_and_free(void *val)
>   */
>  void bpf_wq_cancel_and_free(void *val)
>  {
> -       struct bpf_work *work;
> -
> -       BTF_TYPE_EMIT(struct bpf_wq);
> -
> -       work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
> -       if (!work)
> -               return;
> -       /* Trigger cancel of the sleepable work, but *do not* wait for
> -        * it to finish if it was running as we might not be in a
> -        * sleepable context.
> -        * kfree will be called once the work has finished.
> -        */
> -       schedule_work(&work->delete_work);
> +       __bpf_async_cancel_and_free(val);
>  }
>
>  BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
> @@ -3093,15 +3188,14 @@ __bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
>         struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
>         struct bpf_work *w;
>
> -       if (in_nmi())
> -               return -EOPNOTSUPP;
>         if (flags)
>                 return -EINVAL;
>         w = READ_ONCE(async->work);
>         if (!w || !READ_ONCE(w->cb.prog))
>                 return -EINVAL;
>
> -       schedule_work(&w->work);
> +       bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0);
> +
>         return 0;
>  }
>
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs
  2026-01-07 17:49 ` [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
@ 2026-01-09 22:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Extend the verifier to recognize struct bpf_timer as a valid kfunc
> argument type. Previously, bpf_timer was only supported in BPF helpers.
>
> This prepares for adding timer-related kfuncs in subsequent patches.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/verifier.c | 59 +++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 53 insertions(+), 6 deletions(-)
>

Looks reasonable.

Acked-by: Andrii Nakryiko <andrii@kernel.org>


[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc
  2026-01-07 17:49 ` [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
@ 2026-01-09 22:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> introducing bpf timer cancel kfunc that attempts canceling timer
> asynchronously, hence, supports working in NMI context.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index b90b005a17e1de9c0c62056a665d124b883c6320..1f593df04f326c509398f501907265ec6dae60e9 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -4439,6 +4439,19 @@ __bpf_kfunc int bpf_dynptr_file_discard(struct bpf_dynptr *dynptr)
>         return 0;
>  }
>
> +__bpf_kfunc int bpf_timer_cancel_async(struct bpf_timer *timer)
> +{
> +       struct bpf_async_cb *cb;
> +       struct bpf_async_kern *async = (void *)timer;
> +
> +       guard(rcu)();
> +       cb = async->cb;

READ_ONCE(), this cb can be xchg'ed into NULL, and we can't allow
compiler to re-read the pointer


> +       if (!cb)
> +               return -EINVAL;
> +
> +       return bpf_async_schedule_op(cb, BPF_ASYNC_CANCEL, 0, 0);
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  static void bpf_task_work_cancel_scheduled(struct irq_work *irq_work)
> @@ -4620,6 +4633,7 @@ BTF_ID_FLAGS(func, bpf_task_work_schedule_signal_impl)
>  BTF_ID_FLAGS(func, bpf_task_work_schedule_resume_impl)
>  BTF_ID_FLAGS(func, bpf_dynptr_from_file)
>  BTF_ID_FLAGS(func, bpf_dynptr_file_discard)
> +BTF_ID_FLAGS(func, bpf_timer_cancel_async)
>  BTF_KFUNCS_END(common_btf_ids)
>
>  static const struct btf_kfunc_id_set common_kfunc_set = {
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works
  2026-01-07 17:49 ` [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
@ 2026-01-09 22:19   ` Andrii Nakryiko
  0 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-09 22:19 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Add test that verifies that bpf_timer_cancel_async works: can cancel
> callback successfully.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  tools/testing/selftests/bpf/prog_tests/timer.c | 25 +++++++++++++++++++++++++
>  tools/testing/selftests/bpf/progs/timer.c      | 23 +++++++++++++++++++++++
>  2 files changed, 48 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
> index a157a2a699e638c9f21712b1e7194fc4b6382e71..2b932d4dfd436fd322bd07169f492e20e4ec7624 100644
> --- a/tools/testing/selftests/bpf/prog_tests/timer.c
> +++ b/tools/testing/selftests/bpf/prog_tests/timer.c
> @@ -99,6 +99,26 @@ static int timer(struct timer *timer_skel)
>         return 0;
>  }
>
> +static int timer_cancel_async(struct timer *timer_skel)
> +{
> +       int err, prog_fd;
> +       LIBBPF_OPTS(bpf_test_run_opts, topts);
> +
> +       prog_fd = bpf_program__fd(timer_skel->progs.test_async_cancel_succeed);
> +       err = bpf_prog_test_run_opts(prog_fd, &topts);
> +       ASSERT_OK(err, "test_run");
> +       ASSERT_EQ(topts.retval, 0, "test_run");
> +
> +       usleep(500);
> +       /* check that there were no errors in timer execution */
> +       ASSERT_EQ(timer_skel->bss->err, 0, "err");
> +
> +       /* check that code paths completed */
> +       ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
> +
> +       return 0;
> +}
> +
>  static void test_timer(int (*timer_test_fn)(struct timer *timer_skel))
>  {
>         struct timer *timer_skel = NULL;
> @@ -134,6 +154,11 @@ void serial_test_timer_stress_async_cancel(void)
>         test_timer(timer_stress_async_cancel);
>  }
>
> +void serial_test_timer_async_cancel(void)
> +{
> +       test_timer(timer_cancel_async);
> +}
> +
>  void test_timer_interrupt(void)
>  {
>         struct timer_interrupt *skel = NULL;
> diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
> index a81413514e4b07ef745f27eade71454234e731e8..4b4ca781e7cdcf78015359cbd8f8d8ff591d6036 100644
> --- a/tools/testing/selftests/bpf/progs/timer.c
> +++ b/tools/testing/selftests/bpf/progs/timer.c
> @@ -169,6 +169,29 @@ int BPF_PROG2(test1, int, a)
>         return 0;
>  }
>
> +static int timer_error(void *map, int *key, struct bpf_timer *timer)
> +{
> +       err = 42;
> +       return 0;
> +}
> +
> +SEC("syscall")
> +int test_async_cancel_succeed(void *ctx)
> +{
> +       struct bpf_timer *arr_timer;
> +       int array_key = ARRAY;
> +
> +       arr_timer = bpf_map_lookup_elem(&array, &array_key);
> +       if (!arr_timer)
> +               return 0;
> +       bpf_timer_init(arr_timer, &array, CLOCK_MONOTONIC);
> +       bpf_timer_set_callback(arr_timer, timer_error);
> +       bpf_timer_start(arr_timer, 100000 /* 100us */, 0);

flaky timing-dependent test... not sure how to write a better one, but
this one is not great :(


> +       bpf_timer_cancel_async(arr_timer);
> +       ok = 7;
> +       return 0;
> +}
> +
>  /* callback for prealloc and non-prealloca hashtab timers */
>  static int timer_cb2(void *map, int *key, struct hmap_elem *val)
>  {
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-09 18:47           ` Andrii Nakryiko
@ 2026-01-09 23:51             ` Alexei Starovoitov
  2026-01-10  0:03               ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: Alexei Starovoitov @ 2026-01-09 23:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Mykyta Yatsenko, Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard,
	Mykyta Yatsenko

On Fri, Jan 9, 2026 at 10:47 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 9, 2026 at 10:22 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
> >
> > On 1/9/26 01:18, Andrii Nakryiko wrote:
> > > On Wed, Jan 7, 2026 at 11:05 AM Mykyta Yatsenko
> > > <mykyta.yatsenko5@gmail.com> wrote:
> > >> On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
> > >>> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> > >>>> From: Mykyta Yatsenko <yatsenko@meta.com>
> > >>>>
> > >>>> Introduce mpmc_cell, a lock-free cell primitive designed to support
> > >>>> concurrent writes to struct in NMI context (only one writer advances),
> > >>>> allowing readers to consume consistent snapshot.
> > >>>>
> > >>>> Implementation details:
> > >>>>    Double buffering allows writers run concurrently with readers (read
> > >>>>    from one cell, write to another)
> > >>>>
> > >>>>    The implementation uses a sequence-number-based protocol to enable
> > >>>>    exclusive writes.
> > >>>>     * Bit 0 of seq indicates an active writer
> > >>>>     * Bits 1+ form a generation counter
> > >>>>     * (seq & 2) >> 1 selects the read cell, write cell is opposite
> > >>>>     * Writers atomically set bit 0, write to the inactive cell, then
> > >>>>       increment seq to publish
> > >>>>     * Readers snapshot seq, read from the active cell, then validate
> > >>>>       that seq hasn't changed
> > >>>>
> > >>>> mpmc_cell expects users to pre-allocate double buffers.
> > >>>>
> > >>>> Key properties:
> > >>>>    * Writers never block (fail if lost the race to another writer)
> > >>>>    * Readers never block writers (double buffering), but may require
> > >>>>    retries if write updates the snapshot concurrently.
> > >>>>
> > >>>> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> > >>>> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> > >>>> programs to initiate timers and workqueues from NMI context.
> > >>>>
> > >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > >>>> ---
> > >>> We already have a dual-versioned concurrency control primitive in the
> > >>> kernel (seqcount_latch_t). I would just use that instead of
> > >>> reinventing it here. I don't see much of a difference except writer
> > >>> serialization, which can be done on top of it. If it was already
> > >>> considered and discarded for some reason, please add that reason to
> > >>> the commit message.
> > >> yes, multiple concurrent  writers support would is the main difference
> > >> between seqcount_latch_t and my implementation. I'll switch to
> > >> seqcount_latch_t and external synchronization for writers.
> > > One advantage of custom primitive is that we don't need a second
> > > atomic counter for writers. Here we combine the reader latch counter
> > > (it's just scaled 2x for custom implementation) and "writer is active"
> > > bit (even/odd counter).
> > >
> > > With potentially millions of timer activations per second for some
> > > extreme cases, would performance be enough reason to have custom
> > > "seqlock latch"? (I'm not sure myself, trying to get opinions)
> > >
> > Actually seqcount_latch_t variant may be faster (correct me if I'm wrong),
> > because mpmc_cell requires 2 lock prefixed instructions to enter the write
> > critical section and seqcount_latch_t just 1.
> >
> > mpmc_cell:
> >
> > if (1&atomic_fetch_or_acquire(1, &ctl->seq)) // first lock prefixed insn
> > return;
> > ...
> >        atomic_fetch_add_release(1, &ctl->seq);     // second lock
> > prefixed insn
> >
> > seqcount_latch_t based:
> >
> >      if (atomic_cmpxchg(&ctl->lock, 0, 1))        // first lock prefixed
> > insn
> >          return;
> > write_seqcount_latch_begin(&ctl->seq);       // inc with barriers
> >      ...
> >      write_seqcount_latch(&ctl->seq);            // inc with barriers
> >      atomic_set(&ctl->lock, 0);            // plain mov on x86_64
> >
> > Does it look right?
>
> So I think you are overpivoting on atomic vs non-atomic differences
> here: when uncontended, atomic is almost as fast as non-atomic
> (difference is irrelevant). But under contention, those four writes
> due to separate latch is four cache line bounces (potentially) between
> competing CPUs, while with custom sequence logic it's just two.
>
> I'm not dead set on one approach or the other, but I don't think that
> latch adds that match value. But if Kumar or others insist on latch, I
> don't mind, in the end both will work.

I think it's worth writing a micro benchmark for it.
Intuitively I feel seqcount_latch_t approach will be faster
and it has all the kcsan annotation around it,
while this new primitive still needs work to teach kcsan
about it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations
  2026-01-09 23:51             ` Alexei Starovoitov
@ 2026-01-10  0:03               ` Andrii Nakryiko
  0 siblings, 0 replies; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-10  0:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mykyta Yatsenko, Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin Lau, Kernel Team, Eduard,
	Mykyta Yatsenko

On Fri, Jan 9, 2026 at 3:51 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 9, 2026 at 10:47 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jan 9, 2026 at 10:22 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> > >
> > > On 1/9/26 01:18, Andrii Nakryiko wrote:
> > > > On Wed, Jan 7, 2026 at 11:05 AM Mykyta Yatsenko
> > > > <mykyta.yatsenko5@gmail.com> wrote:
> > > >> On 1/7/26 18:30, Kumar Kartikeya Dwivedi wrote:
> > > >>> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> > > >>>> From: Mykyta Yatsenko <yatsenko@meta.com>
> > > >>>>
> > > >>>> Introduce mpmc_cell, a lock-free cell primitive designed to support
> > > >>>> concurrent writes to struct in NMI context (only one writer advances),
> > > >>>> allowing readers to consume consistent snapshot.
> > > >>>>
> > > >>>> Implementation details:
> > > >>>>    Double buffering allows writers run concurrently with readers (read
> > > >>>>    from one cell, write to another)
> > > >>>>
> > > >>>>    The implementation uses a sequence-number-based protocol to enable
> > > >>>>    exclusive writes.
> > > >>>>     * Bit 0 of seq indicates an active writer
> > > >>>>     * Bits 1+ form a generation counter
> > > >>>>     * (seq & 2) >> 1 selects the read cell, write cell is opposite
> > > >>>>     * Writers atomically set bit 0, write to the inactive cell, then
> > > >>>>       increment seq to publish
> > > >>>>     * Readers snapshot seq, read from the active cell, then validate
> > > >>>>       that seq hasn't changed
> > > >>>>
> > > >>>> mpmc_cell expects users to pre-allocate double buffers.
> > > >>>>
> > > >>>> Key properties:
> > > >>>>    * Writers never block (fail if lost the race to another writer)
> > > >>>>    * Readers never block writers (double buffering), but may require
> > > >>>>    retries if write updates the snapshot concurrently.
> > > >>>>
> > > >>>> This will be used by BPF timer and workqueue helpers to defer NMI-unsafe
> > > >>>> operations (like hrtimer_start()) to irq_work effectively allowing BPF
> > > >>>> programs to initiate timers and workqueues from NMI context.
> > > >>>>
> > > >>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > > >>>> ---
> > > >>> We already have a dual-versioned concurrency control primitive in the
> > > >>> kernel (seqcount_latch_t). I would just use that instead of
> > > >>> reinventing it here. I don't see much of a difference except writer
> > > >>> serialization, which can be done on top of it. If it was already
> > > >>> considered and discarded for some reason, please add that reason to
> > > >>> the commit message.
> > > >> yes, multiple concurrent  writers support would is the main difference
> > > >> between seqcount_latch_t and my implementation. I'll switch to
> > > >> seqcount_latch_t and external synchronization for writers.
> > > > One advantage of custom primitive is that we don't need a second
> > > > atomic counter for writers. Here we combine the reader latch counter
> > > > (it's just scaled 2x for custom implementation) and "writer is active"
> > > > bit (even/odd counter).
> > > >
> > > > With potentially millions of timer activations per second for some
> > > > extreme cases, would performance be enough reason to have custom
> > > > "seqlock latch"? (I'm not sure myself, trying to get opinions)
> > > >
> > > Actually seqcount_latch_t variant may be faster (correct me if I'm wrong),
> > > because mpmc_cell requires 2 lock prefixed instructions to enter the write
> > > critical section and seqcount_latch_t just 1.
> > >
> > > mpmc_cell:
> > >
> > > if (1&atomic_fetch_or_acquire(1, &ctl->seq)) // first lock prefixed insn
> > > return;
> > > ...
> > >        atomic_fetch_add_release(1, &ctl->seq);     // second lock
> > > prefixed insn
> > >
> > > seqcount_latch_t based:
> > >
> > >      if (atomic_cmpxchg(&ctl->lock, 0, 1))        // first lock prefixed
> > > insn
> > >          return;
> > > write_seqcount_latch_begin(&ctl->seq);       // inc with barriers
> > >      ...
> > >      write_seqcount_latch(&ctl->seq);            // inc with barriers
> > >      atomic_set(&ctl->lock, 0);            // plain mov on x86_64
> > >
> > > Does it look right?
> >
> > So I think you are overpivoting on atomic vs non-atomic differences
> > here: when uncontended, atomic is almost as fast as non-atomic
> > (difference is irrelevant). But under contention, those four writes
> > due to separate latch is four cache line bounces (potentially) between
> > competing CPUs, while with custom sequence logic it's just two.
> >
> > I'm not dead set on one approach or the other, but I don't think that
> > latch adds that match value. But if Kumar or others insist on latch, I
> > don't mind, in the end both will work.
>
> I think it's worth writing a micro benchmark for it.
> Intuitively I feel seqcount_latch_t approach will be faster
> and it has all the kcsan annotation around it,
> while this new primitive still needs work to teach kcsan
> about it.

Let's just stick to latch, no need to have more unnecessary work. As I
mentioned to Mykyta offline, latch's advantage (regardless of
performance implications) is that someone already thought and
validated all the memory barriers, so let's take advantage of that and
concentrate on other important aspects here.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
  2026-01-09 22:18   ` Andrii Nakryiko
@ 2026-01-12  6:47   ` Kumar Kartikeya Dwivedi
  2026-01-12  8:12     ` Kumar Kartikeya Dwivedi
  2 siblings, 1 reply; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-12  6:47 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor __bpf_async_set_callback() getting rid of locks. The idea of the
> algorithm is to store both callback_fn and prog in struct bpf_async_cb
> and verify that both pointers are stored, if any pointer does not
> match (because of the concurrent update), retry until complete match.
> On each iteration, increment refcnt of the prog that is going to
> be set and decrement the one that is evicted, ensuring that get/put are
> balanced, as each iteration has both inc/dec.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 61 ++++++++++++++++++----------------------------------
>  1 file changed, 21 insertions(+), 40 deletions(-)
>
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 9eaa4185e0a79b903c6fc2ccb310f521a4b14a1d..954bd61310a6ad3a0d540c1b1ebe8c35a9c0119c 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>  };
>
>  static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
> -                                   struct bpf_prog_aux *aux, unsigned int flags,
> -                                   enum bpf_async_type type)
> +                                   struct bpf_prog *prog)
>  {
> -       struct bpf_prog *prev, *prog = aux->prog;
> -       struct bpf_async_cb *cb;
> -       int ret = 0;
> +       struct bpf_prog *prev;
> +       struct bpf_async_cb *cb = async->cb;
>
> -       if (in_nmi())
> -               return -EOPNOTSUPP;

I'm just nitpicking, but it may be cleaner to drop this hunk in the
commit which marks both NMI-safe, and keep this commit topical to
making this function lockless and ensuring nothing breaks.

> -       __bpf_spin_lock_irqsave(&async->lock);
> -       cb = async->cb;
> -       if (!cb) {
> -               ret = -EINVAL;
> -               goto out;
> -       }
> -       if (!atomic64_read(&cb->map->usercnt)) {
> -               /* maps with timers must be either held by user space
> -                * or pinned in bpffs. Otherwise timer might still be
> -                * running even when bpf prog is detached and user space
> -                * is gone, since map_release_uref won't ever be called.
> -                */
> -               ret = -EPERM;
> -               goto out;
> -       }
> -       prev = cb->prog;
> -       if (prev != prog) {
> -               /* Bump prog refcnt once. Every bpf_timer_set_callback()
> -                * can pick different callback_fn-s within the same prog.
> -                */
> -               prog = bpf_prog_inc_not_zero(prog);
> -               if (IS_ERR(prog)) {
> -                       ret = PTR_ERR(prog);
> -                       goto out;
> +       if (!cb)
> +               return -EPERM;
> +
> +       do {
> +               if (prog) {
> +                       prog = bpf_prog_inc_not_zero(prog);
> +                       if (IS_ERR(prog))
> +                               return PTR_ERR(prog);
>                 }
> +
> +               prev = xchg(&cb->prog, prog);
> +               rcu_assign_pointer(cb->callback_fn, callback_fn);
> +
>                 if (prev)
> -                       /* Drop prev prog refcnt when swapping with new prog */
>                         bpf_prog_put(prev);

I was wondering whether bpf_prog_put needs adjustment when done in NMI context.
At this point I forgot what my conclusion was when looking at it for
task_work, but there are two things that may require care:
1) whether in_hardirq() || irqs_disabled() is enough to detect NMI
context too. This seems to be fine on x86, but I am not sure it will
be the same across architectures.
2) whether schedule_work() can deadlock when invoked from NMI while
its locks are held in some lower context.

Once this function is opened up to be called in NMI context you can
have a reentrant async_set_callback on the same CPU. This will require
both dec_and_test() to drop to 0 on the same CPU somehow. Thus, we may
need to defer to irq_work for bpf_prog_put. queue_work() et. al. have
local_irq_save() so hardirq/task reentrancy should be prevented for them
(they are already supposed to work fine in such cases), even after we
lose our own irqsave through the spin lock.



> -               cb->prog = prog;
> -       }
> -       rcu_assign_pointer(cb->callback_fn, callback_fn);
> -out:
> -       __bpf_spin_unlock_irqrestore(&async->lock);
> -       return ret;
> +
> +       } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
> +
> +       return 0;
>  }
>
>  BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
>            struct bpf_prog_aux *, aux)
>  {
> -       return __bpf_async_set_callback(timer, callback_fn, aux, 0, BPF_ASYNC_TYPE_TIMER);
> +       return __bpf_async_set_callback(timer, callback_fn, aux->prog);
>  }
>
>  static const struct bpf_func_proto bpf_timer_set_callback_proto = {
> @@ -3131,7 +3112,7 @@ __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
>         if (flags)
>                 return -EINVAL;
>
> -       return __bpf_async_set_callback(async, callback_fn, aux, flags, BPF_ASYNC_TYPE_WQ);
> +       return __bpf_async_set_callback(async, callback_fn, aux->prog);
>  }
>
>  __bpf_kfunc void bpf_preempt_disable(void)
>
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper
  2026-01-07 17:49 ` [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
@ 2026-01-12  6:51   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-12  6:51 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Move the timer deletion logic into a dedicated bpf_timer_delete()
> helper so it can be reused by later patches.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel()
  2026-01-09 22:19     ` Andrii Nakryiko
@ 2026-01-12  7:29       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-12  7:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bot+bpf-ci, mykyta.yatsenko5, bpf, ast, andrii, daniel, kafai,
	kernel-team, eddyz87, yatsenko, martin.lau, yonghong.song, clm,
	ihor.solodrai

On Fri, 9 Jan 2026 at 23:19, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jan 7, 2026 at 10:22 AM <bot+bpf-ci@kernel.org> wrote:
> >
> > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > > index ff3c1e1160db..dc8ed948321e 100644
> > > --- a/kernel/bpf/helpers.c
> > > +++ b/kernel/bpf/helpers.c
> >
> > [ ... ]
> >
> > > @@ -1454,13 +1454,12 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
> > >
> > >       if (in_nmi())
> > >               return -EOPNOTSUPP;
> > > -     rcu_read_lock();
> > > -     __bpf_spin_lock_irqsave(&timer->lock);
> > > -     t = timer->timer;
> > > -     if (!t) {
> > > -             ret = -EINVAL;
> > > -             goto out;
> > > -     }
> > > +
> > > +     guard(rcu)();
> > > +
> > > +     t = async->timer;
>
> this should be READ_ONCE() now, no?
>
> > > +     if (!t)
> > > +             return -EINVAL;
> >
> > [ ... ]
> >
> > > @@ -1492,20 +1490,19 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
> > >                * cancelling and waiting for it synchronously, since it might
> > >                * do the same. Bail!
> > >                */
> > > -             ret = -EDEADLK;
> > > -             goto out;
> > > +             atomic_dec(&t->cancelling);
> > > +             return -EDEADLK;
> > >       }
> > > +
> > >  drop:
> > > -     drop_prog_refcnt(&t->cb);
> > > -out:
> > > -     __bpf_spin_unlock_irqrestore(&timer->lock);
> > > +     __bpf_async_set_callback(async, NULL, NULL);
> >
> > With the lock removed here, can bpf_timer_cancel() race with
> > __bpf_async_cancel_and_free()? The concern is that both paths can
> > attempt to drop the prog reference:
> >
> >   bpf_timer_cancel():
> >     __bpf_async_set_callback(async, NULL, NULL)
> >       -> prev = xchg(&cb->prog, NULL)
> >       -> bpf_prog_put(prev)
> >
> >   __bpf_async_cancel_and_free():
> >     -> drop_prog_refcnt(cb)
> >       -> prog = cb->prog  (plain read)
> >       -> bpf_prog_put(prog)
> >
> > The xchg in __bpf_async_set_callback() is atomic, but drop_prog_refcnt()
> > uses a plain read of cb->prog. If drop_prog_refcnt() reads cb->prog
> > before the xchg sets it to NULL, both paths would call bpf_prog_put() on
> > the same program, leading to a double put of the reference count.
> >
> > Note: This appears to be addressed later in the series by commit
> > 0035985a0a23 ("bpf: Enable bpf timer and workqueue use in NMI") which
> > moves bpf_prog_put() to the RCU callback, but the race window exists in
> > this intermediate commit.
>
> fair point, do we need to do this early on?
>

It might also make sense to just fold this patch into the later one.

>
> >
> > >       /* Cancel the timer and wait for associated callback to finish
> > >        * if it was running.
> > >        */
> > > -     ret = ret ?: hrtimer_cancel(&t->timer);
> > > +     ret = hrtimer_cancel(&t->timer);
> > > +
> > >       if (inc)
> > >               atomic_dec(&t->cancelling);
> > > -     rcu_read_unlock();
> > >       return ret;
> > >  }
> >
> >
> > ---
> > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> >
> > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
  2026-01-07 18:22   ` bot+bpf-ci
  2026-01-09 22:19   ` Andrii Nakryiko
@ 2026-01-12  8:10   ` Kumar Kartikeya Dwivedi
  2 siblings, 0 replies; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-12  8:10 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor bpf timer and workqueue helpers to allow calling them from NMI
> context by making all operations lock-free and deferring NMI-unsafe
> work to irq_work.
>
> Previously, bpf_timer_start(), and bpf_wq_start()
> could not be called from NMI context because they acquired
> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
> patch removes these limitations.
>
> Key changes:
>  * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
>    operations with atomic cmpxchg() for initialization and xchg() for
>    cancel and free.
>  * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
>    hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
>  * Use the lock-free mpmc_cell (added in the previous commit) to pass
>    operation commands (start/cancel/free) along with their parameters
>    (nsec, mode) from NMI-safe callers to the irq_work handler.
>  * Add reference counting to bpf_async_cb to ensure the object stays
>    alive until all scheduled irq_work completes and the timer/work
>    callback finishes.
>  * Move bpf_prog_put() to RCU callback to handle races between
>    set_callback() and cancel_and_free().
>
> This enables BPF programs attached to NMI-context hooks (perf
> events) to use timers and workqueues for deferred processing.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  [...]
>
> -BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
> +BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
>  {
>         struct bpf_hrtimer *t;
> -       int ret = 0;
> -       enum hrtimer_mode mode;
> +       u32 mode;
>
> -       if (in_nmi())
> -               return -EOPNOTSUPP;
>         if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
>                 return -EINVAL;
> -       __bpf_spin_lock_irqsave(&timer->lock);
> -       t = timer->timer;
> -       if (!t || !t->cb.prog) {
> -               ret = -EINVAL;
> -               goto out;
> -       }
> +
> +       guard(rcu)();
> +
> +       t = async->timer;
> +       if (!t || !t->cb.prog)

Read of t->cb.prog also needs READ_ONCE().

> +               return -EINVAL;
>
>         if (flags & BPF_F_TIMER_ABS)
>                 mode = HRTIMER_MODE_ABS_SOFT;
> @@ -1420,10 +1495,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla
>         if (flags & BPF_F_TIMER_CPU_PIN)
>                 mode |= HRTIMER_MODE_PINNED;
>
> -       hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> -out:
> -       __bpf_spin_unlock_irqrestore(&timer->lock);
> -       return ret;
> +       return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
>  }
>
>  [...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-12  6:47   ` Kumar Kartikeya Dwivedi
@ 2026-01-12  8:12     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 40+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-01-12  8:12 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87,
	Mykyta Yatsenko

On Mon, 12 Jan 2026 at 07:47, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Wed, 7 Jan 2026 at 18:49, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> >
> > From: Mykyta Yatsenko <yatsenko@meta.com>
> >
> > Refactor __bpf_async_set_callback() getting rid of locks. The idea of the
> > algorithm is to store both callback_fn and prog in struct bpf_async_cb
> > and verify that both pointers are stored, if any pointer does not
> > match (because of the concurrent update), retry until complete match.
> > On each iteration, increment refcnt of the prog that is going to
> > be set and decrement the one that is evicted, ensuring that get/put are
> > balanced, as each iteration has both inc/dec.
> >
> > Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> > ---
> >  kernel/bpf/helpers.c | 61 ++++++++++++++++++----------------------------------
> >  1 file changed, 21 insertions(+), 40 deletions(-)
> >
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 9eaa4185e0a79b903c6fc2ccb310f521a4b14a1d..954bd61310a6ad3a0d540c1b1ebe8c35a9c0119c 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
> >  };
> >
> >  static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
> > -                                   struct bpf_prog_aux *aux, unsigned int flags,
> > -                                   enum bpf_async_type type)
> > +                                   struct bpf_prog *prog)
> >  {
> > -       struct bpf_prog *prev, *prog = aux->prog;
> > -       struct bpf_async_cb *cb;
> > -       int ret = 0;
> > +       struct bpf_prog *prev;
> > +       struct bpf_async_cb *cb = async->cb;
> >
> > -       if (in_nmi())
> > -               return -EOPNOTSUPP;
>
> I'm just nitpicking, but it may be cleaner to drop this hunk in the
> commit which marks both NMI-safe, and keep this commit topical to
> making this function lockless and ensuring nothing breaks.
>
> > -       __bpf_spin_lock_irqsave(&async->lock);
> > -       cb = async->cb;
> > -       if (!cb) {
> > -               ret = -EINVAL;
> > -               goto out;
> > -       }
> > -       if (!atomic64_read(&cb->map->usercnt)) {
> > -               /* maps with timers must be either held by user space
> > -                * or pinned in bpffs. Otherwise timer might still be
> > -                * running even when bpf prog is detached and user space
> > -                * is gone, since map_release_uref won't ever be called.
> > -                */
> > -               ret = -EPERM;
> > -               goto out;
> > -       }
> > -       prev = cb->prog;
> > -       if (prev != prog) {
> > -               /* Bump prog refcnt once. Every bpf_timer_set_callback()
> > -                * can pick different callback_fn-s within the same prog.
> > -                */
> > -               prog = bpf_prog_inc_not_zero(prog);
> > -               if (IS_ERR(prog)) {
> > -                       ret = PTR_ERR(prog);
> > -                       goto out;
> > +       if (!cb)
> > +               return -EPERM;
> > +
> > +       do {
> > +               if (prog) {
> > +                       prog = bpf_prog_inc_not_zero(prog);
> > +                       if (IS_ERR(prog))
> > +                               return PTR_ERR(prog);
> >                 }
> > +
> > +               prev = xchg(&cb->prog, prog);
> > +               rcu_assign_pointer(cb->callback_fn, callback_fn);
> > +
> >                 if (prev)
> > -                       /* Drop prev prog refcnt when swapping with new prog */
> >                         bpf_prog_put(prev);
>
> I was wondering whether bpf_prog_put needs adjustment when done in NMI context.
> At this point I forgot what my conclusion was when looking at it for
> task_work, but there are two things that may require care:
> 1) whether in_hardirq() || irqs_disabled() is enough to detect NMI
> context too. This seems to be fine on x86, but I am not sure it will
> be the same across architectures.
> 2) whether schedule_work() can deadlock when invoked from NMI while
> its locks are held in some lower context.
>
> Once this function is opened up to be called in NMI context you can
> have a reentrant async_set_callback on the same CPU. This will require
> both dec_and_test() to drop to 0 on the same CPU somehow. Thus, we may
> need to defer to irq_work for bpf_prog_put. queue_work() et. al. have
> local_irq_save() so hardirq/task reentrancy should be prevented for them
> (they are already supposed to work fine in such cases), even after we
> lose our own irqsave through the spin lock.
>
>

I saw that you handled this for other cases in the later patch, but I
guess it's missing for async_set_callback.

>
> > -               cb->prog = prog;
> > -       }
> > -       rcu_assign_pointer(cb->callback_fn, callback_fn);
> > -out:
> > -       __bpf_spin_unlock_irqrestore(&async->lock);
> > -       return ret;
> > +
> > +       } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
> > +
> > +       return 0;
> >  }
> >
> >  BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
> >            struct bpf_prog_aux *, aux)
> >  {
> > -       return __bpf_async_set_callback(timer, callback_fn, aux, 0, BPF_ASYNC_TYPE_TIMER);
> > +       return __bpf_async_set_callback(timer, callback_fn, aux->prog);
> >  }
> >
> >  static const struct bpf_func_proto bpf_timer_set_callback_proto = {
> > @@ -3131,7 +3112,7 @@ __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
> >         if (flags)
> >                 return -EINVAL;
> >
> > -       return __bpf_async_set_callback(async, callback_fn, aux, flags, BPF_ASYNC_TYPE_WQ);
> > +       return __bpf_async_set_callback(async, callback_fn, aux->prog);
> >  }
> >
> >  __bpf_kfunc void bpf_preempt_disable(void)
> >
> > --
> > 2.52.0
> >

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-09 22:18     ` Andrii Nakryiko
@ 2026-01-13 13:58       ` Mykyta Yatsenko
  2026-01-13 16:14       ` Mykyta Yatsenko
  1 sibling, 0 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-13 13:58 UTC (permalink / raw)
  To: Andrii Nakryiko, bot+bpf-ci
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	yatsenko, martin.lau, yonghong.song, clm, ihor.solodrai

On 1/9/26 22:18, Andrii Nakryiko wrote:
> On Wed, Jan 7, 2026 at 10:22 AM <bot+bpf-ci@kernel.org> wrote:
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 9eaa4185e0a7..954bd61310a6 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>>>   };
>>>
>>>   static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
>>> -                                 struct bpf_prog_aux *aux, unsigned int flags,
>>> -                                 enum bpf_async_type type)
>>> +                                 struct bpf_prog *prog)
>>>   {
>>> -     struct bpf_prog *prev, *prog = aux->prog;
>>> -     struct bpf_async_cb *cb;
>>> -     int ret = 0;
>>> +     struct bpf_prog *prev;
>>> +     struct bpf_async_cb *cb = async->cb;
>>>
>>> -     if (in_nmi())
>>> -             return -EOPNOTSUPP;
>>> -     __bpf_spin_lock_irqsave(&async->lock);
>>> -     cb = async->cb;
>>> -     if (!cb) {
>>> -             ret = -EINVAL;
>>> -             goto out;
>>> -     }
>>> +     if (!cb)
>>> +             return -EPERM;
>> A small concern: the error code changed from -EINVAL to -EPERM. The verifier
>> comment at kernel/bpf/verifier.c mentions that non-initialized timers
>> "will return -EINVAL". Is the error code change intentional, and should
>> the verifier comment be updated?
>>
> good AI, I asked the same
>
>>> -     if (!atomic64_read(&cb->map->usercnt)) {
>>> -             /* maps with timers must be either held by user space
>>> -              * or pinned in bpffs. Otherwise timer might still be
>>> -              * running even when bpf prog is detached and user space
>>> -              * is gone, since map_release_uref won't ever be called.
>>> -              */
>>> -             ret = -EPERM;
>>> -             goto out;
>>> -     }
>> The usercnt check is removed here. The check still exists in
>> __bpf_async_init(), so a timer cannot be initialized when usercnt is zero.
>> But can there be issues if set_callback is called after usercnt drops to
>> zero on an already-initialized timer? The old code would fail with -EPERM,
>> while the new code proceeds.
>>
> Wasn't this discussed earlier, Mykyta? Do you remember the conclusion?
>
>>> -     prev = cb->prog;
>>> -     if (prev != prog) {
>>> -             /* Bump prog refcnt once. Every bpf_timer_set_callback()
>>> -              * can pick different callback_fn-s within the same prog.
>>> -              */
>>> -             prog = bpf_prog_inc_not_zero(prog);
>>> -             if (IS_ERR(prog)) {
>>> -                     ret = PTR_ERR(prog);
>>> -                     goto out;
>>> +     do {
>>> +             if (prog) {
>>> +                     prog = bpf_prog_inc_not_zero(prog);
>>> +                     if (IS_ERR(prog))
>>> +                             return PTR_ERR(prog);
>>>                }
>>> +
>>> +             prev = xchg(&cb->prog, prog);
>>> +             rcu_assign_pointer(cb->callback_fn, callback_fn);
>> Is there a potential race between these two operations? Consider two
>> concurrent callers (with programs P and Q) racing:
>>
>> 1. Thread A: xchg stores P into cb->prog
>> 2. Thread B: xchg stores Q into cb->prog, gets P back
>> 3. Thread B: rcu_assign_pointer stores FB (Q's callback)
>> 4. Thread B: bpf_prog_put(P), B's check passes, B exits
>> 5. Thread A: rcu_assign_pointer stores FA (P's callback), overwrites FB
>> 6. Thread A: check fails (cb->prog is Q, not P), retries
>>
>> After step 5, cb->prog holds Q (with Q's reference) but cb->callback_fn
>> points to FA (from P). If P's refcount drops to zero elsewhere and P is
>> freed, would the timer callback FA become a dangling pointer?
> AI is not completely wrong here, IMO. No, there is no use-after-free
> just yet because program is RCU protected, so we don't have dangling
> pointer just yet.
>
> But. That bpf_prog_inc_not_zero() on retry will fail if P's program
> refcount dropped to zero already. And then once we exit, we'll have
> Q+FA combo, which is not good.
>
> So I think we need to do one bpf_prog_inc_not_zero() outside of the
> loop once and keep that extra ref until we exit. And then instead of
> bpf_prog_inc_not_zero(prog) inside the loop, we do unconditional
> bpf_prog_inc(), which we *know* has to succeed. Less error handling
> inside the loop as well.
>
> Thoughts?
Sounds right, alternatively, we may try to reset both prog and callback
to NULL if prog's refcnt can't be incremented.
>
>> The bpf_timer_cb() function reads cb->callback_fn and calls it directly
>> without verifying it corresponds to cb->prog. The cb->prog field holds
>> the reference keeping the program alive, but if the callback pointer is
>> from a different program that may have been freed, this could be a problem.
>>
>>> +
>>>                if (prev)
>>> -                     /* Drop prev prog refcnt when swapping with new prog */
>>>                        bpf_prog_put(prev);
>>> -             cb->prog = prog;
>>> -     }
>>> -     rcu_assign_pointer(cb->callback_fn, callback_fn);
>>> -out:
>>> -     __bpf_spin_unlock_irqrestore(&async->lock);
>>> -     return ret;
>>> +
>>> +     } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
>>> +
>>> +     return 0;
>>>   }
>> [ ... ]
>>
>>
>> ---
>> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
>> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>>
>> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback()
  2026-01-09 22:18     ` Andrii Nakryiko
  2026-01-13 13:58       ` Mykyta Yatsenko
@ 2026-01-13 16:14       ` Mykyta Yatsenko
  1 sibling, 0 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-13 16:14 UTC (permalink / raw)
  To: Andrii Nakryiko, bot+bpf-ci
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	yatsenko, martin.lau, yonghong.song, clm, ihor.solodrai

On 1/9/26 22:18, Andrii Nakryiko wrote:
> On Wed, Jan 7, 2026 at 10:22 AM <bot+bpf-ci@kernel.org> wrote:
>>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>>> index 9eaa4185e0a7..954bd61310a6 100644
>>> --- a/kernel/bpf/helpers.c
>>> +++ b/kernel/bpf/helpers.c
>>> @@ -1355,55 +1355,36 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>>>   };
>>>
>>>   static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
>>> -                                 struct bpf_prog_aux *aux, unsigned int flags,
>>> -                                 enum bpf_async_type type)
>>> +                                 struct bpf_prog *prog)
>>>   {
>>> -     struct bpf_prog *prev, *prog = aux->prog;
>>> -     struct bpf_async_cb *cb;
>>> -     int ret = 0;
>>> +     struct bpf_prog *prev;
>>> +     struct bpf_async_cb *cb = async->cb;
>>>
>>> -     if (in_nmi())
>>> -             return -EOPNOTSUPP;
>>> -     __bpf_spin_lock_irqsave(&async->lock);
>>> -     cb = async->cb;
>>> -     if (!cb) {
>>> -             ret = -EINVAL;
>>> -             goto out;
>>> -     }
>>> +     if (!cb)
>>> +             return -EPERM;
>> A small concern: the error code changed from -EINVAL to -EPERM. The verifier
>> comment at kernel/bpf/verifier.c mentions that non-initialized timers
>> "will return -EINVAL". Is the error code change intentional, and should
>> the verifier comment be updated?
>>
> good AI, I asked the same
>
>>> -     if (!atomic64_read(&cb->map->usercnt)) {
>>> -             /* maps with timers must be either held by user space
>>> -              * or pinned in bpffs. Otherwise timer might still be
>>> -              * running even when bpf prog is detached and user space
>>> -              * is gone, since map_release_uref won't ever be called.
>>> -              */
>>> -             ret = -EPERM;
>>> -             goto out;
>>> -     }
>> The usercnt check is removed here. The check still exists in
>> __bpf_async_init(), so a timer cannot be initialized when usercnt is zero.
>> But can there be issues if set_callback is called after usercnt drops to
>> zero on an already-initialized timer? The old code would fail with -EPERM,
>> while the new code proceeds.
>>
> Wasn't this discussed earlier, Mykyta? Do you remember the conclusion?
I think what I missed is to put rcu lock in this function, because the
last prog reference is put in the rcu callback (the change this patchset 
introduces).
This way we make sure that whatever is set here will be put if map's 
usercount goes to 0.
>
>>> -     prev = cb->prog;
>>> -     if (prev != prog) {
>>> -             /* Bump prog refcnt once. Every bpf_timer_set_callback()
>>> -              * can pick different callback_fn-s within the same prog.
>>> -              */
>>> -             prog = bpf_prog_inc_not_zero(prog);
>>> -             if (IS_ERR(prog)) {
>>> -                     ret = PTR_ERR(prog);
>>> -                     goto out;
>>> +     do {
>>> +             if (prog) {
>>> +                     prog = bpf_prog_inc_not_zero(prog);
>>> +                     if (IS_ERR(prog))
>>> +                             return PTR_ERR(prog);
>>>                }
>>> +
>>> +             prev = xchg(&cb->prog, prog);
>>> +             rcu_assign_pointer(cb->callback_fn, callback_fn);
>> Is there a potential race between these two operations? Consider two
>> concurrent callers (with programs P and Q) racing:
>>
>> 1. Thread A: xchg stores P into cb->prog
>> 2. Thread B: xchg stores Q into cb->prog, gets P back
>> 3. Thread B: rcu_assign_pointer stores FB (Q's callback)
>> 4. Thread B: bpf_prog_put(P), B's check passes, B exits
>> 5. Thread A: rcu_assign_pointer stores FA (P's callback), overwrites FB
>> 6. Thread A: check fails (cb->prog is Q, not P), retries
>>
>> After step 5, cb->prog holds Q (with Q's reference) but cb->callback_fn
>> points to FA (from P). If P's refcount drops to zero elsewhere and P is
>> freed, would the timer callback FA become a dangling pointer?
> AI is not completely wrong here, IMO. No, there is no use-after-free
> just yet because program is RCU protected, so we don't have dangling
> pointer just yet.
>
> But. That bpf_prog_inc_not_zero() on retry will fail if P's program
> refcount dropped to zero already. And then once we exit, we'll have
> Q+FA combo, which is not good.
>
> So I think we need to do one bpf_prog_inc_not_zero() outside of the
> loop once and keep that extra ref until we exit. And then instead of
> bpf_prog_inc_not_zero(prog) inside the loop, we do unconditional
> bpf_prog_inc(), which we *know* has to succeed. Less error handling
> inside the loop as well.
>
> Thoughts?
>
>
>> The bpf_timer_cb() function reads cb->callback_fn and calls it directly
>> without verifying it corresponds to cb->prog. The cb->prog field holds
>> the reference keeping the program alive, but if the callback pointer is
>> from a different program that may have been freed, this could be a problem.
>>
>>> +
>>>                if (prev)
>>> -                     /* Drop prev prog refcnt when swapping with new prog */
>>>                        bpf_prog_put(prev);
>>> -             cb->prog = prog;
>>> -     }
>>> -     rcu_assign_pointer(cb->callback_fn, callback_fn);
>>> -out:
>>> -     __bpf_spin_unlock_irqrestore(&async->lock);
>>> -     return ret;
>>> +
>>> +     } while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
>>> +
>>> +     return 0;
>>>   }
>> [ ... ]
>>
>>
>> ---
>> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
>> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>>
>> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/20791345842


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-09 22:19   ` Andrii Nakryiko
@ 2026-01-14 14:53     ` Mykyta Yatsenko
  2026-01-15 18:39       ` Andrii Nakryiko
  0 siblings, 1 reply; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-14 14:53 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On 1/9/26 22:19, Andrii Nakryiko wrote:
> On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Refactor bpf timer and workqueue helpers to allow calling them from NMI
>> context by making all operations lock-free and deferring NMI-unsafe
>> work to irq_work.
>>
>> Previously, bpf_timer_start(), and bpf_wq_start()
>> could not be called from NMI context because they acquired
>> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
>> patch removes these limitations.
>>
>> Key changes:
>>   * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
>>     operations with atomic cmpxchg() for initialization and xchg() for
>>     cancel and free.
>>   * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
>>     hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
>>   * Use the lock-free mpmc_cell (added in the previous commit) to pass
>>     operation commands (start/cancel/free) along with their parameters
>>     (nsec, mode) from NMI-safe callers to the irq_work handler.
>>   * Add reference counting to bpf_async_cb to ensure the object stays
>>     alive until all scheduled irq_work completes and the timer/work
>>     callback finishes.
>>   * Move bpf_prog_put() to RCU callback to handle races between
>>     set_callback() and cancel_and_free().
>>
>> This enables BPF programs attached to NMI-context hooks (perf
>> events) to use timers and workqueues for deferred processing.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>>   kernel/bpf/helpers.c | 288 ++++++++++++++++++++++++++++++++++-----------------
>>   1 file changed, 191 insertions(+), 97 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index dc8ed948321e6c535d2cc2e8f9fbdd0636cdcabf..b90b005a17e1de9c0c62056a665d124b883c6320 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -29,6 +29,7 @@
>>   #include <linux/task_work.h>
>>   #include <linux/irq_work.h>
>>   #include <linux/buildid.h>
>> +#include <mpmc_cell.h>
>>
>>   #include "../../lib/kstrtox.h"
>>
>> @@ -1095,6 +1096,23 @@ static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
>>          return (void *)value - round_up(map->key_size, 8);
>>   }
>>
>> +enum bpf_async_type {
>> +       BPF_ASYNC_TYPE_TIMER = 0,
>> +       BPF_ASYNC_TYPE_WQ,
>> +};
>> +
>> +enum bpf_async_op {
>> +       BPF_ASYNC_START,
>> +       BPF_ASYNC_CANCEL,
>> +       BPF_ASYNC_CANCEL_AND_FREE,
>> +};
>> +
>> +struct bpf_async_cmd {
>> +       u64 nsec;
>> +       u32 mode;
>> +       u32 op;
>> +};
>> +
>>   struct bpf_async_cb {
>>          struct bpf_map *map;
>>          struct bpf_prog *prog;
>> @@ -1105,6 +1123,12 @@ struct bpf_async_cb {
>>                  struct work_struct delete_work;
>>          };
>>          u64 flags;
>> +       struct irq_work worker;
>> +       struct bpf_mpmc_cell_ctl ctl;
> nit: I'd call it more meaningful "cmd_cell"
>
>> +       struct bpf_async_cmd cmd[2];
>> +       atomic_t last_seq;
>> +       refcount_t refcnt;
>> +       enum bpf_async_type type;
>>   };
>>
>>   /* BPF map elements can contain 'struct bpf_timer'.
>> @@ -1142,18 +1166,8 @@ struct bpf_async_kern {
>>                  struct bpf_hrtimer *timer;
>>                  struct bpf_work *work;
>>          };
>> -       /* bpf_spin_lock is used here instead of spinlock_t to make
>> -        * sure that it always fits into space reserved by struct bpf_timer
>> -        * regardless of LOCKDEP and spinlock debug flags.
>> -        */
>> -       struct bpf_spin_lock lock;
> we have to leave dummy placeholder instead of preserve bpf_async_kern
> size for ABI compatibility
>
>>   } __attribute__((aligned(8)));
>>
>> -enum bpf_async_type {
>> -       BPF_ASYNC_TYPE_TIMER = 0,
>> -       BPF_ASYNC_TYPE_WQ,
>> -};
>> -
>>   static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
>>
>>   static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
>> @@ -1219,6 +1233,13 @@ static void bpf_async_cb_rcu_free(struct rcu_head *rcu)
>>   {
>>          struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu);
>>
>> +       /*
>> +        * Drop the last reference to prog only after RCU GP, as set_callback()
>> +        * may race with cancel_and_free()
>> +        */
>> +       if (cb->prog)
>> +               bpf_prog_put(cb->prog);
>> +
>>          kfree_nolock(cb);
>>   }
>>
>> @@ -1246,18 +1267,17 @@ static void bpf_timer_delete_work(struct work_struct *work)
>>          call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
>>   }
>>
>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async);
>> +static void bpf_async_irq_worker(struct irq_work *work);
>> +
>>   static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
>>                              enum bpf_async_type type)
>>   {
>> -       struct bpf_async_cb *cb;
>> +       struct bpf_async_cb *cb, *old_cb;
>>          struct bpf_hrtimer *t;
>>          struct bpf_work *w;
>>          clockid_t clockid;
>>          size_t size;
>> -       int ret = 0;
>> -
>> -       if (in_nmi())
>> -               return -EOPNOTSUPP;
>>
>>          switch (type) {
>>          case BPF_ASYNC_TYPE_TIMER:
>> @@ -1270,18 +1290,13 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>>                  return -EINVAL;
>>          }
>>
>> -       __bpf_spin_lock_irqsave(&async->lock);
>>          t = async->timer;
> READ_ONCE()?
>
>> -       if (t) {
>> -               ret = -EBUSY;
>> -               goto out;
>> -       }
>> +       if (t)
>> +               return -EBUSY;
>>
>>          cb = bpf_map_kmalloc_nolock(map, size, 0, map->numa_node);
>> -       if (!cb) {
>> -               ret = -ENOMEM;
>> -               goto out;
>> -       }
>> +       if (!cb)
>> +               return -ENOMEM;
>>
>>          switch (type) {
>>          case BPF_ASYNC_TYPE_TIMER:
>> @@ -1304,9 +1319,19 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>>          cb->map = map;
>>          cb->prog = NULL;
>>          cb->flags = flags;
>> +       cb->worker = IRQ_WORK_INIT(bpf_async_irq_worker);
>> +       bpf_mpmc_cell_init(&cb->ctl, &cb->cmd[0], &cb->cmd[1]);
>> +       refcount_set(&cb->refcnt, 1); /* map's reference */
>> +       atomic_set(&cb->last_seq, 0);
>> +       cb->type = type;
>>          rcu_assign_pointer(cb->callback_fn, NULL);
>>
>> -       WRITE_ONCE(async->cb, cb);
>> +       old_cb = cmpxchg(&async->cb, NULL, cb);
>> +       if (old_cb) {
>> +               /* Lost the race to initialize this bpf_async_kern, drop the allocated object */
>> +               kfree_nolock(cb);
>> +               return -EBUSY;
>> +       }
>>          /* Guarantee the order between async->cb and map->usercnt. So
>>           * when there are concurrent uref release and bpf timer init, either
>>           * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
>> @@ -1317,13 +1342,11 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
>>                  /* maps with timers must be either held by user space
>>                   * or pinned in bpffs.
>>                   */
>> -               WRITE_ONCE(async->cb, NULL);
>> -               kfree_nolock(cb);
>> -               ret = -EPERM;
>> +               __bpf_async_cancel_and_free(async);
>> +               return -EPERM;
>>          }
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return ret;
>> +
>> +       return 0;
>>   }
>>
>>   BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
>> @@ -1354,6 +1377,61 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
>>          .arg3_type      = ARG_ANYTHING,
>>   };
>>
>> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, u32 op, u64 nsec, u32 timer_mode)
>> +{
>> +       struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
>> +       struct bpf_async_cmd *cmd;
>> +
>> +       cmd = bpf_mpmc_cell_write_begin(ctl);
>> +       if (!cmd)
>> +               return -EBUSY;
>> +
>> +       cmd->nsec = nsec;
>> +       cmd->mode = timer_mode;
>> +       cmd->op = op;
>> +
>> +       bpf_mpmc_cell_write_commit(ctl);
>> +
>> +       if (!refcount_inc_not_zero(&cb->refcnt))
>> +               return -EBUSY;
>> +
>> +       irq_work_queue(&cb->worker);
> if not in NMI and not irq-disabled mode, we should be able to call
> bpf_async_irq_worker() directly here and execute action synchronously
> without irq_work execution. Add TODO so we don't forget to implement
> that before patch set lands?
>
>> +
>> +       return 0;
>> +}
>> +
>> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
>> +                            u64 *nsec, u32 *flags)
>> +{
>> +       struct bpf_mpmc_cell_ctl *ctl = &cb->ctl;
>> +       struct bpf_async_cmd *cmd;
>> +       u32 seq, last_seq;
>> +
>> +       do {
>> +               last_seq = atomic_read_acquire(&cb->last_seq);
>> +               cmd = bpf_mpmc_cell_read_begin(ctl, &seq);
>> +
>> +               /* Return -EBUSY if current seq is consumed by another reader */
>> +               if (seq == last_seq)
>> +                       return -EBUSY;
>> +
>> +               *nsec = cmd->nsec;
>> +               *flags = cmd->mode;
>> +               *op = cmd->op;
>> +
>> +       /*
>> +        * Retry read on one of the two conditions:
>> +        *  1. Some writer produced new snapshot while we were reading. Our snapshot may have been
>> +        *     modified, and not consistent.
>> +        *  2. Another reader consumed some snapshot. We need to validate that this snapshot is not
>> +        *     consumed. This prevents duplicate op processing.
>> +        */
>> +       } while (bpf_mpmc_cell_read_end(ctl, seq) == -EAGAIN ||
> can read_end return any other error? If yes, how do we handle? If not,
> why hard-code -EAGAIN here, maybe just `< 0` check?
>
>> +                atomic_cmpxchg_release(&cb->last_seq, last_seq, seq) != last_seq);
> nit: repeat condition is complicated enough (and requires multi-line
> weirdly indented comment), so I'd do:
>
> while (true) {
>     ....
>
>      if (bpf_mpmc_cell_read_end(ctl, seq) < 0)
>          continue;
>      if (atomic_cmpxchg() == last_seq) /* success*/
>          break;
>
>
> (or invert cmpxchg + continue, and then stand-alone break)
>
>> +
>> +       return 0;
>> +}
>> +
>>   static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
>>                                      struct bpf_prog *prog)
>>   {
>> @@ -1395,22 +1473,19 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
>>          .arg2_type      = ARG_PTR_TO_FUNC,
>>   };
>>
>> -BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
>> +BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
>>   {
>>          struct bpf_hrtimer *t;
>> -       int ret = 0;
>> -       enum hrtimer_mode mode;
>> +       u32 mode;
>>
>> -       if (in_nmi())
>> -               return -EOPNOTSUPP;
>>          if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
>>                  return -EINVAL;
>> -       __bpf_spin_lock_irqsave(&timer->lock);
>> -       t = timer->timer;
>> -       if (!t || !t->cb.prog) {
>> -               ret = -EINVAL;
>> -               goto out;
>> -       }
>> +
>> +       guard(rcu)();
>> +
>> +       t = async->timer;
> READ_ONCE()
>
>> +       if (!t || !t->cb.prog)
>> +               return -EINVAL;
>>
>>          if (flags & BPF_F_TIMER_ABS)
>>                  mode = HRTIMER_MODE_ABS_SOFT;
>> @@ -1420,10 +1495,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla
>>          if (flags & BPF_F_TIMER_CPU_PIN)
>>                  mode |= HRTIMER_MODE_PINNED;
>>
>> -       hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&timer->lock);
>> -       return ret;
>> +       return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
>>   }
>>
>>   static const struct bpf_func_proto bpf_timer_start_proto = {
>> @@ -1435,17 +1507,6 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
>>          .arg3_type      = ARG_ANYTHING,
>>   };
>>
>> -static void drop_prog_refcnt(struct bpf_async_cb *async)
>> -{
>> -       struct bpf_prog *prog = async->prog;
>> -
>> -       if (prog) {
>> -               bpf_prog_put(prog);
>> -               async->prog = NULL;
>> -               rcu_assign_pointer(async->callback_fn, NULL);
>> -       }
>> -}
>> -
>>   BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
>>   {
>>          struct bpf_hrtimer *t, *cur_t;
>> @@ -1513,27 +1574,16 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>>          .arg1_type      = ARG_PTR_TO_TIMER,
>>   };
>>
>> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>>   {
>>          struct bpf_async_cb *cb;
>>
>> -       /* Performance optimization: read async->cb without lock first. */
>> -       if (!READ_ONCE(async->cb))
>> -               return NULL;
>> -
>> -       __bpf_spin_lock_irqsave(&async->lock);
>> -       /* re-read it under lock */
>> -       cb = async->cb;
>> +       cb = xchg(&async->cb, NULL);
>>          if (!cb)
>> -               goto out;
>> -       drop_prog_refcnt(cb);
>> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>> -        * this timer, since it won't be initialized.
>> -        */
>> -       WRITE_ONCE(async->cb, NULL);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return cb;
>> +               return;
>> +
>> +       /* Consume map's refcnt */
>> +       irq_work_queue(&cb->worker);
> hm... this is subtle (and maybe broken?) irq_work_queue() can be
> ignored here, if there is already another one scheduled, so I think
> your clever idea with CANCEL_AND_FREE being done based on this
> refcount drop is flawed...
>
> CANCEL_AND_FREE has to succeed, so it's out-of-bounds signal that
> shouldn't be going through that command cell, yes. But can't we just
> have a simple one-way bool that will be set to true here (+ memory
> barriers, maybe), and then irq_work_queue() scheduled. If there is irq
> work is scheduled, it will inevitable will see this flag (even if it's
> not our callback), and if not, then irq_work_queue() will successfully
> schedule callback which will also clean up.
Thanks for pointing to this issue.
I don't think we can solve it with a simple bool, because cleanup
should only happen once, and only after refcnt is 0, otherwise we need 
to go
back to states to indicate cleanup initiation and cleanup entering (to 
implement
mutual exclusion of the irq_work callbacks trying to run cleanup).
We can still solve this with the refcnt: Drop reference, if refcnt is not 0,
we successfully released map reference and one of the scheduled irq_work 
callbacks
will cleanup, otherwise we took the last reference, all we need is to 
schedule new irq_work
(which can't fail)

     if (!refcount_dec_and_test(&cb->refcnt))
         return;

     /* We took the last reference, need to schedule cleanup */
     refcount_set(&cb->refcnt, 1);
     irq_work_queue(&cb->worker);

>
> also, same about TODO for irq_work_queue() avoidance
>
>
>>   }
>>
>>   static void bpf_timer_delete(struct bpf_hrtimer *t)
>> @@ -1588,19 +1638,76 @@ static void bpf_timer_delete(struct bpf_hrtimer *t)
>>          }
>>   }
>>
>> +static void bpf_async_process_op(struct bpf_async_cb *cb, u32 op,
>> +                                u64 timer_nsec, u32 timer_mode)
>> +{
>> +       switch (cb->type) {
>> +       case BPF_ASYNC_TYPE_TIMER: {
>> +               struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb);
>> +
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL:
>> +                       hrtimer_try_to_cancel(&t->timer);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL_AND_FREE:
>> +                       bpf_timer_delete(t);
>> +                       break;
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       }
>> +       case BPF_ASYNC_TYPE_WQ: {
>> +               struct bpf_work *w = container_of(cb, struct bpf_work, cb);
>> +
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       schedule_work(&w->work);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL_AND_FREE:
>> +                       /*
>> +                        * Trigger cancel of the sleepable work, but *do not* wait for
>> +                        * it to finish.
>> +                        * kfree will be called once the work has finished.
>> +                        */
>> +                       schedule_work(&w->delete_work);
>> +                       break;
>> +               default:
>> +                       break;
>> +               }
>> +               break;
>> +       }
>> +       }
>> +}
>> +
>> +static void bpf_async_irq_worker(struct irq_work *work)
>> +{
>> +       struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
>> +       u32 op, timer_mode;
>> +       u64 nsec;
>> +       int err;
>> +
>> +       err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
>> +       if (err)
>> +               goto out;
>> +
>> +       bpf_async_process_op(cb, op, nsec, timer_mode);
>> +
>> +out:
>> +       if (refcount_dec_and_test(&cb->refcnt))
>> +               bpf_async_process_op(cb, BPF_ASYNC_CANCEL_AND_FREE, 0, 0);
>> +}
>> +
>>   /*
>>    * This function is called by map_delete/update_elem for individual element and
>>    * by ops->map_release_uref when the user space reference to a map reaches zero.
>>    */
>>   void bpf_timer_cancel_and_free(void *val)
>>   {
>> -       struct bpf_hrtimer *t;
>> -
>> -       t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
>> -       if (!t)
>> -               return;
>> -
>> -       bpf_timer_delete(t);
>> +       __bpf_async_cancel_and_free(val);
>>   }
>>
>>   /* This function is called by map_delete/update_elem for individual element and
>> @@ -1608,19 +1715,7 @@ void bpf_timer_cancel_and_free(void *val)
>>    */
>>   void bpf_wq_cancel_and_free(void *val)
>>   {
>> -       struct bpf_work *work;
>> -
>> -       BTF_TYPE_EMIT(struct bpf_wq);
>> -
>> -       work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
>> -       if (!work)
>> -               return;
>> -       /* Trigger cancel of the sleepable work, but *do not* wait for
>> -        * it to finish if it was running as we might not be in a
>> -        * sleepable context.
>> -        * kfree will be called once the work has finished.
>> -        */
>> -       schedule_work(&work->delete_work);
>> +       __bpf_async_cancel_and_free(val);
>>   }
>>
>>   BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
>> @@ -3093,15 +3188,14 @@ __bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
>>          struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
>>          struct bpf_work *w;
>>
>> -       if (in_nmi())
>> -               return -EOPNOTSUPP;
>>          if (flags)
>>                  return -EINVAL;
>>          w = READ_ONCE(async->work);
>>          if (!w || !READ_ONCE(w->cb.prog))
>>                  return -EINVAL;
>>
>> -       schedule_work(&w->work);
>> +       bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0);
>> +
>>          return 0;
>>   }
>>
>>
>> --
>> 2.52.0
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-14 14:53     ` Mykyta Yatsenko
@ 2026-01-15 18:39       ` Andrii Nakryiko
  2026-01-15 18:52         ` Mykyta Yatsenko
  0 siblings, 1 reply; 40+ messages in thread
From: Andrii Nakryiko @ 2026-01-15 18:39 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Wed, Jan 14, 2026 at 6:53 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 1/9/26 22:19, Andrii Nakryiko wrote:
> > On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> >> From: Mykyta Yatsenko <yatsenko@meta.com>
> >>
> >> Refactor bpf timer and workqueue helpers to allow calling them from NMI
> >> context by making all operations lock-free and deferring NMI-unsafe
> >> work to irq_work.
> >>
> >> Previously, bpf_timer_start(), and bpf_wq_start()
> >> could not be called from NMI context because they acquired
> >> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
> >> patch removes these limitations.
> >>
> >> Key changes:
> >>   * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
> >>     operations with atomic cmpxchg() for initialization and xchg() for
> >>     cancel and free.
> >>   * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
> >>     hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
> >>   * Use the lock-free mpmc_cell (added in the previous commit) to pass
> >>     operation commands (start/cancel/free) along with their parameters
> >>     (nsec, mode) from NMI-safe callers to the irq_work handler.
> >>   * Add reference counting to bpf_async_cb to ensure the object stays
> >>     alive until all scheduled irq_work completes and the timer/work
> >>     callback finishes.
> >>   * Move bpf_prog_put() to RCU callback to handle races between
> >>     set_callback() and cancel_and_free().
> >>
> >> This enables BPF programs attached to NMI-context hooks (perf
> >> events) to use timers and workqueues for deferred processing.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> >>   kernel/bpf/helpers.c | 288 ++++++++++++++++++++++++++++++++++-----------------
> >>   1 file changed, 191 insertions(+), 97 deletions(-)
> >>

please trim irrelevant parts to shorten distractions in email

[...]

> >> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
> >> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
> >>   {
> >>          struct bpf_async_cb *cb;
> >>
> >> -       /* Performance optimization: read async->cb without lock first. */
> >> -       if (!READ_ONCE(async->cb))
> >> -               return NULL;
> >> -
> >> -       __bpf_spin_lock_irqsave(&async->lock);
> >> -       /* re-read it under lock */
> >> -       cb = async->cb;
> >> +       cb = xchg(&async->cb, NULL);
> >>          if (!cb)
> >> -               goto out;
> >> -       drop_prog_refcnt(cb);
> >> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> >> -        * this timer, since it won't be initialized.
> >> -        */
> >> -       WRITE_ONCE(async->cb, NULL);
> >> -out:
> >> -       __bpf_spin_unlock_irqrestore(&async->lock);
> >> -       return cb;
> >> +               return;
> >> +
> >> +       /* Consume map's refcnt */
> >> +       irq_work_queue(&cb->worker);
> > hm... this is subtle (and maybe broken?) irq_work_queue() can be
> > ignored here, if there is already another one scheduled, so I think
> > your clever idea with CANCEL_AND_FREE being done based on this
> > refcount drop is flawed...
> >
> > CANCEL_AND_FREE has to succeed, so it's out-of-bounds signal that
> > shouldn't be going through that command cell, yes. But can't we just
> > have a simple one-way bool that will be set to true here (+ memory
> > barriers, maybe), and then irq_work_queue() scheduled. If there is irq
> > work is scheduled, it will inevitable will see this flag (even if it's
> > not our callback), and if not, then irq_work_queue() will successfully
> > schedule callback which will also clean up.
> Thanks for pointing to this issue.
> I don't think we can solve it with a simple bool, because cleanup
> should only happen once, and only after refcnt is 0, otherwise we need
> to go
> back to states to indicate cleanup initiation and cleanup entering (to
> implement
> mutual exclusion of the irq_work callbacks trying to run cleanup).

I meant bool as a command indicator of sorts, which is just part of
the solution. We do have reader's mutual exclusion with last_seq,
don't we? And the idea was that once the winning reader notices that
shutdown was requested, they can perform it and make sure that
subsequent readers (if there are any scheduled) would do nothing.
E.g., by setting last_seq to some special large value to indicate "we
are done, no more commands will be accepted".

But let me go and read the latest version and refresh what's going on
there in my mind.

> We can still solve this with the refcnt: Drop reference, if refcnt is not 0,
> we successfully released map reference and one of the scheduled irq_work
> callbacks
> will cleanup, otherwise we took the last reference, all we need is to
> schedule new irq_work
> (which can't fail)
>
>      if (!refcount_dec_and_test(&cb->refcnt))
>          return;
>
>      /* We took the last reference, need to schedule cleanup */
>      refcount_set(&cb->refcnt, 1);
>      irq_work_queue(&cb->worker);
>

[...]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-15 18:39       ` Andrii Nakryiko
@ 2026-01-15 18:52         ` Mykyta Yatsenko
  0 siblings, 0 replies; 40+ messages in thread
From: Mykyta Yatsenko @ 2026-01-15 18:52 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On 1/15/26 18:39, Andrii Nakryiko wrote:
> On Wed, Jan 14, 2026 at 6:53 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> On 1/9/26 22:19, Andrii Nakryiko wrote:
>>> On Wed, Jan 7, 2026 at 9:49 AM Mykyta Yatsenko
>>> <mykyta.yatsenko5@gmail.com> wrote:
>>>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>>>
>>>> Refactor bpf timer and workqueue helpers to allow calling them from NMI
>>>> context by making all operations lock-free and deferring NMI-unsafe
>>>> work to irq_work.
>>>>
>>>> Previously, bpf_timer_start(), and bpf_wq_start()
>>>> could not be called from NMI context because they acquired
>>>> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
>>>> patch removes these limitations.
>>>>
>>>> Key changes:
>>>>    * Remove bpf_spin_lock from struct bpf_async_kern. Replace locked
>>>>      operations with atomic cmpxchg() for initialization and xchg() for
>>>>      cancel and free.
>>>>    * Add per-async irq_work to defer NMI-unsafe operations (hrtimer_start,
>>>>      hrtimer_try_to_cancel, schedule_work) from NMI to softirq context.
>>>>    * Use the lock-free mpmc_cell (added in the previous commit) to pass
>>>>      operation commands (start/cancel/free) along with their parameters
>>>>      (nsec, mode) from NMI-safe callers to the irq_work handler.
>>>>    * Add reference counting to bpf_async_cb to ensure the object stays
>>>>      alive until all scheduled irq_work completes and the timer/work
>>>>      callback finishes.
>>>>    * Move bpf_prog_put() to RCU callback to handle races between
>>>>      set_callback() and cancel_and_free().
>>>>
>>>> This enables BPF programs attached to NMI-context hooks (perf
>>>> events) to use timers and workqueues for deferred processing.
>>>>
>>>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>>>> ---
>>>>    kernel/bpf/helpers.c | 288 ++++++++++++++++++++++++++++++++++-----------------
>>>>    1 file changed, 191 insertions(+), 97 deletions(-)
>>>>
> please trim irrelevant parts to shorten distractions in email
>
> [...]
>
>>>> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
>>>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>>>>    {
>>>>           struct bpf_async_cb *cb;
>>>>
>>>> -       /* Performance optimization: read async->cb without lock first. */
>>>> -       if (!READ_ONCE(async->cb))
>>>> -               return NULL;
>>>> -
>>>> -       __bpf_spin_lock_irqsave(&async->lock);
>>>> -       /* re-read it under lock */
>>>> -       cb = async->cb;
>>>> +       cb = xchg(&async->cb, NULL);
>>>>           if (!cb)
>>>> -               goto out;
>>>> -       drop_prog_refcnt(cb);
>>>> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>>>> -        * this timer, since it won't be initialized.
>>>> -        */
>>>> -       WRITE_ONCE(async->cb, NULL);
>>>> -out:
>>>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>>>> -       return cb;
>>>> +               return;
>>>> +
>>>> +       /* Consume map's refcnt */
>>>> +       irq_work_queue(&cb->worker);
>>> hm... this is subtle (and maybe broken?) irq_work_queue() can be
>>> ignored here, if there is already another one scheduled, so I think
>>> your clever idea with CANCEL_AND_FREE being done based on this
>>> refcount drop is flawed...
>>>
>>> CANCEL_AND_FREE has to succeed, so it's out-of-bounds signal that
>>> shouldn't be going through that command cell, yes. But can't we just
>>> have a simple one-way bool that will be set to true here (+ memory
>>> barriers, maybe), and then irq_work_queue() scheduled. If there is irq
>>> work is scheduled, it will inevitable will see this flag (even if it's
>>> not our callback), and if not, then irq_work_queue() will successfully
>>> schedule callback which will also clean up.
>> Thanks for pointing to this issue.
>> I don't think we can solve it with a simple bool, because cleanup
>> should only happen once, and only after refcnt is 0, otherwise we need
>> to go
>> back to states to indicate cleanup initiation and cleanup entering (to
>> implement
>> mutual exclusion of the irq_work callbacks trying to run cleanup).
> I meant bool as a command indicator of sorts, which is just part of
> the solution. We do have reader's mutual exclusion with last_seq,
> don't we? And the idea was that once the winning reader notices that
> shutdown was requested, they can perform it and make sure that
> subsequent readers (if there are any scheduled) would do nothing.
> E.g., by setting last_seq to some special large value to indicate "we
> are done, no more commands will be accepted".
Winning reader that runs cleanup can't tell other pending readers to not
do anything, because after cleanup we can't guarantee that the bpf_async_cb
is not freed, other readers have to access it, that's why we have refcnt 
that
is the mechanism to make sure the last reader does cleanup.
>
> But let me go and read the latest version and refresh what's going on
> there in my mind.
>
>> We can still solve this with the refcnt: Drop reference, if refcnt is not 0,
>> we successfully released map reference and one of the scheduled irq_work
>> callbacks
>> will cleanup, otherwise we took the last reference, all we need is to
>> schedule new irq_work
>> (which can't fail)
>>
>>       if (!refcount_dec_and_test(&cb->refcnt))
>>           return;
>>
>>       /* We took the last reference, need to schedule cleanup */
>>       refcount_set(&cb->refcnt, 1);
>>       irq_work_queue(&cb->worker);
>>
> [...]


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-01-15 18:52 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 17:49 [PATCH RFC v3 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
2026-01-07 17:49 ` [PATCH RFC v3 01/10] bpf: Refactor __bpf_async_set_callback() Mykyta Yatsenko
2026-01-07 18:22   ` bot+bpf-ci
2026-01-09 22:18     ` Andrii Nakryiko
2026-01-13 13:58       ` Mykyta Yatsenko
2026-01-13 16:14       ` Mykyta Yatsenko
2026-01-09 22:18   ` Andrii Nakryiko
2026-01-12  6:47   ` Kumar Kartikeya Dwivedi
2026-01-12  8:12     ` Kumar Kartikeya Dwivedi
2026-01-07 17:49 ` [PATCH RFC v3 02/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
2026-01-12  6:51   ` Kumar Kartikeya Dwivedi
2026-01-07 17:49 ` [PATCH RFC v3 03/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
2026-01-07 18:22   ` bot+bpf-ci
2026-01-09 22:19     ` Andrii Nakryiko
2026-01-12  7:29       ` Kumar Kartikeya Dwivedi
2026-01-07 17:49 ` [PATCH RFC v3 04/10] bpf: Add lock-free cell for NMI-safe async operations Mykyta Yatsenko
2026-01-07 18:08   ` bot+bpf-ci
2026-01-07 18:30   ` Kumar Kartikeya Dwivedi
2026-01-07 19:05     ` Mykyta Yatsenko
2026-01-09  1:18       ` Andrii Nakryiko
2026-01-09 18:22         ` Mykyta Yatsenko
2026-01-09 18:47           ` Andrii Nakryiko
2026-01-09 23:51             ` Alexei Starovoitov
2026-01-10  0:03               ` Andrii Nakryiko
2026-01-09 22:19   ` Andrii Nakryiko
2026-01-07 17:49 ` [PATCH RFC v3 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
2026-01-07 18:22   ` bot+bpf-ci
2026-01-09 22:19   ` Andrii Nakryiko
2026-01-14 14:53     ` Mykyta Yatsenko
2026-01-15 18:39       ` Andrii Nakryiko
2026-01-15 18:52         ` Mykyta Yatsenko
2026-01-12  8:10   ` Kumar Kartikeya Dwivedi
2026-01-07 17:49 ` [PATCH RFC v3 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
2026-01-09 22:19   ` Andrii Nakryiko
2026-01-07 17:49 ` [PATCH RFC v3 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
2026-01-09 22:19   ` Andrii Nakryiko
2026-01-07 17:49 ` [PATCH RFC v3 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
2026-01-07 17:49 ` [PATCH RFC v3 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
2026-01-07 17:49 ` [PATCH RFC v3 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
2026-01-09 22:19   ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.