BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq
@ 2026-01-20 15:59 Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
                   ` (10 more replies)
  0 siblings, 11 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

This series reworks implementation of BPF timer and workqueue APIs.
The goal is to make both timers and wq non-blocking, enabling their use
in NMI context.
Today this code relies on a bpf_spin_lock embedded in the map element to
serialize:
 * init of the async object,
 * setting/changing the callback and bpf_prog
 * starting/cancelling the timer/work
 * tearing down when the map element is deleted or the map’s user ref is
 dropped

The basic design approach in this series:
 * Use irq_work to offload all blocking work from NMI
 * Introduce refcount to guarantee lifetime of the bpf_async_cb structs
 deferred to potentially multiple irq_work callbacks
 * Keep objects under RCU protection to make sure they are not freed
 while kfuncs/helpers access them (We can't use refcnt for this, as
 refcnt itself is part of the bpf_async_cb struct)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
Changes in v6:
- Reworked destruction and refcnt use:
  - On cancel_and_free() set last_seq to BPF_ASYNC_DESTROY value, drop
    map's reference
  - In irq work callback, atomically switch DESTROY to DESTROYED, cancel
    timer/wq
  - Free bpf_async_cb on refcnt going to 0.
- Link to v5: https://lore.kernel.org/r/20260115-timer_nolock-v5-0-15e3aef2703d@meta.com

Changes in v5:
- Extracted lock-free algorithm for updating cb->prog and
cb->callback_fn into a function bpf_async_update_prog_callback(),
added a new commit and introduces this function and uses it in
__bpf_async_set_callback(), bpf_timer_cancel() and
bpf_async_cancel_and_free().
This allows to move the change into the separate commit without breaking
correctness.
- Handle NULL prog in bpf_async_update_prog_callback().
- Link to v4: https://lore.kernel.org/r/20260114-timer_nolock-v4-0-fa6355f51fa7@meta.com

Changes in v4:
- Handle irq_work_queue failures in both schedule and cancel_and_free
paths: introduced bpf_async_refcnt_dec_cleanup() that decrements refcnt
and makes sure if last reference is put, there is at least one irq_work
scheduled to execute final cleanup.
- Additional refcnt inc/dec in set_callback() + rcu lock to make sure
cleanup is not running at the same time as set_callback().
- Added READ_ONCE where it was needed.
- Squash 'bpf: Refactor __bpf_async_set_callback()' commit into 'bpf:
Add lock-free cell for NMI-safe
async operations'
- Removed mpmc_cell, use seqcount_latch_t instead.
- Link to v3: https://lore.kernel.org/r/20260107-timer_nolock-v3-0-740d3ec3e5f9@meta.com

Changes in v3:
- Major rework
- Introduce mpmc_cell, allowing concurrent writes and reads
- Implement irq_work deferring
- Adding selftests
- Introduces bpf_timer_cancel_async kfunc
- Link to v2: https://lore.kernel.org/r/20251105-timer_nolock-v2-0-32698db08bfa@meta.com

Changes in v2:
- Move refcnt initialization and put (from cancel_and_free())
from patch 5 into the patch 4, so that patch 4 has more clear and full
implementation and use of refcnt
- Link to v1: https://lore.kernel.org/r/20251031-timer_nolock-v1-0-b064ae403bfb@meta.com

---
Mykyta Yatsenko (10):
      bpf: Factor out timer deletion helper
      bpf: Remove unnecessary arguments from bpf_async_set_callback()
      bpf: Introduce lock-free bpf_async_update_prog_callback()
      bpf: Simplify bpf_timer_cancel()
      bpf: Enable bpf timer and workqueue use in NMI
      bpf: Add verifier support for bpf_timer argument in kfuncs
      bpf: Introduce bpf_timer_cancel_async() kfunc
      selftests/bpf: Refactor timer selftests
      selftests/bpf: Add stress test for timer async cancel
      selftests/bpf: Verify bpf_timer_cancel_async works

 kernel/bpf/helpers.c                           | 526 ++++++++++++++-----------
 kernel/bpf/verifier.c                          |  59 ++-
 tools/testing/selftests/bpf/prog_tests/timer.c |  92 ++++-
 tools/testing/selftests/bpf/progs/timer.c      |  37 +-
 4 files changed, 464 insertions(+), 250 deletions(-)
---
base-commit: efad162f5a840ae178e7761c176c49f433c7bb68
change-id: 20251028-timer_nolock-457f5b9daace

Best regards,
-- 
Mykyta Yatsenko <yatsenko@meta.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback() Mykyta Yatsenko
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Move the timer deletion logic into a dedicated bpf_timer_delete()
helper so it can be reused by later patches.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9eaa4185e0a79b903c6fc2ccb310f521a4b14a1d..cbacddc7101a82b2f72278034bba4188829fecd6 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1558,18 +1558,10 @@ static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *a
 	return cb;
 }
 
-/* This function is called by map_delete/update_elem for individual element and
- * by ops->map_release_uref when the user space reference to a map reaches zero.
- */
-void bpf_timer_cancel_and_free(void *val)
+static void bpf_timer_delete(struct bpf_hrtimer *t)
 {
-	struct bpf_hrtimer *t;
-
-	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
-
-	if (!t)
-		return;
-	/* We check that bpf_map_delete/update_elem() was called from timer
+	/*
+	 * We check that bpf_map_delete/update_elem() was called from timer
 	 * callback_fn. In such case we don't call hrtimer_cancel() (since it
 	 * will deadlock) and don't call hrtimer_try_to_cancel() (since it will
 	 * just return -1). Though callback_fn is still running on this cpu it's
@@ -1618,6 +1610,21 @@ void bpf_timer_cancel_and_free(void *val)
 	}
 }
 
+/*
+ * This function is called by map_delete/update_elem for individual element and
+ * by ops->map_release_uref when the user space reference to a map reaches zero.
+ */
+void bpf_timer_cancel_and_free(void *val)
+{
+	struct bpf_hrtimer *t;
+
+	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
+	if (!t)
+		return;
+
+	bpf_timer_delete(t);
+}
+
 /* This function is called by map_delete/update_elem for individual element and
  * by ops->map_release_uref when the user space reference to a map reaches zero.
  */

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback()
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback() Mykyta Yatsenko
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Remove unused arguments from __bpf_async_set_callback().

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index cbacddc7101a82b2f72278034bba4188829fecd6..962b7f1b81b05d663b79218d9d7eaa73679ce94f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1355,10 +1355,9 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
 };
 
 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
-				    struct bpf_prog_aux *aux, unsigned int flags,
-				    enum bpf_async_type type)
+				    struct bpf_prog *prog)
 {
-	struct bpf_prog *prev, *prog = aux->prog;
+	struct bpf_prog *prev;
 	struct bpf_async_cb *cb;
 	int ret = 0;
 
@@ -1403,7 +1402,7 @@ static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback
 BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
 	   struct bpf_prog_aux *, aux)
 {
-	return __bpf_async_set_callback(timer, callback_fn, aux, 0, BPF_ASYNC_TYPE_TIMER);
+	return __bpf_async_set_callback(timer, callback_fn, aux->prog);
 }
 
 static const struct bpf_func_proto bpf_timer_set_callback_proto = {
@@ -3138,7 +3137,7 @@ __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,
 	if (flags)
 		return -EINVAL;
 
-	return __bpf_async_set_callback(async, callback_fn, aux, flags, BPF_ASYNC_TYPE_WQ);
+	return __bpf_async_set_callback(async, callback_fn, aux->prog);
 }
 
 __bpf_kfunc void bpf_preempt_disable(void)

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback()
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback() Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Introduce bpf_async_update_prog_callback(): lock-free update of cb->prog
and cb->callback_fn. This function allows updating prog and callback_fn
fields of the struct bpf_async_cb without holding lock.
For now use it under the lock from __bpf_async_set_callback(), in the
next patches that lock will be removed.

Lock-free algorithm:
 * Acquire a guard reference on prog to prevent it from being freed
   during the retry loop.
 * Retry loop:
    1. Each iteration acquires a new prog reference and stores it
       in cb->prog via xchg. The previous prog is released.
    2. The loop condition checks if both cb->prog and cb->callback_fn
       match what we just wrote. If either differs, a concurrent writer
       overwrote our value, and we must retry.
    3. When we retry, our previously-stored prog was already released by
       the concurrent writer or will be released by us after
       overwriting.
 * Release guard reference.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 67 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 962b7f1b81b05d663b79218d9d7eaa73679ce94f..66424bc5b86137599990957ad2300110b4977df9 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1354,10 +1354,43 @@ static const struct bpf_func_proto bpf_timer_init_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+static int bpf_async_update_prog_callback(struct bpf_async_cb *cb, void *callback_fn,
+					  struct bpf_prog *prog)
+{
+	struct bpf_prog *prev;
+
+	/* Acquire a guard reference on prog to prevent it from being freed during the loop */
+	if (prog) {
+		prog = bpf_prog_inc_not_zero(prog);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	do {
+		if (prog)
+			prog = bpf_prog_inc_not_zero(prog);
+		prev = xchg(&cb->prog, prog);
+		rcu_assign_pointer(cb->callback_fn, callback_fn);
+
+		/*
+		 * Release previous prog, make sure that if other CPU is contending,
+		 * to set bpf_prog, references are not leaked as each iteration acquires and
+		 * releases one reference.
+		 */
+		if (prev)
+			bpf_prog_put(prev);
+
+	} while (READ_ONCE(cb->prog) != prog || READ_ONCE(cb->callback_fn) != callback_fn);
+
+	if (prog)
+		bpf_prog_put(prog);
+
+	return 0;
+}
+
 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
 				    struct bpf_prog *prog)
 {
-	struct bpf_prog *prev;
 	struct bpf_async_cb *cb;
 	int ret = 0;
 
@@ -1378,22 +1411,7 @@ static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback
 		ret = -EPERM;
 		goto out;
 	}
-	prev = cb->prog;
-	if (prev != prog) {
-		/* Bump prog refcnt once. Every bpf_timer_set_callback()
-		 * can pick different callback_fn-s within the same prog.
-		 */
-		prog = bpf_prog_inc_not_zero(prog);
-		if (IS_ERR(prog)) {
-			ret = PTR_ERR(prog);
-			goto out;
-		}
-		if (prev)
-			/* Drop prev prog refcnt when swapping with new prog */
-			bpf_prog_put(prev);
-		cb->prog = prog;
-	}
-	rcu_assign_pointer(cb->callback_fn, callback_fn);
+	ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
 out:
 	__bpf_spin_unlock_irqrestore(&async->lock);
 	return ret;
@@ -1453,17 +1471,6 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-static void drop_prog_refcnt(struct bpf_async_cb *async)
-{
-	struct bpf_prog *prog = async->prog;
-
-	if (prog) {
-		bpf_prog_put(prog);
-		async->prog = NULL;
-		rcu_assign_pointer(async->callback_fn, NULL);
-	}
-}
-
 BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 {
 	struct bpf_hrtimer *t, *cur_t;
@@ -1514,7 +1521,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 		goto out;
 	}
 drop:
-	drop_prog_refcnt(&t->cb);
+	bpf_async_update_prog_callback(&t->cb, NULL, NULL);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	/* Cancel the timer and wait for associated callback to finish
@@ -1547,7 +1554,7 @@ static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *a
 	cb = async->cb;
 	if (!cb)
 		goto out;
-	drop_prog_refcnt(cb);
+	bpf_async_update_prog_callback(cb, NULL, NULL);
 	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
 	 * this timer, since it won't be initialized.
 	 */

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel()
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (2 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback() Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Remove lock from the bpf_timer_cancel() helper. The lock does not
protect from concurrent modification of the bpf_async_cb data fields as
those are modified in the callback without locking.

Use guard(rcu)() instead of pair of explicit lock()/unlock().

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 66424bc5b86137599990957ad2300110b4977df9..61ba4f6b741cc05b4a7a73a0322a23874bfd8e83 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1471,7 +1471,7 @@ static const struct bpf_func_proto bpf_timer_start_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
+BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
 {
 	struct bpf_hrtimer *t, *cur_t;
 	bool inc = false;
@@ -1479,13 +1479,12 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 
 	if (in_nmi())
 		return -EOPNOTSUPP;
-	rcu_read_lock();
-	__bpf_spin_lock_irqsave(&timer->lock);
-	t = timer->timer;
-	if (!t) {
-		ret = -EINVAL;
-		goto out;
-	}
+
+	guard(rcu)();
+
+	t = READ_ONCE(async->timer);
+	if (!t)
+		return -EINVAL;
 
 	cur_t = this_cpu_read(hrtimer_running);
 	if (cur_t == t) {
@@ -1493,8 +1492,7 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 		 * its own timer the hrtimer_cancel() will deadlock
 		 * since it waits for callback_fn to finish.
 		 */
-		ret = -EDEADLK;
-		goto out;
+		return -EDEADLK;
 	}
 
 	/* Only account in-flight cancellations when invoked from a timer
@@ -1517,20 +1515,17 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, timer)
 		 * cancelling and waiting for it synchronously, since it might
 		 * do the same. Bail!
 		 */
-		ret = -EDEADLK;
-		goto out;
+		atomic_dec(&t->cancelling);
+		return -EDEADLK;
 	}
 drop:
 	bpf_async_update_prog_callback(&t->cb, NULL, NULL);
-out:
-	__bpf_spin_unlock_irqrestore(&timer->lock);
 	/* Cancel the timer and wait for associated callback to finish
 	 * if it was running.
 	 */
-	ret = ret ?: hrtimer_cancel(&t->timer);
+	ret = hrtimer_cancel(&t->timer);
 	if (inc)
 		atomic_dec(&t->cancelling);
-	rcu_read_unlock();
 	return ret;
 }
 

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (3 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 18:31   ` Andrii Nakryiko
  2026-01-20 15:59 ` [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor bpf timer and workqueue helpers to allow calling them from NMI
context by making all operations lock-free and deferring NMI-unsafe
work to irq_work.

Previously, bpf_timer_start(), and bpf_wq_start()
could not be called from NMI context because they acquired
bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
patch removes these limitations.

Key changes:
 * Remove bpf_spin_lock from struct bpf_async_kern.
 * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
   atomically.
 * Add per-bpf_async_cb irq_work to defer NMI-unsafe
   operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
   NMI to softirq context.
 * Use the lock-free seqcount_latch_t to pass operation
   commands (start/cancel/free) and parameters
   from NMI-safe callers to the irq_work handler.
 * Add reference counting to bpf_async_cb to ensure the object stays
   alive until all scheduled irq_work completes.
 * Move bpf_prog_put() to RCU callback to handle races between
   set_callback() and cancel_and_free().
 * Modify cancel_and_free() path:
   * Detach bpf_async_cb.
   * Signal destruction to irq_work side via setting last_seq to
     BPF_ASYNC_DESTROY.
   * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
 * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
   task trace grace periods before freeing the bpf_async_cb. Removed
   unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
   task trace lock.

This enables BPF programs attached to NMI-context hooks (perf
events) to use timers and workqueues for deferred processing.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
 1 file changed, 240 insertions(+), 183 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 61ba4f6b741cc05b4a7a73a0322a23874bfd8e83..297723d3f146a6e2f2e3e2dbf249506ae35bf3a2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -29,6 +29,7 @@
 #include <linux/task_work.h>
 #include <linux/irq_work.h>
 #include <linux/buildid.h>
+#include <linux/seqlock.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -1095,16 +1096,42 @@ static void *map_key_from_value(struct bpf_map *map, void *value, u32 *arr_idx)
 	return (void *)value - round_up(map->key_size, 8);
 }
 
+enum bpf_async_type {
+	BPF_ASYNC_TYPE_TIMER = 0,
+	BPF_ASYNC_TYPE_WQ,
+};
+
+enum bpf_async_op {
+	BPF_ASYNC_START,
+	BPF_ASYNC_CANCEL,
+	BPF_ASYNC_CANCEL_AND_FREE,
+};
+
+enum bpf_async_seq_state {
+	BPF_ASYNC_DESTROY = (u64)U32_MAX + 1,
+	BPF_ASYNC_DESTROYED = (u64)U32_MAX + 2,
+};
+
+struct bpf_async_cmd {
+	u64 nsec;
+	u32 mode;
+	enum bpf_async_op op;
+};
+
 struct bpf_async_cb {
 	struct bpf_map *map;
 	struct bpf_prog *prog;
 	void __rcu *callback_fn;
 	void *value;
-	union {
-		struct rcu_head rcu;
-		struct work_struct delete_work;
-	};
+	struct rcu_head rcu;
 	u64 flags;
+	struct irq_work worker;
+	atomic_t writer;
+	seqcount_latch_t latch;
+	struct bpf_async_cmd cmd[2];
+	atomic64_t last_seq;
+	refcount_t refcnt;
+	enum bpf_async_type type;
 };
 
 /* BPF map elements can contain 'struct bpf_timer'.
@@ -1132,7 +1159,6 @@ struct bpf_hrtimer {
 struct bpf_work {
 	struct bpf_async_cb cb;
 	struct work_struct work;
-	struct work_struct delete_work;
 };
 
 /* the actual struct hidden inside uapi struct bpf_timer and bpf_wq */
@@ -1142,18 +1168,9 @@ struct bpf_async_kern {
 		struct bpf_hrtimer *timer;
 		struct bpf_work *work;
 	};
-	/* bpf_spin_lock is used here instead of spinlock_t to make
-	 * sure that it always fits into space reserved by struct bpf_timer
-	 * regardless of LOCKDEP and spinlock debug flags.
-	 */
-	struct bpf_spin_lock lock;
+	u32 __pad; /* Left for binary compatibility, previously stored spinlock */
 } __attribute__((aligned(8)));
 
-enum bpf_async_type {
-	BPF_ASYNC_TYPE_TIMER = 0,
-	BPF_ASYNC_TYPE_WQ,
-};
-
 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
@@ -1219,45 +1236,53 @@ static void bpf_async_cb_rcu_free(struct rcu_head *rcu)
 {
 	struct bpf_async_cb *cb = container_of(rcu, struct bpf_async_cb, rcu);
 
+	/*
+	 * Drop the last reference to prog only after RCU GP, as set_callback()
+	 * may race with cancel_and_free()
+	 */
+	if (cb->prog)
+		bpf_prog_put(cb->prog);
+
 	kfree_nolock(cb);
 }
 
-static void bpf_wq_delete_work(struct work_struct *work)
+/* Callback from call_rcu_tasks_trace, chains to call_rcu for final free */
+static void bpf_async_cb_rcu_tasks_trace_free(struct rcu_head *rcu)
 {
-	struct bpf_work *w = container_of(work, struct bpf_work, delete_work);
-
-	cancel_work_sync(&w->work);
-
-	call_rcu(&w->cb.rcu, bpf_async_cb_rcu_free);
+	/*
+	 * If RCU Tasks Trace grace period implies RCU grace period,
+	 * there is no need to invoke call_rcu().
+	 */
+	if (rcu_trace_implies_rcu_gp())
+		bpf_async_cb_rcu_free(rcu);
+	else
+		call_rcu(rcu, bpf_async_cb_rcu_free);
 }
 
-static void bpf_timer_delete_work(struct work_struct *work)
+/*
+ * Decrement refcount and if it reaches zero, schedule deferred cleanup
+ * through call_rcu_tasks_trace() -> call_rcu() -> bpf_prog_put()/kfree()
+ */
+static void bpf_async_refcount_put(struct bpf_async_cb *cb)
 {
-	struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, cb.delete_work);
+	if (!refcount_dec_and_test(&cb->refcnt))
+		return;
 
-	/* Cancel the timer and wait for callback to complete if it was running.
-	 * If hrtimer_cancel() can be safely called it's safe to call
-	 * call_rcu() right after for both preallocated and non-preallocated
-	 * maps.  The async->cb = NULL was already done and no code path can see
-	 * address 't' anymore. Timer if armed for existing bpf_hrtimer before
-	 * bpf_timer_cancel_and_free will have been cancelled.
-	 */
-	hrtimer_cancel(&t->timer);
-	call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
+	/* Both timer and wq callbacks run under RCU lock, UAF should not be possible there */
+	call_rcu_tasks_trace(&cb->rcu, bpf_async_cb_rcu_tasks_trace_free);
 }
 
+static void __bpf_async_cancel_and_free(struct bpf_async_kern *async);
+static void bpf_async_irq_worker(struct irq_work *work);
+
 static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
 			    enum bpf_async_type type)
 {
-	struct bpf_async_cb *cb;
+	struct bpf_async_cb *cb, *old_cb;
 	struct bpf_hrtimer *t;
 	struct bpf_work *w;
 	clockid_t clockid;
 	size_t size;
-	int ret = 0;
-
-	if (in_nmi())
-		return -EOPNOTSUPP;
 
 	switch (type) {
 	case BPF_ASYNC_TYPE_TIMER:
@@ -1270,18 +1295,13 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		return -EINVAL;
 	}
 
-	__bpf_spin_lock_irqsave(&async->lock);
-	t = async->timer;
-	if (t) {
-		ret = -EBUSY;
-		goto out;
-	}
+	old_cb = READ_ONCE(async->cb);
+	if (old_cb)
+		return -EBUSY;
 
 	cb = bpf_map_kmalloc_nolock(map, size, 0, map->numa_node);
-	if (!cb) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!cb)
+		return -ENOMEM;
 
 	switch (type) {
 	case BPF_ASYNC_TYPE_TIMER:
@@ -1289,7 +1309,6 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		t = (struct bpf_hrtimer *)cb;
 
 		atomic_set(&t->cancelling, 0);
-		INIT_WORK(&t->cb.delete_work, bpf_timer_delete_work);
 		hrtimer_setup(&t->timer, bpf_timer_cb, clockid, HRTIMER_MODE_REL_SOFT);
 		cb->value = (void *)async - map->record->timer_off;
 		break;
@@ -1297,16 +1316,26 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		w = (struct bpf_work *)cb;
 
 		INIT_WORK(&w->work, bpf_wq_work);
-		INIT_WORK(&w->delete_work, bpf_wq_delete_work);
 		cb->value = (void *)async - map->record->wq_off;
 		break;
 	}
 	cb->map = map;
 	cb->prog = NULL;
 	cb->flags = flags;
+	cb->worker = IRQ_WORK_INIT(bpf_async_irq_worker);
+	seqcount_latch_init(&cb->latch);
+	atomic_set(&cb->writer, 0);
+	refcount_set(&cb->refcnt, 1); /* map's reference */
+	atomic64_set(&cb->last_seq, 0);
+	cb->type = type;
 	rcu_assign_pointer(cb->callback_fn, NULL);
 
-	WRITE_ONCE(async->cb, cb);
+	old_cb = cmpxchg(&async->cb, NULL, cb);
+	if (old_cb) {
+		/* Lost the race to initialize this bpf_async_kern, drop the allocated object */
+		kfree_nolock(cb);
+		return -EBUSY;
+	}
 	/* Guarantee the order between async->cb and map->usercnt. So
 	 * when there are concurrent uref release and bpf timer init, either
 	 * bpf_timer_cancel_and_free() called by uref release reads a no-NULL
@@ -1317,13 +1346,11 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		/* maps with timers must be either held by user space
 		 * or pinned in bpffs.
 		 */
-		WRITE_ONCE(async->cb, NULL);
-		kfree_nolock(cb);
-		ret = -EPERM;
+		__bpf_async_cancel_and_free(async);
+		return -EPERM;
 	}
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return ret;
+
+	return 0;
 }
 
 BPF_CALL_3(bpf_timer_init, struct bpf_async_kern *, timer, struct bpf_map *, map,
@@ -1388,33 +1415,97 @@ static int bpf_async_update_prog_callback(struct bpf_async_cb *cb, void *callbac
 	return 0;
 }
 
+static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op,
+				 u64 nsec, u32 timer_mode)
+{
+	/* Acquire active writer */
+	if (atomic_cmpxchg_acquire(&cb->writer, 0, 1))
+		return -EBUSY;
+
+	write_seqcount_latch_begin(&cb->latch);
+	cb->cmd[0].nsec = nsec;
+	cb->cmd[0].mode = timer_mode;
+	cb->cmd[0].op = op;
+	write_seqcount_latch(&cb->latch);
+	cb->cmd[1].nsec = nsec;
+	cb->cmd[1].mode = timer_mode;
+	cb->cmd[1].op = op;
+	write_seqcount_latch_end(&cb->latch);
+
+	atomic_set_release(&cb->writer, 0);
+
+	if (!refcount_inc_not_zero(&cb->refcnt))
+		return -EBUSY;
+
+	if (!in_nmi()) {
+		bpf_async_irq_worker(&cb->worker);
+		return 0;
+	}
+
+	if (!irq_work_queue(&cb->worker))
+		bpf_async_refcount_put(cb);
+
+	return 0;
+}
+
+static int bpf_async_handle_terminal_seq(struct bpf_async_cb *cb, s64 last_seq,
+					 enum bpf_async_op *op)
+{
+	s64 expected = BPF_ASYNC_DESTROY;
+
+	if (last_seq != BPF_ASYNC_DESTROY)
+		return -EBUSY;
+
+	if (!atomic64_try_cmpxchg_release(&cb->last_seq, &expected, BPF_ASYNC_DESTROYED))
+		return -EBUSY; /* Someone else set it to DESTROYED, bail */
+
+	*op = BPF_ASYNC_CANCEL;
+	return 0;
+}
+
+static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
+			     u64 *nsec, u32 *flags)
+{
+	u32 seq, idx;
+	s64 last_seq;
+
+	while (true) {
+		last_seq = atomic64_read_acquire(&cb->last_seq);
+		if (last_seq > U32_MAX) /* Check if terminal seq num has been set */
+			return bpf_async_handle_terminal_seq(cb, last_seq, op);
+
+		seq = raw_read_seqcount_latch(&cb->latch);
+
+		/* Return -EBUSY if current seq is consumed by another reader */
+		if (seq == last_seq)
+			return -EBUSY;
+
+		idx = seq & 1;
+		*nsec = cb->cmd[idx].nsec;
+		*flags = cb->cmd[idx].mode;
+		*op = cb->cmd[idx].op;
+
+		if (raw_read_seqcount_latch_retry(&cb->latch, seq))
+			continue;
+
+		/* Commit read sequence number, own snapshot exclusively */
+		if (atomic64_try_cmpxchg_release(&cb->last_seq, &last_seq, seq))
+			break;
+	}
+
+	return 0;
+}
+
 static int __bpf_async_set_callback(struct bpf_async_kern *async, void *callback_fn,
 				    struct bpf_prog *prog)
 {
 	struct bpf_async_cb *cb;
-	int ret = 0;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
-	__bpf_spin_lock_irqsave(&async->lock);
-	cb = async->cb;
-	if (!cb) {
-		ret = -EINVAL;
-		goto out;
-	}
-	if (!atomic64_read(&cb->map->usercnt)) {
-		/* maps with timers must be either held by user space
-		 * or pinned in bpffs. Otherwise timer might still be
-		 * running even when bpf prog is detached and user space
-		 * is gone, since map_release_uref won't ever be called.
-		 */
-		ret = -EPERM;
-		goto out;
-	}
-	ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return ret;
+	cb = READ_ONCE(async->cb);
+	if (!cb)
+		return -EINVAL;
+
+	return bpf_async_update_prog_callback(cb, callback_fn, prog);
 }
 
 BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
@@ -1431,22 +1522,17 @@ static const struct bpf_func_proto bpf_timer_set_callback_proto = {
 	.arg2_type	= ARG_PTR_TO_FUNC,
 };
 
-BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, flags)
+BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, async, u64, nsecs, u64, flags)
 {
 	struct bpf_hrtimer *t;
-	int ret = 0;
-	enum hrtimer_mode mode;
+	u32 mode;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
 	if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
 		return -EINVAL;
-	__bpf_spin_lock_irqsave(&timer->lock);
-	t = timer->timer;
-	if (!t || !t->cb.prog) {
-		ret = -EINVAL;
-		goto out;
-	}
+
+	t = READ_ONCE(async->timer);
+	if (!t || !READ_ONCE(t->cb.prog))
+		return -EINVAL;
 
 	if (flags & BPF_F_TIMER_ABS)
 		mode = HRTIMER_MODE_ABS_SOFT;
@@ -1456,10 +1542,7 @@ BPF_CALL_3(bpf_timer_start, struct bpf_async_kern *, timer, u64, nsecs, u64, fla
 	if (flags & BPF_F_TIMER_CPU_PIN)
 		mode |= HRTIMER_MODE_PINNED;
 
-	hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
-out:
-	__bpf_spin_unlock_irqrestore(&timer->lock);
-	return ret;
+	return bpf_async_schedule_op(&t->cb, BPF_ASYNC_START, nsecs, mode);
 }
 
 static const struct bpf_func_proto bpf_timer_start_proto = {
@@ -1480,8 +1563,6 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_async_kern *, async)
 	if (in_nmi())
 		return -EOPNOTSUPP;
 
-	guard(rcu)();
-
 	t = READ_ONCE(async->timer);
 	if (!t)
 		return -EINVAL;
@@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
 	.arg1_type	= ARG_PTR_TO_TIMER,
 };
 
-static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
+static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
 {
 	struct bpf_async_cb *cb;
 
-	/* Performance optimization: read async->cb without lock first. */
-	if (!READ_ONCE(async->cb))
-		return NULL;
-
-	__bpf_spin_lock_irqsave(&async->lock);
-	/* re-read it under lock */
-	cb = async->cb;
+	cb = xchg(&async->cb, NULL);
 	if (!cb)
-		goto out;
-	bpf_async_update_prog_callback(cb, NULL, NULL);
-	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
-	 * this timer, since it won't be initialized.
-	 */
-	WRITE_ONCE(async->cb, NULL);
-out:
-	__bpf_spin_unlock_irqrestore(&async->lock);
-	return cb;
+		return;
+
+	atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
+	/* Pass map's reference to irq_work callback */
+	if (!irq_work_queue(&cb->worker))
+		bpf_async_refcount_put(cb);
 }
 
-static void bpf_timer_delete(struct bpf_hrtimer *t)
+static void bpf_async_process_op(struct bpf_async_cb *cb, u32 op,
+				 u64 timer_nsec, u32 timer_mode)
 {
-	/*
-	 * We check that bpf_map_delete/update_elem() was called from timer
-	 * callback_fn. In such case we don't call hrtimer_cancel() (since it
-	 * will deadlock) and don't call hrtimer_try_to_cancel() (since it will
-	 * just return -1). Though callback_fn is still running on this cpu it's
-	 * safe to do kfree(t) because bpf_timer_cb() read everything it needed
-	 * from 't'. The bpf subprog callback_fn won't be able to access 't',
-	 * since async->cb = NULL was already done. The timer will be
-	 * effectively cancelled because bpf_timer_cb() will return
-	 * HRTIMER_NORESTART.
-	 *
-	 * However, it is possible the timer callback_fn calling us armed the
-	 * timer _before_ calling us, such that failing to cancel it here will
-	 * cause it to possibly use struct hrtimer after freeing bpf_hrtimer.
-	 * Therefore, we _need_ to cancel any outstanding timers before we do
-	 * call_rcu, even though no more timers can be armed.
-	 *
-	 * Moreover, we need to schedule work even if timer does not belong to
-	 * the calling callback_fn, as on two different CPUs, we can end up in a
-	 * situation where both sides run in parallel, try to cancel one
-	 * another, and we end up waiting on both sides in hrtimer_cancel
-	 * without making forward progress, since timer1 depends on time2
-	 * callback to finish, and vice versa.
-	 *
-	 *  CPU 1 (timer1_cb)			CPU 2 (timer2_cb)
-	 *  bpf_timer_cancel_and_free(timer2)	bpf_timer_cancel_and_free(timer1)
-	 *
-	 * To avoid these issues, punt to workqueue context when we are in a
-	 * timer callback.
-	 */
-	if (this_cpu_read(hrtimer_running)) {
-		queue_work(system_dfl_wq, &t->cb.delete_work);
-		return;
-	}
+	switch (cb->type) {
+	case BPF_ASYNC_TYPE_TIMER: {
+		struct bpf_hrtimer *t = container_of(cb, struct bpf_hrtimer, cb);
 
-	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-		/* If the timer is running on other CPU, also use a kworker to
-		 * wait for the completion of the timer instead of trying to
-		 * acquire a sleepable lock in hrtimer_cancel() to wait for its
-		 * completion.
-		 */
-		if (hrtimer_try_to_cancel(&t->timer) >= 0)
-			call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
-		else
-			queue_work(system_dfl_wq, &t->cb.delete_work);
-	} else {
-		bpf_timer_delete_work(&t->cb.delete_work);
+		switch (op) {
+		case BPF_ASYNC_START:
+			hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
+			break;
+		case BPF_ASYNC_CANCEL:
+			hrtimer_try_to_cancel(&t->timer);
+			break;
+		default:
+			break;
+		}
+		break;
+	}
+	case BPF_ASYNC_TYPE_WQ: {
+		struct bpf_work *w = container_of(cb, struct bpf_work, cb);
+
+		switch (op) {
+		case BPF_ASYNC_START:
+			schedule_work(&w->work);
+			break;
+		case BPF_ASYNC_CANCEL:
+			/* Use non-blocking cancel, safe in irq_work context.
+			 * RCU grace period ensures callback completes before free.
+			 */
+			cancel_work(&w->work);
+			break;
+		default:
+			break;
+		}
+		break;
 	}
+	}
+}
+
+static void bpf_async_irq_worker(struct irq_work *work)
+{
+	struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
+	u32 op, timer_mode;
+	u64 nsec;
+	int err;
+
+	err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
+	if (err)
+		goto out;
+
+	bpf_async_process_op(cb, op, nsec, timer_mode);
+
+out:
+	bpf_async_refcount_put(cb);
 }
 
 /*
@@ -1617,13 +1694,7 @@ static void bpf_timer_delete(struct bpf_hrtimer *t)
  */
 void bpf_timer_cancel_and_free(void *val)
 {
-	struct bpf_hrtimer *t;
-
-	t = (struct bpf_hrtimer *)__bpf_async_cancel_and_free(val);
-	if (!t)
-		return;
-
-	bpf_timer_delete(t);
+	__bpf_async_cancel_and_free(val);
 }
 
 /* This function is called by map_delete/update_elem for individual element and
@@ -1631,19 +1702,7 @@ void bpf_timer_cancel_and_free(void *val)
  */
 void bpf_wq_cancel_and_free(void *val)
 {
-	struct bpf_work *work;
-
-	BTF_TYPE_EMIT(struct bpf_wq);
-
-	work = (struct bpf_work *)__bpf_async_cancel_and_free(val);
-	if (!work)
-		return;
-	/* Trigger cancel of the sleepable work, but *do not* wait for
-	 * it to finish if it was running as we might not be in a
-	 * sleepable context.
-	 * kfree will be called once the work has finished.
-	 */
-	schedule_work(&work->delete_work);
+	__bpf_async_cancel_and_free(val);
 }
 
 BPF_CALL_2(bpf_kptr_xchg, void *, dst, void *, ptr)
@@ -3116,16 +3175,14 @@ __bpf_kfunc int bpf_wq_start(struct bpf_wq *wq, unsigned int flags)
 	struct bpf_async_kern *async = (struct bpf_async_kern *)wq;
 	struct bpf_work *w;
 
-	if (in_nmi())
-		return -EOPNOTSUPP;
 	if (flags)
 		return -EINVAL;
+
 	w = READ_ONCE(async->work);
 	if (!w || !READ_ONCE(w->cb.prog))
 		return -EINVAL;
 
-	schedule_work(&w->work);
-	return 0;
+	return bpf_async_schedule_op(&w->cb, BPF_ASYNC_START, 0, 0);
 }
 
 __bpf_kfunc int bpf_wq_set_callback_impl(struct bpf_wq *wq,

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (4 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Extend the verifier to recognize struct bpf_timer as a valid kfunc
argument type. Previously, bpf_timer was only supported in BPF helpers.

This prepares for adding timer-related kfuncs in subsequent patches.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/verifier.c | 59 +++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9de0ec0c3ed998b0b135d85d64fe07ee0f8df6d5..ae189a6af1dbcbf2da30f70dd9b848c4f90bb5cf 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8566,17 +8566,15 @@ static int check_map_field_pointer(struct bpf_verifier_env *env, u32 regno,
 }
 
 static int process_timer_func(struct bpf_verifier_env *env, int regno,
-			      struct bpf_call_arg_meta *meta)
+			      struct bpf_map *map)
 {
-	struct bpf_reg_state *reg = reg_state(env, regno);
-	struct bpf_map *map = reg->map_ptr;
 	int err;
 
 	err = check_map_field_pointer(env, regno, BPF_TIMER);
 	if (err)
 		return err;
 
-	if (meta->map_ptr) {
+	if (map) {
 		verifier_bug(env, "Two map pointers in a timer helper");
 		return -EFAULT;
 	}
@@ -8584,8 +8582,36 @@ static int process_timer_func(struct bpf_verifier_env *env, int regno,
 		verbose(env, "bpf_timer cannot be used for PREEMPT_RT.\n");
 		return -EOPNOTSUPP;
 	}
+	return 0;
+}
+
+static int process_timer_helper(struct bpf_verifier_env *env, int regno,
+				struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno);
+	int err;
+
+	err = process_timer_func(env, regno, meta->map_ptr);
+	if (err)
+		return err;
+
 	meta->map_uid = reg->map_uid;
-	meta->map_ptr = map;
+	meta->map_ptr = reg->map_ptr;
+	return 0;
+}
+
+static int process_timer_kfunc(struct bpf_verifier_env *env, int regno,
+			       struct bpf_kfunc_call_arg_meta *meta)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno);
+	int err;
+
+	err = process_timer_func(env, regno, meta->map.ptr);
+	if (err)
+		return err;
+
+	meta->map.uid = reg->map_uid;
+	meta->map.ptr = reg->map_ptr;
 	return 0;
 }
 
@@ -9913,7 +9939,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		}
 		break;
 	case ARG_PTR_TO_TIMER:
-		err = process_timer_func(env, regno, meta);
+		err = process_timer_helper(env, regno, meta);
 		if (err)
 			return err;
 		break;
@@ -12166,6 +12192,7 @@ enum {
 	KF_ARG_WORKQUEUE_ID,
 	KF_ARG_RES_SPIN_LOCK_ID,
 	KF_ARG_TASK_WORK_ID,
+	KF_ARG_TIMER_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
@@ -12177,6 +12204,7 @@ BTF_ID(struct, bpf_rb_node)
 BTF_ID(struct, bpf_wq)
 BTF_ID(struct, bpf_res_spin_lock)
 BTF_ID(struct, bpf_task_work)
+BTF_ID(struct, bpf_timer)
 
 static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
 				    const struct btf_param *arg, int type)
@@ -12220,6 +12248,11 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
 }
 
+static bool is_kfunc_arg_timer(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_TIMER_ID);
+}
+
 static bool is_kfunc_arg_wq(const struct btf *btf, const struct btf_param *arg)
 {
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_WORKQUEUE_ID);
@@ -12314,6 +12347,7 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_NULL,
 	KF_ARG_PTR_TO_CONST_STR,
 	KF_ARG_PTR_TO_MAP,
+	KF_ARG_PTR_TO_TIMER,
 	KF_ARG_PTR_TO_WORKQUEUE,
 	KF_ARG_PTR_TO_IRQ_FLAG,
 	KF_ARG_PTR_TO_RES_SPIN_LOCK,
@@ -12559,6 +12593,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_wq(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_WORKQUEUE;
 
+	if (is_kfunc_arg_timer(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_TIMER;
+
 	if (is_kfunc_arg_task_work(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_TASK_WORK;
 
@@ -13345,6 +13382,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_REFCOUNTED_KPTR:
 		case KF_ARG_PTR_TO_CONST_STR:
 		case KF_ARG_PTR_TO_WORKQUEUE:
+		case KF_ARG_PTR_TO_TIMER:
 		case KF_ARG_PTR_TO_TASK_WORK:
 		case KF_ARG_PTR_TO_IRQ_FLAG:
 		case KF_ARG_PTR_TO_RES_SPIN_LOCK:
@@ -13644,6 +13682,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_TIMER:
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				verbose(env, "arg#%d doesn't point to a map value\n", i);
+				return -EINVAL;
+			}
+			ret = process_timer_kfunc(env, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_TASK_WORK:
 			if (reg->type != PTR_TO_MAP_VALUE) {
 				verbose(env, "arg#%d doesn't point to a map value\n", i);

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (5 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

introducing bpf timer cancel kfunc that attempts canceling timer
asynchronously, hence, supports working in NMI context.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 kernel/bpf/helpers.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 297723d3f146a6e2f2e3e2dbf249506ae35bf3a2..2acab81599b6d9f4dc6c5ba636b9f9bd46be9d81 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -4425,6 +4425,18 @@ __bpf_kfunc int bpf_dynptr_file_discard(struct bpf_dynptr *dynptr)
 	return 0;
 }
 
+__bpf_kfunc int bpf_timer_cancel_async(struct bpf_timer *timer)
+{
+	struct bpf_async_cb *cb;
+	struct bpf_async_kern *async = (void *)timer;
+
+	cb = READ_ONCE(async->cb);
+	if (!cb)
+		return -EINVAL;
+
+	return bpf_async_schedule_op(cb, BPF_ASYNC_CANCEL, 0, 0);
+}
+
 __bpf_kfunc_end_defs();
 
 static void bpf_task_work_cancel_scheduled(struct irq_work *irq_work)
@@ -4606,6 +4618,7 @@ BTF_ID_FLAGS(func, bpf_task_work_schedule_signal_impl)
 BTF_ID_FLAGS(func, bpf_task_work_schedule_resume_impl)
 BTF_ID_FLAGS(func, bpf_dynptr_from_file)
 BTF_ID_FLAGS(func, bpf_dynptr_file_discard)
+BTF_ID_FLAGS(func, bpf_timer_cancel_async)
 BTF_KFUNCS_END(common_btf_ids)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (6 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Refactor timer selftests, extracting stress test into a separate test.
This makes it easier to debug test failures and allows to extend.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 55 +++++++++++++++++---------
 1 file changed, 36 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index 34f9ccce260293755980bcd6fcece491964f7929..4d853d1bd2a71b3d0f1ba0daa7a699945b4457fe 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -22,13 +22,35 @@ static void *spin_lock_thread(void *arg)
 	pthread_exit(arg);
 }
 
-static int timer(struct timer *timer_skel)
+
+static int timer_stress(struct timer *timer_skel)
 {
-	int i, err, prog_fd;
+	int i, err = 1, prog_fd;
 	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	pthread_t thread_id[NUM_THR];
 	void *ret;
 
+	prog_fd = bpf_program__fd(timer_skel->progs.race);
+	for (i = 0; i < NUM_THR; i++) {
+		err = pthread_create(&thread_id[i], NULL,
+				     &spin_lock_thread, &prog_fd);
+		if (!ASSERT_OK(err, "pthread_create"))
+			break;
+	}
+
+	while (i) {
+		err = pthread_join(thread_id[--i], &ret);
+		if (ASSERT_OK(err, "pthread_join"))
+			ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join");
+	}
+	return err;
+}
+
+static int timer(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
 	err = timer__attach(timer_skel);
 	if (!ASSERT_OK(err, "timer_attach"))
 		return err;
@@ -63,25 +85,10 @@ static int timer(struct timer *timer_skel)
 	/* check that code paths completed */
 	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
 
-	prog_fd = bpf_program__fd(timer_skel->progs.race);
-	for (i = 0; i < NUM_THR; i++) {
-		err = pthread_create(&thread_id[i], NULL,
-				     &spin_lock_thread, &prog_fd);
-		if (!ASSERT_OK(err, "pthread_create"))
-			break;
-	}
-
-	while (i) {
-		err = pthread_join(thread_id[--i], &ret);
-		if (ASSERT_OK(err, "pthread_join"))
-			ASSERT_EQ(ret, (void *)&prog_fd, "pthread_join");
-	}
-
 	return 0;
 }
 
-/* TODO: use pid filtering */
-void serial_test_timer(void)
+static void test_timer(int (*timer_test_fn)(struct timer *timer_skel))
 {
 	struct timer *timer_skel = NULL;
 	int err;
@@ -94,13 +101,23 @@ void serial_test_timer(void)
 	if (!ASSERT_OK_PTR(timer_skel, "timer_skel_load"))
 		return;
 
-	err = timer(timer_skel);
+	err = timer_test_fn(timer_skel);
 	ASSERT_OK(err, "timer");
 	timer__destroy(timer_skel);
+}
+
+void serial_test_timer(void)
+{
+	test_timer(timer);
 
 	RUN_TESTS(timer_failure);
 }
 
+void serial_test_timer_stress(void)
+{
+	test_timer(timer_stress);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (7 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-20 15:59 ` [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
  2026-01-21  2:30 ` [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq patchwork-bot+netdevbpf
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Extend BPF timer selftest to run stress test for async cancel.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 18 +++++++++++++++++-
 tools/testing/selftests/bpf/progs/timer.c      | 14 +++++++++++---
 2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index 4d853d1bd2a71b3d0f1ba0daa7a699945b4457fe..a157a2a699e638c9f21712b1e7194fc4b6382e71 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -23,13 +23,14 @@ static void *spin_lock_thread(void *arg)
 }
 
 
-static int timer_stress(struct timer *timer_skel)
+static int timer_stress_runner(struct timer *timer_skel, bool async_cancel)
 {
 	int i, err = 1, prog_fd;
 	LIBBPF_OPTS(bpf_test_run_opts, topts);
 	pthread_t thread_id[NUM_THR];
 	void *ret;
 
+	timer_skel->bss->async_cancel = async_cancel;
 	prog_fd = bpf_program__fd(timer_skel->progs.race);
 	for (i = 0; i < NUM_THR; i++) {
 		err = pthread_create(&thread_id[i], NULL,
@@ -46,6 +47,16 @@ static int timer_stress(struct timer *timer_skel)
 	return err;
 }
 
+static int timer_stress(struct timer *timer_skel)
+{
+	return timer_stress_runner(timer_skel, false);
+}
+
+static int timer_stress_async_cancel(struct timer *timer_skel)
+{
+	return timer_stress_runner(timer_skel, true);
+}
+
 static int timer(struct timer *timer_skel)
 {
 	int err, prog_fd;
@@ -118,6 +129,11 @@ void serial_test_timer_stress(void)
 	test_timer(timer_stress);
 }
 
+void serial_test_timer_stress_async_cancel(void)
+{
+	test_timer(timer_stress_async_cancel);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index 4c677c001258a4c05cd570ec52363d49d8eea169..a81413514e4b07ef745f27eade71454234e731e8 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -1,13 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2021 Facebook */
-#include <linux/bpf.h>
-#include <time.h>
+
+#include <vmlinux.h>
 #include <stdbool.h>
 #include <errno.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 
+#define CLOCK_MONOTONIC 1
+#define CLOCK_BOOTTIME 7
+
 char _license[] SEC("license") = "GPL";
+
 struct hmap_elem {
 	int counter;
 	struct bpf_timer timer;
@@ -63,6 +67,7 @@ __u64 callback_check = 52;
 __u64 callback2_check = 52;
 __u64 pinned_callback_check;
 __s32 pinned_cpu;
+bool async_cancel = 0;
 
 #define ARRAY 1
 #define HTAB 2
@@ -419,7 +424,10 @@ int race(void *ctx)
 
 	bpf_timer_set_callback(timer, race_timer_callback);
 	bpf_timer_start(timer, 0, 0);
-	bpf_timer_cancel(timer);
+	if (async_cancel)
+		bpf_timer_cancel_async(timer);
+	else
+		bpf_timer_cancel(timer);
 
 	return 0;
 }

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (8 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
@ 2026-01-20 15:59 ` Mykyta Yatsenko
  2026-01-21  2:30 ` [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq patchwork-bot+netdevbpf
  10 siblings, 0 replies; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 15:59 UTC (permalink / raw)
  To: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87
  Cc: Mykyta Yatsenko

From: Mykyta Yatsenko <yatsenko@meta.com>

Add test that verifies that bpf_timer_cancel_async works: can cancel
callback successfully.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
 tools/testing/selftests/bpf/prog_tests/timer.c | 25 +++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/timer.c      | 23 +++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/timer.c b/tools/testing/selftests/bpf/prog_tests/timer.c
index a157a2a699e638c9f21712b1e7194fc4b6382e71..2b932d4dfd436fd322bd07169f492e20e4ec7624 100644
--- a/tools/testing/selftests/bpf/prog_tests/timer.c
+++ b/tools/testing/selftests/bpf/prog_tests/timer.c
@@ -99,6 +99,26 @@ static int timer(struct timer *timer_skel)
 	return 0;
 }
 
+static int timer_cancel_async(struct timer *timer_skel)
+{
+	int err, prog_fd;
+	LIBBPF_OPTS(bpf_test_run_opts, topts);
+
+	prog_fd = bpf_program__fd(timer_skel->progs.test_async_cancel_succeed);
+	err = bpf_prog_test_run_opts(prog_fd, &topts);
+	ASSERT_OK(err, "test_run");
+	ASSERT_EQ(topts.retval, 0, "test_run");
+
+	usleep(500);
+	/* check that there were no errors in timer execution */
+	ASSERT_EQ(timer_skel->bss->err, 0, "err");
+
+	/* check that code paths completed */
+	ASSERT_EQ(timer_skel->bss->ok, 1 | 2 | 4, "ok");
+
+	return 0;
+}
+
 static void test_timer(int (*timer_test_fn)(struct timer *timer_skel))
 {
 	struct timer *timer_skel = NULL;
@@ -134,6 +154,11 @@ void serial_test_timer_stress_async_cancel(void)
 	test_timer(timer_stress_async_cancel);
 }
 
+void serial_test_timer_async_cancel(void)
+{
+	test_timer(timer_cancel_async);
+}
+
 void test_timer_interrupt(void)
 {
 	struct timer_interrupt *skel = NULL;
diff --git a/tools/testing/selftests/bpf/progs/timer.c b/tools/testing/selftests/bpf/progs/timer.c
index a81413514e4b07ef745f27eade71454234e731e8..4b4ca781e7cdcf78015359cbd8f8d8ff591d6036 100644
--- a/tools/testing/selftests/bpf/progs/timer.c
+++ b/tools/testing/selftests/bpf/progs/timer.c
@@ -169,6 +169,29 @@ int BPF_PROG2(test1, int, a)
 	return 0;
 }
 
+static int timer_error(void *map, int *key, struct bpf_timer *timer)
+{
+	err = 42;
+	return 0;
+}
+
+SEC("syscall")
+int test_async_cancel_succeed(void *ctx)
+{
+	struct bpf_timer *arr_timer;
+	int array_key = ARRAY;
+
+	arr_timer = bpf_map_lookup_elem(&array, &array_key);
+	if (!arr_timer)
+		return 0;
+	bpf_timer_init(arr_timer, &array, CLOCK_MONOTONIC);
+	bpf_timer_set_callback(arr_timer, timer_error);
+	bpf_timer_start(arr_timer, 100000 /* 100us */, 0);
+	bpf_timer_cancel_async(arr_timer);
+	ok = 7;
+	return 0;
+}
+
 /* callback for prealloc and non-prealloca hashtab timers */
 static int timer_cb2(void *map, int *key, struct hmap_elem *val)
 {

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
@ 2026-01-20 18:31   ` Andrii Nakryiko
  2026-01-20 21:17     ` Mykyta Yatsenko
  0 siblings, 1 reply; 15+ messages in thread
From: Andrii Nakryiko @ 2026-01-20 18:31 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Tue, Jan 20, 2026 at 7:59 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> From: Mykyta Yatsenko <yatsenko@meta.com>
>
> Refactor bpf timer and workqueue helpers to allow calling them from NMI
> context by making all operations lock-free and deferring NMI-unsafe
> work to irq_work.
>
> Previously, bpf_timer_start(), and bpf_wq_start()
> could not be called from NMI context because they acquired
> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
> patch removes these limitations.
>
> Key changes:
>  * Remove bpf_spin_lock from struct bpf_async_kern.
>  * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
>    atomically.
>  * Add per-bpf_async_cb irq_work to defer NMI-unsafe
>    operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
>    NMI to softirq context.
>  * Use the lock-free seqcount_latch_t to pass operation
>    commands (start/cancel/free) and parameters
>    from NMI-safe callers to the irq_work handler.
>  * Add reference counting to bpf_async_cb to ensure the object stays
>    alive until all scheduled irq_work completes.
>  * Move bpf_prog_put() to RCU callback to handle races between
>    set_callback() and cancel_and_free().
>  * Modify cancel_and_free() path:
>    * Detach bpf_async_cb.
>    * Signal destruction to irq_work side via setting last_seq to
>      BPF_ASYNC_DESTROY.
>    * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
>  * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
>    task trace grace periods before freeing the bpf_async_cb. Removed
>    unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
>    task trace lock.
>
> This enables BPF programs attached to NMI-context hooks (perf
> events) to use timers and workqueues for deferred processing.
>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
>  1 file changed, 240 insertions(+), 183 deletions(-)

I think we'll need another revision, but all the things I pointed out
below do not change the logic in any big way, so it would be great for
people that were following along but waited for the overall
implementation to stabilize to actually check the implementation now.
I think it's basically ready, hopefully the next revision will land,
so let's do another round of thorough reviews.


[...]

> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op,
> +                                u64 nsec, u32 timer_mode)
> +{
> +       /* Acquire active writer */
> +       if (atomic_cmpxchg_acquire(&cb->writer, 0, 1))
> +               return -EBUSY;
> +
> +       write_seqcount_latch_begin(&cb->latch);
> +       cb->cmd[0].nsec = nsec;
> +       cb->cmd[0].mode = timer_mode;
> +       cb->cmd[0].op = op;
> +       write_seqcount_latch(&cb->latch);
> +       cb->cmd[1].nsec = nsec;
> +       cb->cmd[1].mode = timer_mode;
> +       cb->cmd[1].op = op;
> +       write_seqcount_latch_end(&cb->latch);
> +
> +       atomic_set_release(&cb->writer, 0);
> +
> +       if (!refcount_inc_not_zero(&cb->refcnt))
> +               return -EBUSY;

let's not do refcount bump unless we are going to schedule irq_work,
so move this after !in_nmi() check, and instead of
bpf_async_irq_worker() call only part of it that does the work, but
doesn't put the refcount. This will speed up common non-NMI case.

[...]

> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
> +                            u64 *nsec, u32 *flags)
> +{
> +       u32 seq, idx;
> +       s64 last_seq;
> +
> +       while (true) {
> +               last_seq = atomic64_read_acquire(&cb->last_seq);
> +               if (last_seq > U32_MAX) /* Check if terminal seq num has been set */

unlikely() seems to be warranted here to optimize (a little bit) a common case

> +                       return bpf_async_handle_terminal_seq(cb, last_seq, op);

why this helper function, it's called once. It's actually a
distraction, IMO, that this logic is split out. Can you please inline
it here?

> +
> +               seq = raw_read_seqcount_latch(&cb->latch);
> +
> +               /* Return -EBUSY if current seq is consumed by another reader */
> +               if (seq == last_seq)
> +                       return -EBUSY;

I'd drop this check, below atomic64_try_cmpxchg() should handle this
unlikely situation anyways, no?

> +
> +               idx = seq & 1;
> +               *nsec = cb->cmd[idx].nsec;

[...]

> -       ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
> -out:
> -       __bpf_spin_unlock_irqrestore(&async->lock);
> -       return ret;
> +       cb = READ_ONCE(async->cb);
> +       if (!cb)
> +               return -EINVAL;
> +
> +       return bpf_async_update_prog_callback(cb, callback_fn, prog);

nit: if you end up with another revision, consider swapping callback
and prog arguments order, it doesn't make sense for callback (that
belongs to that prog) to be specified before the prog. Super minor,
but reads very backwards.

>  }
>
>  BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,

[...]

> @@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>         .arg1_type      = ARG_PTR_TO_TIMER,
>  };
>
> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>  {
>         struct bpf_async_cb *cb;
>
> -       /* Performance optimization: read async->cb without lock first. */
> -       if (!READ_ONCE(async->cb))
> -               return NULL;
> -

I don't feel strongly about this, but I guess we could leave this
performance optimization in place without changing anything else, so
why not?

> -       __bpf_spin_lock_irqsave(&async->lock);
> -       /* re-read it under lock */
> -       cb = async->cb;
> +       cb = xchg(&async->cb, NULL);
>         if (!cb)
> -               goto out;
> -       bpf_async_update_prog_callback(cb, NULL, NULL);
> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> -        * this timer, since it won't be initialized.
> -        */
> -       WRITE_ONCE(async->cb, NULL);
> -out:
> -       __bpf_spin_unlock_irqrestore(&async->lock);
> -       return cb;
> +               return;
> +
> +       atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
> +       /* Pass map's reference to irq_work callback */

I'd expand here, because it's subtle but important. Just to make it
very clear: we *attempt* to pass our initial map's reference to
irq_work callback, but if we fail to schedule it (meaning it is
already scheduled by someone else having their own ref), we have to
put that ref here.

It's important to have a noticeable comment here because refcounting
here looks asymmetrical (and thus broken at first sight), but it's
actually not.

But also, can't we do the same !in_nmi() optimization here as with
schedule_op above?

> +       if (!irq_work_queue(&cb->worker))
> +               bpf_async_refcount_put(cb);
>  }
>

[...]

> -       if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -               /* If the timer is running on other CPU, also use a kworker to
> -                * wait for the completion of the timer instead of trying to
> -                * acquire a sleepable lock in hrtimer_cancel() to wait for its
> -                * completion.
> -                */
> -               if (hrtimer_try_to_cancel(&t->timer) >= 0)
> -                       call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
> -               else
> -                       queue_work(system_dfl_wq, &t->cb.delete_work);
> -       } else {
> -               bpf_timer_delete_work(&t->cb.delete_work);
> +               switch (op) {
> +               case BPF_ASYNC_START:
> +                       hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
> +                       break;
> +               case BPF_ASYNC_CANCEL:
> +                       hrtimer_try_to_cancel(&t->timer);

I think it's important to point out (in patch description) that we now
have asynchronous timer cancellation on cancel_and_free, and thus we
don't need all that hrtimer_running logic that you removed above (and
a huge comment accompanying it).

> +                       break;
> +               default:

Drop BPF_ASYNC_CANCEL_AND_FREE which we don't use anymore, and then
you won't need default (and compiler will complain if we add new
operation that isn't handled explicitly).

If that doesn't work for some reason, at least WARN_ON_ONCE() here.
This shouldn't ever happen, but if it does during development, we
better know about that.

> +                       break;
> +               }
> +               break;
> +       }
> +       case BPF_ASYNC_TYPE_WQ: {
> +               struct bpf_work *w = container_of(cb, struct bpf_work, cb);
> +
> +               switch (op) {
> +               case BPF_ASYNC_START:
> +                       schedule_work(&w->work);
> +                       break;
> +               case BPF_ASYNC_CANCEL:
> +                       /* Use non-blocking cancel, safe in irq_work context.
> +                        * RCU grace period ensures callback completes before free.
> +                        */
> +                       cancel_work(&w->work);
> +                       break;
> +               default:

same as above

> +                       break;
> +               }
> +               break;
>         }
> +       }
> +}
> +
> +static void bpf_async_irq_worker(struct irq_work *work)
> +{
> +       struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
> +       u32 op, timer_mode;
> +       u64 nsec;
> +       int err;
> +
> +       err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
> +       if (err)
> +               goto out;
> +
> +       bpf_async_process_op(cb, op, nsec, timer_mode);

if you split read_op+process_op combo into a separate function, then
you won't need refcount manipulations outside of in_nmi context. Let's
do it.

> +
> +out:
> +       bpf_async_refcount_put(cb);
>  }
>
>  /*

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-20 18:31   ` Andrii Nakryiko
@ 2026-01-20 21:17     ` Mykyta Yatsenko
  2026-01-21  0:26       ` Andrii Nakryiko
  0 siblings, 1 reply; 15+ messages in thread
From: Mykyta Yatsenko @ 2026-01-20 21:17 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On 1/20/26 18:31, Andrii Nakryiko wrote:
> On Tue, Jan 20, 2026 at 7:59 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Refactor bpf timer and workqueue helpers to allow calling them from NMI
>> context by making all operations lock-free and deferring NMI-unsafe
>> work to irq_work.
>>
>> Previously, bpf_timer_start(), and bpf_wq_start()
>> could not be called from NMI context because they acquired
>> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
>> patch removes these limitations.
>>
>> Key changes:
>>   * Remove bpf_spin_lock from struct bpf_async_kern.
>>   * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
>>     atomically.
>>   * Add per-bpf_async_cb irq_work to defer NMI-unsafe
>>     operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
>>     NMI to softirq context.
>>   * Use the lock-free seqcount_latch_t to pass operation
>>     commands (start/cancel/free) and parameters
>>     from NMI-safe callers to the irq_work handler.
>>   * Add reference counting to bpf_async_cb to ensure the object stays
>>     alive until all scheduled irq_work completes.
>>   * Move bpf_prog_put() to RCU callback to handle races between
>>     set_callback() and cancel_and_free().
>>   * Modify cancel_and_free() path:
>>     * Detach bpf_async_cb.
>>     * Signal destruction to irq_work side via setting last_seq to
>>       BPF_ASYNC_DESTROY.
>>     * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
>>   * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
>>     task trace grace periods before freeing the bpf_async_cb. Removed
>>     unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
>>     task trace lock.
>>
>> This enables BPF programs attached to NMI-context hooks (perf
>> events) to use timers and workqueues for deferred processing.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>>   kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
>>   1 file changed, 240 insertions(+), 183 deletions(-)
> I think we'll need another revision, but all the things I pointed out
> below do not change the logic in any big way, so it would be great for
> people that were following along but waited for the overall
> implementation to stabilize to actually check the implementation now.
> I think it's basically ready, hopefully the next revision will land,
> so let's do another round of thorough reviews.
>
>
> [...]
>
>> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op,
>> +                                u64 nsec, u32 timer_mode)
>> +{
>> +       /* Acquire active writer */
>> +       if (atomic_cmpxchg_acquire(&cb->writer, 0, 1))
>> +               return -EBUSY;
>> +
>> +       write_seqcount_latch_begin(&cb->latch);
>> +       cb->cmd[0].nsec = nsec;
>> +       cb->cmd[0].mode = timer_mode;
>> +       cb->cmd[0].op = op;
>> +       write_seqcount_latch(&cb->latch);
>> +       cb->cmd[1].nsec = nsec;
>> +       cb->cmd[1].mode = timer_mode;
>> +       cb->cmd[1].op = op;
>> +       write_seqcount_latch_end(&cb->latch);
>> +
>> +       atomic_set_release(&cb->writer, 0);
>> +
>> +       if (!refcount_inc_not_zero(&cb->refcnt))
>> +               return -EBUSY;
> let's not do refcount bump unless we are going to schedule irq_work,
> so move this after !in_nmi() check, and instead of
> bpf_async_irq_worker() call only part of it that does the work, but
> doesn't put the refcount. This will speed up common non-NMI case.
Thanks for reminding about this one.
>
> [...]
>
>> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
>> +                            u64 *nsec, u32 *flags)
>> +{
>> +       u32 seq, idx;
>> +       s64 last_seq;
>> +
>> +       while (true) {
>> +               last_seq = atomic64_read_acquire(&cb->last_seq);
>> +               if (last_seq > U32_MAX) /* Check if terminal seq num has been set */
> unlikely() seems to be warranted here to optimize (a little bit) a common case
>
>> +                       return bpf_async_handle_terminal_seq(cb, last_seq, op);
> why this helper function, it's called once. It's actually a
> distraction, IMO, that this logic is split out. Can you please inline
> it here?
It looked ugly inlined, I saw value in attaching word "terminal" to the 
functionality.
I'll send inlined in the next version.
>
>> +
>> +               seq = raw_read_seqcount_latch(&cb->latch);
>> +
>> +               /* Return -EBUSY if current seq is consumed by another reader */
>> +               if (seq == last_seq)
>> +                       return -EBUSY;
> I'd drop this check, below atomic64_try_cmpxchg() should handle this
> unlikely situation anyways, no?
I don't think cmpxchg handles this, if seq == last_seq, cmpxchg still 
succeeds,
it only fails if cb->last_seq != last_seq.
>
>> +
>> +               idx = seq & 1;
>> +               *nsec = cb->cmd[idx].nsec;
> [...]
>
>> -       ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return ret;
>> +       cb = READ_ONCE(async->cb);
>> +       if (!cb)
>> +               return -EINVAL;
>> +
>> +       return bpf_async_update_prog_callback(cb, callback_fn, prog);
> nit: if you end up with another revision, consider swapping callback
> and prog arguments order, it doesn't make sense for callback (that
> belongs to that prog) to be specified before the prog. Super minor,
> but reads very backwards.
>
>>   }
>>
>>   BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
> [...]
>
>> @@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>>          .arg1_type      = ARG_PTR_TO_TIMER,
>>   };
>>
>> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>>   {
>>          struct bpf_async_cb *cb;
>>
>> -       /* Performance optimization: read async->cb without lock first. */
>> -       if (!READ_ONCE(async->cb))
>> -               return NULL;
>> -
> I don't feel strongly about this, but I guess we could leave this
> performance optimization in place without changing anything else, so
> why not?
>
>> -       __bpf_spin_lock_irqsave(&async->lock);
>> -       /* re-read it under lock */
>> -       cb = async->cb;
>> +       cb = xchg(&async->cb, NULL);
>>          if (!cb)
>> -               goto out;
>> -       bpf_async_update_prog_callback(cb, NULL, NULL);
>> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>> -        * this timer, since it won't be initialized.
>> -        */
>> -       WRITE_ONCE(async->cb, NULL);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return cb;
>> +               return;
>> +
>> +       atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
>> +       /* Pass map's reference to irq_work callback */
> I'd expand here, because it's subtle but important. Just to make it
> very clear: we *attempt* to pass our initial map's reference to
> irq_work callback, but if we fail to schedule it (meaning it is
> already scheduled by someone else having their own ref), we have to
> put that ref here.
>
> It's important to have a noticeable comment here because refcounting
> here looks asymmetrical (and thus broken at first sight), but it's
> actually not.
>
> But also, can't we do the same !in_nmi() optimization here as with
> schedule_op above?
>
>> +       if (!irq_work_queue(&cb->worker))
>> +               bpf_async_refcount_put(cb);
>>   }
>>
> [...]
>
>> -       if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> -               /* If the timer is running on other CPU, also use a kworker to
>> -                * wait for the completion of the timer instead of trying to
>> -                * acquire a sleepable lock in hrtimer_cancel() to wait for its
>> -                * completion.
>> -                */
>> -               if (hrtimer_try_to_cancel(&t->timer) >= 0)
>> -                       call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
>> -               else
>> -                       queue_work(system_dfl_wq, &t->cb.delete_work);
>> -       } else {
>> -               bpf_timer_delete_work(&t->cb.delete_work);
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL:
>> +                       hrtimer_try_to_cancel(&t->timer);
> I think it's important to point out (in patch description) that we now
> have asynchronous timer cancellation on cancel_and_free, and thus we
> don't need all that hrtimer_running logic that you removed above (and
> a huge comment accompanying it).
>
>> +                       break;
>> +               default:
> Drop BPF_ASYNC_CANCEL_AND_FREE which we don't use anymore, and then
> you won't need default (and compiler will complain if we add new
> operation that isn't handled explicitly).
>
> If that doesn't work for some reason, at least WARN_ON_ONCE() here.
> This shouldn't ever happen, but if it does during development, we
> better know about that.
>
>> +                       break;
>> +               }
>> +               break;
>> +       }
>> +       case BPF_ASYNC_TYPE_WQ: {
>> +               struct bpf_work *w = container_of(cb, struct bpf_work, cb);
>> +
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       schedule_work(&w->work);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL:
>> +                       /* Use non-blocking cancel, safe in irq_work context.
>> +                        * RCU grace period ensures callback completes before free.
>> +                        */
>> +                       cancel_work(&w->work);
>> +                       break;
>> +               default:
> same as above
>
>> +                       break;
>> +               }
>> +               break;
>>          }
>> +       }
>> +}
>> +
>> +static void bpf_async_irq_worker(struct irq_work *work)
>> +{
>> +       struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
>> +       u32 op, timer_mode;
>> +       u64 nsec;
>> +       int err;
>> +
>> +       err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
>> +       if (err)
>> +               goto out;
>> +
>> +       bpf_async_process_op(cb, op, nsec, timer_mode);
> if you split read_op+process_op combo into a separate function, then
> you won't need refcount manipulations outside of in_nmi context. Let's
> do it.
Thanks!
>> +
>> +out:
>> +       bpf_async_refcount_put(cb);
>>   }
>>
>>   /*
> [...]


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
  2026-01-20 21:17     ` Mykyta Yatsenko
@ 2026-01-21  0:26       ` Andrii Nakryiko
  0 siblings, 0 replies; 15+ messages in thread
From: Andrii Nakryiko @ 2026-01-21  0:26 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	Mykyta Yatsenko

On Tue, Jan 20, 2026 at 1:17 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 1/20/26 18:31, Andrii Nakryiko wrote:
> > On Tue, Jan 20, 2026 at 7:59 AM Mykyta Yatsenko
> > <mykyta.yatsenko5@gmail.com> wrote:
> >> From: Mykyta Yatsenko <yatsenko@meta.com>
> >>
> >> Refactor bpf timer and workqueue helpers to allow calling them from NMI
> >> context by making all operations lock-free and deferring NMI-unsafe
> >> work to irq_work.
> >>
> >> Previously, bpf_timer_start(), and bpf_wq_start()
> >> could not be called from NMI context because they acquired
> >> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
> >> patch removes these limitations.
> >>
> >> Key changes:
> >>   * Remove bpf_spin_lock from struct bpf_async_kern.
> >>   * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
> >>     atomically.
> >>   * Add per-bpf_async_cb irq_work to defer NMI-unsafe
> >>     operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
> >>     NMI to softirq context.
> >>   * Use the lock-free seqcount_latch_t to pass operation
> >>     commands (start/cancel/free) and parameters
> >>     from NMI-safe callers to the irq_work handler.
> >>   * Add reference counting to bpf_async_cb to ensure the object stays
> >>     alive until all scheduled irq_work completes.
> >>   * Move bpf_prog_put() to RCU callback to handle races between
> >>     set_callback() and cancel_and_free().
> >>   * Modify cancel_and_free() path:
> >>     * Detach bpf_async_cb.
> >>     * Signal destruction to irq_work side via setting last_seq to
> >>       BPF_ASYNC_DESTROY.
> >>     * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
> >>   * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
> >>     task trace grace periods before freeing the bpf_async_cb. Removed
> >>     unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
> >>     task trace lock.
> >>
> >> This enables BPF programs attached to NMI-context hooks (perf
> >> events) to use timers and workqueues for deferred processing.
> >>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> >>   kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
> >>   1 file changed, 240 insertions(+), 183 deletions(-)
> > I think we'll need another revision, but all the things I pointed out
> > below do not change the logic in any big way, so it would be great for
> > people that were following along but waited for the overall
> > implementation to stabilize to actually check the implementation now.
> > I think it's basically ready, hopefully the next revision will land,
> > so let's do another round of thorough reviews.
> >
> >

[...]

> > [...]
> >
> >> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
> >> +                            u64 *nsec, u32 *flags)
> >> +{
> >> +       u32 seq, idx;
> >> +       s64 last_seq;
> >> +
> >> +       while (true) {
> >> +               last_seq = atomic64_read_acquire(&cb->last_seq);
> >> +               if (last_seq > U32_MAX) /* Check if terminal seq num has been set */
> > unlikely() seems to be warranted here to optimize (a little bit) a common case
> >
> >> +                       return bpf_async_handle_terminal_seq(cb, last_seq, op);
> > why this helper function, it's called once. It's actually a
> > distraction, IMO, that this logic is split out. Can you please inline
> > it here?
> It looked ugly inlined, I saw value in attaching word "terminal" to the
> functionality.
> I'll send inlined in the next version.

I don't know, BPF_ASYNC_DESTROY -> BPF_ASYNC_DESTROYED has the same
terminal viber :)

As for ugly, I don't know, let's see...

if (unlikely(last_seq >= BPF_ASYNC_DESTROY)) {
    u64 expected = BPF_ASYNC_DESTROY;

    if (!try_cmpxchg_release(&cb->last_seq, &expected, BPF_ASYNC_DESTROYED))
        return -EBUSY;

    *op = BPF_ASYNC_CANCEL;
    return 0;
}


Looks fine to me, tbh.

And note, you don't need to check last_seq ahead of time, this is rare
event, no need to optimize with this speculative read check (unlike
the cancel_and_free one I mentioned in last reply, that one across
millions of map elements might add up to something)


> >
> >> +
> >> +               seq = raw_read_seqcount_latch(&cb->latch);
> >> +
> >> +               /* Return -EBUSY if current seq is consumed by another reader */
> >> +               if (seq == last_seq)
> >> +                       return -EBUSY;
> > I'd drop this check, below atomic64_try_cmpxchg() should handle this
> > unlikely situation anyways, no?
> I don't think cmpxchg handles this, if seq == last_seq, cmpxchg still
> succeeds,
> it only fails if cb->last_seq != last_seq.

doh, of course, what I've been thinking, sorry, stupid suggestion on my part

> >
> >> +
> >> +               idx = seq & 1;
> >> +               *nsec = cb->cmd[idx].nsec;
> > [...]
> >
> >> -       ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
> >> -out:
> >> -       __bpf_spin_unlock_irqrestore(&async->lock);
> >> -       return ret;
> >> +       cb = READ_ONCE(async->cb);
> >> +       if (!cb)
> >> +               return -EINVAL;
> >> +
> >> +       return bpf_async_update_prog_callback(cb, callback_fn, prog);
> > nit: if you end up with another revision, consider swapping callback
> > and prog arguments order, it doesn't make sense for callback (that
> > belongs to that prog) to be specified before the prog. Super minor,
> > but reads very backwards.
> >
> >>   }
> >>
> >>   BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
> > [...]
> >
> >> @@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
> >>          .arg1_type      = ARG_PTR_TO_TIMER,
> >>   };
> >>
> >> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
> >> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
> >>   {
> >>          struct bpf_async_cb *cb;
> >>
> >> -       /* Performance optimization: read async->cb without lock first. */
> >> -       if (!READ_ONCE(async->cb))
> >> -               return NULL;
> >> -
> > I don't feel strongly about this, but I guess we could leave this
> > performance optimization in place without changing anything else, so
> > why not?

It's useful for reviewers and readers to see acknowledgement of
requests like this, as it's not always clear if this was missed or you
are just going to apply the suggestion in the next revision. So please
consider replying even if you agree.

> >
> >> -       __bpf_spin_lock_irqsave(&async->lock);
> >> -       /* re-read it under lock */
> >> -       cb = async->cb;
> >> +       cb = xchg(&async->cb, NULL);
> >>          if (!cb)
> >> -               goto out;
> >> -       bpf_async_update_prog_callback(cb, NULL, NULL);
> >> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> >> -        * this timer, since it won't be initialized.
> >> -        */
> >> -       WRITE_ONCE(async->cb, NULL);
> >> -out:
> >> -       __bpf_spin_unlock_irqrestore(&async->lock);
> >> -       return cb;
> >> +               return;
> >> +
> >> +       atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
> >> +       /* Pass map's reference to irq_work callback */
> > I'd expand here, because it's subtle but important. Just to make it
> > very clear: we *attempt* to pass our initial map's reference to
> > irq_work callback, but if we fail to schedule it (meaning it is
> > already scheduled by someone else having their own ref), we have to
> > put that ref here.
> >
> > It's important to have a noticeable comment here because refcounting
> > here looks asymmetrical (and thus broken at first sight), but it's
> > actually not.
> >
> > But also, can't we do the same !in_nmi() optimization here as with
> > schedule_op above?
> >

same point as above (and few more below), I assume you agree, but it
would be nice to ack this to clarify (this is a general point for
email-based reviews)

> >> +       if (!irq_work_queue(&cb->worker))
> >> +               bpf_async_refcount_put(cb);
> >>   }
> >>

[...]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq
  2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
                   ` (9 preceding siblings ...)
  2026-01-20 15:59 ` [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
@ 2026-01-21  2:30 ` patchwork-bot+netdevbpf
  10 siblings, 0 replies; 15+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-01-21  2:30 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, ast, andrii, daniel, kafai, kernel-team, memxor, eddyz87,
	yatsenko

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 20 Jan 2026 15:59:09 +0000 you wrote:
> This series reworks implementation of BPF timer and workqueue APIs.
> The goal is to make both timers and wq non-blocking, enabling their use
> in NMI context.
> Today this code relies on a bpf_spin_lock embedded in the map element to
> serialize:
>  * init of the async object,
>  * setting/changing the callback and bpf_prog
>  * starting/cancelling the timer/work
>  * tearing down when the map element is deleted or the map’s user ref is
>  dropped
> 
> [...]

Here is the summary with links:
  - [bpf-next,v6,01/10] bpf: Factor out timer deletion helper
    https://git.kernel.org/bpf/bpf-next/c/c1f2c449de27
  - [bpf-next,v6,02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback()
    https://git.kernel.org/bpf/bpf-next/c/57d31e72dbdd
  - [bpf-next,v6,03/10] bpf: Introduce lock-free bpf_async_update_prog_callback()
    https://git.kernel.org/bpf/bpf-next/c/8bb1e32b3fac
  - [bpf-next,v6,04/10] bpf: Simplify bpf_timer_cancel()
    https://git.kernel.org/bpf/bpf-next/c/83c9030cdc45
  - [bpf-next,v6,05/10] bpf: Enable bpf timer and workqueue use in NMI
    (no matching commit)
  - [bpf-next,v6,06/10] bpf: Add verifier support for bpf_timer argument in kfuncs
    (no matching commit)
  - [bpf-next,v6,07/10] bpf: Introduce bpf_timer_cancel_async() kfunc
    (no matching commit)
  - [bpf-next,v6,08/10] selftests/bpf: Refactor timer selftests
    (no matching commit)
  - [bpf-next,v6,09/10] selftests/bpf: Add stress test for timer async cancel
    (no matching commit)
  - [bpf-next,v6,10/10] selftests/bpf: Verify bpf_timer_cancel_async works
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-01-21  2:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
2026-01-20 18:31   ` Andrii Nakryiko
2026-01-20 21:17     ` Mykyta Yatsenko
2026-01-21  0:26       ` Andrii Nakryiko
2026-01-20 15:59 ` [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
2026-01-21  2:30 ` [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox