From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com,
memxor@gmail.com, eddyz87@gmail.com,
Mykyta Yatsenko <yatsenko@meta.com>
Subject: Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
Date: Tue, 20 Jan 2026 21:17:23 +0000 [thread overview]
Message-ID: <2ef459bf-6a7a-4249-b400-aab75f03f44a@gmail.com> (raw)
In-Reply-To: <CAEf4Bza1r4db3srwPYymSDC=ynhbZ-vsVJKshy8wGfnRhKkCmQ@mail.gmail.com>
On 1/20/26 18:31, Andrii Nakryiko wrote:
> On Tue, Jan 20, 2026 at 7:59 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Refactor bpf timer and workqueue helpers to allow calling them from NMI
>> context by making all operations lock-free and deferring NMI-unsafe
>> work to irq_work.
>>
>> Previously, bpf_timer_start(), and bpf_wq_start()
>> could not be called from NMI context because they acquired
>> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
>> patch removes these limitations.
>>
>> Key changes:
>> * Remove bpf_spin_lock from struct bpf_async_kern.
>> * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
>> atomically.
>> * Add per-bpf_async_cb irq_work to defer NMI-unsafe
>> operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
>> NMI to softirq context.
>> * Use the lock-free seqcount_latch_t to pass operation
>> commands (start/cancel/free) and parameters
>> from NMI-safe callers to the irq_work handler.
>> * Add reference counting to bpf_async_cb to ensure the object stays
>> alive until all scheduled irq_work completes.
>> * Move bpf_prog_put() to RCU callback to handle races between
>> set_callback() and cancel_and_free().
>> * Modify cancel_and_free() path:
>> * Detach bpf_async_cb.
>> * Signal destruction to irq_work side via setting last_seq to
>> BPF_ASYNC_DESTROY.
>> * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
>> * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
>> task trace grace periods before freeing the bpf_async_cb. Removed
>> unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
>> task trace lock.
>>
>> This enables BPF programs attached to NMI-context hooks (perf
>> events) to use timers and workqueues for deferred processing.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>> kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
>> 1 file changed, 240 insertions(+), 183 deletions(-)
> I think we'll need another revision, but all the things I pointed out
> below do not change the logic in any big way, so it would be great for
> people that were following along but waited for the overall
> implementation to stabilize to actually check the implementation now.
> I think it's basically ready, hopefully the next revision will land,
> so let's do another round of thorough reviews.
>
>
> [...]
>
>> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op,
>> + u64 nsec, u32 timer_mode)
>> +{
>> + /* Acquire active writer */
>> + if (atomic_cmpxchg_acquire(&cb->writer, 0, 1))
>> + return -EBUSY;
>> +
>> + write_seqcount_latch_begin(&cb->latch);
>> + cb->cmd[0].nsec = nsec;
>> + cb->cmd[0].mode = timer_mode;
>> + cb->cmd[0].op = op;
>> + write_seqcount_latch(&cb->latch);
>> + cb->cmd[1].nsec = nsec;
>> + cb->cmd[1].mode = timer_mode;
>> + cb->cmd[1].op = op;
>> + write_seqcount_latch_end(&cb->latch);
>> +
>> + atomic_set_release(&cb->writer, 0);
>> +
>> + if (!refcount_inc_not_zero(&cb->refcnt))
>> + return -EBUSY;
> let's not do refcount bump unless we are going to schedule irq_work,
> so move this after !in_nmi() check, and instead of
> bpf_async_irq_worker() call only part of it that does the work, but
> doesn't put the refcount. This will speed up common non-NMI case.
Thanks for reminding about this one.
>
> [...]
>
>> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
>> + u64 *nsec, u32 *flags)
>> +{
>> + u32 seq, idx;
>> + s64 last_seq;
>> +
>> + while (true) {
>> + last_seq = atomic64_read_acquire(&cb->last_seq);
>> + if (last_seq > U32_MAX) /* Check if terminal seq num has been set */
> unlikely() seems to be warranted here to optimize (a little bit) a common case
>
>> + return bpf_async_handle_terminal_seq(cb, last_seq, op);
> why this helper function, it's called once. It's actually a
> distraction, IMO, that this logic is split out. Can you please inline
> it here?
It looked ugly inlined, I saw value in attaching word "terminal" to the
functionality.
I'll send inlined in the next version.
>
>> +
>> + seq = raw_read_seqcount_latch(&cb->latch);
>> +
>> + /* Return -EBUSY if current seq is consumed by another reader */
>> + if (seq == last_seq)
>> + return -EBUSY;
> I'd drop this check, below atomic64_try_cmpxchg() should handle this
> unlikely situation anyways, no?
I don't think cmpxchg handles this, if seq == last_seq, cmpxchg still
succeeds,
it only fails if cb->last_seq != last_seq.
>
>> +
>> + idx = seq & 1;
>> + *nsec = cb->cmd[idx].nsec;
> [...]
>
>> - ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
>> -out:
>> - __bpf_spin_unlock_irqrestore(&async->lock);
>> - return ret;
>> + cb = READ_ONCE(async->cb);
>> + if (!cb)
>> + return -EINVAL;
>> +
>> + return bpf_async_update_prog_callback(cb, callback_fn, prog);
> nit: if you end up with another revision, consider swapping callback
> and prog arguments order, it doesn't make sense for callback (that
> belongs to that prog) to be specified before the prog. Super minor,
> but reads very backwards.
>
>> }
>>
>> BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
> [...]
>
>> @@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>> .arg1_type = ARG_PTR_TO_TIMER,
>> };
>>
>> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>> {
>> struct bpf_async_cb *cb;
>>
>> - /* Performance optimization: read async->cb without lock first. */
>> - if (!READ_ONCE(async->cb))
>> - return NULL;
>> -
> I don't feel strongly about this, but I guess we could leave this
> performance optimization in place without changing anything else, so
> why not?
>
>> - __bpf_spin_lock_irqsave(&async->lock);
>> - /* re-read it under lock */
>> - cb = async->cb;
>> + cb = xchg(&async->cb, NULL);
>> if (!cb)
>> - goto out;
>> - bpf_async_update_prog_callback(cb, NULL, NULL);
>> - /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>> - * this timer, since it won't be initialized.
>> - */
>> - WRITE_ONCE(async->cb, NULL);
>> -out:
>> - __bpf_spin_unlock_irqrestore(&async->lock);
>> - return cb;
>> + return;
>> +
>> + atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
>> + /* Pass map's reference to irq_work callback */
> I'd expand here, because it's subtle but important. Just to make it
> very clear: we *attempt* to pass our initial map's reference to
> irq_work callback, but if we fail to schedule it (meaning it is
> already scheduled by someone else having their own ref), we have to
> put that ref here.
>
> It's important to have a noticeable comment here because refcounting
> here looks asymmetrical (and thus broken at first sight), but it's
> actually not.
>
> But also, can't we do the same !in_nmi() optimization here as with
> schedule_op above?
>
>> + if (!irq_work_queue(&cb->worker))
>> + bpf_async_refcount_put(cb);
>> }
>>
> [...]
>
>> - if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> - /* If the timer is running on other CPU, also use a kworker to
>> - * wait for the completion of the timer instead of trying to
>> - * acquire a sleepable lock in hrtimer_cancel() to wait for its
>> - * completion.
>> - */
>> - if (hrtimer_try_to_cancel(&t->timer) >= 0)
>> - call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
>> - else
>> - queue_work(system_dfl_wq, &t->cb.delete_work);
>> - } else {
>> - bpf_timer_delete_work(&t->cb.delete_work);
>> + switch (op) {
>> + case BPF_ASYNC_START:
>> + hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
>> + break;
>> + case BPF_ASYNC_CANCEL:
>> + hrtimer_try_to_cancel(&t->timer);
> I think it's important to point out (in patch description) that we now
> have asynchronous timer cancellation on cancel_and_free, and thus we
> don't need all that hrtimer_running logic that you removed above (and
> a huge comment accompanying it).
>
>> + break;
>> + default:
> Drop BPF_ASYNC_CANCEL_AND_FREE which we don't use anymore, and then
> you won't need default (and compiler will complain if we add new
> operation that isn't handled explicitly).
>
> If that doesn't work for some reason, at least WARN_ON_ONCE() here.
> This shouldn't ever happen, but if it does during development, we
> better know about that.
>
>> + break;
>> + }
>> + break;
>> + }
>> + case BPF_ASYNC_TYPE_WQ: {
>> + struct bpf_work *w = container_of(cb, struct bpf_work, cb);
>> +
>> + switch (op) {
>> + case BPF_ASYNC_START:
>> + schedule_work(&w->work);
>> + break;
>> + case BPF_ASYNC_CANCEL:
>> + /* Use non-blocking cancel, safe in irq_work context.
>> + * RCU grace period ensures callback completes before free.
>> + */
>> + cancel_work(&w->work);
>> + break;
>> + default:
> same as above
>
>> + break;
>> + }
>> + break;
>> }
>> + }
>> +}
>> +
>> +static void bpf_async_irq_worker(struct irq_work *work)
>> +{
>> + struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
>> + u32 op, timer_mode;
>> + u64 nsec;
>> + int err;
>> +
>> + err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
>> + if (err)
>> + goto out;
>> +
>> + bpf_async_process_op(cb, op, nsec, timer_mode);
> if you split read_op+process_op combo into a separate function, then
> you won't need refcount manipulations outside of in_nmi context. Let's
> do it.
Thanks!
>> +
>> +out:
>> + bpf_async_refcount_put(cb);
>> }
>>
>> /*
> [...]
next prev parent reply other threads:[~2026-01-20 21:17 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
2026-01-20 18:31 ` Andrii Nakryiko
2026-01-20 21:17 ` Mykyta Yatsenko [this message]
2026-01-21 0:26 ` Andrii Nakryiko
2026-01-20 15:59 ` [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
2026-01-21 2:30 ` [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2ef459bf-6a7a-4249-b400-aab75f03f44a@gmail.com \
--to=mykyta.yatsenko5@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=kafai@meta.com \
--cc=kernel-team@meta.com \
--cc=memxor@gmail.com \
--cc=yatsenko@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox