Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI

BPF List
 help / color / mirror / Atom feed

From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
	daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com,
	memxor@gmail.com, eddyz87@gmail.com,
	Mykyta Yatsenko <yatsenko@meta.com>
Subject: Re: [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI
Date: Tue, 20 Jan 2026 21:17:23 +0000	[thread overview]
Message-ID: <2ef459bf-6a7a-4249-b400-aab75f03f44a@gmail.com> (raw)
In-Reply-To: <CAEf4Bza1r4db3srwPYymSDC=ynhbZ-vsVJKshy8wGfnRhKkCmQ@mail.gmail.com>

On 1/20/26 18:31, Andrii Nakryiko wrote:
> On Tue, Jan 20, 2026 at 7:59 AM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Refactor bpf timer and workqueue helpers to allow calling them from NMI
>> context by making all operations lock-free and deferring NMI-unsafe
>> work to irq_work.
>>
>> Previously, bpf_timer_start(), and bpf_wq_start()
>> could not be called from NMI context because they acquired
>> bpf_spin_lock and called hrtimer/schedule_work APIs directly. This
>> patch removes these limitations.
>>
>> Key changes:
>>   * Remove bpf_spin_lock from struct bpf_async_kern.
>>   * Initialize/Destroy via setting/unsetting bpf_async_cb pointer
>>     atomically.
>>   * Add per-bpf_async_cb irq_work to defer NMI-unsafe
>>     operations (hrtimer_start, hrtimer_try_to_cancel, schedule_work) from
>>     NMI to softirq context.
>>   * Use the lock-free seqcount_latch_t to pass operation
>>     commands (start/cancel/free) and parameters
>>     from NMI-safe callers to the irq_work handler.
>>   * Add reference counting to bpf_async_cb to ensure the object stays
>>     alive until all scheduled irq_work completes.
>>   * Move bpf_prog_put() to RCU callback to handle races between
>>     set_callback() and cancel_and_free().
>>   * Modify cancel_and_free() path:
>>     * Detach bpf_async_cb.
>>     * Signal destruction to irq_work side via setting last_seq to
>>       BPF_ASYNC_DESTROY.
>>     * On receiving BPF_ASYNC_DESTROY, cancel timer/wq.
>>   * Free bpf_async_cb on refcnt reaching 0, wait for both rcu and rcu
>>     task trace grace periods before freeing the bpf_async_cb. Removed
>>     unnecessary rcu locks, as kfunc/helper allways assumes rcu or rcu
>>     task trace lock.
>>
>> This enables BPF programs attached to NMI-context hooks (perf
>> events) to use timers and workqueues for deferred processing.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>>   kernel/bpf/helpers.c | 423 +++++++++++++++++++++++++++++----------------------
>>   1 file changed, 240 insertions(+), 183 deletions(-)
> I think we'll need another revision, but all the things I pointed out
> below do not change the logic in any big way, so it would be great for
> people that were following along but waited for the overall
> implementation to stabilize to actually check the implementation now.
> I think it's basically ready, hopefully the next revision will land,
> so let's do another round of thorough reviews.
>
>
> [...]
>
>> +static int bpf_async_schedule_op(struct bpf_async_cb *cb, enum bpf_async_op op,
>> +                                u64 nsec, u32 timer_mode)
>> +{
>> +       /* Acquire active writer */
>> +       if (atomic_cmpxchg_acquire(&cb->writer, 0, 1))
>> +               return -EBUSY;
>> +
>> +       write_seqcount_latch_begin(&cb->latch);
>> +       cb->cmd[0].nsec = nsec;
>> +       cb->cmd[0].mode = timer_mode;
>> +       cb->cmd[0].op = op;
>> +       write_seqcount_latch(&cb->latch);
>> +       cb->cmd[1].nsec = nsec;
>> +       cb->cmd[1].mode = timer_mode;
>> +       cb->cmd[1].op = op;
>> +       write_seqcount_latch_end(&cb->latch);
>> +
>> +       atomic_set_release(&cb->writer, 0);
>> +
>> +       if (!refcount_inc_not_zero(&cb->refcnt))
>> +               return -EBUSY;
> let's not do refcount bump unless we are going to schedule irq_work,
> so move this after !in_nmi() check, and instead of
> bpf_async_irq_worker() call only part of it that does the work, but
> doesn't put the refcount. This will speed up common non-NMI case.
Thanks for reminding about this one.
>
> [...]
>
>> +static int bpf_async_read_op(struct bpf_async_cb *cb, enum bpf_async_op *op,
>> +                            u64 *nsec, u32 *flags)
>> +{
>> +       u32 seq, idx;
>> +       s64 last_seq;
>> +
>> +       while (true) {
>> +               last_seq = atomic64_read_acquire(&cb->last_seq);
>> +               if (last_seq > U32_MAX) /* Check if terminal seq num has been set */
> unlikely() seems to be warranted here to optimize (a little bit) a common case
>
>> +                       return bpf_async_handle_terminal_seq(cb, last_seq, op);
> why this helper function, it's called once. It's actually a
> distraction, IMO, that this logic is split out. Can you please inline
> it here?
It looked ugly inlined, I saw value in attaching word "terminal" to the 
functionality.
I'll send inlined in the next version.
>
>> +
>> +               seq = raw_read_seqcount_latch(&cb->latch);
>> +
>> +               /* Return -EBUSY if current seq is consumed by another reader */
>> +               if (seq == last_seq)
>> +                       return -EBUSY;
> I'd drop this check, below atomic64_try_cmpxchg() should handle this
> unlikely situation anyways, no?
I don't think cmpxchg handles this, if seq == last_seq, cmpxchg still 
succeeds,
it only fails if cb->last_seq != last_seq.
>
>> +
>> +               idx = seq & 1;
>> +               *nsec = cb->cmd[idx].nsec;
> [...]
>
>> -       ret = bpf_async_update_prog_callback(cb, callback_fn, prog);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return ret;
>> +       cb = READ_ONCE(async->cb);
>> +       if (!cb)
>> +               return -EINVAL;
>> +
>> +       return bpf_async_update_prog_callback(cb, callback_fn, prog);
> nit: if you end up with another revision, consider swapping callback
> and prog arguments order, it doesn't make sense for callback (that
> belongs to that prog) to be specified before the prog. Super minor,
> but reads very backwards.
>
>>   }
>>
>>   BPF_CALL_3(bpf_timer_set_callback, struct bpf_async_kern *, timer, void *, callback_fn,
> [...]
>
>> @@ -1536,79 +1617,75 @@ static const struct bpf_func_proto bpf_timer_cancel_proto = {
>>          .arg1_type      = ARG_PTR_TO_TIMER,
>>   };
>>
>> -static struct bpf_async_cb *__bpf_async_cancel_and_free(struct bpf_async_kern *async)
>> +static void __bpf_async_cancel_and_free(struct bpf_async_kern *async)
>>   {
>>          struct bpf_async_cb *cb;
>>
>> -       /* Performance optimization: read async->cb without lock first. */
>> -       if (!READ_ONCE(async->cb))
>> -               return NULL;
>> -
> I don't feel strongly about this, but I guess we could leave this
> performance optimization in place without changing anything else, so
> why not?
>
>> -       __bpf_spin_lock_irqsave(&async->lock);
>> -       /* re-read it under lock */
>> -       cb = async->cb;
>> +       cb = xchg(&async->cb, NULL);
>>          if (!cb)
>> -               goto out;
>> -       bpf_async_update_prog_callback(cb, NULL, NULL);
>> -       /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
>> -        * this timer, since it won't be initialized.
>> -        */
>> -       WRITE_ONCE(async->cb, NULL);
>> -out:
>> -       __bpf_spin_unlock_irqrestore(&async->lock);
>> -       return cb;
>> +               return;
>> +
>> +       atomic64_set(&cb->last_seq, BPF_ASYNC_DESTROY);
>> +       /* Pass map's reference to irq_work callback */
> I'd expand here, because it's subtle but important. Just to make it
> very clear: we *attempt* to pass our initial map's reference to
> irq_work callback, but if we fail to schedule it (meaning it is
> already scheduled by someone else having their own ref), we have to
> put that ref here.
>
> It's important to have a noticeable comment here because refcounting
> here looks asymmetrical (and thus broken at first sight), but it's
> actually not.
>
> But also, can't we do the same !in_nmi() optimization here as with
> schedule_op above?
>
>> +       if (!irq_work_queue(&cb->worker))
>> +               bpf_async_refcount_put(cb);
>>   }
>>
> [...]
>
>> -       if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
>> -               /* If the timer is running on other CPU, also use a kworker to
>> -                * wait for the completion of the timer instead of trying to
>> -                * acquire a sleepable lock in hrtimer_cancel() to wait for its
>> -                * completion.
>> -                */
>> -               if (hrtimer_try_to_cancel(&t->timer) >= 0)
>> -                       call_rcu(&t->cb.rcu, bpf_async_cb_rcu_free);
>> -               else
>> -                       queue_work(system_dfl_wq, &t->cb.delete_work);
>> -       } else {
>> -               bpf_timer_delete_work(&t->cb.delete_work);
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       hrtimer_start(&t->timer, ns_to_ktime(timer_nsec), timer_mode);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL:
>> +                       hrtimer_try_to_cancel(&t->timer);
> I think it's important to point out (in patch description) that we now
> have asynchronous timer cancellation on cancel_and_free, and thus we
> don't need all that hrtimer_running logic that you removed above (and
> a huge comment accompanying it).
>
>> +                       break;
>> +               default:
> Drop BPF_ASYNC_CANCEL_AND_FREE which we don't use anymore, and then
> you won't need default (and compiler will complain if we add new
> operation that isn't handled explicitly).
>
> If that doesn't work for some reason, at least WARN_ON_ONCE() here.
> This shouldn't ever happen, but if it does during development, we
> better know about that.
>
>> +                       break;
>> +               }
>> +               break;
>> +       }
>> +       case BPF_ASYNC_TYPE_WQ: {
>> +               struct bpf_work *w = container_of(cb, struct bpf_work, cb);
>> +
>> +               switch (op) {
>> +               case BPF_ASYNC_START:
>> +                       schedule_work(&w->work);
>> +                       break;
>> +               case BPF_ASYNC_CANCEL:
>> +                       /* Use non-blocking cancel, safe in irq_work context.
>> +                        * RCU grace period ensures callback completes before free.
>> +                        */
>> +                       cancel_work(&w->work);
>> +                       break;
>> +               default:
> same as above
>
>> +                       break;
>> +               }
>> +               break;
>>          }
>> +       }
>> +}
>> +
>> +static void bpf_async_irq_worker(struct irq_work *work)
>> +{
>> +       struct bpf_async_cb *cb = container_of(work, struct bpf_async_cb, worker);
>> +       u32 op, timer_mode;
>> +       u64 nsec;
>> +       int err;
>> +
>> +       err = bpf_async_read_op(cb, &op, &nsec, &timer_mode);
>> +       if (err)
>> +               goto out;
>> +
>> +       bpf_async_process_op(cb, op, nsec, timer_mode);
> if you split read_op+process_op combo into a separate function, then
> you won't need refcount manipulations outside of in_nmi context. Let's
> do it.
Thanks!
>> +
>> +out:
>> +       bpf_async_refcount_put(cb);
>>   }
>>
>>   /*
> [...]

next prev parent reply	other threads:[~2026-01-20 21:17 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-20 15:59 [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 01/10] bpf: Factor out timer deletion helper Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 02/10] bpf: Remove unnecessary arguments from bpf_async_set_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 03/10] bpf: Introduce lock-free bpf_async_update_prog_callback() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 04/10] bpf: Simplify bpf_timer_cancel() Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 05/10] bpf: Enable bpf timer and workqueue use in NMI Mykyta Yatsenko
2026-01-20 18:31   ` Andrii Nakryiko
2026-01-20 21:17     ` Mykyta Yatsenko [this message]
2026-01-21  0:26       ` Andrii Nakryiko
2026-01-20 15:59 ` [PATCH bpf-next v6 06/10] bpf: Add verifier support for bpf_timer argument in kfuncs Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 07/10] bpf: Introduce bpf_timer_cancel_async() kfunc Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 08/10] selftests/bpf: Refactor timer selftests Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 09/10] selftests/bpf: Add stress test for timer async cancel Mykyta Yatsenko
2026-01-20 15:59 ` [PATCH bpf-next v6 10/10] selftests/bpf: Verify bpf_timer_cancel_async works Mykyta Yatsenko
2026-01-21  2:30 ` [PATCH bpf-next v6 00/10] bpf: Avoid locks in bpf_timer and bpf_wq patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2ef459bf-6a7a-4249-b400-aab75f03f44a@gmail.com \
    --to=mykyta.yatsenko5@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=kafai@meta.com \
    --cc=kernel-team@meta.com \
    --cc=memxor@gmail.com \
    --cc=yatsenko@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox