Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
	daniel@iogearbox.net, kafai@meta.com, kernel-team@meta.com,
	eddyz87@gmail.com, Mykyta Yatsenko <yatsenko@meta.com>
Subject: Re: [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs
Date: Tue, 19 Aug 2025 19:13:50 +0100	[thread overview]
Message-ID: <7a40bdcc-3905-4fa2-beac-c7612becabb7@gmail.com> (raw)
In-Reply-To: <CAP01T751FPiuZv5yBMeHSAmFmywc7L3iY=jYLb992YOp_94pRQ@mail.gmail.com>

On 8/19/25 15:18, Kumar Kartikeya Dwivedi wrote:
> On Fri, 15 Aug 2025 at 21:22, Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>> From: Mykyta Yatsenko <yatsenko@meta.com>
>>
>> Implementation of the bpf_task_work_schedule kfuncs.
>>
>> Main components:
>>   * struct bpf_task_work_context – Metadata and state management per task
>> work.
>>   * enum bpf_task_work_state – A state machine to serialize work
>>   scheduling and execution.
>>   * bpf_task_work_schedule() – The central helper that initiates
>> scheduling.
>>   * bpf_task_work_acquire() - Attempts to take ownership of the context,
>>   pointed by passed struct bpf_task_work, allocates new context if none
>>   exists yet.
>>   * bpf_task_work_callback() – Invoked when the actual task_work runs.
>>   * bpf_task_work_irq() – An intermediate step (runs in softirq context)
>> to enqueue task work.
>>   * bpf_task_work_cancel_and_free() – Cleanup for deleted BPF map entries.
> Can you elaborate on why the bouncing through irq_work context is necessary?
> I think we should have this info in the commit log.
> Is it to avoid deadlocks with task_work locks and/or task->pi_lock?
yes, mainly to avoid locks in NMI.
>
>> Flow of successful task work scheduling
>>   1) bpf_task_work_schedule_* is called from BPF code.
>>   2) Transition state from STANDBY to PENDING, marks context is owned by
>>   this task work scheduler
>>   3) irq_work_queue() schedules bpf_task_work_irq().
>>   4) Transition state from PENDING to SCHEDULING.
>>   4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>>   transitions to SCHEDULED.
>>   5) Task work calls bpf_task_work_callback(), which transition state to
>>   RUNNING.
>>   6) BPF callback is executed
>>   7) Context is cleaned up, refcounts released, context state set back to
>>   STANDBY.
>>
>> bpf_task_work_context handling
>> The context pointer is stored in bpf_task_work ctx field (u64) but
>> treated as an __rcu pointer via casts.
>> bpf_task_work_acquire() publishes new bpf_task_work_context by cmpxchg
>> with RCU initializer.
>> Read under the RCU lock only in bpf_task_work_acquire() when ownership
>> is contended.
>> Upon deleting map value, bpf_task_work_cancel_and_free() is detaching
>> context pointer from struct bpf_task_work and releases resources
>> if scheduler does not own the context or can be canceled (state ==
>> STANDBY or state == SCHEDULED and callback canceled). If task work
>> scheduler owns the context, its state is set to FREED and scheduler is
>> expected to cleanup on the next state transition.
>>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
> This is much better now, with clear ownership between free path and
> scheduling path, I mostly have a few more comments on the current
> implementation, plus one potential bug.
>
> However, the more time I spend on this, the more I feel we should
> unify all this with the two other bpf async work execution mechanisms
> (timers and wq), and simplify and deduplicate a lot of this under the
> serialized async->lock. I know NMI execution is probably critical for
> this primitive, but we can replace async->lock with rqspinlock to
> address that, so that it becomes safe to serialize in any context.
> Apart from that, I don't see anything that would negate reworking all
> this as a case of BPF_TASK_WORK for bpf_async_kern, modulo internal
> task_work locks that have trouble with NMI execution (see later
> comments).
>
> I also feel like it would be cleaner if we split the API into 3 steps:
> init(), set_callback(), schedule() like the other cases, I don't see
> why we necessarily need to diverge, and it simplifies some of the
> logic in schedule().
> Once every state update is protected by a lock, all of the state
> transitions are done automatically and a lot of the extra races are
> eliminated.
>
> I think we should discuss whether this was considered and why you
> discarded this approach, otherwise the code is pretty complex, with
> little upside.
> Maybe I'm missing something obvious and you'd know more since you've
> thought about all this longer.
As for API, I think having 1 function for scheduling callback is cleaner
then having 3 which are always called in the same order anyway. Most of 
the complexity
comes from synchronization, not logic, so not having to do the same 
synchronization in
init(), set_callback() and schedule() seems like a benefit to me.
Let me check if using rqspinlock going to make things simpler. We still 
need states to at least know if cancellation is possible and to flag 
deletion to scheduler, but using a lock will make code easier to 
understand.
>
>>   kernel/bpf/helpers.c | 270 +++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 260 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index d2f88a9bc47b..346ae8fd3ada 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -25,6 +25,8 @@
>>   #include <linux/kasan.h>
>>   #include <linux/bpf_verifier.h>
>>   #include <linux/uaccess.h>
>> +#include <linux/task_work.h>
>> +#include <linux/irq_work.h>
>>
>>   #include "../../lib/kstrtox.h"
>>
>> @@ -3701,6 +3703,226 @@ __bpf_kfunc int bpf_strstr(const char *s1__ign, const char *s2__ign)
>>
>>   typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *value);
>>
>> +enum bpf_task_work_state {
>> +       /* bpf_task_work is ready to be used */
>> +       BPF_TW_STANDBY = 0,
>> +       /* irq work scheduling in progress */
>> +       BPF_TW_PENDING,
>> +       /* task work scheduling in progress */
>> +       BPF_TW_SCHEDULING,
>> +       /* task work is scheduled successfully */
>> +       BPF_TW_SCHEDULED,
>> +       /* callback is running */
>> +       BPF_TW_RUNNING,
>> +       /* associated BPF map value is deleted */
>> +       BPF_TW_FREED,
>> +};
>> +
>> +struct bpf_task_work_context {
>> +       /* the map and map value associated with this context */
>> +       struct bpf_map *map;
>> +       void *map_val;
>> +       /* bpf_prog that schedules task work */
>> +       struct bpf_prog *prog;
>> +       /* task for which callback is scheduled */
>> +       struct task_struct *task;
>> +       enum task_work_notify_mode mode;
>> +       enum bpf_task_work_state state;
>> +       bpf_task_work_callback_t callback_fn;
>> +       struct callback_head work;
>> +       struct irq_work irq_work;
>> +       struct rcu_head rcu;
>> +} __aligned(8);
>> +
>> +static struct bpf_task_work_context *bpf_task_work_context_alloc(void)
>> +{
>> +       struct bpf_task_work_context *ctx;
>> +
>> +       ctx = bpf_mem_alloc(&bpf_global_ma, sizeof(struct bpf_task_work_context));
>> +       if (ctx)
>> +               memset(ctx, 0, sizeof(*ctx));
>> +       return ctx;
>> +}
>> +
>> +static void bpf_task_work_context_free(struct rcu_head *rcu)
>> +{
>> +       struct bpf_task_work_context *ctx = container_of(rcu, struct bpf_task_work_context, rcu);
>> +       /* bpf_mem_free expects migration to be disabled */
>> +       migrate_disable();
>> +       bpf_mem_free(&bpf_global_ma, ctx);
>> +       migrate_enable();
>> +}
>> +
>> +static bool task_work_match(struct callback_head *head, void *data)
>> +{
>> +       struct bpf_task_work_context *ctx = container_of(head, struct bpf_task_work_context, work);
>> +
>> +       return ctx == data;
>> +}
>> +
>> +static void bpf_task_work_context_reset(struct bpf_task_work_context *ctx)
>> +{
>> +       bpf_prog_put(ctx->prog);
>> +       bpf_task_release(ctx->task);
>> +}
>> +
>> +static void bpf_task_work_callback(struct callback_head *cb)
>> +{
>> +       enum bpf_task_work_state state;
>> +       struct bpf_task_work_context *ctx;
>> +       u32 idx;
>> +       void *key;
>> +
>> +       ctx = container_of(cb, struct bpf_task_work_context, work);
>> +
>> +       /*
>> +        * Read lock is needed to protect map key and value access below, it has to be done before
>> +        * the state transition
>> +        */
>> +       rcu_read_lock_trace();
>> +       /*
>> +        * This callback may start running before bpf_task_work_irq() switched the state to
>> +        * SCHEDULED so handle both transition variants SCHEDULING|SCHEDULED -> RUNNING.
>> +        */
>> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_RUNNING);
>> +       if (state == BPF_TW_SCHEDULED)
> ... and let's say we have concurrent cancel_and_free here, we mark
> state BPF_TW_FREED.
>
>> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULED, BPF_TW_RUNNING);
>> +       if (state == BPF_TW_FREED) {
> ... and notice it here now ...
>
>> +               rcu_read_unlock_trace();
>> +               bpf_task_work_context_reset(ctx);
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
> ... then I presume this is ok, because the cancel of tw in
> cancel_and_free will fail?
yes, if cancellation succeeds, callback will not be called.
If it fails, cancel_and_free does not do anything, except changing
the state and callback does the cleanup.
> Maybe add a comment here that it's interlocked with the free path.
>
>> +               return;
>> +       }
>> +
>> +       key = (void *)map_key_from_value(ctx->map, ctx->map_val, &idx);
>> +       migrate_disable();
>> +       ctx->callback_fn(ctx->map, key, ctx->map_val);
>> +       migrate_enable();
>> +       rcu_read_unlock_trace();
>> +       /* State is running or freed, either way reset. */
>> +       bpf_task_work_context_reset(ctx);
>> +       state = cmpxchg(&ctx->state, BPF_TW_RUNNING, BPF_TW_STANDBY);
>> +       if (state == BPF_TW_FREED)
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +}
>> +
>> +static void bpf_task_work_irq(struct irq_work *irq_work)
>> +{
>> +       struct bpf_task_work_context *ctx;
>> +       enum bpf_task_work_state state;
>> +       int err;
>> +
>> +       ctx = container_of(irq_work, struct bpf_task_work_context, irq_work);
>> +
>> +       state = cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING);
>> +       if (state == BPF_TW_FREED)
>> +               goto free_context;
>> +
>> +       err = task_work_add(ctx->task, &ctx->work, ctx->mode);
>> +       if (err) {
>> +               bpf_task_work_context_reset(ctx);
>> +               state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
>> +               if (state == BPF_TW_FREED)
>> +                       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +               return;
>> +       }
>> +       state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
>> +       if (state == BPF_TW_FREED && task_work_cancel_match(ctx->task, task_work_match, ctx))
>> +               goto free_context; /* successful cancellation, release and free ctx */
>> +       return;
>> +
>> +free_context:
>> +       bpf_task_work_context_reset(ctx);
>> +       call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +}
>> +
>> +static struct bpf_task_work_context *bpf_task_work_context_acquire(struct bpf_task_work *tw,
>> +                                                                  struct bpf_map *map)
>> +{
>> +       struct bpf_task_work_context *ctx, *old_ctx;
>> +       enum bpf_task_work_state state;
>> +       struct bpf_task_work_context __force __rcu **ppc =
>> +               (struct bpf_task_work_context __force __rcu **)&tw->ctx;
>> +
>> +       /* ctx pointer is RCU protected */
>> +       rcu_read_lock_trace();
>> +       ctx = rcu_dereference(*ppc);
>> +       if (!ctx) {
>> +               ctx = bpf_task_work_context_alloc();
>> +               if (!ctx) {
>> +                       rcu_read_unlock_trace();
>> +                       return ERR_PTR(-ENOMEM);
>> +               }
>> +               old_ctx = unrcu_pointer(cmpxchg(ppc, NULL, RCU_INITIALIZER(ctx)));
>> +               /*
>> +                * If ctx is set by another CPU, release allocated memory.
>> +                * Do not fail, though, attempt stealing the work
>> +                */
>> +               if (old_ctx) {
>> +                       bpf_mem_free(&bpf_global_ma, ctx);
>> +                       ctx = old_ctx;
>> +               }
>> +       }
>> +       state = cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING);
>> +       /*
>> +        * We can unlock RCU, because task work scheduler (this codepath)
>> +        * now owns the ctx or returning an error
>> +        */
>> +       rcu_read_unlock_trace();
>> +       if (state != BPF_TW_STANDBY)
>> +               return ERR_PTR(-EBUSY);
>> +       return ctx;
>> +}
>> +
>> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
>> +                                 struct bpf_map *map, bpf_task_work_callback_t callback_fn,
>> +                                 struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
>> +{
>> +       struct bpf_prog *prog;
>> +       struct bpf_task_work_context *ctx = NULL;
>> +       int err;
>> +
>> +       BTF_TYPE_EMIT(struct bpf_task_work);
>> +
>> +       prog = bpf_prog_inc_not_zero(aux->prog);
>> +       if (IS_ERR(prog))
>> +               return -EBADF;
>> +
>> +       if (!atomic64_read(&map->usercnt)) {
>> +               err = -EBADF;
>> +               goto release_prog;
>> +       }
> Please add a comment on why lack of ordering between load of usercnt
> and load of tw->ctx is safe, in presence of a parallel usercnt
> dec_and_test and ctx xchg.
> See __bpf_async_init for similar race.
I think I see what you mean, let me double check this.
>
>> +       task = bpf_task_acquire(task);
>> +       if (!task) {
>> +               err = -EPERM;
>> +               goto release_prog;
>> +       }
>> +       ctx = bpf_task_work_context_acquire(tw, map);
>> +       if (IS_ERR(ctx)) {
>> +               err = PTR_ERR(ctx);
>> +               goto release_all;
>> +       }
>> +
>> +       ctx->task = task;
>> +       ctx->callback_fn = callback_fn;
>> +       ctx->prog = prog;
>> +       ctx->mode = mode;
>> +       ctx->map = map;
>> +       ctx->map_val = (void *)tw - map->record->task_work_off;
>> +       init_irq_work(&ctx->irq_work, bpf_task_work_irq);
>> +       init_task_work(&ctx->work, bpf_task_work_callback);
>> +
>> +       irq_work_queue(&ctx->irq_work);
>> +
>> +       return 0;
>> +
>> +release_all:
>> +       bpf_task_release(task);
>> +release_prog:
>> +       bpf_prog_put(prog);
>> +       return err;
>> +}
>> +
>>   /**
>>    * bpf_task_work_schedule_signal - Schedule BPF callback using task_work_add with TWA_SIGNAL mode
>>    * @task: Task struct for which callback should be scheduled
>> @@ -3711,13 +3933,11 @@ typedef void (*bpf_task_work_callback_t)(struct bpf_map *map, void *key, void *v
>>    *
>>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>>    */
>> -__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>> -                                             struct bpf_task_work *tw,
>> +__bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task, struct bpf_task_work *tw,
>>                                                struct bpf_map *map__map,
>> -                                             bpf_task_work_callback_t callback,
>> -                                             void *aux__prog)
>> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>>   {
>> -       return 0;
>> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, TWA_SIGNAL);
>>   }
>>
>>   /**
>> @@ -3731,19 +3951,47 @@ __bpf_kfunc int bpf_task_work_schedule_signal(struct task_struct *task,
>>    *
>>    * Return: 0 if task work has been scheduled successfully, negative error code otherwise
>>    */
>> -__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task,
>> -                                             struct bpf_task_work *tw,
>> +__bpf_kfunc int bpf_task_work_schedule_resume(struct task_struct *task, struct bpf_task_work *tw,
>>                                                struct bpf_map *map__map,
>> -                                             bpf_task_work_callback_t callback,
>> -                                             void *aux__prog)
>> +                                             bpf_task_work_callback_t callback, void *aux__prog)
>>   {
>> -       return 0;
>> +       enum task_work_notify_mode mode;
>> +
>> +       mode = task == current && in_nmi() ? TWA_NMI_CURRENT : TWA_RESUME;
>> +       return bpf_task_work_schedule(task, tw, map__map, callback, aux__prog, mode);
>>   }
>>
>>   __bpf_kfunc_end_defs();
>>
>>   void bpf_task_work_cancel_and_free(void *val)
>>   {
>> +       struct bpf_task_work *tw = val;
>> +       struct bpf_task_work_context *ctx;
>> +       enum bpf_task_work_state state;
>> +
>> +       /* No need do rcu_read_lock as no other codepath can reset this pointer */
>> +       ctx = unrcu_pointer(xchg((struct bpf_task_work_context __force __rcu **)&tw->ctx, NULL));
>> +       if (!ctx)
>> +               return;
>> +       state = xchg(&ctx->state, BPF_TW_FREED);
>> +
>> +       switch (state) {
>> +       case BPF_TW_SCHEDULED:
>> +               /* If we can't cancel task work, rely on task work callback to free the context */
>> +               if (!task_work_cancel_match(ctx->task, task_work_match, ctx))
> This part looks broken to me.
> You are calling this path
> (update->obj_free_fields->cancel_and_free->cancel_and_match) in
> possibly NMI context.
> Which means we can deadlock if we hit the NMI context prog in the
> middle of task->pi_lock critical section.
> That's taken in task_work functions
> The task_work_cancel_match takes the pi_lock.
Good point, thanks. I think this could be solved in 2 ways:
  * Don't cancel, rely on callback dropping the work
  * Cancel in another irq_work
I'll probably go with the second one.
>
>> +                       break;
>> +               bpf_task_work_context_reset(ctx);
>> +               fallthrough;
>> +       case BPF_TW_STANDBY:
>> +               call_rcu_tasks_trace(&ctx->rcu, bpf_task_work_context_free);
>> +               break;
>> +       /* In all below cases scheduling logic should detect context state change and cleanup */
>> +       case BPF_TW_SCHEDULING:
>> +       case BPF_TW_PENDING:
>> +       case BPF_TW_RUNNING:
>> +       default:
>> +               break;
>> +       }
>>   }
>>
>>   BTF_KFUNCS_START(generic_btf_ids)
>> @@ -3769,6 +4017,8 @@ BTF_ID_FLAGS(func, bpf_rbtree_first, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_root, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_left, KF_RET_NULL)
>>   BTF_ID_FLAGS(func, bpf_rbtree_right, KF_RET_NULL)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_signal, KF_TRUSTED_ARGS)
>> +BTF_ID_FLAGS(func, bpf_task_work_schedule_resume, KF_TRUSTED_ARGS)
>>
>>   #ifdef CONFIG_CGROUPS
>>   BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
>> --
>> 2.50.1
>>

next prev parent reply	other threads:[~2025-08-19 18:16 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 19:21 [PATCH bpf-next v2 0/4] bpf: Introduce deferred task context execution Mykyta Yatsenko
2025-08-15 19:21 ` [PATCH bpf-next v2 1/4] bpf: bpf task work plumbing Mykyta Yatsenko
2025-08-19  1:34   ` Eduard Zingerman
2025-08-29 15:23     ` Mykyta Yatsenko
2025-08-15 19:21 ` [PATCH bpf-next v2 2/4] bpf: extract map key pointer calculation Mykyta Yatsenko
2025-08-19 11:05   ` Kumar Kartikeya Dwivedi
2025-08-19 20:50   ` Andrii Nakryiko
2025-08-15 19:21 ` [PATCH bpf-next v2 3/4] bpf: task work scheduling kfuncs Mykyta Yatsenko
2025-08-15 22:00   ` Jiri Olsa
2025-08-18 13:36     ` Mykyta Yatsenko
2025-08-16 18:14   ` kernel test robot
2025-08-19 14:18   ` Kumar Kartikeya Dwivedi
2025-08-19 18:13     ` Mykyta Yatsenko [this message]
2025-08-19 19:27       ` Kumar Kartikeya Dwivedi
2025-08-19 20:49         ` Andrii Nakryiko
2025-08-20 16:11           ` Kumar Kartikeya Dwivedi
2025-08-20 18:33             ` Andrii Nakryiko
2025-08-28  1:34               ` Alexei Starovoitov
2025-08-28 17:00                 ` Andrii Nakryiko
2025-08-28 17:38                   ` Kumar Kartikeya Dwivedi
2025-08-27 21:03   ` Andrii Nakryiko
2025-08-28 22:29     ` Kumar Kartikeya Dwivedi
2025-08-15 19:21 ` [PATCH bpf-next v2 4/4] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7a40bdcc-3905-4fa2-beac-c7612becabb7@gmail.com \
    --to=mykyta.yatsenko5@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=kafai@meta.com \
    --cc=kernel-team@meta.com \
    --cc=memxor@gmail.com \
    --cc=yatsenko@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).