Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Eduard Zingerman <eddyz87@gmail.com>
To: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>,
	bpf@vger.kernel.org,  ast@kernel.org, andrii@kernel.org,
	daniel@iogearbox.net, kafai@meta.com,  kernel-team@meta.com,
	memxor@gmail.com
Cc: Mykyta Yatsenko <yatsenko@meta.com>
Subject: Re: [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs
Date: Sat, 06 Sep 2025 13:22:47 -0700	[thread overview]
Message-ID: <aa9dcf55f1ed26c140f83fdde8312304efb80099.camel@gmail.com> (raw)
In-Reply-To: <20250905164508.1489482-7-mykyta.yatsenko5@gmail.com>

On Fri, 2025-09-05 at 17:45 +0100, Mykyta Yatsenko wrote:

[...]

> A small state machine and refcounting scheme ensures safe reuse and
> teardown:
>  STANDBY -> PENDING -> SCHEDULING -> SCHEDULED -> RUNNING -> STANDBY

Nit: state machine is actually a bit more complex:

  digraph G {
    scheduling  -> running    [label="callback 1"];
    scheduled   -> running    [label="callback 2"];
    running     -> standby    [label="callback 3"];
    pending     -> scheduling [label="irq 1"];
    scheduling  -> standby    [label="irq 2"];
    scheduling  -> scheduled  [label="irq 3"];
    standby     -> pending    [label="acquire_ctx"];
  
    freed      -> freed [label="cancel_and_free"];
    pending    -> freed [label="cancel_and_free"];
    running    -> freed [label="cancel_and_free"];
    scheduled  -> freed [label="cancel_and_free"];
    scheduling -> freed [label="cancel_and_free"];
    standby    -> freed [label="cancel_and_free"];
  }

[...]

> Flow of successful task work scheduling
>  1) bpf_task_work_schedule_* is called from BPF code.
>  2) Transition state from STANDBY to PENDING, marks context is owned by
>  this task work scheduler
>  3) irq_work_queue() schedules bpf_task_work_irq().
>  4) Transition state from PENDING to SCHEDULING.
>  4) bpf_task_work_irq() attempts task_work_add(). If successful, state
>  transitions to SCHEDULED.

Nit: "4" repeated two times.

>  5) Task work calls bpf_task_work_callback(), which transition state to
>  RUNNING.
>  6) BPF callback is executed
>  7) Context is cleaned up, refcounts released, context state set back to
>  STANDBY.
> 
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
>  kernel/bpf/helpers.c | 319 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 317 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 109cb249e88c..418a0a211699 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[...]

> +static void bpf_task_work_cancel(struct bpf_task_work_ctx *ctx)
> +{
> +	/*
> +	 * Scheduled task_work callback holds ctx ref, so if we successfully
> +	 * cancelled, we put that ref on callback's behalf. If we couldn't
> +	 * cancel, callback is inevitably run or has already completed
> +	 * running, and it would have taken care of its ctx ref itself.
> +	 */
> +	if (task_work_cancel_match(ctx->task, task_work_match, ctx))

Will `task_work_cancel(ctx->task, ctx->work)` do the same thing here?

> +		bpf_task_work_ctx_put(ctx);
> +}

[...]

> +static void bpf_task_work_irq(struct irq_work *irq_work)
> +{
> +	struct bpf_task_work_ctx *ctx = container_of(irq_work, struct bpf_task_work_ctx, irq_work);
> +	enum bpf_task_work_state state;
> +	int err;
> +
> +	guard(rcu_tasks_trace)();
> +
> +	if (cmpxchg(&ctx->state, BPF_TW_PENDING, BPF_TW_SCHEDULING) != BPF_TW_PENDING) {
> +		bpf_task_work_ctx_put(ctx);
> +		return;
> +	}

Why are separate PENDING and SCHEDULING states needed?
Both indicate that the task had not been yet but is ready to be
submitted to task_work_add(). So, on a first glance it seems that
merging the two won't change the behaviour, what do I miss?

> +	err = task_work_add(ctx->task, &ctx->work, ctx->mode);
> +	if (err) {
> +		bpf_task_work_ctx_reset(ctx);
> +		/*
> +		 * try to switch back to STANDBY for another task_work reuse, but we might have
> +		 * gone to FREED already, which is fine as we already cleaned up after ourselves
> +		 */
> +		(void)cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_STANDBY);
> +
> +		/* we don't have RCU protection, so put after switching state */
> +		bpf_task_work_ctx_put(ctx);
> +	}
> +
> +	/*
> +	 * It's technically possible for just scheduled task_work callback to
> +	 * complete running by now, going SCHEDULING -> RUNNING and then
> +	 * dropping its ctx refcount. Instead of capturing extra ref just to
> +	 * protected below ctx->state access, we rely on RCU protection to
> +	 * perform below SCHEDULING -> SCHEDULED attempt.
> +	 */
> +	state = cmpxchg(&ctx->state, BPF_TW_SCHEDULING, BPF_TW_SCHEDULED);
> +	if (state == BPF_TW_FREED)
> +		bpf_task_work_cancel(ctx); /* clean up if we switched into FREED state */
> +}

[...]

> +static struct bpf_task_work_ctx *bpf_task_work_acquire_ctx(struct bpf_task_work *tw,
> +							   struct bpf_map *map)
> +{
> +	struct bpf_task_work_ctx *ctx;
> +
> +	/* early check to avoid any work, we'll double check at the end again */
> +	if (!atomic64_read(&map->usercnt))
> +		return ERR_PTR(-EBUSY);
> +
> +	ctx = bpf_task_work_fetch_ctx(tw, map);
> +	if (IS_ERR(ctx))
> +		return ctx;
> +
> +	/* try to get ref for task_work callback to hold */
> +	if (!bpf_task_work_ctx_tryget(ctx))
> +		return ERR_PTR(-EBUSY);
> +
> +	if (cmpxchg(&ctx->state, BPF_TW_STANDBY, BPF_TW_PENDING) != BPF_TW_STANDBY) {
> +		/* lost acquiring race or map_release_uref() stole it from us, put ref and bail */
> +		bpf_task_work_ctx_put(ctx);
> +		return ERR_PTR(-EBUSY);
> +	}
> +
> +	/*
> +	 * Double check that map->usercnt wasn't dropped while we were
> +	 * preparing context, and if it was, we need to clean up as if
> +	 * map_release_uref() was called; bpf_task_work_cancel_and_free()
> +	 * is safe to be called twice on the same task work
> +	 */
> +	if (!atomic64_read(&map->usercnt)) {
> +		/* drop ref we just got for task_work callback itself */
> +		bpf_task_work_ctx_put(ctx);
> +		/* transfer map's ref into cancel_and_free() */
> +		bpf_task_work_cancel_and_free(tw);
> +		return ERR_PTR(-EBUSY);
> +	}

I don't understand how the above check is useful.
Is map->usercnt protected from being changed during execution of
bpf_task_work_schedule()?
There are two such checks in this function, so apparently it is not.
Then what's the point of checking usercnt value if it can be
immediately changed after the check?

> +
> +	return ctx;
> +}
> +
> +static int bpf_task_work_schedule(struct task_struct *task, struct bpf_task_work *tw,
> +				  struct bpf_map *map, bpf_task_work_callback_t callback_fn,
> +				  struct bpf_prog_aux *aux, enum task_work_notify_mode mode)
> +{
> +	struct bpf_prog *prog;
> +	struct bpf_task_work_ctx *ctx;
> +	int err;
> +
> +	BTF_TYPE_EMIT(struct bpf_task_work);
> +
> +	prog = bpf_prog_inc_not_zero(aux->prog);
> +	if (IS_ERR(prog))
> +		return -EBADF;
> +	task = bpf_task_acquire(task);
> +	if (!task) {
> +		err = -EPERM;

Nit: Why -EPERM? bpf_task_acquire() returns NULL if task->rcu_users
     is zero, does not seem to be permission related.

> +		goto release_prog;
> +	}
> +
> +	ctx = bpf_task_work_acquire_ctx(tw, map);
> +	if (IS_ERR(ctx)) {
> +		err = PTR_ERR(ctx);
> +		goto release_all;
> +	}
> +
> +	ctx->task = task;
> +	ctx->callback_fn = callback_fn;
> +	ctx->prog = prog;
> +	ctx->mode = mode;
> +	ctx->map = map;
> +	ctx->map_val = (void *)tw - map->record->task_work_off;
> +	init_task_work(&ctx->work, bpf_task_work_callback);
> +	init_irq_work(&ctx->irq_work, bpf_task_work_irq);
> +
> +	irq_work_queue(&ctx->irq_work);
> +	return 0;
> +
> +release_all:
> +	bpf_task_release(task);
> +release_prog:
> +	bpf_prog_put(prog);
> +	return err;
> +}
> +

[...]

next prev parent reply	other threads:[~2025-09-06 20:22 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-05 16:44 [PATCH bpf-next v3 0/7] bpf: Introduce deferred task context execution Mykyta Yatsenko
2025-09-05 16:44 ` [PATCH bpf-next v3 1/7] bpf: refactor special field-type detection Mykyta Yatsenko
2025-09-05 19:36   ` Eduard Zingerman
2025-09-05 21:29   ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 2/7] bpf: extract generic helper from process_timer_func() Mykyta Yatsenko
2025-09-05 21:15   ` Eduard Zingerman
2025-09-05 21:28   ` Eduard Zingerman
2025-09-05 21:31     ` Andrii Nakryiko
2025-09-05 21:32       ` Eduard Zingerman
2025-09-05 21:29   ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 3/7] bpf: htab: extract helper for freeing special structs Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 21:31   ` Eduard Zingerman
2025-09-05 16:45 ` [PATCH bpf-next v3 4/7] bpf: bpf task work plumbing Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 23:09   ` Eduard Zingerman
2025-09-15 15:59     ` Mykyta Yatsenko
2025-09-15 20:12       ` Andrii Nakryiko
2025-09-15 20:20         ` Mykyta Yatsenko
2025-09-15 20:28           ` Andrii Nakryiko
2025-09-05 16:45 ` [PATCH bpf-next v3 5/7] bpf: extract map key pointer calculation Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-05 23:19   ` Eduard Zingerman
2025-09-08 13:39     ` Mykyta Yatsenko
2025-09-08 17:18       ` Eduard Zingerman
2025-09-05 16:45 ` [PATCH bpf-next v3 6/7] bpf: task work scheduling kfuncs Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-06 20:22   ` Eduard Zingerman [this message]
2025-09-08 13:13     ` Mykyta Yatsenko
2025-09-08 17:38       ` Eduard Zingerman
2025-09-09  3:42         ` Andrii Nakryiko
2025-09-09  4:15           ` Eduard Zingerman
2025-09-09  3:33       ` Andrii Nakryiko
2025-09-09  4:05         ` Eduard Zingerman
2025-09-10 14:14           ` Andrii Nakryiko
2025-09-09 17:49   ` Chris Mason
2025-09-09 18:59     ` Mykyta Yatsenko
2025-09-05 16:45 ` [PATCH bpf-next v3 7/7] selftests/bpf: BPF task work scheduling tests Mykyta Yatsenko
2025-09-05 21:31   ` Andrii Nakryiko
2025-09-08  7:43   ` Eduard Zingerman
2025-09-08 13:21     ` Mykyta Yatsenko
2025-09-08 18:23       ` Eduard Zingerman
2025-09-09  3:44         ` Andrii Nakryiko
2025-09-08 18:23   ` Eduard Zingerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa9dcf55f1ed26c140f83fdde8312304efb80099.camel@gmail.com \
    --to=eddyz87@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kafai@meta.com \
    --cc=kernel-team@meta.com \
    --cc=memxor@gmail.com \
    --cc=mykyta.yatsenko5@gmail.com \
    --cc=yatsenko@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.