From: Andrea Righi <arighi@nvidia.com>
To: Emil Tsalapatis <etsal@meta.com>
Cc: tj@kernel.org, void@manifault.com, changwoo@igalia.com,
sched-ext@lists.linux.dev, Jakub Kicinski <kuba@kernel.org>,
"Emil Tsalapatis (Meta)" <emil@etsalapatis.com>
Subject: Re: [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch
Date: Thu, 9 Oct 2025 21:36:38 +0200 [thread overview]
Message-ID: <aOgOxtHCeyRT_7jn@gpd4> (raw)
In-Reply-To: <20251009173620.1882642-1-etsal@meta.com>
Hi Emil,
On Thu, Oct 09, 2025 at 10:36:20AM -0700, Emil Tsalapatis wrote:
> The sched_ext code calls queue_balance_callback() during .balance(),
I think we call queue_balance_callback() from .enqueue_task(), right?
> to defer operations that drop multiple locks until we can unpin them.
> The is call assumes that the rq lock is held until the callbacks are
s/is//
> invoked, and the pending callbacks will not be visible to any other
> threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().
>
> However, balance_one() may actually drop the lock during a BPF dispatch
> call. Another thread may win the race to get the rq lock and see the
> pending callback. To avoid this, sched_ext must only queue the callback
> after the dispatch calls have completed.
>
> CPU 0 CPU 1 CPU 2
>
> scx_balance()
> rq_unpin_lock()
> scx_balance_one()
> |= IN_BALANCE scx_enqueue()
> ops.dispatch()
> rq_unlock()
> rq_lock()
> queue_balance_callback()
> rq_unlock()
> [WARN] rq_pin_lock()
> rq_lock()
> &= ~IN_BALANCE
> rq_repin_lock()
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
> ---
> kernel/sched/ext.c | 28 ++++++++++++++++++++++++++--
> kernel/sched/sched.h | 1 +
> 2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5f957cff5d17..35dfce06297a 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -780,13 +780,23 @@ static void schedule_deferred(struct rq *rq)
> if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
> return;
>
> + /* Don't do anything if there already is a deferred operation. */
> + if (rq->scx.flags & SCX_RQ_BAL_PENDING)
> + return;
> +
> /*
> * If in balance, the balance callbacks will be called before rq lock is
> * released. Schedule one.
> + *
> + *
> + * We can't directly insert the callback into the
> + * rq's list: The call can drop its lock and make the pending balance
> + * callback visible to unrelated code paths that call rq_pin_lock().
> + *
> + * Just let balance_one() know that it must do it itself.
> */
> if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
> - queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> - deferred_bal_cb_workfn);
> + rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
> return;
> }
>
> @@ -2003,6 +2013,18 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
> dspc->cursor = 0;
> }
>
> +static inline void maybe_queue_balance_callback(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> +
> + if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
> + queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> + deferred_bal_cb_workfn);
> + }
> +
> + rq->scx.flags &= SCX_RQ_BAL_CB_PENDING;
Hm... this looks wrong. I think you want to clear SCX_RQ_BAL_CB_PENDING, so
it should be:
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
And while at it, just to better reflect the logic:
if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
deferred_bal_cb_workfn);
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
}
> +}
> +
> static int balance_one(struct rq *rq, struct task_struct *prev)
> {
> struct scx_sched *sch = scx_root;
> @@ -2150,6 +2172,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
> #endif
> rq_repin_lock(rq, rf);
>
> + maybe_queue_balance_callback(rq);
> +
> return ret;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..8f3935785fc5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -757,6 +757,7 @@ enum scx_rq_flags {
> SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
> SCX_RQ_BYPASSING = 1 << 4,
> SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
> + SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */
>
> SCX_RQ_IN_WAKEUP = 1 << 16,
> SCX_RQ_IN_BALANCE = 1 << 17,
> --
> 2.47.3
>
Thanks,
-Andrea
next prev parent reply other threads:[~2025-10-09 19:36 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-09 17:36 [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch Emil Tsalapatis
2025-10-09 19:36 ` Andrea Righi [this message]
2025-10-09 21:53 ` Emil Tsalapatis
2025-10-09 21:57 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aOgOxtHCeyRT_7jn@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=etsal@meta.com \
--cc=kuba@kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.