* [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch
@ 2025-10-09 17:36 Emil Tsalapatis
2025-10-09 19:36 ` Andrea Righi
0 siblings, 1 reply; 4+ messages in thread
From: Emil Tsalapatis @ 2025-10-09 17:36 UTC (permalink / raw)
To: tj
Cc: void, arighi, changwoo, sched-ext, Emil Tsalapatis,
Jakub Kicinski, Emil Tsalapatis (Meta)
The sched_ext code calls queue_balance_callback() during .balance(),
to defer operations that drop multiple locks until we can unpin them.
The is call assumes that the rq lock is held until the callbacks are
invoked, and the pending callbacks will not be visible to any other
threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().
However, balance_one() may actually drop the lock during a BPF dispatch
call. Another thread may win the race to get the rq lock and see the
pending callback. To avoid this, sched_ext must only queue the callback
after the dispatch calls have completed.
CPU 0 CPU 1 CPU 2
scx_balance()
rq_unpin_lock()
scx_balance_one()
|= IN_BALANCE scx_enqueue()
ops.dispatch()
rq_unlock()
rq_lock()
queue_balance_callback()
rq_unlock()
[WARN] rq_pin_lock()
rq_lock()
&= ~IN_BALANCE
rq_repin_lock()
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
---
kernel/sched/ext.c | 28 ++++++++++++++++++++++++++--
kernel/sched/sched.h | 1 +
2 files changed, 27 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5f957cff5d17..35dfce06297a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -780,13 +780,23 @@ static void schedule_deferred(struct rq *rq)
if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
return;
+ /* Don't do anything if there already is a deferred operation. */
+ if (rq->scx.flags & SCX_RQ_BAL_PENDING)
+ return;
+
/*
* If in balance, the balance callbacks will be called before rq lock is
* released. Schedule one.
+ *
+ *
+ * We can't directly insert the callback into the
+ * rq's list: The call can drop its lock and make the pending balance
+ * callback visible to unrelated code paths that call rq_pin_lock().
+ *
+ * Just let balance_one() know that it must do it itself.
*/
if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
- queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
- deferred_bal_cb_workfn);
+ rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
return;
}
@@ -2003,6 +2013,18 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
dspc->cursor = 0;
}
+static inline void maybe_queue_balance_callback(struct rq *rq)
+{
+ lockdep_assert_rq_held(rq);
+
+ if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
+ queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
+ deferred_bal_cb_workfn);
+ }
+
+ rq->scx.flags &= SCX_RQ_BAL_CB_PENDING;
+}
+
static int balance_one(struct rq *rq, struct task_struct *prev)
{
struct scx_sched *sch = scx_root;
@@ -2150,6 +2172,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
#endif
rq_repin_lock(rq, rf);
+ maybe_queue_balance_callback(rq);
+
return ret;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..8f3935785fc5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -757,6 +757,7 @@ enum scx_rq_flags {
SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
SCX_RQ_BYPASSING = 1 << 4,
SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
+ SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */
SCX_RQ_IN_WAKEUP = 1 << 16,
SCX_RQ_IN_BALANCE = 1 << 17,
--
2.47.3
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch
2025-10-09 17:36 [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch Emil Tsalapatis
@ 2025-10-09 19:36 ` Andrea Righi
2025-10-09 21:53 ` Emil Tsalapatis
0 siblings, 1 reply; 4+ messages in thread
From: Andrea Righi @ 2025-10-09 19:36 UTC (permalink / raw)
To: Emil Tsalapatis
Cc: tj, void, changwoo, sched-ext, Jakub Kicinski,
Emil Tsalapatis (Meta)
Hi Emil,
On Thu, Oct 09, 2025 at 10:36:20AM -0700, Emil Tsalapatis wrote:
> The sched_ext code calls queue_balance_callback() during .balance(),
I think we call queue_balance_callback() from .enqueue_task(), right?
> to defer operations that drop multiple locks until we can unpin them.
> The is call assumes that the rq lock is held until the callbacks are
s/is//
> invoked, and the pending callbacks will not be visible to any other
> threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().
>
> However, balance_one() may actually drop the lock during a BPF dispatch
> call. Another thread may win the race to get the rq lock and see the
> pending callback. To avoid this, sched_ext must only queue the callback
> after the dispatch calls have completed.
>
> CPU 0 CPU 1 CPU 2
>
> scx_balance()
> rq_unpin_lock()
> scx_balance_one()
> |= IN_BALANCE scx_enqueue()
> ops.dispatch()
> rq_unlock()
> rq_lock()
> queue_balance_callback()
> rq_unlock()
> [WARN] rq_pin_lock()
> rq_lock()
> &= ~IN_BALANCE
> rq_repin_lock()
>
> Reported-by: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
> ---
> kernel/sched/ext.c | 28 ++++++++++++++++++++++++++--
> kernel/sched/sched.h | 1 +
> 2 files changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5f957cff5d17..35dfce06297a 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -780,13 +780,23 @@ static void schedule_deferred(struct rq *rq)
> if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
> return;
>
> + /* Don't do anything if there already is a deferred operation. */
> + if (rq->scx.flags & SCX_RQ_BAL_PENDING)
> + return;
> +
> /*
> * If in balance, the balance callbacks will be called before rq lock is
> * released. Schedule one.
> + *
> + *
> + * We can't directly insert the callback into the
> + * rq's list: The call can drop its lock and make the pending balance
> + * callback visible to unrelated code paths that call rq_pin_lock().
> + *
> + * Just let balance_one() know that it must do it itself.
> */
> if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
> - queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> - deferred_bal_cb_workfn);
> + rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
> return;
> }
>
> @@ -2003,6 +2013,18 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
> dspc->cursor = 0;
> }
>
> +static inline void maybe_queue_balance_callback(struct rq *rq)
> +{
> + lockdep_assert_rq_held(rq);
> +
> + if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
> + queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> + deferred_bal_cb_workfn);
> + }
> +
> + rq->scx.flags &= SCX_RQ_BAL_CB_PENDING;
Hm... this looks wrong. I think you want to clear SCX_RQ_BAL_CB_PENDING, so
it should be:
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
And while at it, just to better reflect the logic:
if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
deferred_bal_cb_workfn);
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
}
> +}
> +
> static int balance_one(struct rq *rq, struct task_struct *prev)
> {
> struct scx_sched *sch = scx_root;
> @@ -2150,6 +2172,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
> #endif
> rq_repin_lock(rq, rf);
>
> + maybe_queue_balance_callback(rq);
> +
> return ret;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..8f3935785fc5 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -757,6 +757,7 @@ enum scx_rq_flags {
> SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
> SCX_RQ_BYPASSING = 1 << 4,
> SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
> + SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */
>
> SCX_RQ_IN_WAKEUP = 1 << 16,
> SCX_RQ_IN_BALANCE = 1 << 17,
> --
> 2.47.3
>
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch
2025-10-09 19:36 ` Andrea Righi
@ 2025-10-09 21:53 ` Emil Tsalapatis
2025-10-09 21:57 ` Andrea Righi
0 siblings, 1 reply; 4+ messages in thread
From: Emil Tsalapatis @ 2025-10-09 21:53 UTC (permalink / raw)
To: Andrea Righi
Cc: Emil Tsalapatis, tj, void, changwoo, sched-ext, Jakub Kicinski
Hi Andrea,
Thank you for the feedback, answering inline:
On Thu, Oct 9, 2025 at 3:36 PM Andrea Righi <arighi@nvidia.com> wrote:
>
> Hi Emil,
>
> On Thu, Oct 09, 2025 at 10:36:20AM -0700, Emil Tsalapatis wrote:
> > The sched_ext code calls queue_balance_callback() during .balance(),
>
> I think we call queue_balance_callback() from .enqueue_task(), right?
>
Ack, the text messes up the description of the actual race that is shown in
the diagram below.
> > to defer operations that drop multiple locks until we can unpin them.
> > The is call assumes that the rq lock is held until the callbacks are
>
> s/is//
>
Ack, will fix.
> > invoked, and the pending callbacks will not be visible to any other
> > threads. This is enforced by a WARN_ON_ONCE() in rq_pin_lock().
> >
> > However, balance_one() may actually drop the lock during a BPF dispatch
> > call. Another thread may win the race to get the rq lock and see the
> > pending callback. To avoid this, sched_ext must only queue the callback
> > after the dispatch calls have completed.
> >
> > CPU 0 CPU 1 CPU 2
> >
> > scx_balance()
> > rq_unpin_lock()
> > scx_balance_one()
> > |= IN_BALANCE scx_enqueue()
> > ops.dispatch()
> > rq_unlock()
> > rq_lock()
> > queue_balance_callback()
> > rq_unlock()
> > [WARN] rq_pin_lock()
> > rq_lock()
> > &= ~IN_BALANCE
> > rq_repin_lock()
> >
> > Reported-by: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Emil Tsalapatis (Meta) <emil@etsalapatis.com>
> > ---
> > kernel/sched/ext.c | 28 ++++++++++++++++++++++++++--
> > kernel/sched/sched.h | 1 +
> > 2 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> > index 5f957cff5d17..35dfce06297a 100644
> > --- a/kernel/sched/ext.c
> > +++ b/kernel/sched/ext.c
> > @@ -780,13 +780,23 @@ static void schedule_deferred(struct rq *rq)
> > if (rq->scx.flags & SCX_RQ_IN_WAKEUP)
> > return;
> >
> > + /* Don't do anything if there already is a deferred operation. */
> > + if (rq->scx.flags & SCX_RQ_BAL_PENDING)
> > + return;
> > +
> > /*
> > * If in balance, the balance callbacks will be called before rq lock is
> > * released. Schedule one.
> > + *
> > + *
> > + * We can't directly insert the callback into the
> > + * rq's list: The call can drop its lock and make the pending balance
> > + * callback visible to unrelated code paths that call rq_pin_lock().
> > + *
> > + * Just let balance_one() know that it must do it itself.
> > */
> > if (rq->scx.flags & SCX_RQ_IN_BALANCE) {
> > - queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> > - deferred_bal_cb_workfn);
> > + rq->scx.flags |= SCX_RQ_BAL_CB_PENDING;
> > return;
> > }
> >
> > @@ -2003,6 +2013,18 @@ static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
> > dspc->cursor = 0;
> > }
> >
> > +static inline void maybe_queue_balance_callback(struct rq *rq)
> > +{
> > + lockdep_assert_rq_held(rq);
> > +
> > + if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
> > + queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> > + deferred_bal_cb_workfn);
> > + }
> > +
> > + rq->scx.flags &= SCX_RQ_BAL_CB_PENDING;
>
> Hm... this looks wrong. I think you want to clear SCX_RQ_BAL_CB_PENDING, so
> it should be:
>
> rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
>
Ack, this is a rebase error from a previous version. I've retested
with the above line fixed
and everything still works as expected:
a) The rq_pin_lock warning goes away
b) The ratio of ddsp_process_deferred_locals() done using
deferred_irq_work() vs
queue_balance_callback() does not change
> And while at it, just to better reflect the logic:
>
> if (rq->scx.flags & SCX_RQ_BAL_CB_PENDING) {
> queue_balance_callback(rq, &rq->scx.deferred_bal_cb,
> deferred_bal_cb_workfn);
> rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
> }
>
How about the following to avoid nesting:
if (!(rq->scx.flags & SCX_RQ_BAL_CB_PENDING))
return;
queue_balance_callback(rq, &rq->scx.deferred_bal_cb, deferred_bal_cb_workfn);
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
> > +}
> > +
> > static int balance_one(struct rq *rq, struct task_struct *prev)
> > {
> > struct scx_sched *sch = scx_root;
> > @@ -2150,6 +2172,8 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
> > #endif
> > rq_repin_lock(rq, rf);
> >
> > + maybe_queue_balance_callback(rq);
> > +
> > return ret;
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index be9745d104f7..8f3935785fc5 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -757,6 +757,7 @@ enum scx_rq_flags {
> > SCX_RQ_BAL_KEEP = 1 << 3, /* balance decided to keep current */
> > SCX_RQ_BYPASSING = 1 << 4,
> > SCX_RQ_CLK_VALID = 1 << 5, /* RQ clock is fresh and valid */
> > + SCX_RQ_BAL_CB_PENDING = 1 << 6, /* must queue a cb after dispatching */
> >
> > SCX_RQ_IN_WAKEUP = 1 << 16,
> > SCX_RQ_IN_BALANCE = 1 << 17,
> > --
> > 2.47.3
> >
>
> Thanks,
> -Andrea
Thanks,
Emil
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch
2025-10-09 21:53 ` Emil Tsalapatis
@ 2025-10-09 21:57 ` Andrea Righi
0 siblings, 0 replies; 4+ messages in thread
From: Andrea Righi @ 2025-10-09 21:57 UTC (permalink / raw)
To: Emil Tsalapatis
Cc: Emil Tsalapatis, tj, void, changwoo, sched-ext, Jakub Kicinski
On Thu, Oct 09, 2025 at 05:53:41PM -0400, Emil Tsalapatis wrote:
...
> How about the following to avoid nesting:
>
> if (!(rq->scx.flags & SCX_RQ_BAL_CB_PENDING))
> return;
>
> queue_balance_callback(rq, &rq->scx.deferred_bal_cb, deferred_bal_cb_workfn);
> rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
>
Looks good. With all the fixes you can add my:
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-10-09 21:58 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-09 17:36 [PATCH] sched_ext: defer queue_balance_callback() until after ops.dispatch Emil Tsalapatis
2025-10-09 19:36 ` Andrea Righi
2025-10-09 21:53 ` Emil Tsalapatis
2025-10-09 21:57 ` Andrea Righi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.