[PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
@ 2025-03-25 14:00 Andrea Righi
  2025-03-27  0:24 ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-03-25 14:00 UTC (permalink / raw)
  To: Tejun Heo, David Vernet, Changwoo Min; +Cc: Joel Fernandes, linux-kernel

scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:

[   51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[   51.713836] Call trace:
[   51.713837]  scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[   51.713839]  bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[   51.713841]  bpf__sched_ext_ops_init+0x54/0x8c
[   51.713843]  scx_ops_enable.constprop.0+0x2c0/0x10f0
[   51.713845]  bpf_scx_reg+0x18/0x30
[   51.713847]  bpf_struct_ops_link_create+0x154/0x1b0
[   51.713849]  __sys_bpf+0x1934/0x22a0

Fix by properly acquiring the rq lock.

Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3d20228052217..53b97b32dfa43 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
 
 	if (ops_cpu_valid(cpu, NULL)) {
 		struct rq *rq = cpu_rq(cpu);
+		struct rq_flags rf;
+		bool rq_unlocked;
+
+		preempt_disable();
+		rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
+		if (rq_unlocked) {
+			rq_lock_irqsave(rq, &rf);
+			update_rq_clock(rq);
+		}
 
 		rq->scx.cpuperf_target = perf;
+		cpufreq_update_util(rq, 0);
 
-		rcu_read_lock_sched_notrace();
-		cpufreq_update_util(cpu_rq(cpu), 0);
-		rcu_read_unlock_sched_notrace();
+		if (rq_unlocked)
+			rq_unlock_irqrestore(rq, &rf);
+		preempt_enable();
 	}
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-25 14:00 [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi
@ 2025-03-27  0:24 ` Tejun Heo
  2025-03-27  7:56   ` Andrea Righi
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2025-03-27  0:24 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

Hello, Andrea.

On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
>  
>  	if (ops_cpu_valid(cpu, NULL)) {
>  		struct rq *rq = cpu_rq(cpu);
> +		struct rq_flags rf;
> +		bool rq_unlocked;
> +
> +		preempt_disable();
> +		rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> +		if (rq_unlocked) {
> +			rq_lock_irqsave(rq, &rf);

I don't think this is correct:

- This is double-locking regardless of the locking order and thus can lead
  to ABBA deadlocks.

- There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
  the locked rq is on the CPU that the wakeup is targeting, not this_rq().

Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
locked or not. We might as well pass it the currently locked rq and remember
that in a percpu variable, so that scx_bpf_*() can always tell whether and
which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
the traget cpu is not the locked one, we can either fail the operation (and
trigger ops error) or bounce it to an irq work.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27  0:24 ` Tejun Heo
@ 2025-03-27  7:56   ` Andrea Righi
  2025-03-27  9:53     ` Andrea Righi
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-03-27  7:56 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote:
> Hello, Andrea.
> 
> On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
> >  
> >  	if (ops_cpu_valid(cpu, NULL)) {
> >  		struct rq *rq = cpu_rq(cpu);
> > +		struct rq_flags rf;
> > +		bool rq_unlocked;
> > +
> > +		preempt_disable();
> > +		rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> > +		if (rq_unlocked) {
> > +			rq_lock_irqsave(rq, &rf);
> 
> I don't think this is correct:
> 
> - This is double-locking regardless of the locking order and thus can lead
>   to ABBA deadlocks.
> 
> - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
>   the locked rq is on the CPU that the wakeup is targeting, not this_rq().
> 
> Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
> locked or not. We might as well pass it the currently locked rq and remember
> that in a percpu variable, so that scx_bpf_*() can always tell whether and
> which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
> the traget cpu is not the locked one, we can either fail the operation (and
> trigger ops error) or bounce it to an irq work.

Hm... that's right, it looks like this requires a bit more work than
expected, but saving the currently locked rq might be helpful also for
other kfuncs, I'll take a look at this.

Thanks!
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27  7:56   ` Andrea Righi
@ 2025-03-27  9:53     ` Andrea Righi
  2025-03-27 17:09       ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-03-27  9:53 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

On Thu, Mar 27, 2025 at 08:56:12AM +0100, Andrea Righi wrote:
> On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote:
> > Hello, Andrea.
> > 
> > On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote:
> > > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
> > >  
> > >  	if (ops_cpu_valid(cpu, NULL)) {
> > >  		struct rq *rq = cpu_rq(cpu);
> > > +		struct rq_flags rf;
> > > +		bool rq_unlocked;
> > > +
> > > +		preempt_disable();
> > > +		rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
> > > +		if (rq_unlocked) {
> > > +			rq_lock_irqsave(rq, &rf);
> > 
> > I don't think this is correct:
> > 
> > - This is double-locking regardless of the locking order and thus can lead
> >   to ABBA deadlocks.
> > 
> > - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path,
> >   the locked rq is on the CPU that the wakeup is targeting, not this_rq().
> > 
> > Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is
> > locked or not. We might as well pass it the currently locked rq and remember
> > that in a percpu variable, so that scx_bpf_*() can always tell whether and
> > which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If
> > the traget cpu is not the locked one, we can either fail the operation (and
> > trigger ops error) or bounce it to an irq work.
> 
> Hm... that's right, it looks like this requires a bit more work than
> expected, but saving the currently locked rq might be helpful also for
> other kfuncs, I'll take a look at this.

What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
all the other cases we ignore locking if rq == this_rq(). If we need to
operate on a different rq than the current one we could either defer the
work or just trigger an ops error. Something like:

	if (scx_kf_allowed_if_unlocked()) {
		rq_lock_irqsave(rq, &rf);
		update_rq_clock(rq);
	} else if (rq != this_rq()) {
		// defer work or ops error
		return;
	}

	lockdep_assert_rq_held(rq);
	rq->scx.cpuperf_target = perf;
	cpufreq_update_util(rq, 0);

	if (scx_kf_allowed_if_unlocked())
		rq_unlock_irqrestore(rq, &rf);

AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
ops.running(), ops.tick() or ops.init(), so even with the ops error we
should cover all the existent cases.

The only unsupported scenario is calling scx_bpf_cpuperf_set() from
ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
later to handle that if needed.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27  9:53     ` Andrea Righi
@ 2025-03-27 17:09       ` Tejun Heo
  2025-03-27 17:15         ` Andrea Righi
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2025-03-27 17:09 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

Hello,

On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote:
...
> > Hm... that's right, it looks like this requires a bit more work than
> > expected, but saving the currently locked rq might be helpful also for
> > other kfuncs, I'll take a look at this.
> 
> What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
> all the other cases we ignore locking if rq == this_rq(). If we need to
> operate on a different rq than the current one we could either defer the
> work or just trigger an ops error. Something like:
> 
> 	if (scx_kf_allowed_if_unlocked()) {
> 		rq_lock_irqsave(rq, &rf);
> 		update_rq_clock(rq);
> 	} else if (rq != this_rq()) {
> 		// defer work or ops error
> 		return;
> 	}
> 
> 	lockdep_assert_rq_held(rq);
> 	rq->scx.cpuperf_target = perf;
> 	cpufreq_update_util(rq, 0);
> 
> 	if (scx_kf_allowed_if_unlocked())
> 		rq_unlock_irqrestore(rq, &rf);
> 
> AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
> ops.running(), ops.tick() or ops.init(), so even with the ops error we
> should cover all the existent cases.
> 
> The only unsupported scenario is calling scx_bpf_cpuperf_set() from
> ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
> later to handle that if needed.

balance_one() can be called from a sibling CPU when core sched is enabled,
so ttwu isn't the only path where this_rq() test wouldn't work. Even if we
plug all the existing holes and make it work, it feels a bit too fragile to
me. It's something which can easily break inadvertently and cause subtle
failures.

If we don't want to do locked rq tracking, we can always use
schedule_deferred() when any rq is locked too. That's a bit more expensive
tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27 17:09       ` Tejun Heo
@ 2025-03-27 17:15         ` Andrea Righi
  2025-03-27 17:19           ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-03-27 17:15 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

On Thu, Mar 27, 2025 at 07:09:25AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote:
> ...
> > > Hm... that's right, it looks like this requires a bit more work than
> > > expected, but saving the currently locked rq might be helpful also for
> > > other kfuncs, I'll take a look at this.
> > 
> > What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for
> > all the other cases we ignore locking if rq == this_rq(). If we need to
> > operate on a different rq than the current one we could either defer the
> > work or just trigger an ops error. Something like:
> > 
> > 	if (scx_kf_allowed_if_unlocked()) {
> > 		rq_lock_irqsave(rq, &rf);
> > 		update_rq_clock(rq);
> > 	} else if (rq != this_rq()) {
> > 		// defer work or ops error
> > 		return;
> > 	}
> > 
> > 	lockdep_assert_rq_held(rq);
> > 	rq->scx.cpuperf_target = perf;
> > 	cpufreq_update_util(rq, 0);
> > 
> > 	if (scx_kf_allowed_if_unlocked())
> > 		rq_unlock_irqrestore(rq, &rf);
> > 
> > AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from
> > ops.running(), ops.tick() or ops.init(), so even with the ops error we
> > should cover all the existent cases.
> > 
> > The only unsupported scenario is calling scx_bpf_cpuperf_set() from
> > ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work
> > later to handle that if needed.
> 
> balance_one() can be called from a sibling CPU when core sched is enabled,
> so ttwu isn't the only path where this_rq() test wouldn't work. Even if we
> plug all the existing holes and make it work, it feels a bit too fragile to
> me. It's something which can easily break inadvertently and cause subtle
> failures.
> 
> If we don't want to do locked rq tracking, we can always use
> schedule_deferred() when any rq is locked too. That's a bit more expensive
> tho.

Yeah, I'm a bit worried that locked rq tracking might introduce overhead to
all the scx callbacks, just to address this issue.

Perhaps schedule_deferred() is a reasonable compromise and we can limit the
overhead just to scx_bpf_cpuperf_set().

-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27 17:15         ` Andrea Righi
@ 2025-03-27 17:19           ` Tejun Heo
  2025-03-27 17:27             ` Andrea Righi
  0 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2025-03-27 17:19 UTC (permalink / raw)
  To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

Hello,

On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote:
> > If we don't want to do locked rq tracking, we can always use
> > schedule_deferred() when any rq is locked too. That's a bit more expensive
> > tho.
> 
> Yeah, I'm a bit worried that locked rq tracking might introduce overhead to
> all the scx callbacks, just to address this issue.

All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu
state (kf flags). It's unlikely that another percpu variable update is going
to be noticeable.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
  2025-03-27 17:19           ` Tejun Heo
@ 2025-03-27 17:27             ` Andrea Righi
  0 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2025-03-27 17:27 UTC (permalink / raw)
  To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel

On Thu, Mar 27, 2025 at 07:19:03AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote:
> > > If we don't want to do locked rq tracking, we can always use
> > > schedule_deferred() when any rq is locked too. That's a bit more expensive
> > > tho.
> > 
> > Yeah, I'm a bit worried that locked rq tracking might introduce overhead to
> > all the scx callbacks, just to address this issue.
> 
> All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu
> state (kf flags). It's unlikely that another percpu variable update is going
> to be noticeable.

Ack, I'll explore the locked rq tracking way then.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-27 17:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25 14:00 [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi
2025-03-27  0:24 ` Tejun Heo
2025-03-27  7:56   ` Andrea Righi
2025-03-27  9:53     ` Andrea Righi
2025-03-27 17:09       ` Tejun Heo
2025-03-27 17:15         ` Andrea Righi
2025-03-27 17:19           ` Tejun Heo
2025-03-27 17:27             ` Andrea Righi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox