* [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set()
@ 2025-03-25 14:00 Andrea Righi
2025-03-27 0:24 ` Tejun Heo
0 siblings, 1 reply; 8+ messages in thread
From: Andrea Righi @ 2025-03-25 14:00 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: Joel Fernandes, linux-kernel
scx_bpf_cpuperf_set() can be used to set a performance target level on
any CPU. However, it doesn't correctly acquire the corresponding rq
lock, which may lead to unsafe behavior and trigger the following
warning, due to the lockdep_assert_rq_held() check:
[ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0
...
[ 51.713836] Call trace:
[ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P)
[ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440
[ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c
[ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0
[ 51.713845] bpf_scx_reg+0x18/0x30
[ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0
[ 51.713849] __sys_bpf+0x1934/0x22a0
Fix by properly acquiring the rq lock.
Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3d20228052217..53b97b32dfa43 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf)
if (ops_cpu_valid(cpu, NULL)) {
struct rq *rq = cpu_rq(cpu);
+ struct rq_flags rf;
+ bool rq_unlocked;
+
+ preempt_disable();
+ rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked();
+ if (rq_unlocked) {
+ rq_lock_irqsave(rq, &rf);
+ update_rq_clock(rq);
+ }
rq->scx.cpuperf_target = perf;
+ cpufreq_update_util(rq, 0);
- rcu_read_lock_sched_notrace();
- cpufreq_update_util(cpu_rq(cpu), 0);
- rcu_read_unlock_sched_notrace();
+ if (rq_unlocked)
+ rq_unlock_irqrestore(rq, &rf);
+ preempt_enable();
}
}
--
2.49.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-25 14:00 [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi @ 2025-03-27 0:24 ` Tejun Heo 2025-03-27 7:56 ` Andrea Righi 0 siblings, 1 reply; 8+ messages in thread From: Tejun Heo @ 2025-03-27 0:24 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel Hello, Andrea. On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote: > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) > > if (ops_cpu_valid(cpu, NULL)) { > struct rq *rq = cpu_rq(cpu); > + struct rq_flags rf; > + bool rq_unlocked; > + > + preempt_disable(); > + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked(); > + if (rq_unlocked) { > + rq_lock_irqsave(rq, &rf); I don't think this is correct: - This is double-locking regardless of the locking order and thus can lead to ABBA deadlocks. - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path, the locked rq is on the CPU that the wakeup is targeting, not this_rq(). Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is locked or not. We might as well pass it the currently locked rq and remember that in a percpu variable, so that scx_bpf_*() can always tell whether and which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If the traget cpu is not the locked one, we can either fail the operation (and trigger ops error) or bounce it to an irq work. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 0:24 ` Tejun Heo @ 2025-03-27 7:56 ` Andrea Righi 2025-03-27 9:53 ` Andrea Righi 0 siblings, 1 reply; 8+ messages in thread From: Andrea Righi @ 2025-03-27 7:56 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote: > > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) > > > > if (ops_cpu_valid(cpu, NULL)) { > > struct rq *rq = cpu_rq(cpu); > > + struct rq_flags rf; > > + bool rq_unlocked; > > + > > + preempt_disable(); > > + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked(); > > + if (rq_unlocked) { > > + rq_lock_irqsave(rq, &rf); > > I don't think this is correct: > > - This is double-locking regardless of the locking order and thus can lead > to ABBA deadlocks. > > - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path, > the locked rq is on the CPU that the wakeup is targeting, not this_rq(). > > Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is > locked or not. We might as well pass it the currently locked rq and remember > that in a percpu variable, so that scx_bpf_*() can always tell whether and > which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If > the traget cpu is not the locked one, we can either fail the operation (and > trigger ops error) or bounce it to an irq work. Hm... that's right, it looks like this requires a bit more work than expected, but saving the currently locked rq might be helpful also for other kfuncs, I'll take a look at this. Thanks! -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 7:56 ` Andrea Righi @ 2025-03-27 9:53 ` Andrea Righi 2025-03-27 17:09 ` Tejun Heo 0 siblings, 1 reply; 8+ messages in thread From: Andrea Righi @ 2025-03-27 9:53 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel On Thu, Mar 27, 2025 at 08:56:12AM +0100, Andrea Righi wrote: > On Wed, Mar 26, 2025 at 02:24:16PM -1000, Tejun Heo wrote: > > Hello, Andrea. > > > > On Tue, Mar 25, 2025 at 03:00:21PM +0100, Andrea Righi wrote: > > > @@ -7114,12 +7114,22 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) > > > > > > if (ops_cpu_valid(cpu, NULL)) { > > > struct rq *rq = cpu_rq(cpu); > > > + struct rq_flags rf; > > > + bool rq_unlocked; > > > + > > > + preempt_disable(); > > > + rq_unlocked = (rq != this_rq()) || scx_kf_allowed_if_unlocked(); > > > + if (rq_unlocked) { > > > + rq_lock_irqsave(rq, &rf); > > > > I don't think this is correct: > > > > - This is double-locking regardless of the locking order and thus can lead > > to ABBA deadlocks. > > > > - There's no guarantee that the locked rq is this_rq(). e.g. In wakeup path, > > the locked rq is on the CPU that the wakeup is targeting, not this_rq(). > > > > Hmm... this is a bit tricky. SCX_CALL_OP*() always knows whether the rq is > > locked or not. We might as well pass it the currently locked rq and remember > > that in a percpu variable, so that scx_bpf_*() can always tell whether and > > which cpu is rq-locked currently. If unlocked, we can grab the rq lock. If > > the traget cpu is not the locked one, we can either fail the operation (and > > trigger ops error) or bounce it to an irq work. > > Hm... that's right, it looks like this requires a bit more work than > expected, but saving the currently locked rq might be helpful also for > other kfuncs, I'll take a look at this. What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for all the other cases we ignore locking if rq == this_rq(). If we need to operate on a different rq than the current one we could either defer the work or just trigger an ops error. Something like: if (scx_kf_allowed_if_unlocked()) { rq_lock_irqsave(rq, &rf); update_rq_clock(rq); } else if (rq != this_rq()) { // defer work or ops error return; } lockdep_assert_rq_held(rq); rq->scx.cpuperf_target = perf; cpufreq_update_util(rq, 0); if (scx_kf_allowed_if_unlocked()) rq_unlock_irqrestore(rq, &rf); AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from ops.running(), ops.tick() or ops.init(), so even with the ops error we should cover all the existent cases. The only unsupported scenario is calling scx_bpf_cpuperf_set() from ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work later to handle that if needed. Thanks, -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 9:53 ` Andrea Righi @ 2025-03-27 17:09 ` Tejun Heo 2025-03-27 17:15 ` Andrea Righi 0 siblings, 1 reply; 8+ messages in thread From: Tejun Heo @ 2025-03-27 17:09 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel Hello, On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote: ... > > Hm... that's right, it looks like this requires a bit more work than > > expected, but saving the currently locked rq might be helpful also for > > other kfuncs, I'll take a look at this. > > What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for > all the other cases we ignore locking if rq == this_rq(). If we need to > operate on a different rq than the current one we could either defer the > work or just trigger an ops error. Something like: > > if (scx_kf_allowed_if_unlocked()) { > rq_lock_irqsave(rq, &rf); > update_rq_clock(rq); > } else if (rq != this_rq()) { > // defer work or ops error > return; > } > > lockdep_assert_rq_held(rq); > rq->scx.cpuperf_target = perf; > cpufreq_update_util(rq, 0); > > if (scx_kf_allowed_if_unlocked()) > rq_unlock_irqrestore(rq, &rf); > > AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from > ops.running(), ops.tick() or ops.init(), so even with the ops error we > should cover all the existent cases. > > The only unsupported scenario is calling scx_bpf_cpuperf_set() from > ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work > later to handle that if needed. balance_one() can be called from a sibling CPU when core sched is enabled, so ttwu isn't the only path where this_rq() test wouldn't work. Even if we plug all the existing holes and make it work, it feels a bit too fragile to me. It's something which can easily break inadvertently and cause subtle failures. If we don't want to do locked rq tracking, we can always use schedule_deferred() when any rq is locked too. That's a bit more expensive tho. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 17:09 ` Tejun Heo @ 2025-03-27 17:15 ` Andrea Righi 2025-03-27 17:19 ` Tejun Heo 0 siblings, 1 reply; 8+ messages in thread From: Andrea Righi @ 2025-03-27 17:15 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel On Thu, Mar 27, 2025 at 07:09:25AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Mar 27, 2025 at 10:53:39AM +0100, Andrea Righi wrote: > ... > > > Hm... that's right, it looks like this requires a bit more work than > > > expected, but saving the currently locked rq might be helpful also for > > > other kfuncs, I'll take a look at this. > > > > What if we lock the rq in the scx_kf_allowed_if_unlocked() case, and for > > all the other cases we ignore locking if rq == this_rq(). If we need to > > operate on a different rq than the current one we could either defer the > > work or just trigger an ops error. Something like: > > > > if (scx_kf_allowed_if_unlocked()) { > > rq_lock_irqsave(rq, &rf); > > update_rq_clock(rq); > > } else if (rq != this_rq()) { > > // defer work or ops error > > return; > > } > > > > lockdep_assert_rq_held(rq); > > rq->scx.cpuperf_target = perf; > > cpufreq_update_util(rq, 0); > > > > if (scx_kf_allowed_if_unlocked()) > > rq_unlock_irqrestore(rq, &rf); > > > > AFAICS all the current scx schedulers call scx_bpf_cpuperf_set() from > > ops.running(), ops.tick() or ops.init(), so even with the ops error we > > should cover all the existent cases. > > > > The only unsupported scenario is calling scx_bpf_cpuperf_set() from > > ops.enqueue() / ops.select_cpu(), but maybe we could add the deferred work > > later to handle that if needed. > > balance_one() can be called from a sibling CPU when core sched is enabled, > so ttwu isn't the only path where this_rq() test wouldn't work. Even if we > plug all the existing holes and make it work, it feels a bit too fragile to > me. It's something which can easily break inadvertently and cause subtle > failures. > > If we don't want to do locked rq tracking, we can always use > schedule_deferred() when any rq is locked too. That's a bit more expensive > tho. Yeah, I'm a bit worried that locked rq tracking might introduce overhead to all the scx callbacks, just to address this issue. Perhaps schedule_deferred() is a reasonable compromise and we can limit the overhead just to scx_bpf_cpuperf_set(). -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 17:15 ` Andrea Righi @ 2025-03-27 17:19 ` Tejun Heo 2025-03-27 17:27 ` Andrea Righi 0 siblings, 1 reply; 8+ messages in thread From: Tejun Heo @ 2025-03-27 17:19 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel Hello, On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote: > > If we don't want to do locked rq tracking, we can always use > > schedule_deferred() when any rq is locked too. That's a bit more expensive > > tho. > > Yeah, I'm a bit worried that locked rq tracking might introduce overhead to > all the scx callbacks, just to address this issue. All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu state (kf flags). It's unlikely that another percpu variable update is going to be noticeable. Thanks. -- tejun ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-03-27 17:19 ` Tejun Heo @ 2025-03-27 17:27 ` Andrea Righi 0 siblings, 0 replies; 8+ messages in thread From: Andrea Righi @ 2025-03-27 17:27 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, Joel Fernandes, linux-kernel On Thu, Mar 27, 2025 at 07:19:03AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Mar 27, 2025 at 06:15:09PM +0100, Andrea Righi wrote: > > > If we don't want to do locked rq tracking, we can always use > > > schedule_deferred() when any rq is locked too. That's a bit more expensive > > > tho. > > > > Yeah, I'm a bit worried that locked rq tracking might introduce overhead to > > all the scx callbacks, just to address this issue. > > All operaitons are already wrapped with SCX_CALL_OP() and updating per-cpu > state (kf flags). It's unlikely that another percpu variable update is going > to be noticeable. Ack, I'll explore the locked rq tracking way then. Thanks, -Andrea ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-03-27 17:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-03-25 14:00 [PATCH] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi 2025-03-27 0:24 ` Tejun Heo 2025-03-27 7:56 ` Andrea Righi 2025-03-27 9:53 ` Andrea Righi 2025-03-27 17:09 ` Tejun Heo 2025-03-27 17:15 ` Andrea Righi 2025-03-27 17:19 ` Tejun Heo 2025-03-27 17:27 ` Andrea Righi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox