* [PATCH 0/2] sched_ext: Introduce rq lock tracking @ 2025-04-19 12:24 Andrea Righi 2025-04-19 12:24 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 2025-04-19 12:24 ` [PATCH 2/2] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi 0 siblings, 2 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-19 12:24 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Add rq lock tracking to sched_ext ops callbacks to enable scx_bpf_*() kfuncs to detect whether a specific rq is currently locked and safely operate with it. If no rq is locked, the target rq lock can be acquired. If a different rq is already locked, the operation can either fail (triggering an ops error), deferred using IRQ work, or managed accordingly to avoid deadlocks. This patchset is also available in the following git branch: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-rq-lock-tracking diffstat follows: Andrea Righi (2): sched_ext: Track currently locked rq sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 153 ++++++++++++++++++++++++++++++---------------- kernel/sched/ext_idle.c | 2 +- 3 files changed, 101 insertions(+), 55 deletions(-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 12:24 [PATCH 0/2] sched_ext: Introduce rq lock tracking Andrea Righi @ 2025-04-19 12:24 ` Andrea Righi 2025-04-19 17:34 ` Tejun Heo 2025-04-19 12:24 ` [PATCH 2/2] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi 1 sibling, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-19 12:24 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- include/linux/sched/ext.h | 1 + kernel/sched/ext.c | 126 +++++++++++++++++++++++--------------- kernel/sched/ext_idle.c | 2 +- 3 files changed, 78 insertions(+), 51 deletions(-) diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h index f7545430a5482..9de57f2a284fd 100644 --- a/include/linux/sched/ext.h +++ b/include/linux/sched/ext.h @@ -149,6 +149,7 @@ struct sched_ext_entity { s32 selected_cpu; u32 kf_mask; /* see scx_kf_mask above */ struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ + struct rq *locked_rq; /* currently locked rq */ atomic_long_t ops_state; struct list_head runnable_node; /* rq->scx.runnable_list */ diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bb0873411d798..a14a5c3bc38ac 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1116,8 +1116,32 @@ static void scx_kf_disallow(u32 mask) current->scx.kf_mask &= ~mask; } -#define SCX_CALL_OP(mask, op, args...) \ +static inline void update_locked_rq(struct rq *rq) +{ + /* + * Check whether @rq is actually locked. This can help expose bugs + * or incorrect assumptions about the context in which a kfunc or + * callback is executed. + */ + if (rq) + lockdep_assert_rq_held(rq); + current->scx.locked_rq = rq; + barrier(); +} + +/* + * Return the rq currently locked from an scx callback, or NULL if no rq is + * locked. + */ +static inline struct rq *scx_locked_rq(void) +{ + barrier(); + return current->scx.locked_rq; +} + +#define SCX_CALL_OP(mask, rq, op, args...) \ do { \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ scx_ops.op(args); \ @@ -1127,9 +1151,11 @@ do { \ } \ } while (0) -#define SCX_CALL_OP_RET(mask, op, args...) \ +#define SCX_CALL_OP_RET(mask, rq, op, args...) \ ({ \ __typeof__(scx_ops.op(args)) __ret; \ + \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ __ret = scx_ops.op(args); \ @@ -1151,31 +1177,31 @@ do { \ * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on * the specific task. */ -#define SCX_CALL_OP_TASK(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK(mask, rq, op, task, args...) \ do { \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - SCX_CALL_OP(mask, op, task, ##args); \ + SCX_CALL_OP(mask, rq, op, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ } while (0) -#define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK_RET(mask, rq, op, task, args...) \ ({ \ __typeof__(scx_ops.op(task, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \ + __ret = SCX_CALL_OP_RET(mask, rq, op, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ __ret; \ }) -#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \ +#define SCX_CALL_OP_2TASKS_RET(mask, rq, op, task0, task1, args...) \ ({ \ __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ - __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \ + __ret = SCX_CALL_OP_RET(mask, rq, op, task0, task1, ##args); \ current->scx.kf_tasks[0] = NULL; \ current->scx.kf_tasks[1] = NULL; \ __ret; \ @@ -2174,7 +2200,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, rq, enqueue, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -2269,7 +2295,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags add_nr_running(rq, 1); if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, runnable, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p); @@ -2283,7 +2309,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(SCX_EV_SELECT_CPU_FALLBACK, 1); } -static void ops_dequeue(struct task_struct *p, u64 deq_flags) +static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { unsigned long opss; @@ -2304,7 +2330,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) BUG(); case SCX_OPSS_QUEUED: if (SCX_HAS_OP(dequeue)) - SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, dequeue, p, deq_flags); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -2337,7 +2363,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags return true; } - ops_dequeue(p, deq_flags); + ops_dequeue(rq, p, deq_flags); /* * A currently running task which is going off @rq first gets dequeued @@ -2353,11 +2379,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags */ if (SCX_HAS_OP(stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, stopping, p, false); } if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, quiescent, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -2377,7 +2403,7 @@ static void yield_task_scx(struct rq *rq) struct task_struct *p = rq->curr; if (SCX_HAS_OP(yield)) - SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL); + SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, rq, yield, p, NULL); else p->scx.slice = 0; } @@ -2387,7 +2413,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) struct task_struct *from = rq->curr; if (SCX_HAS_OP(yield)) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to); + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, rq, yield, from, to); else return false; } @@ -2945,7 +2971,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * emitted in switch_class(). */ if (SCX_HAS_OP(cpu_acquire)) - SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL); + SCX_CALL_OP(SCX_KF_REST, rq, cpu_acquire, cpu_of(rq), NULL); rq->scx.cpu_released = false; } @@ -2990,7 +3016,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) do { dspc->nr_tasks = 0; - SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), + SCX_CALL_OP(SCX_KF_DISPATCH, rq, dispatch, cpu_of(rq), prev_on_scx ? prev : NULL); flush_dispatch_buf(rq); @@ -3104,7 +3130,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) * Core-sched might decide to execute @p before it is * dispatched. Call ops_dequeue() to notify the BPF scheduler. */ - ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC); + ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC); dispatch_dequeue(rq, p); } @@ -3112,7 +3138,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, running, p); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, running, p); clr_task_runnable(p, true); @@ -3194,7 +3220,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) }; SCX_CALL_OP(SCX_KF_CPU_RELEASE, - cpu_release, cpu_of(rq), &args); + rq, cpu_release, cpu_of(rq), &args); } rq->scx.cpu_released = true; } @@ -3207,7 +3233,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, stopping, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); @@ -3345,7 +3371,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, * verifier. */ if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a))) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, NULL, core_sched_before, (struct task_struct *)a, (struct task_struct *)b); else @@ -3381,7 +3407,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, + cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, NULL, select_cpu, p, prev_cpu, wake_flags); p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; @@ -3426,7 +3452,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + SCX_CALL_OP_TASK(SCX_KF_REST, NULL, set_cpumask, p, (struct cpumask *)p->cpus_ptr); } @@ -3440,9 +3466,9 @@ static void handle_hotplug(struct rq *rq, bool online) scx_idle_update_selcpu_topology(&scx_ops); if (online && SCX_HAS_OP(cpu_online)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, rq, cpu_online, cpu); else if (!online && SCX_HAS_OP(cpu_offline)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, rq, cpu_offline, cpu); else scx_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, "cpu %d going %s, exiting scheduler", cpu, @@ -3545,7 +3571,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(tick)) { - SCX_CALL_OP_TASK(SCX_KF_REST, tick, curr); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, tick, curr); } if (!curr->scx.slice) @@ -3622,7 +3648,7 @@ static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork .fork = fork, }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, init_task, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err("init_task", ret); return ret; @@ -3679,11 +3705,11 @@ static void scx_enable_task(struct task_struct *p) p->scx.weight = sched_weight_to_cgroup(weight); if (SCX_HAS_OP(enable)) - SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), enable, p); scx_set_task_state(p, SCX_TASK_ENABLED); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), set_weight, p, p->scx.weight); } static void scx_disable_task(struct task_struct *p) @@ -3692,7 +3718,7 @@ static void scx_disable_task(struct task_struct *p) WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED); if (SCX_HAS_OP(disable)) - SCX_CALL_OP_TASK(SCX_KF_REST, disable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), disable, p); scx_set_task_state(p, SCX_TASK_READY); } @@ -3721,7 +3747,7 @@ static void scx_exit_task(struct task_struct *p) } if (SCX_HAS_OP(exit_task)) - SCX_CALL_OP_TASK(SCX_KF_REST, exit_task, p, &args); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), exit_task, p, &args); scx_set_task_state(p, SCX_TASK_NONE); } @@ -3830,7 +3856,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, set_weight, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) @@ -3846,7 +3872,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + SCX_CALL_OP_TASK(SCX_KF_REST, rq, set_cpumask, p, (struct cpumask *)p->cpus_ptr); } @@ -3908,7 +3934,7 @@ int scx_tg_online(struct task_group *tg) struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_init, tg->css.cgroup, &args); if (ret) ret = ops_sanitize_err("cgroup_init", ret); @@ -3930,7 +3956,7 @@ void scx_tg_offline(struct task_group *tg) percpu_down_read(&scx_cgroup_rwsem); if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_exit, tg->css.cgroup); tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); percpu_up_read(&scx_cgroup_rwsem); @@ -3963,7 +3989,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) continue; if (SCX_HAS_OP(cgroup_prep_move)) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_prep_move, p, from, css->cgroup); if (ret) goto err; @@ -3977,7 +4003,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) err: cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -3996,7 +4022,7 @@ void scx_cgroup_move_task(struct task_struct *p) * cgrp_moving_from set. */ if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p, + SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, NULL, cgroup_move, p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; } @@ -4016,7 +4042,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -4030,7 +4056,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) if (scx_cgroup_enabled && tg->scx_weight != weight) { if (SCX_HAS_OP(cgroup_set_weight)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_set_weight, tg_cgrp(tg), weight); tg->scx_weight = weight; } @@ -4219,7 +4245,7 @@ static void scx_cgroup_exit(void) continue; rcu_read_unlock(); - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_exit, css->cgroup); rcu_read_lock(); css_put(css); @@ -4256,7 +4282,7 @@ static int scx_cgroup_init(void) continue; rcu_read_unlock(); - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_init, css->cgroup, &args); if (ret) { css_put(css); @@ -4749,7 +4775,7 @@ static void scx_disable_workfn(struct kthread_work *work) } if (scx_ops.exit) - SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, exit, ei); cancel_delayed_work_sync(&scx_watchdog_work); @@ -4955,7 +4981,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, if (SCX_HAS_OP(dump_task)) { ops_dump_init(s, " "); - SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p); + SCX_CALL_OP(SCX_KF_REST, NULL, dump_task, dctx, p); ops_dump_exit(); } @@ -5002,7 +5028,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (SCX_HAS_OP(dump)) { ops_dump_init(&s, ""); - SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, dump, &dctx); ops_dump_exit(); } @@ -5059,7 +5085,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) used = seq_buf_used(&ns); if (SCX_HAS_OP(dump_cpu)) { ops_dump_init(&ns, " "); - SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle); + SCX_CALL_OP(SCX_KF_REST, NULL, dump_cpu, &dctx, cpu, idle); ops_dump_exit(); } @@ -5315,7 +5341,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) scx_idle_enable(ops); if (scx_ops.init) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, init); if (ret) { ret = ops_sanitize_err("init", ret); cpus_read_unlock(); diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 023ae6df5e8ca..9213989e72b4e 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -745,7 +745,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) * managed by put_prev_task_idle()/set_next_task_idle(). */ if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq)) - SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); + SCX_CALL_OP(SCX_KF_REST, rq, update_idle, cpu_of(rq), idle); /* * Update the idle masks: -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 12:24 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi @ 2025-04-19 17:34 ` Tejun Heo 2025-04-19 20:10 ` Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Tejun Heo @ 2025-04-19 17:34 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel Hello, Andrea. On Sat, Apr 19, 2025 at 02:24:30PM +0200, Andrea Righi wrote: > @@ -149,6 +149,7 @@ struct sched_ext_entity { > s32 selected_cpu; > u32 kf_mask; /* see scx_kf_mask above */ > struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ > + struct rq *locked_rq; /* currently locked rq */ Can this be a percpu variable? While rq is locked, current can't switch out anyway and that way we don't have to increase the size of task. Note that kf_tasks[] are different in that some ops may, at least theoretically, sleep. > +static inline void update_locked_rq(struct rq *rq) > +{ > + /* > + * Check whether @rq is actually locked. This can help expose bugs > + * or incorrect assumptions about the context in which a kfunc or > + * callback is executed. > + */ > + if (rq) > + lockdep_assert_rq_held(rq); > + current->scx.locked_rq = rq; > + barrier(); As these conditions are program-order checks on the local CPU, I don't think any barrier is necessary. Thanks. -- tejun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 17:34 ` Tejun Heo @ 2025-04-19 20:10 ` Andrea Righi 2025-04-19 20:30 ` Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-19 20:10 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel On Sat, Apr 19, 2025 at 07:34:16AM -1000, Tejun Heo wrote: > Hello, Andrea. > > On Sat, Apr 19, 2025 at 02:24:30PM +0200, Andrea Righi wrote: > > @@ -149,6 +149,7 @@ struct sched_ext_entity { > > s32 selected_cpu; > > u32 kf_mask; /* see scx_kf_mask above */ > > struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ > > + struct rq *locked_rq; /* currently locked rq */ > > Can this be a percpu variable? While rq is locked, current can't switch out > anyway and that way we don't have to increase the size of task. Note that > kf_tasks[] are different in that some ops may, at least theoretically, > sleep. Yeah, I was debating between using a percpu variable or storing it in current. I went with current just to stay consistent with kf_tasks. But you're right about not to increasing the size of the task, and as you pointed out, we can’t switch if the rq is locked, so a percpu variable should work. I’ll update that in v2. > > > +static inline void update_locked_rq(struct rq *rq) > > +{ > > + /* > > + * Check whether @rq is actually locked. This can help expose bugs > > + * or incorrect assumptions about the context in which a kfunc or > > + * callback is executed. > > + */ > > + if (rq) > > + lockdep_assert_rq_held(rq); > > + current->scx.locked_rq = rq; > > + barrier(); > > As these conditions are program-order checks on the local CPU, I don't think > any barrier is necessary. Right, these are local CPU access only, I'll drop the barrier. Thanks, -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 20:10 ` Andrea Righi @ 2025-04-19 20:30 ` Andrea Righi 2025-04-19 21:27 ` Andrea Righi 2025-04-20 17:44 ` Tejun Heo 0 siblings, 2 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-19 20:30 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel On Sat, Apr 19, 2025 at 10:10:13PM +0200, Andrea Righi wrote: > On Sat, Apr 19, 2025 at 07:34:16AM -1000, Tejun Heo wrote: > > Hello, Andrea. > > > > On Sat, Apr 19, 2025 at 02:24:30PM +0200, Andrea Righi wrote: > > > @@ -149,6 +149,7 @@ struct sched_ext_entity { > > > s32 selected_cpu; > > > u32 kf_mask; /* see scx_kf_mask above */ > > > struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ > > > + struct rq *locked_rq; /* currently locked rq */ > > > > Can this be a percpu variable? While rq is locked, current can't switch out > > anyway and that way we don't have to increase the size of task. Note that > > kf_tasks[] are different in that some ops may, at least theoretically, > > sleep. > > Yeah, I was debating between using a percpu variable or storing it in > current. I went with current just to stay consistent with kf_tasks. > > But you're right about not to increasing the size of the task, and as you > pointed out, we can’t switch if the rq is locked, so a percpu variable > should work. I’ll update that in v2. Hm... actually thinking more about this, a problem with the percpu variable is that, if no rq is locked, we could move to a different CPU and end up reading the wrong rq_locked via scx_locked_rq(). I don't think we want to preempt_disable/enable all the callbacks just to fix this... Maybe storing in current is a safer choice? Thanks, -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 20:30 ` Andrea Righi @ 2025-04-19 21:27 ` Andrea Righi 2025-04-20 17:44 ` Tejun Heo 1 sibling, 0 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-19 21:27 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel On Sat, Apr 19, 2025 at 10:30:37PM +0200, Andrea Righi wrote: > On Sat, Apr 19, 2025 at 10:10:13PM +0200, Andrea Righi wrote: > > On Sat, Apr 19, 2025 at 07:34:16AM -1000, Tejun Heo wrote: > > > Hello, Andrea. > > > > > > On Sat, Apr 19, 2025 at 02:24:30PM +0200, Andrea Righi wrote: > > > > @@ -149,6 +149,7 @@ struct sched_ext_entity { > > > > s32 selected_cpu; > > > > u32 kf_mask; /* see scx_kf_mask above */ > > > > struct task_struct *kf_tasks[2]; /* see SCX_CALL_OP_TASK() */ > > > > + struct rq *locked_rq; /* currently locked rq */ > > > > > > Can this be a percpu variable? While rq is locked, current can't switch out > > > anyway and that way we don't have to increase the size of task. Note that > > > kf_tasks[] are different in that some ops may, at least theoretically, > > > sleep. > > > > Yeah, I was debating between using a percpu variable or storing it in > > current. I went with current just to stay consistent with kf_tasks. > > > > But you're right about not to increasing the size of the task, and as you > > pointed out, we can’t switch if the rq is locked, so a percpu variable > > should work. I’ll update that in v2. > > Hm... actually thinking more about this, a problem with the percpu variable > is that, if no rq is locked, we could move to a different CPU and end up > reading the wrong rq_locked via scx_locked_rq(). I don't think we want to > preempt_disable/enable all the callbacks just to fix this... Maybe storing > in current is a safer choice? And if we don't want to increase the size of sched_ext_entity, we could store the cpu of the currently locked rq, right before "disallow", like: struct sched_ext_entity { struct scx_dispatch_q * dsq; /* 0 8 */ struct scx_dsq_list_node dsq_list; /* 8 24 */ struct rb_node dsq_priq __attribute__((__aligned__(8))); /* 32 24 */ u32 dsq_seq; /* 56 4 */ u32 dsq_flags; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ u32 flags; /* 64 4 */ u32 weight; /* 68 4 */ s32 sticky_cpu; /* 72 4 */ s32 holding_cpu; /* 76 4 */ s32 selected_cpu; /* 80 4 */ u32 kf_mask; /* 84 4 */ struct task_struct * kf_tasks[2]; /* 88 16 */ atomic_long_t ops_state; /* 104 8 */ struct list_head runnable_node; /* 112 16 */ /* --- cacheline 2 boundary (128 bytes) --- */ long unsigned int runnable_at; /* 128 8 */ u64 core_sched_at; /* 136 8 */ u64 ddsp_dsq_id; /* 144 8 */ u64 ddsp_enq_flags; /* 152 8 */ u64 slice; /* 160 8 */ u64 dsq_vtime; /* 168 8 */ int locked_cpu; /* 176 4 */ bool disallow; /* 180 1 */ /* XXX 3 bytes hole, try to pack */ struct cgroup * cgrp_moving_from; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ struct list_head tasks_node; /* 192 16 */ /* size: 208, cachelines: 4, members: 24 */ /* sum members: 205, holes: 1, sum holes: 3 */ /* forced alignments: 1 */ /* last cacheline: 16 bytes */ } __attribute__((__aligned__(8))); (before the hole was 7 bytes) Then use cpu_rq()/cpu_of() to resolve that to/from the corresponding rq. -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-19 20:30 ` Andrea Righi 2025-04-19 21:27 ` Andrea Righi @ 2025-04-20 17:44 ` Tejun Heo 2025-04-20 18:34 ` Andrea Righi 1 sibling, 1 reply; 18+ messages in thread From: Tejun Heo @ 2025-04-20 17:44 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel Hello, On Sat, Apr 19, 2025 at 10:30:33PM +0200, Andrea Righi wrote: > Hm... actually thinking more about this, a problem with the percpu variable > is that, if no rq is locked, we could move to a different CPU and end up > reading the wrong rq_locked via scx_locked_rq(). I don't think we want to > preempt_disable/enable all the callbacks just to fix this... Maybe storing > in current is a safer choice? Hmm... I have a hard time imagining a timeline of events which would lead to the wrong conclusion. The percpu variable is set only while an rq is locked and cleared before the rq lock is released and thus can only be read as non-NULL while the rq is locked by that CPU. Maybe I'm missing something. Can you illustrate a timeline of events which would lead to the wrong conclusion? Thanks. -- tejun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-20 17:44 ` Tejun Heo @ 2025-04-20 18:34 ` Andrea Righi 0 siblings, 0 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-20 18:34 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel On Sun, Apr 20, 2025 at 07:44:12AM -1000, Tejun Heo wrote: > Hello, > > On Sat, Apr 19, 2025 at 10:30:33PM +0200, Andrea Righi wrote: > > Hm... actually thinking more about this, a problem with the percpu variable > > is that, if no rq is locked, we could move to a different CPU and end up > > reading the wrong rq_locked via scx_locked_rq(). I don't think we want to > > preempt_disable/enable all the callbacks just to fix this... Maybe storing > > in current is a safer choice? > > Hmm... I have a hard time imagining a timeline of events which would lead to > the wrong conclusion. The percpu variable is set only while an rq is locked > and cleared before the rq lock is released and thus can only be read as > non-NULL while the rq is locked by that CPU. Maybe I'm missing something. > Can you illustrate a timeline of events which would lead to the wrong > conclusion? Oh ok, I was only thinking of setting the percpu variable when we call SCX_CALL_OP*(), but if we clear it after the scx op returns, then it should work. If no rq is locked and we bounce to a different CPU, we'd still read NULL, so it should be always correct. Alright, I'll send a v2 with this logic and the percpu variable. Thanks, -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 2/2] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() 2025-04-19 12:24 [PATCH 0/2] sched_ext: Introduce rq lock tracking Andrea Righi 2025-04-19 12:24 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi @ 2025-04-19 12:24 ` Andrea Righi 1 sibling, 0 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-19 12:24 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel scx_bpf_cpuperf_set() can be used to set a performance target level on any CPU. However, it doesn't correctly acquire the corresponding rq lock, which may lead to unsafe behavior and trigger the following warning, due to the lockdep_assert_rq_held() check: [ 51.713737] WARNING: CPU: 3 PID: 3899 at kernel/sched/sched.h:1512 scx_bpf_cpuperf_set+0x1a0/0x1e0 ... [ 51.713836] Call trace: [ 51.713837] scx_bpf_cpuperf_set+0x1a0/0x1e0 (P) [ 51.713839] bpf_prog_62d35beb9301601f_bpfland_init+0x168/0x440 [ 51.713841] bpf__sched_ext_ops_init+0x54/0x8c [ 51.713843] scx_ops_enable.constprop.0+0x2c0/0x10f0 [ 51.713845] bpf_scx_reg+0x18/0x30 [ 51.713847] bpf_struct_ops_link_create+0x154/0x1b0 [ 51.713849] __sys_bpf+0x1934/0x22a0 Fix by properly acquiring the rq lock when possible or raising an error if we try to operate on a CPU that is not the one currently locked. Fixes: d86adb4fc0655 ("sched_ext: Add cpuperf support") Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index a14a5c3bc38ac..eddd84731c019 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -7078,13 +7078,32 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf) } if (ops_cpu_valid(cpu, NULL)) { - struct rq *rq = cpu_rq(cpu); + struct rq *rq = cpu_rq(cpu), *locked_rq = scx_locked_rq(); + struct rq_flags rf; + + /* + * When called with an rq lock held, restrict the operation + * to the corresponding CPU to prevent ABBA deadlocks. + */ + if (locked_rq && rq != locked_rq) { + scx_error("Invalid target CPU %d", cpu); + return; + } + + /* + * If no rq lock is held, allow to operate on any CPU by + * acquiring the corresponding rq lock. + */ + if (!locked_rq) { + rq_lock_irqsave(rq, &rf); + update_rq_clock(rq); + } rq->scx.cpuperf_target = perf; + cpufreq_update_util(rq, 0); - rcu_read_lock_sched_notrace(); - cpufreq_update_util(cpu_rq(cpu), 0); - rcu_read_unlock_sched_notrace(); + if (!locked_rq) + rq_unlock_irqrestore(rq, &rf); } } -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH v2 0/2] sched_ext: Introduce rq lock tracking @ 2025-04-20 19:30 Andrea Righi 2025-04-20 19:30 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-20 19:30 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Add rq lock tracking to sched_ext ops callbacks to enable scx_bpf_*() kfuncs to detect whether a specific rq is currently locked and safely operate with it. If no rq is locked, the target rq lock can be acquired. If a different rq is already locked, the operation can either fail (triggering an ops error), deferred using IRQ work, or managed accordingly to avoid deadlocks. This patchset is also available in the following git branch: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-rq-lock-tracking Changes in v2: - store currently locked rq in a percpu variable - link to v1: https://lore.kernel.org/all/20250419123536.154469-1-arighi@nvidia.com Andrea Righi (2): sched_ext: Track currently locked rq sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() kernel/sched/ext.c | 163 ++++++++++++++++++++++++++++++++---------------- kernel/sched/ext_idle.c | 2 +- 2 files changed, 110 insertions(+), 55 deletions(-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-20 19:30 [PATCH v2 0/2] sched_ext: Introduce rq lock tracking Andrea Righi @ 2025-04-20 19:30 ` Andrea Righi 2025-04-21 19:03 ` Tejun Heo 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-20 19:30 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 136 +++++++++++++++++++++++++--------------- kernel/sched/ext_idle.c | 2 +- 2 files changed, 87 insertions(+), 51 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bb0873411d798..51dad94f1952d 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1116,8 +1116,38 @@ static void scx_kf_disallow(u32 mask) current->scx.kf_mask &= ~mask; } -#define SCX_CALL_OP(mask, op, args...) \ +/* + * Track the rq currently locked. + * + * This allows kfuncs to safely operate on rq from any scx ops callback, + * knowing which rq is already locked. + */ +static DEFINE_PER_CPU(struct rq *, locked_rq); + +static inline void update_locked_rq(struct rq *rq) +{ + /* + * Check whether @rq is actually locked. This can help expose bugs + * or incorrect assumptions about the context in which a kfunc or + * callback is executed. + */ + if (rq) + lockdep_assert_rq_held(rq); + __this_cpu_write(locked_rq, rq); +} + +/* + * Return the rq currently locked from an scx callback, or NULL if no rq is + * locked. + */ +static inline struct rq *scx_locked_rq(void) +{ + return __this_cpu_read(locked_rq); +} + +#define SCX_CALL_OP(mask, rq, op, args...) \ do { \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ scx_ops.op(args); \ @@ -1125,11 +1155,15 @@ do { \ } else { \ scx_ops.op(args); \ } \ + if (rq) \ + update_locked_rq(NULL); \ } while (0) -#define SCX_CALL_OP_RET(mask, op, args...) \ +#define SCX_CALL_OP_RET(mask, rq, op, args...) \ ({ \ __typeof__(scx_ops.op(args)) __ret; \ + \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ __ret = scx_ops.op(args); \ @@ -1137,6 +1171,8 @@ do { \ } else { \ __ret = scx_ops.op(args); \ } \ + if (rq) \ + update_locked_rq(NULL); \ __ret; \ }) @@ -1151,31 +1187,31 @@ do { \ * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on * the specific task. */ -#define SCX_CALL_OP_TASK(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK(mask, rq, op, task, args...) \ do { \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - SCX_CALL_OP(mask, op, task, ##args); \ + SCX_CALL_OP(mask, rq, op, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ } while (0) -#define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK_RET(mask, rq, op, task, args...) \ ({ \ __typeof__(scx_ops.op(task, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \ + __ret = SCX_CALL_OP_RET(mask, rq, op, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ __ret; \ }) -#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \ +#define SCX_CALL_OP_2TASKS_RET(mask, rq, op, task0, task1, args...) \ ({ \ __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ - __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \ + __ret = SCX_CALL_OP_RET(mask, rq, op, task0, task1, ##args); \ current->scx.kf_tasks[0] = NULL; \ current->scx.kf_tasks[1] = NULL; \ __ret; \ @@ -2174,7 +2210,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, rq, enqueue, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -2269,7 +2305,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags add_nr_running(rq, 1); if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, runnable, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p); @@ -2283,7 +2319,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(SCX_EV_SELECT_CPU_FALLBACK, 1); } -static void ops_dequeue(struct task_struct *p, u64 deq_flags) +static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { unsigned long opss; @@ -2304,7 +2340,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) BUG(); case SCX_OPSS_QUEUED: if (SCX_HAS_OP(dequeue)) - SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, dequeue, p, deq_flags); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -2337,7 +2373,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags return true; } - ops_dequeue(p, deq_flags); + ops_dequeue(rq, p, deq_flags); /* * A currently running task which is going off @rq first gets dequeued @@ -2353,11 +2389,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags */ if (SCX_HAS_OP(stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, stopping, p, false); } if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, quiescent, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -2377,7 +2413,7 @@ static void yield_task_scx(struct rq *rq) struct task_struct *p = rq->curr; if (SCX_HAS_OP(yield)) - SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL); + SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, rq, yield, p, NULL); else p->scx.slice = 0; } @@ -2387,7 +2423,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) struct task_struct *from = rq->curr; if (SCX_HAS_OP(yield)) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to); + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, rq, yield, from, to); else return false; } @@ -2945,7 +2981,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * emitted in switch_class(). */ if (SCX_HAS_OP(cpu_acquire)) - SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL); + SCX_CALL_OP(SCX_KF_REST, rq, cpu_acquire, cpu_of(rq), NULL); rq->scx.cpu_released = false; } @@ -2990,7 +3026,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) do { dspc->nr_tasks = 0; - SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), + SCX_CALL_OP(SCX_KF_DISPATCH, rq, dispatch, cpu_of(rq), prev_on_scx ? prev : NULL); flush_dispatch_buf(rq); @@ -3104,7 +3140,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) * Core-sched might decide to execute @p before it is * dispatched. Call ops_dequeue() to notify the BPF scheduler. */ - ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC); + ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC); dispatch_dequeue(rq, p); } @@ -3112,7 +3148,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, running, p); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, running, p); clr_task_runnable(p, true); @@ -3194,7 +3230,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) }; SCX_CALL_OP(SCX_KF_CPU_RELEASE, - cpu_release, cpu_of(rq), &args); + rq, cpu_release, cpu_of(rq), &args); } rq->scx.cpu_released = true; } @@ -3207,7 +3243,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, stopping, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); @@ -3345,7 +3381,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, * verifier. */ if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a))) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, NULL, core_sched_before, (struct task_struct *)a, (struct task_struct *)b); else @@ -3381,7 +3417,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, + cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, NULL, select_cpu, p, prev_cpu, wake_flags); p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; @@ -3426,7 +3462,7 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + SCX_CALL_OP_TASK(SCX_KF_REST, NULL, set_cpumask, p, (struct cpumask *)p->cpus_ptr); } @@ -3440,9 +3476,9 @@ static void handle_hotplug(struct rq *rq, bool online) scx_idle_update_selcpu_topology(&scx_ops); if (online && SCX_HAS_OP(cpu_online)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, rq, cpu_online, cpu); else if (!online && SCX_HAS_OP(cpu_offline)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, rq, cpu_offline, cpu); else scx_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, "cpu %d going %s, exiting scheduler", cpu, @@ -3545,7 +3581,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(tick)) { - SCX_CALL_OP_TASK(SCX_KF_REST, tick, curr); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, tick, curr); } if (!curr->scx.slice) @@ -3622,7 +3658,7 @@ static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork .fork = fork, }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, init_task, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err("init_task", ret); return ret; @@ -3679,11 +3715,11 @@ static void scx_enable_task(struct task_struct *p) p->scx.weight = sched_weight_to_cgroup(weight); if (SCX_HAS_OP(enable)) - SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), enable, p); scx_set_task_state(p, SCX_TASK_ENABLED); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), set_weight, p, p->scx.weight); } static void scx_disable_task(struct task_struct *p) @@ -3692,7 +3728,7 @@ static void scx_disable_task(struct task_struct *p) WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED); if (SCX_HAS_OP(disable)) - SCX_CALL_OP_TASK(SCX_KF_REST, disable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), disable, p); scx_set_task_state(p, SCX_TASK_READY); } @@ -3721,7 +3757,7 @@ static void scx_exit_task(struct task_struct *p) } if (SCX_HAS_OP(exit_task)) - SCX_CALL_OP_TASK(SCX_KF_REST, exit_task, p, &args); + SCX_CALL_OP_TASK(SCX_KF_REST, task_rq(p), exit_task, p, &args); scx_set_task_state(p, SCX_TASK_NONE); } @@ -3830,7 +3866,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, rq, set_weight, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) @@ -3846,7 +3882,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, + SCX_CALL_OP_TASK(SCX_KF_REST, rq, set_cpumask, p, (struct cpumask *)p->cpus_ptr); } @@ -3908,7 +3944,7 @@ int scx_tg_online(struct task_group *tg) struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_init, tg->css.cgroup, &args); if (ret) ret = ops_sanitize_err("cgroup_init", ret); @@ -3930,7 +3966,7 @@ void scx_tg_offline(struct task_group *tg) percpu_down_read(&scx_cgroup_rwsem); if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_exit, tg->css.cgroup); tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); percpu_up_read(&scx_cgroup_rwsem); @@ -3963,7 +3999,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) continue; if (SCX_HAS_OP(cgroup_prep_move)) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_prep_move, p, from, css->cgroup); if (ret) goto err; @@ -3977,7 +4013,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) err: cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -3996,7 +4032,7 @@ void scx_cgroup_move_task(struct task_struct *p) * cgrp_moving_from set. */ if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p, + SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, NULL, cgroup_move, p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; } @@ -4016,7 +4052,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_cancel_move, p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -4030,7 +4066,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) if (scx_cgroup_enabled && tg->scx_weight != weight) { if (SCX_HAS_OP(cgroup_set_weight)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_set_weight, tg_cgrp(tg), weight); tg->scx_weight = weight; } @@ -4219,7 +4255,7 @@ static void scx_cgroup_exit(void) continue; rcu_read_unlock(); - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, cgroup_exit, css->cgroup); rcu_read_lock(); css_put(css); @@ -4256,7 +4292,7 @@ static int scx_cgroup_init(void) continue; rcu_read_unlock(); - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, cgroup_init, css->cgroup, &args); if (ret) { css_put(css); @@ -4749,7 +4785,7 @@ static void scx_disable_workfn(struct kthread_work *work) } if (scx_ops.exit) - SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, exit, ei); cancel_delayed_work_sync(&scx_watchdog_work); @@ -4955,7 +4991,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, if (SCX_HAS_OP(dump_task)) { ops_dump_init(s, " "); - SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p); + SCX_CALL_OP(SCX_KF_REST, NULL, dump_task, dctx, p); ops_dump_exit(); } @@ -5002,7 +5038,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (SCX_HAS_OP(dump)) { ops_dump_init(&s, ""); - SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx); + SCX_CALL_OP(SCX_KF_UNLOCKED, NULL, dump, &dctx); ops_dump_exit(); } @@ -5059,7 +5095,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) used = seq_buf_used(&ns); if (SCX_HAS_OP(dump_cpu)) { ops_dump_init(&ns, " "); - SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle); + SCX_CALL_OP(SCX_KF_REST, NULL, dump_cpu, &dctx, cpu, idle); ops_dump_exit(); } @@ -5315,7 +5351,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) scx_idle_enable(ops); if (scx_ops.init) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, NULL, init); if (ret) { ret = ops_sanitize_err("init", ret); cpus_read_unlock(); diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 023ae6df5e8ca..9213989e72b4e 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -745,7 +745,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) * managed by put_prev_task_idle()/set_next_task_idle(). */ if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq)) - SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); + SCX_CALL_OP(SCX_KF_REST, rq, update_idle, cpu_of(rq), idle); /* * Update the idle masks: -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-20 19:30 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi @ 2025-04-21 19:03 ` Tejun Heo 2025-04-22 6:27 ` Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Tejun Heo @ 2025-04-21 19:03 UTC (permalink / raw) To: Andrea Righi; +Cc: David Vernet, Changwoo Min, linux-kernel Hello, On Sun, Apr 20, 2025 at 09:30:21PM +0200, Andrea Righi wrote: ... > +static inline struct rq *scx_locked_rq(void) > +{ > + return __this_cpu_read(locked_rq); > +} > + > +#define SCX_CALL_OP(mask, rq, op, args...) \ > do { \ > + update_locked_rq(rq); \ Minor but why not if (rq) update_locked_rq(rq); here too to be symmetric? > if (mask) { \ > scx_kf_allow(mask); \ > scx_ops.op(args); \ > @@ -1125,11 +1155,15 @@ do { \ > } else { \ > scx_ops.op(args); \ > } \ > + if (rq) \ > + update_locked_rq(NULL); \ Or alternatively, drop `if (rq)` from both places. That's simpler and given that all the hot paths are called with rq locked, that may be *minutely* faster. > @@ -2174,7 +2210,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > WARN_ON_ONCE(*ddsp_taskp); > *ddsp_taskp = p; > > - SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); > + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, rq, enqueue, p, enq_flags); Let's do SCX_CALL_OP_TASK(SCX_FK_ENQUEUE, enqueue, rq, p, enq_flags) so that the static parts of the invocation are grouped together and we usually have @rq and @p next to each other when they're used as parameters. Thanks. -- tejun ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-21 19:03 ` Tejun Heo @ 2025-04-22 6:27 ` Andrea Righi 0 siblings, 0 replies; 18+ messages in thread From: Andrea Righi @ 2025-04-22 6:27 UTC (permalink / raw) To: Tejun Heo; +Cc: David Vernet, Changwoo Min, linux-kernel On Mon, Apr 21, 2025 at 09:03:08AM -1000, Tejun Heo wrote: > Hello, > > On Sun, Apr 20, 2025 at 09:30:21PM +0200, Andrea Righi wrote: > ... > > +static inline struct rq *scx_locked_rq(void) > > +{ > > + return __this_cpu_read(locked_rq); > > +} > > + > > +#define SCX_CALL_OP(mask, rq, op, args...) \ > > do { \ > > + update_locked_rq(rq); \ > > Minor but why not > > if (rq) > update_locked_rq(rq); > > here too to be symmetric? > > > if (mask) { \ > > scx_kf_allow(mask); \ > > scx_ops.op(args); \ > > @@ -1125,11 +1155,15 @@ do { \ > > } else { \ > > scx_ops.op(args); \ > > } \ > > + if (rq) \ > > + update_locked_rq(NULL); \ > > Or alternatively, drop `if (rq)` from both places. That's simpler and given > that all the hot paths are called with rq locked, that may be *minutely* > faster. Ack, let's not complicate the code unnecessarily, that's a negligible optimization (I tried to remove that `if (rq)` and I don't see any measurable difference, as expected). > > > @@ -2174,7 +2210,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, > > WARN_ON_ONCE(*ddsp_taskp); > > *ddsp_taskp = p; > > > > - SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); > > + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, rq, enqueue, p, enq_flags); > > Let's do SCX_CALL_OP_TASK(SCX_FK_ENQUEUE, enqueue, rq, p, enq_flags) so that > the static parts of the invocation are grouped together and we usually have > @rq and @p next to each other when they're used as parameters. Ack. Will change this as well and send a v3. Thanks, -Andrea ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v3 0/2] sched_ext: Introduce rq lock tracking @ 2025-04-22 8:26 Andrea Righi 2025-04-22 8:26 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-22 8:26 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Add rq lock tracking to sched_ext ops callbacks to enable scx_bpf_*() kfuncs to detect whether a specific rq is currently locked and safely operate with it. If no rq is locked, the target rq lock can be acquired. If a different rq is already locked, the operation can either fail (triggering an ops error), deferred using IRQ work, or managed accordingly to avoid deadlocks. This patchset is also available in the following git branch: git://git.kernel.org/pub/scm/linux/kernel/git/arighi/linux.git scx-rq-lock-tracking Changes in v3: - remove unnecessary rq != NULL optimization check in SCX_CALL_OP() - reorder SCX_CALL_OP*() arguments to group static args together - link to v2: https://lore.kernel.org/all/20250420193106.42533-1-arighi@nvidia.com/ Changes in v2: - store currently locked rq in a percpu variable - link to v1: https://lore.kernel.org/all/20250419123536.154469-1-arighi@nvidia.com Andrea Righi (2): sched_ext: Track currently locked rq sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() kernel/sched/ext.c | 179 +++++++++++++++++++++++++++++++----------------- kernel/sched/ext_idle.c | 2 +- 2 files changed, 118 insertions(+), 63 deletions(-) ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-22 8:26 [PATCH v3 0/2] sched_ext: Introduce rq lock tracking Andrea Righi @ 2025-04-22 8:26 ` Andrea Righi 2025-07-15 9:13 ` Breno Leitao 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-04-22 8:26 UTC (permalink / raw) To: Tejun Heo, David Vernet, Changwoo Min; +Cc: linux-kernel Some kfuncs provided by sched_ext may need to operate on a struct rq, but they can be invoked from various contexts, specifically, different scx callbacks. While some of these callbacks are invoked with a particular rq already locked, others are not. This makes it impossible for a kfunc to reliably determine whether it's safe to access a given rq, triggering potential bugs or unsafe behaviors, see for example [1]. To address this, track the currently locked rq whenever a sched_ext callback is invoked via SCX_CALL_OP*(). This allows kfuncs that need to operate on an arbitrary rq to retrieve the currently locked one and apply the appropriate action as needed. [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> --- kernel/sched/ext.c | 152 +++++++++++++++++++++++++--------------- kernel/sched/ext_idle.c | 2 +- 2 files changed, 95 insertions(+), 59 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index bb0873411d798..3365b447cbdb8 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1116,8 +1116,38 @@ static void scx_kf_disallow(u32 mask) current->scx.kf_mask &= ~mask; } -#define SCX_CALL_OP(mask, op, args...) \ +/* + * Track the rq currently locked. + * + * This allows kfuncs to safely operate on rq from any scx ops callback, + * knowing which rq is already locked. + */ +static DEFINE_PER_CPU(struct rq *, locked_rq); + +static inline void update_locked_rq(struct rq *rq) +{ + /* + * Check whether @rq is actually locked. This can help expose bugs + * or incorrect assumptions about the context in which a kfunc or + * callback is executed. + */ + if (rq) + lockdep_assert_rq_held(rq); + __this_cpu_write(locked_rq, rq); +} + +/* + * Return the rq currently locked from an scx callback, or NULL if no rq is + * locked. + */ +static inline struct rq *scx_locked_rq(void) +{ + return __this_cpu_read(locked_rq); +} + +#define SCX_CALL_OP(mask, op, rq, args...) \ do { \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ scx_ops.op(args); \ @@ -1125,11 +1155,14 @@ do { \ } else { \ scx_ops.op(args); \ } \ + update_locked_rq(NULL); \ } while (0) -#define SCX_CALL_OP_RET(mask, op, args...) \ +#define SCX_CALL_OP_RET(mask, op, rq, args...) \ ({ \ __typeof__(scx_ops.op(args)) __ret; \ + \ + update_locked_rq(rq); \ if (mask) { \ scx_kf_allow(mask); \ __ret = scx_ops.op(args); \ @@ -1137,6 +1170,7 @@ do { \ } else { \ __ret = scx_ops.op(args); \ } \ + update_locked_rq(NULL); \ __ret; \ }) @@ -1151,31 +1185,31 @@ do { \ * scx_kf_allowed_on_arg_tasks() to test whether the invocation is allowed on * the specific task. */ -#define SCX_CALL_OP_TASK(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK(mask, op, rq, task, args...) \ do { \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - SCX_CALL_OP(mask, op, task, ##args); \ + SCX_CALL_OP(mask, op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ } while (0) -#define SCX_CALL_OP_TASK_RET(mask, op, task, args...) \ +#define SCX_CALL_OP_TASK_RET(mask, op, rq, task, args...) \ ({ \ __typeof__(scx_ops.op(task, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task; \ - __ret = SCX_CALL_OP_RET(mask, op, task, ##args); \ + __ret = SCX_CALL_OP_RET(mask, op, rq, task, ##args); \ current->scx.kf_tasks[0] = NULL; \ __ret; \ }) -#define SCX_CALL_OP_2TASKS_RET(mask, op, task0, task1, args...) \ +#define SCX_CALL_OP_2TASKS_RET(mask, op, rq, task0, task1, args...) \ ({ \ __typeof__(scx_ops.op(task0, task1, ##args)) __ret; \ BUILD_BUG_ON((mask) & ~__SCX_KF_TERMINAL); \ current->scx.kf_tasks[0] = task0; \ current->scx.kf_tasks[1] = task1; \ - __ret = SCX_CALL_OP_RET(mask, op, task0, task1, ##args); \ + __ret = SCX_CALL_OP_RET(mask, op, rq, task0, task1, ##args); \ current->scx.kf_tasks[0] = NULL; \ current->scx.kf_tasks[1] = NULL; \ __ret; \ @@ -2174,7 +2208,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, WARN_ON_ONCE(*ddsp_taskp); *ddsp_taskp = p; - SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_ENQUEUE, enqueue, rq, p, enq_flags); *ddsp_taskp = NULL; if (p->scx.ddsp_dsq_id != SCX_DSQ_INVALID) @@ -2269,7 +2303,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags add_nr_running(rq, 1); if (SCX_HAS_OP(runnable) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, runnable, p, enq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, runnable, rq, p, enq_flags); if (enq_flags & SCX_ENQ_WAKEUP) touch_core_sched(rq, p); @@ -2283,7 +2317,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags __scx_add_event(SCX_EV_SELECT_CPU_FALLBACK, 1); } -static void ops_dequeue(struct task_struct *p, u64 deq_flags) +static void ops_dequeue(struct rq *rq, struct task_struct *p, u64 deq_flags) { unsigned long opss; @@ -2304,7 +2338,7 @@ static void ops_dequeue(struct task_struct *p, u64 deq_flags) BUG(); case SCX_OPSS_QUEUED: if (SCX_HAS_OP(dequeue)) - SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, dequeue, rq, p, deq_flags); if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss, SCX_OPSS_NONE)) @@ -2337,7 +2371,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags return true; } - ops_dequeue(p, deq_flags); + ops_dequeue(rq, p, deq_flags); /* * A currently running task which is going off @rq first gets dequeued @@ -2353,11 +2387,11 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags */ if (SCX_HAS_OP(stopping) && task_current(rq, p)) { update_curr_scx(rq); - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, false); + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, rq, p, false); } if (SCX_HAS_OP(quiescent) && !task_on_rq_migrating(p)) - SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, p, deq_flags); + SCX_CALL_OP_TASK(SCX_KF_REST, quiescent, rq, p, deq_flags); if (deq_flags & SCX_DEQ_SLEEP) p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP; @@ -2377,7 +2411,7 @@ static void yield_task_scx(struct rq *rq) struct task_struct *p = rq->curr; if (SCX_HAS_OP(yield)) - SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, p, NULL); + SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, rq, p, NULL); else p->scx.slice = 0; } @@ -2387,7 +2421,7 @@ static bool yield_to_task_scx(struct rq *rq, struct task_struct *to) struct task_struct *from = rq->curr; if (SCX_HAS_OP(yield)) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, from, to); + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, yield, rq, from, to); else return false; } @@ -2945,7 +2979,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) * emitted in switch_class(). */ if (SCX_HAS_OP(cpu_acquire)) - SCX_CALL_OP(SCX_KF_REST, cpu_acquire, cpu_of(rq), NULL); + SCX_CALL_OP(SCX_KF_REST, cpu_acquire, rq, cpu_of(rq), NULL); rq->scx.cpu_released = false; } @@ -2990,7 +3024,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev) do { dspc->nr_tasks = 0; - SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq), + SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, rq, cpu_of(rq), prev_on_scx ? prev : NULL); flush_dispatch_buf(rq); @@ -3104,7 +3138,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) * Core-sched might decide to execute @p before it is * dispatched. Call ops_dequeue() to notify the BPF scheduler. */ - ops_dequeue(p, SCX_DEQ_CORE_SCHED_EXEC); + ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC); dispatch_dequeue(rq, p); } @@ -3112,7 +3146,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first) /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(running) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, running, p); + SCX_CALL_OP_TASK(SCX_KF_REST, running, rq, p); clr_task_runnable(p, true); @@ -3193,8 +3227,7 @@ static void switch_class(struct rq *rq, struct task_struct *next) .task = next, }; - SCX_CALL_OP(SCX_KF_CPU_RELEASE, - cpu_release, cpu_of(rq), &args); + SCX_CALL_OP(SCX_KF_CPU_RELEASE, cpu_release, rq, cpu_of(rq), &args); } rq->scx.cpu_released = true; } @@ -3207,7 +3240,7 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p, /* see dequeue_task_scx() on why we skip when !QUEUED */ if (SCX_HAS_OP(stopping) && (p->scx.flags & SCX_TASK_QUEUED)) - SCX_CALL_OP_TASK(SCX_KF_REST, stopping, p, true); + SCX_CALL_OP_TASK(SCX_KF_REST, stopping, rq, p, true); if (p->scx.flags & SCX_TASK_QUEUED) { set_task_runnable(rq, p); @@ -3345,7 +3378,7 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b, * verifier. */ if (SCX_HAS_OP(core_sched_before) && !scx_rq_bypassing(task_rq(a))) - return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, + return SCX_CALL_OP_2TASKS_RET(SCX_KF_REST, core_sched_before, NULL, (struct task_struct *)a, (struct task_struct *)b); else @@ -3382,7 +3415,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag *ddsp_taskp = p; cpu = SCX_CALL_OP_TASK_RET(SCX_KF_ENQUEUE | SCX_KF_SELECT_CPU, - select_cpu, p, prev_cpu, wake_flags); + select_cpu, NULL, p, prev_cpu, wake_flags); p->scx.selected_cpu = cpu; *ddsp_taskp = NULL; if (ops_cpu_valid(cpu, "from ops.select_cpu()")) @@ -3426,8 +3459,8 @@ static void set_cpus_allowed_scx(struct task_struct *p, * designation pointless. Cast it away when calling the operation. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, - (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, NULL, + p, (struct cpumask *)p->cpus_ptr); } static void handle_hotplug(struct rq *rq, bool online) @@ -3440,9 +3473,9 @@ static void handle_hotplug(struct rq *rq, bool online) scx_idle_update_selcpu_topology(&scx_ops); if (online && SCX_HAS_OP(cpu_online)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_online, rq, cpu); else if (!online && SCX_HAS_OP(cpu_offline)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, cpu); + SCX_CALL_OP(SCX_KF_UNLOCKED, cpu_offline, rq, cpu); else scx_exit(SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG, "cpu %d going %s, exiting scheduler", cpu, @@ -3545,7 +3578,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued) curr->scx.slice = 0; touch_core_sched(rq, curr); } else if (SCX_HAS_OP(tick)) { - SCX_CALL_OP_TASK(SCX_KF_REST, tick, curr); + SCX_CALL_OP_TASK(SCX_KF_REST, tick, rq, curr); } if (!curr->scx.slice) @@ -3622,7 +3655,7 @@ static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork .fork = fork, }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, p, &args); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init_task, NULL, p, &args); if (unlikely(ret)) { ret = ops_sanitize_err("init_task", ret); return ret; @@ -3663,9 +3696,10 @@ static int scx_init_task(struct task_struct *p, struct task_group *tg, bool fork static void scx_enable_task(struct task_struct *p) { + struct rq *rq = task_rq(p); u32 weight; - lockdep_assert_rq_held(task_rq(p)); + lockdep_assert_rq_held(rq); /* * Set the weight before calling ops.enable() so that the scheduler @@ -3679,20 +3713,22 @@ static void scx_enable_task(struct task_struct *p) p->scx.weight = sched_weight_to_cgroup(weight); if (SCX_HAS_OP(enable)) - SCX_CALL_OP_TASK(SCX_KF_REST, enable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, enable, rq, p); scx_set_task_state(p, SCX_TASK_ENABLED); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, rq, p, p->scx.weight); } static void scx_disable_task(struct task_struct *p) { - lockdep_assert_rq_held(task_rq(p)); + struct rq *rq = task_rq(p); + + lockdep_assert_rq_held(rq); WARN_ON_ONCE(scx_get_task_state(p) != SCX_TASK_ENABLED); if (SCX_HAS_OP(disable)) - SCX_CALL_OP_TASK(SCX_KF_REST, disable, p); + SCX_CALL_OP_TASK(SCX_KF_REST, disable, rq, p); scx_set_task_state(p, SCX_TASK_READY); } @@ -3721,7 +3757,7 @@ static void scx_exit_task(struct task_struct *p) } if (SCX_HAS_OP(exit_task)) - SCX_CALL_OP_TASK(SCX_KF_REST, exit_task, p, &args); + SCX_CALL_OP_TASK(SCX_KF_REST, exit_task, task_rq(p), p, &args); scx_set_task_state(p, SCX_TASK_NONE); } @@ -3830,7 +3866,7 @@ static void reweight_task_scx(struct rq *rq, struct task_struct *p, p->scx.weight = sched_weight_to_cgroup(scale_load_down(lw->weight)); if (SCX_HAS_OP(set_weight)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, p, p->scx.weight); + SCX_CALL_OP_TASK(SCX_KF_REST, set_weight, rq, p, p->scx.weight); } static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio) @@ -3846,8 +3882,8 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p) * different scheduler class. Keep the BPF scheduler up-to-date. */ if (SCX_HAS_OP(set_cpumask)) - SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, p, - (struct cpumask *)p->cpus_ptr); + SCX_CALL_OP_TASK(SCX_KF_REST, set_cpumask, rq, + p, (struct cpumask *)p->cpus_ptr); } static void switched_from_scx(struct rq *rq, struct task_struct *p) @@ -3908,7 +3944,7 @@ int scx_tg_online(struct task_group *tg) struct scx_cgroup_init_args args = { .weight = tg->scx_weight }; - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, NULL, tg->css.cgroup, &args); if (ret) ret = ops_sanitize_err("cgroup_init", ret); @@ -3930,7 +3966,7 @@ void scx_tg_offline(struct task_group *tg) percpu_down_read(&scx_cgroup_rwsem); if (SCX_HAS_OP(cgroup_exit) && (tg->scx_flags & SCX_TG_INITED)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, tg->css.cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, NULL, tg->css.cgroup); tg->scx_flags &= ~(SCX_TG_ONLINE | SCX_TG_INITED); percpu_up_read(&scx_cgroup_rwsem); @@ -3963,7 +3999,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) continue; if (SCX_HAS_OP(cgroup_prep_move)) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_prep_move, NULL, p, from, css->cgroup); if (ret) goto err; @@ -3977,8 +4013,8 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset) err: cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, - p->scx.cgrp_moving_from, css->cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } @@ -3996,8 +4032,8 @@ void scx_cgroup_move_task(struct task_struct *p) * cgrp_moving_from set. */ if (SCX_HAS_OP(cgroup_move) && !WARN_ON_ONCE(!p->scx.cgrp_moving_from)) - SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, p, - p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); + SCX_CALL_OP_TASK(SCX_KF_UNLOCKED, cgroup_move, NULL, + p, p->scx.cgrp_moving_from, tg_cgrp(task_group(p))); p->scx.cgrp_moving_from = NULL; } @@ -4016,8 +4052,8 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset) cgroup_taskset_for_each(p, css, tset) { if (SCX_HAS_OP(cgroup_cancel_move) && p->scx.cgrp_moving_from) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, p, - p->scx.cgrp_moving_from, css->cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_cancel_move, NULL, + p, p->scx.cgrp_moving_from, css->cgroup); p->scx.cgrp_moving_from = NULL; } out_unlock: @@ -4030,7 +4066,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight) if (scx_cgroup_enabled && tg->scx_weight != weight) { if (SCX_HAS_OP(cgroup_set_weight)) - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight, NULL, tg_cgrp(tg), weight); tg->scx_weight = weight; } @@ -4219,7 +4255,7 @@ static void scx_cgroup_exit(void) continue; rcu_read_unlock(); - SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, css->cgroup); + SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_exit, NULL, css->cgroup); rcu_read_lock(); css_put(css); @@ -4256,7 +4292,7 @@ static int scx_cgroup_init(void) continue; rcu_read_unlock(); - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init, NULL, css->cgroup, &args); if (ret) { css_put(css); @@ -4749,7 +4785,7 @@ static void scx_disable_workfn(struct kthread_work *work) } if (scx_ops.exit) - SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei); + SCX_CALL_OP(SCX_KF_UNLOCKED, exit, NULL, ei); cancel_delayed_work_sync(&scx_watchdog_work); @@ -4955,7 +4991,7 @@ static void scx_dump_task(struct seq_buf *s, struct scx_dump_ctx *dctx, if (SCX_HAS_OP(dump_task)) { ops_dump_init(s, " "); - SCX_CALL_OP(SCX_KF_REST, dump_task, dctx, p); + SCX_CALL_OP(SCX_KF_REST, dump_task, NULL, dctx, p); ops_dump_exit(); } @@ -5002,7 +5038,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (SCX_HAS_OP(dump)) { ops_dump_init(&s, ""); - SCX_CALL_OP(SCX_KF_UNLOCKED, dump, &dctx); + SCX_CALL_OP(SCX_KF_UNLOCKED, dump, NULL, &dctx); ops_dump_exit(); } @@ -5059,7 +5095,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) used = seq_buf_used(&ns); if (SCX_HAS_OP(dump_cpu)) { ops_dump_init(&ns, " "); - SCX_CALL_OP(SCX_KF_REST, dump_cpu, &dctx, cpu, idle); + SCX_CALL_OP(SCX_KF_REST, dump_cpu, NULL, &dctx, cpu, idle); ops_dump_exit(); } @@ -5315,7 +5351,7 @@ static int scx_enable(struct sched_ext_ops *ops, struct bpf_link *link) scx_idle_enable(ops); if (scx_ops.init) { - ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init); + ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init, NULL); if (ret) { ret = ops_sanitize_err("init", ret); cpus_read_unlock(); diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c index 023ae6df5e8ca..35aa309c95846 100644 --- a/kernel/sched/ext_idle.c +++ b/kernel/sched/ext_idle.c @@ -745,7 +745,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify) * managed by put_prev_task_idle()/set_next_task_idle(). */ if (SCX_HAS_OP(update_idle) && do_notify && !scx_rq_bypassing(rq)) - SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle); + SCX_CALL_OP(SCX_KF_REST, update_idle, rq, cpu_of(rq), idle); /* * Update the idle masks: -- 2.49.0 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-04-22 8:26 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi @ 2025-07-15 9:13 ` Breno Leitao 2025-07-15 17:20 ` Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Breno Leitao @ 2025-07-15 9:13 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel Hello Andrea, On Tue, Apr 22, 2025 at 10:26:32AM +0200, Andrea Righi wrote: > Some kfuncs provided by sched_ext may need to operate on a struct rq, > but they can be invoked from various contexts, specifically, different > scx callbacks. > > While some of these callbacks are invoked with a particular rq already > locked, others are not. This makes it impossible for a kfunc to reliably > determine whether it's safe to access a given rq, triggering potential > bugs or unsafe behaviors, see for example [1]. > > To address this, track the currently locked rq whenever a sched_ext > callback is invoked via SCX_CALL_OP*(). > > This allows kfuncs that need to operate on an arbitrary rq to retrieve > the currently locked one and apply the appropriate action as needed. > > [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ > > Suggested-by: Tejun Heo <tj@kernel.org> > Signed-off-by: Andrea Righi <arighi@nvidia.com> > --- > kernel/sched/ext.c | 152 +++++++++++++++++++++++++--------------- > kernel/sched/ext_idle.c | 2 +- > 2 files changed, 95 insertions(+), 59 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index bb0873411d798..3365b447cbdb8 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1116,8 +1116,38 @@ static void scx_kf_disallow(u32 mask) > current->scx.kf_mask &= ~mask; > } > > -#define SCX_CALL_OP(mask, op, args...) \ > +/* > + * Track the rq currently locked. > + * > + * This allows kfuncs to safely operate on rq from any scx ops callback, > + * knowing which rq is already locked. > + */ > +static DEFINE_PER_CPU(struct rq *, locked_rq); > + > +static inline void update_locked_rq(struct rq *rq) > +{ > + /* > + * Check whether @rq is actually locked. This can help expose bugs > + * or incorrect assumptions about the context in which a kfunc or > + * callback is executed. > + */ > + if (rq) Why do you only need to lock when rq is not NULL? > + lockdep_assert_rq_held(rq); > + __this_cpu_write(locked_rq, rq); This is hitting the following BUG() on some of my debug kernels: BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 I have lockdep enabled, and I don't see the assert above. I am wondering if rq is locked but preemption continues to be enabled (!?) Also, I am wondering if the following patch would be useful. diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b498d867ba210..c458dd7928e92 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1256,8 +1256,10 @@ static inline void update_locked_rq(struct rq *rq) * or incorrect assumptions about the context in which a kfunc or * callback is executed. */ - if (rq) + if (rq) { lockdep_assert_rq_held(rq); + } + WARN_ON_ONCE(preemptible()); __this_cpu_write(locked_rq, rq); } This is the full stack for the BUG: above: BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 caller is bpf_scx_reg (kernel/sched/ext.c:1261 kernel/sched/ext.c:5558 kernel/sched/ext.c:5879) Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [N]=TEST Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM20 02/01/2023 Sched_ext: layered (enabling) Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:122) check_preemption_disabled (lib/smp_processor_id.c:?) bpf_scx_reg (kernel/sched/ext.c:1261 kernel/sched/ext.c:5558 kernel/sched/ext.c:5879) ? bpf_struct_ops_link_create (kernel/bpf/bpf_struct_ops.c:?) ? rcu_is_watching (./include/linux/context_tracking.h:128 kernel/rcu/tree.c:745) ? trace_contention_end (./include/trace/events/lock.h:122) ? __mutex_lock (./arch/x86/include/asm/preempt.h:104 kernel/locking/mutex.c:612 kernel/locking/mutex.c:747) ? __raw_spin_lock_init (./include/linux/lockdep.h:135 ./include/linux/lockdep.h:142 kernel/locking/spinlock_debug.c:25) bpf_struct_ops_link_create (kernel/bpf/bpf_struct_ops.c:1367) __sys_bpf (kernel/bpf/syscall.c:5907) ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) __x64_sys_bpf (kernel/bpf/syscall.c:5943 kernel/bpf/syscall.c:5941 kernel/bpf/syscall.c:5941) do_syscall_64 (arch/x86/entry/syscall_64.c:?) ? exc_page_fault (arch/x86/mm/fault.c:1536) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-07-15 9:13 ` Breno Leitao @ 2025-07-15 17:20 ` Andrea Righi 2025-07-16 10:47 ` Breno Leitao 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-07-15 17:20 UTC (permalink / raw) To: Breno Leitao; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel Hi Breno, On Tue, Jul 15, 2025 at 02:13:03AM -0700, Breno Leitao wrote: > Hello Andrea, > > On Tue, Apr 22, 2025 at 10:26:32AM +0200, Andrea Righi wrote: > > Some kfuncs provided by sched_ext may need to operate on a struct rq, > > but they can be invoked from various contexts, specifically, different > > scx callbacks. > > > > While some of these callbacks are invoked with a particular rq already > > locked, others are not. This makes it impossible for a kfunc to reliably > > determine whether it's safe to access a given rq, triggering potential > > bugs or unsafe behaviors, see for example [1]. > > > > To address this, track the currently locked rq whenever a sched_ext > > callback is invoked via SCX_CALL_OP*(). > > > > This allows kfuncs that need to operate on an arbitrary rq to retrieve > > the currently locked one and apply the appropriate action as needed. > > > > [1] https://lore.kernel.org/lkml/20250325140021.73570-1-arighi@nvidia.com/ > > > > Suggested-by: Tejun Heo <tj@kernel.org> > > Signed-off-by: Andrea Righi <arighi@nvidia.com> > > --- > > kernel/sched/ext.c | 152 +++++++++++++++++++++++++--------------- > > kernel/sched/ext_idle.c | 2 +- > > 2 files changed, 95 insertions(+), 59 deletions(-) > > > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > > index bb0873411d798..3365b447cbdb8 100644 > > --- a/kernel/sched/ext.c > > +++ b/kernel/sched/ext.c > > @@ -1116,8 +1116,38 @@ static void scx_kf_disallow(u32 mask) > > current->scx.kf_mask &= ~mask; > > } > > > > -#define SCX_CALL_OP(mask, op, args...) \ > > +/* > > + * Track the rq currently locked. > > + * > > + * This allows kfuncs to safely operate on rq from any scx ops callback, > > + * knowing which rq is already locked. > > + */ > > +static DEFINE_PER_CPU(struct rq *, locked_rq); > > + > > +static inline void update_locked_rq(struct rq *rq) > > +{ > > + /* > > + * Check whether @rq is actually locked. This can help expose bugs > > + * or incorrect assumptions about the context in which a kfunc or > > + * callback is executed. > > + */ > > + if (rq) > > Why do you only need to lock when rq is not NULL? We're not actually locking here, but we're checking if the rq that we pass to update_locked_rq() is actually locked, since that's our assumption here. The idea is to distinguish the sched_ext callbacks that are invoked with a rq locked and those that are invoked from an unlocked context (and in this case rq == NULL). > > > + lockdep_assert_rq_held(rq); > > + __this_cpu_write(locked_rq, rq); > > This is hitting the following BUG() on some of my debug kernels: > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > > I have lockdep enabled, and I don't see the assert above. I am wondering > if rq is locked but preemption continues to be enabled (!?) Interesting. And it makes sense, because we may have callbacks called from a preemptible context (especially when rq == NULL). I think we can just put a preempt_disable() / preempt_enable() around __this_cpu_write(). If we jump to another CPU during the callback it's fine, since we would track the rq state on the other CPU with its own local variable. And if we were able to jump there, it means that preemption was disabled as well. > > Also, I am wondering if the following patch would be useful. > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index b498d867ba210..c458dd7928e92 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1256,8 +1256,10 @@ static inline void update_locked_rq(struct rq *rq) > * or incorrect assumptions about the context in which a kfunc or > * callback is executed. > */ > - if (rq) > + if (rq) { > lockdep_assert_rq_held(rq); > + } > + WARN_ON_ONCE(preemptible()); > __this_cpu_write(locked_rq, rq); > } We could do that as well and fix individual use cases, but I think adding preempt_disable/enable() would be better. We do have callbacks that can be called from a preemptible context. Thanks, -Andrea > > This is the full stack for the BUG: above: > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > caller is bpf_scx_reg (kernel/sched/ext.c:1261 kernel/sched/ext.c:5558 kernel/sched/ext.c:5879) > Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [N]=TEST > Hardware name: Wiwynn Twin Lakes MP/Twin Lakes Passive MP, BIOS YMM20 02/01/2023 > Sched_ext: layered (enabling) > Call Trace: > <TASK> > dump_stack_lvl (lib/dump_stack.c:122) > check_preemption_disabled (lib/smp_processor_id.c:?) > bpf_scx_reg (kernel/sched/ext.c:1261 kernel/sched/ext.c:5558 kernel/sched/ext.c:5879) > ? bpf_struct_ops_link_create (kernel/bpf/bpf_struct_ops.c:?) > ? rcu_is_watching (./include/linux/context_tracking.h:128 kernel/rcu/tree.c:745) > ? trace_contention_end (./include/trace/events/lock.h:122) > ? __mutex_lock (./arch/x86/include/asm/preempt.h:104 kernel/locking/mutex.c:612 kernel/locking/mutex.c:747) > ? __raw_spin_lock_init (./include/linux/lockdep.h:135 ./include/linux/lockdep.h:142 kernel/locking/spinlock_debug.c:25) > bpf_struct_ops_link_create (kernel/bpf/bpf_struct_ops.c:1367) > __sys_bpf (kernel/bpf/syscall.c:5907) > ? entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) > __x64_sys_bpf (kernel/bpf/syscall.c:5943 kernel/bpf/syscall.c:5941 kernel/bpf/syscall.c:5941) > do_syscall_64 (arch/x86/entry/syscall_64.c:?) > ? exc_page_fault (arch/x86/mm/fault.c:1536) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-07-15 17:20 ` Andrea Righi @ 2025-07-16 10:47 ` Breno Leitao 2025-07-16 12:40 ` Andrea Righi 0 siblings, 1 reply; 18+ messages in thread From: Breno Leitao @ 2025-07-16 10:47 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel Hello Andrea, On Tue, Jul 15, 2025 at 07:20:28PM +0200, Andrea Righi wrote: > > On Tue, Apr 22, 2025 at 10:26:32AM +0200, Andrea Righi wrote: > > > > > + lockdep_assert_rq_held(rq); > > > + __this_cpu_write(locked_rq, rq); > > > > This is hitting the following BUG() on some of my debug kernels: > > > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > > > > I have lockdep enabled, and I don't see the assert above. I am wondering > > if rq is locked but preemption continues to be enabled (!?) > > Interesting. And it makes sense, because we may have callbacks called from > a preemptible context (especially when rq == NULL). > > I think we can just put a preempt_disable() / preempt_enable() around > __this_cpu_write(). If we jump to another CPU during the callback it's > fine, since we would track the rq state on the other CPU with its own local > variable. And if we were able to jump there, it means that preemption was > disabled as well. First of all thanks for the suggestion! What about a patch like the following: commit 9ed31e914181ec8f2d0b4484c42b00b6794661b9 Author: Breno Leitao <leitao@debian.org> Date: Wed Jul 16 03:10:59 2025 -0700 sched/ext: Suppress warning in __this_cpu_write() by disabling preemption __this_cpu_write() emits a warning if used with preemption enabled. Function update_locked_rq() might be called with preemption enabled, which causes the following warning: BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 Disable preemption around the __this_cpu_write() call in update_locked_rq() to suppress the warning, without affecting behavior. If preemption triggers a jump to another CPU during the callback it's fine, since we would track the rq state on the other CPU with its own local variable. Suggested-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Breno Leitao <leitao@debian.org> Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index b498d867ba210..24fcbd7331f73 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1258,7 +1258,14 @@ static inline void update_locked_rq(struct rq *rq) */ if (rq) lockdep_assert_rq_held(rq); + /* + * __this_cpu_write() emits a warning when used with preemption enabled. + * While there's no functional issue if the callback runs on another + * CPU, we disable preemption here solely to suppress that warning. + */ + preempt_disable(); __this_cpu_write(locked_rq, rq); + preempt_enable(); } /* ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-07-16 10:47 ` Breno Leitao @ 2025-07-16 12:40 ` Andrea Righi 2025-07-16 12:43 ` Breno Leitao 0 siblings, 1 reply; 18+ messages in thread From: Andrea Righi @ 2025-07-16 12:40 UTC (permalink / raw) To: Breno Leitao; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel Hi Breno, On Wed, Jul 16, 2025 at 03:47:38AM -0700, Breno Leitao wrote: > Hello Andrea, > > On Tue, Jul 15, 2025 at 07:20:28PM +0200, Andrea Righi wrote: > > > On Tue, Apr 22, 2025 at 10:26:32AM +0200, Andrea Righi wrote: > > > > > > > > + lockdep_assert_rq_held(rq); > > > > + __this_cpu_write(locked_rq, rq); > > > > > > This is hitting the following BUG() on some of my debug kernels: > > > > > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > > > > > > I have lockdep enabled, and I don't see the assert above. I am wondering > > > if rq is locked but preemption continues to be enabled (!?) > > > > Interesting. And it makes sense, because we may have callbacks called from > > a preemptible context (especially when rq == NULL). > > > > I think we can just put a preempt_disable() / preempt_enable() around > > __this_cpu_write(). If we jump to another CPU during the callback it's > > fine, since we would track the rq state on the other CPU with its own local > > variable. And if we were able to jump there, it means that preemption was > > disabled as well. > > First of all thanks for the suggestion! > > What about a patch like the following: Looks good to me, feel free to add my: Acked-by: Andrea Righi <arighi@nvidia.com> Thanks, -Andrea > > commit 9ed31e914181ec8f2d0b4484c42b00b6794661b9 > Author: Breno Leitao <leitao@debian.org> > Date: Wed Jul 16 03:10:59 2025 -0700 > > sched/ext: Suppress warning in __this_cpu_write() by disabling preemption > > __this_cpu_write() emits a warning if used with preemption enabled. > > Function update_locked_rq() might be called with preemption enabled, > which causes the following warning: > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > > Disable preemption around the __this_cpu_write() call in > update_locked_rq() to suppress the warning, without affecting behavior. > > If preemption triggers a jump to another CPU during the callback it's > fine, since we would track the rq state on the other CPU with its own > local variable. > > Suggested-by: Andrea Righi <arighi@nvidia.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index b498d867ba210..24fcbd7331f73 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -1258,7 +1258,14 @@ static inline void update_locked_rq(struct rq *rq) > */ > if (rq) > lockdep_assert_rq_held(rq); > + /* > + * __this_cpu_write() emits a warning when used with preemption enabled. > + * While there's no functional issue if the callback runs on another > + * CPU, we disable preemption here solely to suppress that warning. > + */ > + preempt_disable(); > __this_cpu_write(locked_rq, rq); > + preempt_enable(); > } > > /* ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH 1/2] sched_ext: Track currently locked rq 2025-07-16 12:40 ` Andrea Righi @ 2025-07-16 12:43 ` Breno Leitao 0 siblings, 0 replies; 18+ messages in thread From: Breno Leitao @ 2025-07-16 12:43 UTC (permalink / raw) To: Andrea Righi; +Cc: Tejun Heo, David Vernet, Changwoo Min, linux-kernel On Wed, Jul 16, 2025 at 02:40:14PM +0200, Andrea Righi wrote: > Hi Breno, > > On Wed, Jul 16, 2025 at 03:47:38AM -0700, Breno Leitao wrote: > > Hello Andrea, > > > > On Tue, Jul 15, 2025 at 07:20:28PM +0200, Andrea Righi wrote: > > > > On Tue, Apr 22, 2025 at 10:26:32AM +0200, Andrea Righi wrote: > > > > > > > > > > > + lockdep_assert_rq_held(rq); > > > > > + __this_cpu_write(locked_rq, rq); > > > > > > > > This is hitting the following BUG() on some of my debug kernels: > > > > > > > > BUG: using __this_cpu_write() in preemptible [00000000] code: scx_layered_6-9/68770 > > > > > > > > I have lockdep enabled, and I don't see the assert above. I am wondering > > > > if rq is locked but preemption continues to be enabled (!?) > > > > > > Interesting. And it makes sense, because we may have callbacks called from > > > a preemptible context (especially when rq == NULL). > > > > > > I think we can just put a preempt_disable() / preempt_enable() around > > > __this_cpu_write(). If we jump to another CPU during the callback it's > > > fine, since we would track the rq state on the other CPU with its own local > > > variable. And if we were able to jump there, it means that preemption was > > > disabled as well. > > > > First of all thanks for the suggestion! > > > > What about a patch like the following: > > Looks good to me, feel free to add my: > > Acked-by: Andrea Righi <arighi@nvidia.com> Thanks. I will send it to the mailing list them. --breno ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2025-07-16 12:43 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-04-19 12:24 [PATCH 0/2] sched_ext: Introduce rq lock tracking Andrea Righi 2025-04-19 12:24 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 2025-04-19 17:34 ` Tejun Heo 2025-04-19 20:10 ` Andrea Righi 2025-04-19 20:30 ` Andrea Righi 2025-04-19 21:27 ` Andrea Righi 2025-04-20 17:44 ` Tejun Heo 2025-04-20 18:34 ` Andrea Righi 2025-04-19 12:24 ` [PATCH 2/2] sched_ext: Fix missing rq lock in scx_bpf_cpuperf_set() Andrea Righi -- strict thread matches above, loose matches on Subject: below -- 2025-04-20 19:30 [PATCH v2 0/2] sched_ext: Introduce rq lock tracking Andrea Righi 2025-04-20 19:30 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 2025-04-21 19:03 ` Tejun Heo 2025-04-22 6:27 ` Andrea Righi 2025-04-22 8:26 [PATCH v3 0/2] sched_ext: Introduce rq lock tracking Andrea Righi 2025-04-22 8:26 ` [PATCH 1/2] sched_ext: Track currently locked rq Andrea Righi 2025-07-15 9:13 ` Breno Leitao 2025-07-15 17:20 ` Andrea Righi 2025-07-16 10:47 ` Breno Leitao 2025-07-16 12:40 ` Andrea Righi 2025-07-16 12:43 ` Breno Leitao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).