* [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one
@ 2024-09-23 18:59 Tejun Heo
2024-09-23 18:59 ` [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() Tejun Heo
` (8 more replies)
0 siblings, 9 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad
Aboorva Devarajan reported an issue where sched_ext init code can
occasionally deadlock when scheduler loading races CPU hotplug. The deadlock
scenario is as follows:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.
This is primarily caused by SCX enable/disable paths grabbing big locks
together. This patchset updates the enable/disable paths to decouple the
locks. In the process, it also fixes several subtle bugs in the enable path.
This patchset contains the following patches:
0001-sched_ext-Relocate-check_hotplug_seq-call-in-scx_ops.patch
0002-sched_ext-Remove-SCX_OPS_PREPPING.patch
0003-sched_ext-Initialize-in-bypass-mode.patch
0004-sched_ext-Fix-SCX_TASK_INIT-SCX_TASK_READY-transitio.patch
0005-sched_ext-Enable-scx_ops_init_task-separately.patch
0006-sched_ext-Add-scx_cgroup_enabled-to-gate-cgroup-oper.patch
0007-sched_ext-Decouple-locks-in-scx_ops_disable_workfn.patch
0008-sched_ext-Decouple-locks-in-scx_ops_enable.patch
0001-0002 are prep patches.
0003 removes a race window in the enable path that can cause stalls and
prepares for further locking updates.
0004-0005 remove race windows in the enable path that can cause invalid task
state transitions.
0006 fixes a bug in cgroup enable path which can skip invocation of
ops.cgroup_exit() and prepares for further locking updates.
0007-0008 decouple the big locks and fix the deadlock.
This patchset can also be found in the following git branch:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-enable-locking-fix
diffstat follows. Thanks.
kernel/sched/ext.c | 199 ++++++++++++++++++++++++++++++++------------------------------------
1 file changed, 94 insertions(+), 105 deletions(-)
--
tejun
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 2/8] sched_ext: Remove SCX_OPS_PREPPING Tejun Heo
` (7 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization can be retried if
CPU hotplug events take place before the CPU hotplug callbacks are online.
As such, the best place to call it is in the same cpu_read_lock() section
that enables the CPU hotplug ops. Currently, it is called in the next
cpus_read_lock() block in scx_ops_enable(). The side effect of this
placement is a small window in which hotplug sequence detection can trigger
unnecessarily, which isn't critical.
Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
the hotplug operation enablement to close the window and get the invocation
out of the way for planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c09e3dc38c34..95ed822fa563 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5005,6 +5005,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
if (((void (**)(void))ops)[i])
static_branch_enable_cpuslocked(&scx_has_op[i]);
+ check_hotplug_seq(ops);
cpus_read_unlock();
ret = validate_ops(ops);
@@ -5053,8 +5054,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
cpus_read_lock();
scx_cgroup_lock();
- check_hotplug_seq(ops);
-
for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++)
if (((void (**)(void))ops)[i])
static_branch_enable_cpuslocked(&scx_has_op[i]);
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 2/8] sched_ext: Remove SCX_OPS_PREPPING
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
2024-09-23 18:59 ` [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 3/8] sched_ext: Initialize in bypass mode Tejun Heo
` (6 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 29 +++++++----------------------
1 file changed, 7 insertions(+), 22 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 95ed822fa563..2422b2abee6e 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -778,7 +778,6 @@ enum scx_tg_flags {
};
enum scx_ops_enable_state {
- SCX_OPS_PREPPING,
SCX_OPS_ENABLING,
SCX_OPS_ENABLED,
SCX_OPS_DISABLING,
@@ -786,7 +785,6 @@ enum scx_ops_enable_state {
};
static const char *scx_ops_enable_state_str[] = {
- [SCX_OPS_PREPPING] = "prepping",
[SCX_OPS_ENABLING] = "enabling",
[SCX_OPS_ENABLED] = "enabled",
[SCX_OPS_DISABLING] = "disabling",
@@ -4971,12 +4969,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
}
/*
- * Set scx_ops, transition to PREPPING and clear exit info to arm the
+ * Set scx_ops, transition to ENABLING and clear exit info to arm the
* disable path. Failure triggers full disabling from here on.
*/
scx_ops = *ops;
- WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) !=
+ WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_ENABLING) !=
SCX_OPS_DISABLED);
atomic_set(&scx_exit_kind, SCX_EXIT_NONE);
@@ -5129,23 +5127,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
*/
preempt_disable();
- /*
- * From here on, the disable path must assume that tasks have ops
- * enabled and need to be recovered.
- *
- * Transition to ENABLING fails iff the BPF scheduler has already
- * triggered scx_bpf_error(). Returning an error code here would lose
- * the recorded error information. Exit indicating success so that the
- * error is notified through ops.exit() with all the details.
- */
- if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) {
- preempt_enable();
- spin_unlock_irq(&scx_tasks_lock);
- WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE);
- ret = 0;
- goto err_disable_unlock_all;
- }
-
/*
* We're fully committed and can't fail. The PREPPED -> ENABLED
* transitions here are synchronized against sched_ext_free() through
@@ -5176,7 +5157,11 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);
- /* see above ENABLING transition for the explanation on exiting with 0 */
+ /*
+ * Returning an error code here would lose the recorded error
+ * information. Exit indicating success so that the error is notified
+ * through ops.exit() with all the details.
+ */
if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
WARN_ON_ONCE(atomic_read(&scx_exit_kind) == SCX_EXIT_NONE);
ret = 0;
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 3/8] sched_ext: Initialize in bypass mode
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
2024-09-23 18:59 ` [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() Tejun Heo
2024-09-23 18:59 ` [PATCH 2/8] sched_ext: Remove SCX_OPS_PREPPING Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 4/8] sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() Tejun Heo
` (5 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.
The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.
While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2422b2abee6e..1b4b6439cd1a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5030,6 +5030,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
queue_delayed_work(system_unbound_wq, &scx_watchdog_work,
scx_watchdog_timeout / 2);
+ /*
+ * Once __scx_ops_enabled is set, %current can be switched to SCX
+ * anytime. This can lead to stalls as some BPF schedulers (e.g.
+ * userspace scheduling) may not function correctly before all tasks are
+ * switched. Init in bypass mode to guarantee forward progress.
+ */
+ scx_ops_bypass(true);
+
/*
* Lock out forks, cgroup on/offlining and moves before opening the
* floodgate so that they don't wander into the operations prematurely.
@@ -5089,7 +5097,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
* disabled.
*/
spin_lock_irq(&scx_tasks_lock);
-
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
/*
@@ -5118,22 +5125,19 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
spin_lock_irq(&scx_tasks_lock);
}
scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
/*
- * All tasks are prepped but are still ops-disabled. Ensure that
- * %current can't be scheduled out and switch everyone.
- * preempt_disable() is necessary because we can't guarantee that
- * %current won't be starved if scheduled out while switching.
+ * All tasks are prepped but the tasks are not enabled. Switch everyone.
*/
- preempt_disable();
+ WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
/*
* We're fully committed and can't fail. The PREPPED -> ENABLED
* transitions here are synchronized against sched_ext_free() through
* scx_tasks_lock.
*/
- WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
-
+ spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
const struct sched_class *old_class = p->sched_class;
@@ -5150,12 +5154,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
check_class_changed(task_rq(p), p, old_class, p->prio);
}
scx_task_iter_exit(&sti);
-
spin_unlock_irq(&scx_tasks_lock);
- preempt_enable();
+
scx_cgroup_unlock();
cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);
+ scx_ops_bypass(false);
/*
* Returning an error code here would lose the recorded error
@@ -5196,6 +5200,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
err_disable_unlock_all:
scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
+ scx_ops_bypass(false);
err_disable_unlock_cpus:
cpus_read_unlock();
err_disable:
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 4/8] sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (2 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 3/8] sched_ext: Initialize in bypass mode Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 5/8] sched_ext: Enable scx_ops_init_task() separately Tejun Heo
` (4 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.
The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:
sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: qmap (enabling+all)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
---
kernel/sched/ext.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 1b4b6439cd1a..fe86231e021d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5121,6 +5121,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
goto err_disable_unlock_all;
}
+ scx_set_task_state(p, SCX_TASK_READY);
+
put_task_struct(p);
spin_lock_irq(&scx_tasks_lock);
}
@@ -5133,7 +5135,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
/*
- * We're fully committed and can't fail. The PREPPED -> ENABLED
+ * We're fully committed and can't fail. The task READY -> ENABLED
* transitions here are synchronized against sched_ext_free() through
* scx_tasks_lock.
*/
@@ -5145,7 +5147,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
- scx_set_task_state(p, SCX_TASK_READY);
__setscheduler_prio(p, p->prio);
check_class_changing(task_rq(p), p, old_class);
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 5/8] sched_ext: Enable scx_ops_init_task() separately
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (3 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 4/8] sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 6/8] sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() Tejun Heo
` (3 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.
The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: simple (enabling)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fe86231e021d..feb7c620f9c6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -852,6 +852,7 @@ DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
static atomic_t scx_ops_bypass_depth = ATOMIC_INIT(0);
+static bool scx_ops_init_task_enabled;
static bool scx_switching_all;
DEFINE_STATIC_KEY_FALSE(__scx_switched_all);
@@ -3548,7 +3549,7 @@ int scx_fork(struct task_struct *p)
{
percpu_rwsem_assert_held(&scx_fork_rwsem);
- if (scx_enabled())
+ if (scx_ops_init_task_enabled)
return scx_ops_init_task(p, task_group(p), true);
else
return 0;
@@ -3556,7 +3557,7 @@ int scx_fork(struct task_struct *p)
void scx_post_fork(struct task_struct *p)
{
- if (scx_enabled()) {
+ if (scx_ops_init_task_enabled) {
scx_set_task_state(p, SCX_TASK_READY);
/*
@@ -4436,6 +4437,8 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
cpus_read_lock();
scx_cgroup_lock();
+ scx_ops_init_task_enabled = false;
+
spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
/*
@@ -5087,7 +5090,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
if (ret)
goto err_disable_unlock_all;
- static_branch_enable_cpuslocked(&__scx_ops_enabled);
+ WARN_ON_ONCE(scx_ops_init_task_enabled);
+ scx_ops_init_task_enabled = true;
/*
* Enable ops for every task. Fork is excluded by scx_fork_rwsem
@@ -5130,9 +5134,11 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
spin_unlock_irq(&scx_tasks_lock);
/*
- * All tasks are prepped but the tasks are not enabled. Switch everyone.
+ * All tasks are READY. It's safe to turn on scx_enabled() and switch
+ * all eligible tasks.
*/
WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
+ static_branch_enable_cpuslocked(&__scx_ops_enabled);
/*
* We're fully committed and can't fail. The task READY -> ENABLED
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 6/8] sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (4 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 5/8] sched_ext: Enable scx_ops_init_task() separately Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 7/8] sched_ext: Decouple locks in scx_ops_disable_workfn() Tejun Heo
` (2 subsequent siblings)
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.
Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index feb7c620f9c6..06dc1011312d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3689,6 +3689,7 @@ bool scx_can_stop_tick(struct rq *rq)
#ifdef CONFIG_EXT_GROUP_SCHED
DEFINE_STATIC_PERCPU_RWSEM(scx_cgroup_rwsem);
+static bool scx_cgroup_enabled;
static bool cgroup_warned_missing_weight;
static bool cgroup_warned_missing_idle;
@@ -3708,8 +3709,7 @@ static void scx_cgroup_warn_missing_weight(struct task_group *tg)
static void scx_cgroup_warn_missing_idle(struct task_group *tg)
{
- if (scx_ops_enable_state() == SCX_OPS_DISABLED ||
- cgroup_warned_missing_idle)
+ if (!scx_cgroup_enabled || cgroup_warned_missing_idle)
return;
if (!tg->idle)
@@ -3730,15 +3730,18 @@ int scx_tg_online(struct task_group *tg)
scx_cgroup_warn_missing_weight(tg);
- if (SCX_HAS_OP(cgroup_init)) {
- struct scx_cgroup_init_args args = { .weight = tg->scx_weight };
+ if (scx_cgroup_enabled) {
+ if (SCX_HAS_OP(cgroup_init)) {
+ struct scx_cgroup_init_args args =
+ { .weight = tg->scx_weight };
- ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
- tg->css.cgroup, &args);
- if (!ret)
+ ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, cgroup_init,
+ tg->css.cgroup, &args);
+ if (ret)
+ ret = ops_sanitize_err("cgroup_init", ret);
+ }
+ if (ret == 0)
tg->scx_flags |= SCX_TG_ONLINE | SCX_TG_INITED;
- else
- ret = ops_sanitize_err("cgroup_init", ret);
} else {
tg->scx_flags |= SCX_TG_ONLINE;
}
@@ -3769,7 +3772,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
/* released in scx_finish/cancel_attach() */
percpu_down_read(&scx_cgroup_rwsem);
- if (!scx_enabled())
+ if (!scx_cgroup_enabled)
return 0;
cgroup_taskset_for_each(p, css, tset) {
@@ -3812,7 +3815,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
void scx_move_task(struct task_struct *p)
{
- if (!scx_enabled())
+ if (!scx_cgroup_enabled)
return;
/*
@@ -3848,7 +3851,7 @@ void scx_cgroup_cancel_attach(struct cgroup_taskset *tset)
struct cgroup_subsys_state *css;
struct task_struct *p;
- if (!scx_enabled())
+ if (!scx_cgroup_enabled)
goto out_unlock;
cgroup_taskset_for_each(p, css, tset) {
@@ -3865,7 +3868,7 @@ void scx_group_set_weight(struct task_group *tg, unsigned long weight)
{
percpu_down_read(&scx_cgroup_rwsem);
- if (tg->scx_weight != weight) {
+ if (scx_cgroup_enabled && tg->scx_weight != weight) {
if (SCX_HAS_OP(cgroup_set_weight))
SCX_CALL_OP(SCX_KF_UNLOCKED, cgroup_set_weight,
tg_cgrp(tg), weight);
@@ -4037,6 +4040,9 @@ static void scx_cgroup_exit(void)
percpu_rwsem_assert_held(&scx_cgroup_rwsem);
+ WARN_ON_ONCE(!scx_cgroup_enabled);
+ scx_cgroup_enabled = false;
+
/*
* scx_tg_on/offline() are excluded through scx_cgroup_rwsem. If we walk
* cgroups and exit all the inited ones, all online cgroups are exited.
@@ -4112,6 +4118,9 @@ static int scx_cgroup_init(void)
}
rcu_read_unlock();
+ WARN_ON_ONCE(scx_cgroup_enabled);
+ scx_cgroup_enabled = true;
+
return 0;
}
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 7/8] sched_ext: Decouple locks in scx_ops_disable_workfn()
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (5 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 6/8] sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-23 18:59 ` [PATCH 8/8] sched_ext: Decouple locks in scx_ops_enable() Tejun Heo
2024-09-27 20:03 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:
- As cgroup disabling no longer requires modifying a static_key which
requires cpus_read_lock(), no need to grab cpus_read_lock() before
grabbing scx_cgroup_rwsem.
- cgroup can now be independently disabled before tasks are moved back to
the fair class.
Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
---
kernel/sched/ext.c | 37 +++++++++++++++++--------------------
1 file changed, 17 insertions(+), 20 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 06dc1011312d..18f95072866f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4439,21 +4439,23 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
WRITE_ONCE(scx_switching_all, false);
/*
- * Avoid racing against fork and cgroup changes. See scx_ops_enable()
- * for explanation on the locking order.
+ * Shut down cgroup support before tasks so that the cgroup attach path
+ * doesn't race against scx_ops_exit_task().
*/
- percpu_down_write(&scx_fork_rwsem);
- cpus_read_lock();
scx_cgroup_lock();
+ scx_cgroup_exit();
+ scx_cgroup_unlock();
- scx_ops_init_task_enabled = false;
-
- spin_lock_irq(&scx_tasks_lock);
- scx_task_iter_init(&sti);
/*
* The BPF scheduler is going away. All tasks including %TASK_DEAD ones
* must be switched out and exited synchronously.
*/
+ percpu_down_write(&scx_fork_rwsem);
+
+ scx_ops_init_task_enabled = false;
+
+ spin_lock_irq(&scx_tasks_lock);
+ scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
const struct sched_class *old_class = p->sched_class;
struct sched_enq_and_set_ctx ctx;
@@ -4471,23 +4473,18 @@ static void scx_ops_disable_workfn(struct kthread_work *work)
}
scx_task_iter_exit(&sti);
spin_unlock_irq(&scx_tasks_lock);
+ percpu_up_write(&scx_fork_rwsem);
/* no task is on scx, turn off all the switches and flush in-progress calls */
- static_branch_disable_cpuslocked(&__scx_ops_enabled);
+ static_branch_disable(&__scx_ops_enabled);
for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
- static_branch_disable_cpuslocked(&scx_has_op[i]);
- static_branch_disable_cpuslocked(&scx_ops_enq_last);
- static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
- static_branch_disable_cpuslocked(&scx_ops_cpu_preempt);
- static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ static_branch_disable(&scx_has_op[i]);
+ static_branch_disable(&scx_ops_enq_last);
+ static_branch_disable(&scx_ops_enq_exiting);
+ static_branch_disable(&scx_ops_cpu_preempt);
+ static_branch_disable(&scx_builtin_idle_enabled);
synchronize_rcu();
- scx_cgroup_exit();
-
- scx_cgroup_unlock();
- cpus_read_unlock();
- percpu_up_write(&scx_fork_rwsem);
-
if (ei->kind >= SCX_EXIT_ERROR) {
pr_err("sched_ext: BPF scheduler \"%s\" disabled (%s)\n",
scx_ops.name, ei->reason);
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH 8/8] sched_ext: Decouple locks in scx_ops_enable()
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (6 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 7/8] sched_ext: Decouple locks in scx_ops_disable_workfn() Tejun Heo
@ 2024-09-23 18:59 ` Tejun Heo
2024-09-27 20:03 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-23 18:59 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad, Tejun Heo
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.
For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.
With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.
- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.
- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
---
kernel/sched/ext.c | 67 +++++++++++++++++++---------------------------
1 file changed, 27 insertions(+), 40 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 18f95072866f..5fe0dfd97340 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5004,7 +5004,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
ret = SCX_CALL_OP_RET(SCX_KF_UNLOCKED, init);
if (ret) {
ret = ops_sanitize_err("init", ret);
- goto err_disable_unlock_cpus;
+ cpus_read_unlock();
+ goto err_disable;
}
}
@@ -5047,54 +5048,30 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
*/
scx_ops_bypass(true);
- /*
- * Lock out forks, cgroup on/offlining and moves before opening the
- * floodgate so that they don't wander into the operations prematurely.
- *
- * We don't need to keep the CPUs stable but static_branch_*() requires
- * cpus_read_lock() and scx_cgroup_rwsem must nest inside
- * cpu_hotplug_lock because of the following dependency chain:
- *
- * cpu_hotplug_lock --> cgroup_threadgroup_rwsem --> scx_cgroup_rwsem
- *
- * So, we need to do cpus_read_lock() before scx_cgroup_lock() and use
- * static_branch_*_cpuslocked().
- *
- * Note that cpu_hotplug_lock must nest inside scx_fork_rwsem due to the
- * following dependency chain:
- *
- * scx_fork_rwsem --> pernet_ops_rwsem --> cpu_hotplug_lock
- */
- percpu_down_write(&scx_fork_rwsem);
- cpus_read_lock();
- scx_cgroup_lock();
-
for (i = SCX_OPI_NORMAL_BEGIN; i < SCX_OPI_NORMAL_END; i++)
if (((void (**)(void))ops)[i])
- static_branch_enable_cpuslocked(&scx_has_op[i]);
+ static_branch_enable(&scx_has_op[i]);
if (ops->flags & SCX_OPS_ENQ_LAST)
- static_branch_enable_cpuslocked(&scx_ops_enq_last);
+ static_branch_enable(&scx_ops_enq_last);
if (ops->flags & SCX_OPS_ENQ_EXITING)
- static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
+ static_branch_enable(&scx_ops_enq_exiting);
if (scx_ops.cpu_acquire || scx_ops.cpu_release)
- static_branch_enable_cpuslocked(&scx_ops_cpu_preempt);
+ static_branch_enable(&scx_ops_cpu_preempt);
if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
reset_idle_masks();
- static_branch_enable_cpuslocked(&scx_builtin_idle_enabled);
+ static_branch_enable(&scx_builtin_idle_enabled);
} else {
- static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
+ static_branch_disable(&scx_builtin_idle_enabled);
}
/*
- * All cgroups should be initialized before letting in tasks. cgroup
- * on/offlining and task migrations are already locked out.
+ * Lock out forks, cgroup on/offlining and moves before opening the
+ * floodgate so that they don't wander into the operations prematurely.
*/
- ret = scx_cgroup_init();
- if (ret)
- goto err_disable_unlock_all;
+ percpu_down_write(&scx_fork_rwsem);
WARN_ON_ONCE(scx_ops_init_task_enabled);
scx_ops_init_task_enabled = true;
@@ -5105,7 +5082,18 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
* leaving as sched_ext_free() can handle both prepped and enabled
* tasks. Prep all tasks first and then enable them with preemption
* disabled.
+ *
+ * All cgroups should be initialized before scx_ops_init_task() so that
+ * the BPF scheduler can reliably track each task's cgroup membership
+ * from scx_ops_init_task(). Lock out cgroup on/offlining and task
+ * migrations while tasks are being initialized so that
+ * scx_cgroup_can_attach() never sees uninitialized tasks.
*/
+ scx_cgroup_lock();
+ ret = scx_cgroup_init();
+ if (ret)
+ goto err_disable_unlock_all;
+
spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
@@ -5138,19 +5126,22 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
}
scx_task_iter_exit(&sti);
spin_unlock_irq(&scx_tasks_lock);
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
/*
* All tasks are READY. It's safe to turn on scx_enabled() and switch
* all eligible tasks.
*/
WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
- static_branch_enable_cpuslocked(&__scx_ops_enabled);
+ static_branch_enable(&__scx_ops_enabled);
/*
* We're fully committed and can't fail. The task READY -> ENABLED
* transitions here are synchronized against sched_ext_free() through
* scx_tasks_lock.
*/
+ percpu_down_write(&scx_fork_rwsem);
spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
@@ -5168,10 +5159,8 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
}
scx_task_iter_exit(&sti);
spin_unlock_irq(&scx_tasks_lock);
-
- scx_cgroup_unlock();
- cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);
+
scx_ops_bypass(false);
/*
@@ -5214,8 +5203,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
scx_ops_bypass(false);
-err_disable_unlock_cpus:
- cpus_read_unlock();
err_disable:
mutex_unlock(&scx_ops_enable_mutex);
/* must be fully disabled before returning */
--
2.46.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
` (7 preceding siblings ...)
2024-09-23 18:59 ` [PATCH 8/8] sched_ext: Decouple locks in scx_ops_enable() Tejun Heo
@ 2024-09-27 20:03 ` Tejun Heo
8 siblings, 0 replies; 10+ messages in thread
From: Tejun Heo @ 2024-09-27 20:03 UTC (permalink / raw)
To: void; +Cc: kernel-team, linux-kernel, sched-ext, aboorvad
On Mon, Sep 23, 2024 at 08:59:27AM -1000, Tejun Heo wrote:
> Aboorva Devarajan reported an issue where sched_ext init code can
> occasionally deadlock when scheduler loading races CPU hotplug. The deadlock
> scenario is as follows:
>
> scx_ops_enable() hotplug
>
> percpu_down_write(&cpu_hotplug_lock)
> percpu_down_write(&scx_fork_rwsem)
> block on cpu_hotplug_lock
> kthread_create() waits for kthreadd
> kthreadd blocks on scx_fork_rwsem
>
> Note that this doesn't trigger lockdep because the hotplug side dependency
> bounces through kthreadd.
>
> This is primarily caused by SCX enable/disable paths grabbing big locks
> together. This patchset updates the enable/disable paths to decouple the
> locks. In the process, it also fixes several subtle bugs in the enable path.
>
> This patchset contains the following patches:
>
> 0001-sched_ext-Relocate-check_hotplug_seq-call-in-scx_ops.patch
> 0002-sched_ext-Remove-SCX_OPS_PREPPING.patch
> 0003-sched_ext-Initialize-in-bypass-mode.patch
> 0004-sched_ext-Fix-SCX_TASK_INIT-SCX_TASK_READY-transitio.patch
> 0005-sched_ext-Enable-scx_ops_init_task-separately.patch
> 0006-sched_ext-Add-scx_cgroup_enabled-to-gate-cgroup-oper.patch
> 0007-sched_ext-Decouple-locks-in-scx_ops_disable_workfn.patch
> 0008-sched_ext-Decouple-locks-in-scx_ops_enable.patch
Applied to sched_ext/for-6.12-fixes.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-09-27 20:03 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
2024-09-23 18:59 ` [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() Tejun Heo
2024-09-23 18:59 ` [PATCH 2/8] sched_ext: Remove SCX_OPS_PREPPING Tejun Heo
2024-09-23 18:59 ` [PATCH 3/8] sched_ext: Initialize in bypass mode Tejun Heo
2024-09-23 18:59 ` [PATCH 4/8] sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() Tejun Heo
2024-09-23 18:59 ` [PATCH 5/8] sched_ext: Enable scx_ops_init_task() separately Tejun Heo
2024-09-23 18:59 ` [PATCH 6/8] sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() Tejun Heo
2024-09-23 18:59 ` [PATCH 7/8] sched_ext: Decouple locks in scx_ops_disable_workfn() Tejun Heo
2024-09-23 18:59 ` [PATCH 8/8] sched_ext: Decouple locks in scx_ops_enable() Tejun Heo
2024-09-27 20:03 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox