From: Tejun Heo <tj@kernel.org>
To: void@manifault.com
Cc: kernel-team@meta.com, linux-kernel@vger.kernel.org,
sched-ext@meta.com, aboorvad@linux.ibm.com,
Tejun Heo <tj@kernel.org>
Subject: [PATCH 3/8] sched_ext: Initialize in bypass mode
Date: Mon, 23 Sep 2024 08:59:30 -1000 [thread overview]
Message-ID: <20240923190020.1446325-4-tj@kernel.org> (raw)
In-Reply-To: <20240923190020.1446325-1-tj@kernel.org>
scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.
The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.
While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2422b2abee6e..1b4b6439cd1a 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -5030,6 +5030,14 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
queue_delayed_work(system_unbound_wq, &scx_watchdog_work,
scx_watchdog_timeout / 2);
+ /*
+ * Once __scx_ops_enabled is set, %current can be switched to SCX
+ * anytime. This can lead to stalls as some BPF schedulers (e.g.
+ * userspace scheduling) may not function correctly before all tasks are
+ * switched. Init in bypass mode to guarantee forward progress.
+ */
+ scx_ops_bypass(true);
+
/*
* Lock out forks, cgroup on/offlining and moves before opening the
* floodgate so that they don't wander into the operations prematurely.
@@ -5089,7 +5097,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
* disabled.
*/
spin_lock_irq(&scx_tasks_lock);
-
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
/*
@@ -5118,22 +5125,19 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
spin_lock_irq(&scx_tasks_lock);
}
scx_task_iter_exit(&sti);
+ spin_unlock_irq(&scx_tasks_lock);
/*
- * All tasks are prepped but are still ops-disabled. Ensure that
- * %current can't be scheduled out and switch everyone.
- * preempt_disable() is necessary because we can't guarantee that
- * %current won't be starved if scheduled out while switching.
+ * All tasks are prepped but the tasks are not enabled. Switch everyone.
*/
- preempt_disable();
+ WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
/*
* We're fully committed and can't fail. The PREPPED -> ENABLED
* transitions here are synchronized against sched_ext_free() through
* scx_tasks_lock.
*/
- WRITE_ONCE(scx_switching_all, !(ops->flags & SCX_OPS_SWITCH_PARTIAL));
-
+ spin_lock_irq(&scx_tasks_lock);
scx_task_iter_init(&sti);
while ((p = scx_task_iter_next_locked(&sti))) {
const struct sched_class *old_class = p->sched_class;
@@ -5150,12 +5154,12 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
check_class_changed(task_rq(p), p, old_class, p->prio);
}
scx_task_iter_exit(&sti);
-
spin_unlock_irq(&scx_tasks_lock);
- preempt_enable();
+
scx_cgroup_unlock();
cpus_read_unlock();
percpu_up_write(&scx_fork_rwsem);
+ scx_ops_bypass(false);
/*
* Returning an error code here would lose the recorded error
@@ -5196,6 +5200,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
err_disable_unlock_all:
scx_cgroup_unlock();
percpu_up_write(&scx_fork_rwsem);
+ scx_ops_bypass(false);
err_disable_unlock_cpus:
cpus_read_unlock();
err_disable:
--
2.46.0
next prev parent reply other threads:[~2024-09-23 19:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-23 18:59 [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
2024-09-23 18:59 ` [PATCH 1/8] sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable() Tejun Heo
2024-09-23 18:59 ` [PATCH 2/8] sched_ext: Remove SCX_OPS_PREPPING Tejun Heo
2024-09-23 18:59 ` Tejun Heo [this message]
2024-09-23 18:59 ` [PATCH 4/8] sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable() Tejun Heo
2024-09-23 18:59 ` [PATCH 5/8] sched_ext: Enable scx_ops_init_task() separately Tejun Heo
2024-09-23 18:59 ` [PATCH 6/8] sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online() Tejun Heo
2024-09-23 18:59 ` [PATCH 7/8] sched_ext: Decouple locks in scx_ops_disable_workfn() Tejun Heo
2024-09-23 18:59 ` [PATCH 8/8] sched_ext: Decouple locks in scx_ops_enable() Tejun Heo
2024-09-27 20:03 ` [PATCHSET sched_ext/for-6.12-fixes] sched_ext: Fix locking enable/disable path bugs includling locking order one Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240923190020.1446325-4-tj@kernel.org \
--to=tj@kernel.org \
--cc=aboorvad@linux.ibm.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@meta.com \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox