public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: torvalds@linux-foundation.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
	daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
	joshdon@google.com, brho@google.com, pjt@google.com,
	derkling@google.com, haoluo@google.com, dvernet@meta.com,
	dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com
Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	kernel-team@meta.com, Tejun Heo <tj@kernel.org>
Subject: [PATCH 20/34] sched_ext: Make watchdog handle ops.dispatch() looping stall
Date: Mon, 10 Jul 2023 15:13:38 -1000	[thread overview]
Message-ID: <20230711011412.100319-21-tj@kernel.org> (raw)
In-Reply-To: <20230711011412.100319-1-tj@kernel.org>

The dispatch path retries if the local DSQ is still empty after
ops.dispatch() either dispatched or consumed a task. This is both out of
necessity and for convenience. It has to retry because the dispatch path
might lose the tasks to dequeue while the rq lock is released while trying
to migrate tasks across CPUs, and the retry mechanism makes ops.dispatch()
implementation easier as it only needs to make some forward progress each
iteration.

However, this makes it possible for ops.dispatch() to stall CPUs by
repeatedly dispatching ineligible tasks. If all CPUs are stalled that way,
the watchdog or sysrq handler can't run and the system can't be saved. Let's
address the issue by breaking out of the dispatch loop after 32 iterations.

It is unlikely but not impossible for ops.dispatch() to legitimately go over
the iteration limit. We want to come back to the dispatch path in such cases
as not doing so risks stalling the CPU by idling with runnable tasks
pending. As the previous task is still current in balance_scx(),
resched_curr() doesn't do anything - it will just get cleared. Let's instead
use scx_kick_bpf() which will trigger reschedule after switching to the next
task which will likely be the idle task.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
---
 kernel/sched/ext.c             | 17 +++++++++++++++++
 tools/sched_ext/scx_qmap.bpf.c | 17 +++++++++++++++++
 tools/sched_ext/scx_qmap.c     |  8 ++++++--
 3 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9e8f9f9fcb3d..48e27d59e621 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9,6 +9,7 @@
 enum scx_internal_consts {
 	SCX_NR_ONLINE_OPS	= SCX_OP_IDX(init),
 	SCX_DSP_DFL_MAX_BATCH	= 32,
+	SCX_DSP_MAX_LOOPS	= 32,
 	SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,
 };
 
@@ -167,6 +168,7 @@ static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);
 
 void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
 		      u64 enq_flags);
+void scx_bpf_kick_cpu(s32 cpu, u64 flags);
 
 struct scx_task_iter {
 	struct sched_ext_entity		cursor;
@@ -1286,6 +1288,7 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
 	struct scx_rq *scx_rq = &rq->scx;
 	struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
 	bool prev_on_scx = prev->sched_class == &ext_sched_class;
+	int nr_loops = SCX_DSP_MAX_LOOPS;
 
 	lockdep_assert_rq_held(rq);
 
@@ -1340,6 +1343,20 @@ static int balance_scx(struct rq *rq, struct task_struct *prev,
 			return 1;
 		if (consume_dispatch_q(rq, rf, &scx_dsq_global))
 			return 1;
+
+		/*
+		 * ops.dispatch() can trap us in this loop by repeatedly
+		 * dispatching ineligible tasks. Break out once in a while to
+		 * allow the watchdog to run. As IRQ can't be enabled in
+		 * balance(), we want to complete this scheduling cycle and then
+		 * start a new one. IOW, we want to call resched_curr() on the
+		 * next, most likely idle, task, not the current one. Use
+		 * scx_bpf_kick_cpu() for deferred kicking.
+		 */
+		if (unlikely(!--nr_loops)) {
+			scx_bpf_kick_cpu(cpu_of(rq), 0);
+			break;
+		}
 	} while (dspc->nr_tasks);
 
 	return 0;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index da43f962ab4e..1c3a7d050e32 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -28,6 +28,7 @@ const volatile u64 slice_ns = SCX_SLICE_DFL;
 const volatile bool switch_partial;
 const volatile u32 stall_user_nth;
 const volatile u32 stall_kernel_nth;
+const volatile u32 dsp_inf_loop_after;
 const volatile s32 disallow_tgid;
 
 u32 test_error_cnt;
@@ -187,6 +188,22 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 	s32 pid;
 	int i;
 
+	if (dsp_inf_loop_after && nr_dispatched > dsp_inf_loop_after) {
+		struct task_struct *p;
+
+		/*
+		 * PID 2 should be kthreadd which should mostly be idle and off
+		 * the scheduler. Let's keep dispatching it to force the kernel
+		 * to call this function over and over again.
+		 */
+		p = bpf_task_from_pid(2);
+		if (p) {
+			scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, 0);
+			bpf_task_release(p);
+			return;
+		}
+	}
+
 	if (!idx || !cnt) {
 		scx_bpf_error("failed to lookup idx[%p], cnt[%p]", idx, cnt);
 		return;
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 3444e3597b19..805ac453698f 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -20,12 +20,13 @@ const char help_fmt[] =
 "\n"
 "See the top-level comment in .bpf.c for more details.\n"
 "\n"
-"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-d PID] [-p]\n"
+"Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-d PID] [-p]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
 "  -t COUNT      Stall every COUNT'th user thread\n"
 "  -T COUNT      Stall every COUNT'th kernel thread\n"
+"  -l COUNT      Trigger dispatch infinite looping after COUNT dispatches\n"
 "  -d PID        Disallow a process from switching into SCHED_EXT (-1 for self)\n"
 "  -p            Switch only tasks on SCHED_EXT policy intead of all\n"
 "  -h            Display this help and exit\n";
@@ -51,7 +52,7 @@ int main(int argc, char **argv)
 	skel = scx_qmap__open();
 	assert(skel);
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:d:ph")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:d:ph")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -65,6 +66,9 @@ int main(int argc, char **argv)
 		case 'T':
 			skel->rodata->stall_kernel_nth = strtoul(optarg, NULL, 0);
 			break;
+		case 'l':
+			skel->rodata->dsp_inf_loop_after = strtoul(optarg, NULL, 0);
+			break;
 		case 'd':
 			skel->rodata->disallow_tgid = strtol(optarg, NULL, 0);
 			if (skel->rodata->disallow_tgid < 0)
-- 
2.41.0


  parent reply	other threads:[~2023-07-11  1:16 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-11  1:13 [PATCHSET v4] sched: Implement BPF extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 01/34] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2023-07-11  1:13 ` [PATCH 02/34] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2023-07-11  1:13 ` [PATCH 03/34] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2023-07-11  1:13 ` [PATCH 04/34] sched: Add sched_class->reweight_task() Tejun Heo
2023-07-11  1:13 ` [PATCH 05/34] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2023-07-11  1:13 ` [PATCH 06/34] sched: Factor out cgroup weight conversion functions Tejun Heo
2023-07-11  1:13 ` [PATCH 07/34] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2023-07-11  1:13 ` [PATCH 08/34] sched: Enumerate CPU cgroup file types Tejun Heo
2023-07-11  1:13 ` [PATCH 09/34] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2023-07-11  1:13 ` [PATCH 10/34] sched: Add normal_policy() Tejun Heo
2023-07-11  1:13 ` [PATCH 11/34] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 12/34] sched_ext: Implement BPF " Tejun Heo
2023-07-11  9:21   ` Andrea Righi
2023-07-11 21:45     ` Tejun Heo
2023-08-16 11:45   ` Vishal Chourasia
2023-08-16 19:20     ` Tejun Heo
2023-07-11  1:13 ` [PATCH 13/34] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2023-07-11  1:13 ` [PATCH 14/34] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 15/34] sched_ext: Implement runnable task stall watchdog Tejun Heo
2023-07-11  1:13 ` [PATCH 16/34] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2023-07-11  1:13 ` [PATCH 17/34] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2023-07-11  1:13 ` [PATCH 18/34] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2023-07-11  1:13 ` [PATCH 19/34] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2023-07-11  1:13 ` Tejun Heo [this message]
2023-07-11  1:13 ` [PATCH 21/34] sched_ext: Add task state tracking operations Tejun Heo
2023-07-11  1:13 ` [PATCH 22/34] sched_ext: Implement tickless support Tejun Heo
2023-07-11  1:13 ` [PATCH 23/34] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2023-07-11  1:13 ` [PATCH 24/34] sched_ext: Add cgroup support Tejun Heo
2023-07-11  1:13 ` [PATCH 25/34] sched_ext: Add a cgroup-based core-scheduling scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 26/34] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2023-07-11  1:13 ` [PATCH 27/34] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2023-07-13 13:45   ` Andrea Righi
2023-07-13 18:32     ` Linus Torvalds
2023-07-13 19:48       ` Tejun Heo
2023-07-11  1:13 ` [PATCH 28/34] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2023-07-11  1:13 ` [PATCH 29/34] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2023-07-11  1:13 ` [PATCH 30/34] sched_ext: Implement core-sched support Tejun Heo
2023-07-11  1:13 ` [PATCH 31/34] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2023-07-11  1:13 ` [PATCH 32/34] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2023-07-11  1:13 ` [PATCH 33/34] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2023-07-11  1:13 ` [PATCH 34/34] sched_ext: Add a rust userspace hybrid example scheduler Tejun Heo
2023-07-21 18:37 ` [PATCHSET v4] sched: Implement BPF extensible scheduler class Tejun Heo
2023-07-24 15:11   ` Barret Rhoden
2023-07-26  9:17   ` Peter Zijlstra
2023-07-28  0:12     ` Tejun Heo
2023-08-04  0:08       ` Tejun Heo
2023-08-11  1:16       ` Tejun Heo
2023-08-17 12:44       ` Mel Gorman
2023-08-24 21:31         ` Tejun Heo
2023-09-19 17:56           ` Tejun Heo
2023-09-26  9:20             ` Mel Gorman
2023-10-10 22:09               ` Tejun Heo
2023-08-25  0:26   ` Josh Don

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230711011412.100319-21-tj@kernel.org \
    --to=tj@kernel.org \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brho@google.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel@iogearbox.net \
    --cc=derkling@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=dschatzberg@meta.com \
    --cc=dskarlat@cs.cmu.edu \
    --cc=dvernet@meta.com \
    --cc=haoluo@google.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=riel@surriel.com \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox