[PATCH 0/2] sched_ext: Fix SCX_KICK

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock
@ 2026-03-16 10:02 Christian Loehle
  2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo,
	peterz, shuah, dietmar.eggemann, Christian Loehle

When using SCX_KICK_WAIT I noticed that it lacks any deadlock
prevention and will deadlock with no chance of sched_ext even ejecting
the BPF scheduler.
The BPF scheduler cannot impose any reasonablesynchronisation itself,
except for a strict partition of which CPUs are allowed to SCX_KICK_WAIT
which other CPUs.
Since SCX_KICK_WAIT seems to be used quite rarely, just synchronize
all SCX_KICK_WAIT globally and don't try to be clever about cycle
detection.
Also add a testcase that reproduces the issue.

Christian Loehle (2):
  sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  sched_ext/selftests: Add SCX_KICK_WAIT cycle tests

 kernel/sched/ext.c                            |  45 +++-
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/wait_kick_cycle.bpf.c |  70 ++++++
 .../selftests/sched_ext/wait_kick_cycle.c     | 223 ++++++++++++++++++
 4 files changed, 337 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.c

-- 
2.34.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle
@ 2026-03-16 10:02 ` Christian Loehle
  2026-03-16 10:49   ` Andrea Righi
  2026-03-16 17:46   ` Tejun Heo
  2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle
  2026-03-29  0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo
  2 siblings, 2 replies; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo,
	peterz, shuah, dietmar.eggemann, Christian Loehle

SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
smp_cond_load_acquire() until the target CPU's current SCX task has been
context-switched out (its kick_sync counter advanced).

If multiple CPUs each issue SCX_KICK_WAIT targeting one another
concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
context, it cannot reschedule, so no kick_sync counter ever advances and
the system deadlocks.

Fix this by serializing access to the wait loop behind a global raw
spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
acquire the lock records itself in scx_kick_wait_pending and returns.
When the active waiter finishes and releases the lock, it replays the
pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
no wait request is silently dropped.

This is deliberately a coarse serialization: multiple simultaneous wait
operations now run sequentially, increasing latency.  In exchange,
deadlocks are impossible regardless of the cycle length (A->B->C->...->A).

Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
bits left by a CPU that deferred just as the scheduler exited are reset
before the next scheduler instance loads.

Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26a6ac2f8826..b63ae13d0486 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -89,6 +89,19 @@ struct scx_kick_syncs {
 
 static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
 
+/*
+ * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
+ * Callers failing to acquire @scx_kick_wait_lock defer by recording
+ * themselves in @scx_kick_wait_pending and are retriggered when the active
+ * waiter completes.
+ *
+ * Lock ordering: @scx_kick_wait_lock is always acquired before
+ * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
+ */
+static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
+static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
+static cpumask_t scx_kick_wait_pending;
+
 /*
  * Direct dispatch marker.
  *
@@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
 		if (to_free)
 			kvfree_rcu(to_free, rcu);
 	}
+
+	/*
+	 * Clear any CPUs that were waiting for the lock when the scheduler
+	 * exited.  Their irq_work has already returned so no in-flight
+	 * waiter can observe the stale bits on the next enable.
+	 */
+	cpumask_clear(&scx_kick_wait_pending);
 }
 
 static void scx_disable_workfn(struct kthread_work *work)
@@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 	struct rq *this_rq = this_rq();
 	struct scx_rq *this_scx = &this_rq->scx;
 	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
-	bool should_wait = false;
+	bool should_wait = !cpumask_empty(this_scx->cpus_to_wait);
 	unsigned long *ksyncs;
+	s32 this_cpu = cpu_of(this_rq);
 	s32 cpu;
 
 	if (unlikely(!ksyncs_pcpu)) {
@@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 	if (!should_wait)
 		return;
 
+	if (!raw_spin_trylock(&scx_kick_wait_lock)) {
+		raw_spin_lock(&scx_kick_wait_pending_lock);
+		cpumask_set_cpu(this_cpu, &scx_kick_wait_pending);
+		raw_spin_unlock(&scx_kick_wait_pending_lock);
+		return;
+	}
+
+	raw_spin_lock(&scx_kick_wait_pending_lock);
+	cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending);
+	raw_spin_unlock(&scx_kick_wait_pending_lock);
+
 	for_each_cpu(cpu, this_scx->cpus_to_wait) {
 		unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync;
 
@@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		 * task is picked subsequently. The latter is necessary to break
 		 * the wait when $cpu is taken by a higher sched class.
 		 */
-		if (cpu != cpu_of(this_rq))
+		if (cpu != this_cpu)
 			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
 
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
 	}
+
+	raw_spin_unlock(&scx_kick_wait_lock);
+
+	raw_spin_lock(&scx_kick_wait_pending_lock);
+	for_each_cpu(cpu, &scx_kick_wait_pending) {
+		cpumask_clear_cpu(cpu, &scx_kick_wait_pending);
+		irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work);
+	}
+	raw_spin_unlock(&scx_kick_wait_pending_lock);
 }
 
 /**
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests
  2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle
  2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
@ 2026-03-16 10:02 ` Christian Loehle
  2026-03-29  0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo
  2 siblings, 0 replies; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw)
  To: sched-ext
  Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo,
	peterz, shuah, dietmar.eggemann, Christian Loehle

Add wait_kick_cycle, a test stressing an SCX_KICK_WAIT cycle
between three CPUs by calling SCX_KICK_WAIT between them to
test if sched_ext prevents a deadlock.

Note: hangs on unfixed kernels

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/wait_kick_cycle.bpf.c |  70 ++++++
 .../selftests/sched_ext/wait_kick_cycle.c     | 223 ++++++++++++++++++
 3 files changed, 294 insertions(+)
 create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.c

diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index 006300ac6dff..0b5b527265f7 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -188,6 +188,7 @@ auto-test-targets :=			\
 	rt_stall			\
 	test_example			\
 	total_bw			\
+	wait_kick_cycle			\
 
 testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets)))
 
diff --git a/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c b/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c
new file mode 100644
index 000000000000..c53cda86ec75
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Christian Loehle <christian.loehle@arm.com>
+ *
+ * Stress concurrent SCX_KICK_WAIT calls to validate forward progress.
+ *
+ * Three CPUs are designated from userspace. Every enqueue from one of the
+ * three CPUs kicks the next CPU in the ring with SCX_KICK_WAIT, creating a
+ * persistent A -> B -> C -> A wait cycle pressure.
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+const volatile s32 test_cpu_a;
+const volatile s32 test_cpu_b;
+const volatile s32 test_cpu_c;
+
+u64 nr_enqueues;
+u64 nr_wait_kicks;
+
+UEI_DEFINE(uei);
+
+static s32 target_cpu(s32 cpu)
+{
+	if (cpu == test_cpu_a)
+		return test_cpu_b;
+	if (cpu == test_cpu_b)
+		return test_cpu_c;
+	if (cpu == test_cpu_c)
+		return test_cpu_a;
+	return -1;
+}
+
+void BPF_STRUCT_OPS(wait_kick_cycle_enqueue, struct task_struct *p, u64 enq_flags)
+{
+	s32 this_cpu = bpf_get_smp_processor_id();
+	s32 tgt;
+
+	__sync_fetch_and_add(&nr_enqueues, 1);
+
+	if (p->flags & PF_KTHREAD) {
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_INF,
+				   enq_flags | SCX_ENQ_PREEMPT);
+		return;
+	}
+
+	scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
+
+	tgt = target_cpu(this_cpu);
+	if (tgt < 0 || tgt == this_cpu)
+		return;
+
+	__sync_fetch_and_add(&nr_wait_kicks, 1);
+	scx_bpf_kick_cpu(tgt, SCX_KICK_WAIT);
+}
+
+void BPF_STRUCT_OPS(wait_kick_cycle_exit, struct scx_exit_info *ei)
+{
+	UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops wait_kick_cycle_ops = {
+	.enqueue		= wait_kick_cycle_enqueue,
+	.exit			= wait_kick_cycle_exit,
+	.name			= "wait_kick_cycle",
+	.timeout_ms		= 1000U,
+};
diff --git a/tools/testing/selftests/sched_ext/wait_kick_cycle.c b/tools/testing/selftests/sched_ext/wait_kick_cycle.c
new file mode 100644
index 000000000000..3889e7a9a0a7
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/wait_kick_cycle.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2026 Christian Loehle <christian.loehle@arm.com>
+ */
+#define _GNU_SOURCE
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+#include <bpf/bpf.h>
+#include <errno.h>
+#include <pthread.h>
+#include <sched.h>
+#include <scx/common.h>
+#include <stdint.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "scx_test.h"
+#include "wait_kick_cycle.bpf.skel.h"
+
+/*
+ * Multiple workers per test CPU. Packing several runnable threads onto each
+ * CPU causes frequent context switching and back-to-back enqueue() calls, which
+ * maximizes the chance that all three test CPUs fire enqueue() concurrently
+ * and enter the SCX_KICK_WAIT cycle simultaneously.
+ */
+#define WORKERS_PER_CPU	4
+#define NR_TEST_CPUS	3
+#define NR_WORKERS	(NR_TEST_CPUS * WORKERS_PER_CPU)
+
+struct worker_ctx {
+	pthread_t tid;
+	int cpu;
+	volatile bool stop;
+	volatile __u64 iters;
+	bool started;
+};
+
+static int pick_test_cpus(int *cpu_a, int *cpu_b, int *cpu_c)
+{
+	cpu_set_t mask;
+	int cpus[4];
+	int nr = 0;
+	int cpu;
+
+	if (sched_getaffinity(0, sizeof(mask), &mask))
+		return -errno;
+
+	for (cpu = 0; cpu < CPU_SETSIZE && nr < ARRAY_SIZE(cpus); cpu++) {
+		if (!CPU_ISSET(cpu, &mask))
+			continue;
+		cpus[nr++] = cpu;
+	}
+
+	if (nr < 3)
+		return -EOPNOTSUPP;
+
+	/* Leave one CPU unused when possible so one CPU remains uncongested. */
+	if (nr >= 4) {
+		*cpu_a = cpus[1];
+		*cpu_b = cpus[2];
+		*cpu_c = cpus[3];
+	} else {
+		*cpu_a = cpus[0];
+		*cpu_b = cpus[1];
+		*cpu_c = cpus[2];
+	}
+	return 0;
+}
+
+static void *worker_fn(void *arg)
+{
+	struct worker_ctx *worker = arg;
+	cpu_set_t mask;
+
+	CPU_ZERO(&mask);
+	CPU_SET(worker->cpu, &mask);
+
+	if (sched_setaffinity(0, sizeof(mask), &mask))
+		return (void *)(uintptr_t)errno;
+
+	/*
+	 * Tight yield loop — no sleep.  Keeping the CPU continuously busy
+	 * with rapid context switches ensures enqueue() fires at the highest
+	 * possible rate on each test CPU.
+	 */
+	while (!worker->stop) {
+		sched_yield();
+		worker->iters++;
+	}
+
+	return NULL;
+}
+
+static int join_worker(struct worker_ctx *worker)
+{
+	void *ret;
+	struct timespec ts;
+	int err;
+
+	if (!worker->started)
+		return 0;
+
+	if (clock_gettime(CLOCK_REALTIME, &ts))
+		return -errno;
+
+	ts.tv_sec += 2;
+	err = pthread_timedjoin_np(worker->tid, &ret, &ts);
+	if (err == ETIMEDOUT)
+		pthread_detach(worker->tid);
+	if (err)
+		return -err;
+
+	if ((uintptr_t)ret)
+		return -(int)(uintptr_t)ret;
+
+	return 0;
+}
+
+static enum scx_test_status setup(void **ctx)
+{
+	struct wait_kick_cycle *skel;
+
+	skel = wait_kick_cycle__open();
+	SCX_FAIL_IF(!skel, "Failed to open skel");
+	SCX_ENUM_INIT(skel);
+
+	*ctx = skel;
+	return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+	struct wait_kick_cycle *skel = ctx;
+	struct worker_ctx workers[NR_WORKERS] = {};
+	struct bpf_link *link = NULL;
+	enum scx_test_status status = SCX_TEST_PASS;
+	int test_cpus[NR_TEST_CPUS] = { -1, -1, -1 };
+	int ret;
+	int i;
+
+	ret = pick_test_cpus(&test_cpus[0], &test_cpus[1], &test_cpus[2]);
+	if (ret == -EOPNOTSUPP)
+		return SCX_TEST_SKIP;
+	if (ret) {
+		SCX_ERR("Failed to pick test cpus (%d)", ret);
+		return SCX_TEST_FAIL;
+	}
+
+	skel->rodata->test_cpu_a = test_cpus[0];
+	skel->rodata->test_cpu_b = test_cpus[1];
+	skel->rodata->test_cpu_c = test_cpus[2];
+
+	if (wait_kick_cycle__load(skel)) {
+		SCX_ERR("Failed to load skel");
+		return SCX_TEST_FAIL;
+	}
+
+	link = bpf_map__attach_struct_ops(skel->maps.wait_kick_cycle_ops);
+	if (!link) {
+		SCX_ERR("Failed to attach scheduler");
+		return SCX_TEST_FAIL;
+	}
+
+	/* WORKERS_PER_CPU threads per test CPU, all in tight yield loops. */
+	for (i = 0; i < NR_WORKERS; i++)
+		workers[i].cpu = test_cpus[i / WORKERS_PER_CPU];
+
+	for (i = 0; i < NR_WORKERS; i++) {
+		ret = pthread_create(&workers[i].tid, NULL, worker_fn, &workers[i]);
+		if (ret) {
+			SCX_ERR("Failed to create worker thread %d (%d)", i, ret);
+			status = SCX_TEST_FAIL;
+			goto out;
+		}
+		workers[i].started = true;
+	}
+
+	sleep(3);
+
+	if (skel->data->uei.kind != EXIT_KIND(SCX_EXIT_NONE)) {
+		SCX_ERR("Scheduler exited unexpectedly (kind=%llu code=%lld)",
+			(unsigned long long)skel->data->uei.kind,
+			(long long)skel->data->uei.exit_code);
+		status = SCX_TEST_FAIL;
+	}
+
+out:
+	for (i = 0; i < NR_WORKERS; i++)
+		workers[i].stop = true;
+
+	for (i = 0; i < NR_WORKERS; i++) {
+		ret = join_worker(&workers[i]);
+		if (ret && status == SCX_TEST_PASS) {
+			SCX_ERR("Failed to join worker thread %d (%d)", i, ret);
+			status = SCX_TEST_FAIL;
+		}
+	}
+
+	if (link)
+		bpf_link__destroy(link);
+
+	return status;
+}
+
+static void cleanup(void *ctx)
+{
+	struct wait_kick_cycle *skel = ctx;
+
+	wait_kick_cycle__destroy(skel);
+}
+
+struct scx_test wait_kick_cycle = {
+	.name = "wait_kick_cycle",
+	.description = "Verify SCX_KICK_WAIT forward progress under a 3-CPU wait cycle",
+	.setup = setup,
+	.run = run,
+	.cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&wait_kick_cycle)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
@ 2026-03-16 10:49   ` Andrea Righi
  2026-03-16 11:12     ` Christian Loehle
  2026-03-16 17:46   ` Tejun Heo
  1 sibling, 1 reply; 11+ messages in thread
From: Andrea Righi @ 2026-03-16 10:49 UTC (permalink / raw)
  To: Christian Loehle
  Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

Hi Christian,

On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
> smp_cond_load_acquire() until the target CPU's current SCX task has been
> context-switched out (its kick_sync counter advanced).
> 
> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
> context, it cannot reschedule, so no kick_sync counter ever advances and
> the system deadlocks.
> 
> Fix this by serializing access to the wait loop behind a global raw
> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
> acquire the lock records itself in scx_kick_wait_pending and returns.
> When the active waiter finishes and releases the lock, it replays the
> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
> no wait request is silently dropped.
> 
> This is deliberately a coarse serialization: multiple simultaneous wait
> operations now run sequentially, increasing latency.  In exchange,
> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
> 
> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
> bits left by a CPU that deferred just as the scheduler exited are reset
> before the next scheduler instance loads.
> 
> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> ---
>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 26a6ac2f8826..b63ae13d0486 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
>  
>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
>  
> +/*
> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
> + * themselves in @scx_kick_wait_pending and are retriggered when the active
> + * waiter completes.
> + *
> + * Lock ordering: @scx_kick_wait_lock is always acquired before
> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
> + */
> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
> +static cpumask_t scx_kick_wait_pending;
> +
>  /*
>   * Direct dispatch marker.
>   *
> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
>  		if (to_free)
>  			kvfree_rcu(to_free, rcu);
>  	}
> +
> +	/*
> +	 * Clear any CPUs that were waiting for the lock when the scheduler
> +	 * exited.  Their irq_work has already returned so no in-flight
> +	 * waiter can observe the stale bits on the next enable.
> +	 */
> +	cpumask_clear(&scx_kick_wait_pending);

Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
Probably it's not that relevant at this point, but I'd keep the locking for
correctness.

Thanks,
-Andrea

>  }
>  
>  static void scx_disable_workfn(struct kthread_work *work)
> @@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  	struct rq *this_rq = this_rq();
>  	struct scx_rq *this_scx = &this_rq->scx;
>  	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
> -	bool should_wait = false;
> +	bool should_wait = !cpumask_empty(this_scx->cpus_to_wait);
>  	unsigned long *ksyncs;
> +	s32 this_cpu = cpu_of(this_rq);
>  	s32 cpu;
>  
>  	if (unlikely(!ksyncs_pcpu)) {
> @@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  	if (!should_wait)
>  		return;
>  
> +	if (!raw_spin_trylock(&scx_kick_wait_lock)) {
> +		raw_spin_lock(&scx_kick_wait_pending_lock);
> +		cpumask_set_cpu(this_cpu, &scx_kick_wait_pending);
> +		raw_spin_unlock(&scx_kick_wait_pending_lock);
> +		return;
> +	}
> +
> +	raw_spin_lock(&scx_kick_wait_pending_lock);
> +	cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending);
> +	raw_spin_unlock(&scx_kick_wait_pending_lock);
> +
>  	for_each_cpu(cpu, this_scx->cpus_to_wait) {
>  		unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync;
>  
> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  		 * task is picked subsequently. The latter is necessary to break
>  		 * the wait when $cpu is taken by a higher sched class.
>  		 */
> -		if (cpu != cpu_of(this_rq))
> +		if (cpu != this_cpu)
>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>  
>  		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
>  	}
> +
> +	raw_spin_unlock(&scx_kick_wait_lock);
> +
> +	raw_spin_lock(&scx_kick_wait_pending_lock);
> +	for_each_cpu(cpu, &scx_kick_wait_pending) {
> +		cpumask_clear_cpu(cpu, &scx_kick_wait_pending);
> +		irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work);
> +	}
> +	raw_spin_unlock(&scx_kick_wait_pending_lock);
>  }
>  
>  /**
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 10:49   ` Andrea Righi
@ 2026-03-16 11:12     ` Christian Loehle
  2026-03-16 14:42       ` Andrea Righi
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 11:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

On 3/16/26 10:49, Andrea Righi wrote:
> Hi Christian,
> 
> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
>> smp_cond_load_acquire() until the target CPU's current SCX task has been
>> context-switched out (its kick_sync counter advanced).
>>
>> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
>> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
>> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
>> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
>> context, it cannot reschedule, so no kick_sync counter ever advances and
>> the system deadlocks.
>>
>> Fix this by serializing access to the wait loop behind a global raw
>> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
>> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
>> acquire the lock records itself in scx_kick_wait_pending and returns.
>> When the active waiter finishes and releases the lock, it replays the
>> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
>> no wait request is silently dropped.
>>
>> This is deliberately a coarse serialization: multiple simultaneous wait
>> operations now run sequentially, increasing latency.  In exchange,
>> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
>>
>> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
>> bits left by a CPU that deferred just as the scheduler exited are reset
>> before the next scheduler instance loads.
>>
>> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
>> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
>> ---
>>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 43 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
>> index 26a6ac2f8826..b63ae13d0486 100644
>> --- a/kernel/sched/ext.c
>> +++ b/kernel/sched/ext.c
>> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
>>  
>>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
>>  
>> +/*
>> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
>> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
>> + * themselves in @scx_kick_wait_pending and are retriggered when the active
>> + * waiter completes.
>> + *
>> + * Lock ordering: @scx_kick_wait_lock is always acquired before
>> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
>> + */
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
>> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
>> +static cpumask_t scx_kick_wait_pending;
>> +
>>  /*
>>   * Direct dispatch marker.
>>   *
>> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
>>  		if (to_free)
>>  			kvfree_rcu(to_free, rcu);
>>  	}
>> +
>> +	/*
>> +	 * Clear any CPUs that were waiting for the lock when the scheduler
>> +	 * exited.  Their irq_work has already returned so no in-flight
>> +	 * waiter can observe the stale bits on the next enable.
>> +	 */
>> +	cpumask_clear(&scx_kick_wait_pending);
> 
> Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> Probably it's not that relevant at this point, but I'd keep the locking for
> correctness.

Of course, thanks. Noted for v2!

Are you fine with the approach, i.e. hitting it with the sledge hammer of global
serialization?
I have something more complex in mind too, but yeah, we'd need to at least either
let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce
a whole lot of infra, which seems a bit overkill for a apparently barely used
interface and also would be nasty to backport.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 11:12     ` Christian Loehle
@ 2026-03-16 14:42       ` Andrea Righi
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Righi @ 2026-03-16 14:42 UTC (permalink / raw)
  To: Christian Loehle
  Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

On Mon, Mar 16, 2026 at 11:12:43AM +0000, Christian Loehle wrote:
> On 3/16/26 10:49, Andrea Righi wrote:
> > Hi Christian,
> > 
> > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> >> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using
> >> smp_cond_load_acquire() until the target CPU's current SCX task has been
> >> context-switched out (its kick_sync counter advanced).
> >>
> >> If multiple CPUs each issue SCX_KICK_WAIT targeting one another
> >> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for
> >> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire()
> >> simultaneously.  Because each victim CPU is spinning in hardirq/irq_work
> >> context, it cannot reschedule, so no kick_sync counter ever advances and
> >> the system deadlocks.
> >>
> >> Fix this by serializing access to the wait loop behind a global raw
> >> spinlock (scx_kick_wait_lock).  Only one CPU at a time may execute the
> >> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to
> >> acquire the lock records itself in scx_kick_wait_pending and returns.
> >> When the active waiter finishes and releases the lock, it replays the
> >> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring
> >> no wait request is silently dropped.
> >>
> >> This is deliberately a coarse serialization: multiple simultaneous wait
> >> operations now run sequentially, increasing latency.  In exchange,
> >> deadlocks are impossible regardless of the cycle length (A->B->C->...->A).
> >>
> >> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale
> >> bits left by a CPU that deferred just as the scheduler exited are reset
> >> before the next scheduler instance loads.
> >>
> >> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
> >> Signed-off-by: Christian Loehle <christian.loehle@arm.com>
> >> ---
> >>  kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 43 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> >> index 26a6ac2f8826..b63ae13d0486 100644
> >> --- a/kernel/sched/ext.c
> >> +++ b/kernel/sched/ext.c
> >> @@ -89,6 +89,19 @@ struct scx_kick_syncs {
> >>  
> >>  static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs);
> >>  
> >> +/*
> >> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles.
> >> + * Callers failing to acquire @scx_kick_wait_lock defer by recording
> >> + * themselves in @scx_kick_wait_pending and are retriggered when the active
> >> + * waiter completes.
> >> + *
> >> + * Lock ordering: @scx_kick_wait_lock is always acquired before
> >> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order.
> >> + */
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock);
> >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock);
> >> +static cpumask_t scx_kick_wait_pending;
> >> +
> >>  /*
> >>   * Direct dispatch marker.
> >>   *
> >> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void)
> >>  		if (to_free)
> >>  			kvfree_rcu(to_free, rcu);
> >>  	}
> >> +
> >> +	/*
> >> +	 * Clear any CPUs that were waiting for the lock when the scheduler
> >> +	 * exited.  Their irq_work has already returned so no in-flight
> >> +	 * waiter can observe the stale bits on the next enable.
> >> +	 */
> >> +	cpumask_clear(&scx_kick_wait_pending);
> > 
> > Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make
> > sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()?
> > Probably it's not that relevant at this point, but I'd keep the locking for
> > correctness.
> 
> Of course, thanks. Noted for v2!
> 
> Are you fine with the approach, i.e. hitting it with the sledge hammer of global
> serialization?
> I have something more complex in mind too, but yeah, we'd need to at least either
> let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce
> a whole lot of infra, which seems a bit overkill for a apparently barely used
> interface and also would be nasty to backport.

Yes, the current approach looks reasonable to me, I think the potential
latency increase (assuming there's any noticeable increase) is totally
acceptable in order to fix the deadlock.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
  2026-03-16 10:49   ` Andrea Righi
@ 2026-03-16 17:46   ` Tejun Heo
  2026-03-16 22:26     ` Christian Loehle
  1 sibling, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2026-03-16 17:46 UTC (permalink / raw)
  To: Christian Loehle
  Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

Hello,

On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  		 * task is picked subsequently. The latter is necessary to break
>  		 * the wait when $cpu is taken by a higher sched class.
>  		 */
> -		if (cpu != cpu_of(this_rq))
> +		if (cpu != this_cpu)
>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);

Given that irq_work is executed at the end of IRQ handling, we can just
reschedule the irq work when the condition is not met (or separate that out
into its own irq_work). That way, I think we can avoid the global lock.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 17:46   ` Tejun Heo
@ 2026-03-16 22:26     ` Christian Loehle
  2026-03-17  8:23       ` Christian Loehle
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 22:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

On 3/16/26 17:46, Tejun Heo wrote:
> Hello,
> 
> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>  		 * task is picked subsequently. The latter is necessary to break
>>  		 * the wait when $cpu is taken by a higher sched class.
>>  		 */
>> -		if (cpu != cpu_of(this_rq))
>> +		if (cpu != this_cpu)
>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
> 
> Given that irq_work is executed at the end of IRQ handling, we can just
> reschedule the irq work when the condition is not met (or separate that out
> into its own irq_work). That way, I think we can avoid the global lock.
> 
I'll go poke at it some more, but I think it's not guaranteed that B actually
advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
Or what would the separated out irq work do differently?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 22:26     ` Christian Loehle
@ 2026-03-17  8:23       ` Christian Loehle
  2026-03-17  9:15         ` Christian Loehle
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Loehle @ 2026-03-17  8:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

On 3/16/26 22:26, Christian Loehle wrote:
> On 3/16/26 17:46, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>>  		 * task is picked subsequently. The latter is necessary to break
>>>  		 * the wait when $cpu is taken by a higher sched class.
>>>  		 */
>>> -		if (cpu != cpu_of(this_rq))
>>> +		if (cpu != this_cpu)
>>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>>
>> Given that irq_work is executed at the end of IRQ handling, we can just
>> reschedule the irq work when the condition is not met (or separate that out
>> into its own irq_work). That way, I think we can avoid the global lock.
>>
> I'll go poke at it some more, but I think it's not guaranteed that B actually
> advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
> Or what would the separated out irq work do differently?

So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but
I don't think we can delay calling that until we've advanced our local kick_sync and if we
don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy),
because then the local CPU is able to continuously issue new kicks (which will have to
be handled by the non-retry path) without advancing it's own kick_sync.
The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and
make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency
most likely a lot higher than with the global lock, is that what you had in mind?
Or am I missing something here?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-17  8:23       ` Christian Loehle
@ 2026-03-17  9:15         ` Christian Loehle
  0 siblings, 0 replies; 11+ messages in thread
From: Christian Loehle @ 2026-03-17  9:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo,
	mingo, peterz, shuah, dietmar.eggemann

[-- Attachment #1: Type: text/plain, Size: 1849 bytes --]

On 3/17/26 08:23, Christian Loehle wrote:
> On 3/16/26 22:26, Christian Loehle wrote:
>> On 3/16/26 17:46, Tejun Heo wrote:
>>> Hello,
>>>
>>> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote:
>>>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>>>>  		 * task is picked subsequently. The latter is necessary to break
>>>>  		 * the wait when $cpu is taken by a higher sched class.
>>>>  		 */
>>>> -		if (cpu != cpu_of(this_rq))
>>>> +		if (cpu != this_cpu)
>>>>  			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
>>>
>>> Given that irq_work is executed at the end of IRQ handling, we can just
>>> reschedule the irq work when the condition is not met (or separate that out
>>> into its own irq_work). That way, I think we can avoid the global lock.
>>>
>> I'll go poke at it some more, but I think it's not guaranteed that B actually
>> advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork?
>> Or what would the separated out irq work do differently?
> 
> So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but
> I don't think we can delay calling that until we've advanced our local kick_sync and if we
> don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy),
> because then the local CPU is able to continuously issue new kicks (which will have to
> be handled by the non-retry path) without advancing it's own kick_sync.
> The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and
> make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency
> most likely a lot higher than with the global lock, is that what you had in mind?
> Or am I missing something here?

Something like the attached, for completeness

[-- Attachment #2: 0001-sched_ext-Prevent-SCX_KICK_WAIT-deadlocks-with-lazy-.patch --]
[-- Type: text/x-patch, Size: 9875 bytes --]

From c2e882b4c54680a57bc0817b3c9819ff5f3bbd21 Mon Sep 17 00:00:00 2001
From: Christian Loehle <christian.loehle@arm.com>
Date: Tue, 17 Mar 2026 09:07:49 +0000
Subject: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlocks with lazy irq_work

SCX_KICK_WAIT waits for the target CPU's kick_sync to change after
rescheduling it. If multiple CPUs form a wait cycle concurrently,
e.g. A waits for B, B for C and C for A, and the wait runs from
hard irq_work, each CPU can end up spinning in irq context waiting
for another CPU which cannot reschedule. No kick_sync advances and
the system deadlocks.

Deferring only the retry path is not enough. SCX_KICK_WAIT can be
issued from ops.enqueue(), so a CPU can keep arming fresh waits before
it goes through another SCX scheduling cycle and advances its own
kick_sync. If the initial wait handling still runs on hard irq_work,
a wait cycle can keep rearming without making progress.

Split the kick paths instead. Keep normal and idle kicks on the
existing hard irq_work so prompt rescheduling behavior is unchanged,
but route the full SCX_KICK_WAIT handling (i.e. initial kick, kick_sync
snapshot and retries) through a dedicated lazy irq_work.
This gives wait cycles a context that can yield and be retriggered
instead of pinning all participants in hard irq context, eliminating
the deadlock without serializing unrelated kicks (e.g. through a global
spinlock).

Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT")
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/sched/ext.c   | 131 ++++++++++++++++++++++++++++++-------------
 kernel/sched/sched.h |   2 +
 2 files changed, 94 insertions(+), 39 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 26a6ac2f8826..8d133cc9c5f4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4713,6 +4713,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
 		if (!cpumask_empty(rq->scx.cpus_to_wait))
 			dump_line(&ns, "  cpus_to_wait   : %*pb",
 				  cpumask_pr_args(rq->scx.cpus_to_wait));
+		if (!cpumask_empty(rq->scx.cpus_to_wait_kick))
+			dump_line(&ns, "  cpus_to_wait_kick: %*pb",
+				  cpumask_pr_args(rq->scx.cpus_to_wait_kick));
 
 		used = seq_buf_used(&ns);
 		if (SCX_HAS_OP(sch, dump_cpu)) {
@@ -5583,12 +5586,43 @@ static bool can_skip_idle_kick(struct rq *rq)
 	return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
 }
 
-static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
+static void kick_one_cpu(s32 cpu, struct rq *this_rq)
+{
+	struct rq *rq = cpu_rq(cpu);
+	struct scx_rq *this_scx = &this_rq->scx;
+	const struct sched_class *cur_class;
+	unsigned long flags;
+
+	raw_spin_rq_lock_irqsave(rq, flags);
+	cur_class = rq->curr->sched_class;
+
+	/*
+	 * During CPU hotplug, a CPU may depend on kicking itself to make
+	 * forward progress. Allow kicking self regardless of online state. If
+	 * @cpu is running a higher class task, we have no control over @cpu.
+	 * Skip kicking.
+	 */
+	if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
+	    !sched_class_above(cur_class, &ext_sched_class)) {
+		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
+			if (cur_class == &ext_sched_class)
+				rq->curr->scx.slice = 0;
+			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+		}
+
+		resched_curr(rq);
+	} else {
+		cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+	}
+
+	raw_spin_rq_unlock_irqrestore(rq, flags);
+}
+
+static void kick_one_cpu_wait(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct scx_rq *this_scx = &this_rq->scx;
 	const struct sched_class *cur_class;
-	bool should_wait = false;
 	unsigned long flags;
 
 	raw_spin_rq_lock_irqsave(rq, flags);
@@ -5609,12 +5643,10 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 		}
 
 		if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
-			if (cur_class == &ext_sched_class) {
+			if (cur_class == &ext_sched_class)
 				ksyncs[cpu] = rq->scx.kick_sync;
-				should_wait = true;
-			} else {
+			else
 				cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
-			}
 		}
 
 		resched_curr(rq);
@@ -5624,8 +5656,6 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	}
 
 	raw_spin_rq_unlock_irqrestore(rq, flags);
-
-	return should_wait;
 }
 
 static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
@@ -5646,20 +5676,10 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 {
 	struct rq *this_rq = this_rq();
 	struct scx_rq *this_scx = &this_rq->scx;
-	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
-	bool should_wait = false;
-	unsigned long *ksyncs;
 	s32 cpu;
 
-	if (unlikely(!ksyncs_pcpu)) {
-		pr_warn_once("kick_cpus_irq_workfn() called with NULL scx_kick_syncs");
-		return;
-	}
-
-	ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs;
-
 	for_each_cpu(cpu, this_scx->cpus_to_kick) {
-		should_wait |= kick_one_cpu(cpu, this_rq, ksyncs);
+		kick_one_cpu(cpu, this_rq);
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick);
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
 	}
@@ -5668,29 +5688,51 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		kick_one_cpu_if_idle(cpu, this_rq);
 		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
 	}
+}
 
-	if (!should_wait)
+static void kick_wait_irq_workfn(struct irq_work *irq_work)
+{
+	struct rq *this_rq = this_rq();
+	struct scx_rq *this_scx = &this_rq->scx;
+	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
+	unsigned long *ksyncs;
+	s32 this_cpu = cpu_of(this_rq);
+	s32 cpu;
+
+	if (unlikely(!ksyncs_pcpu)) {
+		pr_warn_once("kick_wait_irq_workfn() called with NULL scx_kick_syncs");
 		return;
+	}
+
+	ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs;
+
+	for_each_cpu(cpu, this_scx->cpus_to_wait_kick) {
+		kick_one_cpu_wait(cpu, this_rq, ksyncs);
+		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait_kick);
+	}
 
 	for_each_cpu(cpu, this_scx->cpus_to_wait) {
 		unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync;
 
 		/*
-		 * Busy-wait until the task running at the time of kicking is no
-		 * longer running. This can be used to implement e.g. core
-		 * scheduling.
+		 * Check whether the target CPU has gone through a scheduling
+		 * cycle since SCX_KICK_WAIT was issued by testing kick_sync.
+		 * If the condition is already met, or this CPU kicked itself,
+		 * clear from the wait mask. Otherwise leave it set and
+		 * re-queue below: we yield and retry without ever blocking,
+		 * which eliminates the possibility of deadlock when multiple
+		 * CPUs issue SCX_KICK_WAIT targeting each other in a cycle.
 		 *
-		 * smp_cond_load_acquire() pairs with store_releases in
-		 * pick_task_scx() and put_prev_task_scx(). The former breaks
-		 * the wait if SCX's scheduling path is entered even if the same
-		 * task is picked subsequently. The latter is necessary to break
-		 * the wait when $cpu is taken by a higher sched class.
+		 * smp_load_acquire() pairs with the smp_store_release() in
+		 * pick_task_scx() and put_prev_task_scx().
 		 */
-		if (cpu != cpu_of(this_rq))
-			smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]);
-
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+		if (cpu == this_cpu ||
+		    smp_load_acquire(wait_kick_sync) != ksyncs[cpu])
+			cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
 	}
+
+	if (!cpumask_empty(this_scx->cpus_to_wait))
+		irq_work_queue(&this_rq->scx.kick_wait_irq_work);
 }
 
 /**
@@ -5794,8 +5836,10 @@ void __init init_sched_ext_class(void)
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
+		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait_kick, GFP_KERNEL, n));
 		rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn);
 		rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn);
+		rq->scx.kick_wait_irq_work = IRQ_WORK_INIT_LAZY(kick_wait_irq_workfn);
 
 		if (cpu_online(cpu))
 			cpu_rq(cpu)->scx.flags |= SCX_RQ_ONLINE;
@@ -6543,16 +6587,25 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 			raw_spin_rq_unlock(target_rq);
 		}
 		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle);
+		irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
 	} else {
-		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick);
-
-		if (flags & SCX_KICK_PREEMPT)
-			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
-		if (flags & SCX_KICK_WAIT)
+		if (flags & SCX_KICK_WAIT) {
 			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait);
-	}
+			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait_kick);
+
+			if (flags & SCX_KICK_PREEMPT)
+				cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
+
+			irq_work_queue(&this_rq->scx.kick_wait_irq_work);
+		} else {
+			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick);
+
+			if (flags & SCX_KICK_PREEMPT)
+				cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
 
-	irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
+			irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
+		}
+	}
 out:
 	local_irq_restore(irq_flags);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 43bbf0693cca..010c0821d230 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -805,11 +805,13 @@ struct scx_rq {
 	cpumask_var_t		cpus_to_kick_if_idle;
 	cpumask_var_t		cpus_to_preempt;
 	cpumask_var_t		cpus_to_wait;
+	cpumask_var_t		cpus_to_wait_kick;
 	unsigned long		kick_sync;
 	local_t			reenq_local_deferred;
 	struct balance_callback	deferred_bal_cb;
 	struct irq_work		deferred_irq_work;
 	struct irq_work		kick_cpus_irq_work;
+	struct irq_work		kick_wait_irq_work;
 	struct scx_dispatch_q	bypass_dsq;
 };
 #endif /* CONFIG_SCHED_CLASS_EXT */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
  2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle
  2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
  2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle
@ 2026-03-29  0:20 ` Tejun Heo
  2 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2026-03-29  0:20 UTC (permalink / raw)
  To: Christian Loehle
  Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo,
	mingo, peterz, shuah, dietmar.eggemann, emil

Hello,

I posted an alternative fix here:

  https://lore.kernel.org/r/20260329001856.835643-1-tj@kernel.org

Instead of serializing the kicks, it defers the wait to a balance callback
which can drop the rq lock and enable IRQs, avoiding the deadlock while
preserving the concurrent kick_wait semantics.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-03-29  0:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle
2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
2026-03-16 10:49   ` Andrea Righi
2026-03-16 11:12     ` Christian Loehle
2026-03-16 14:42       ` Andrea Righi
2026-03-16 17:46   ` Tejun Heo
2026-03-16 22:26     ` Christian Loehle
2026-03-17  8:23       ` Christian Loehle
2026-03-17  9:15         ` Christian Loehle
2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle
2026-03-29  0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox