* [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock
@ 2026-03-16 10:02 Christian Loehle
2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw)
To: sched-ext
Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo,
peterz, shuah, dietmar.eggemann, Christian Loehle
When using SCX_KICK_WAIT I noticed that it lacks any deadlock
prevention and will deadlock with no chance of sched_ext even ejecting
the BPF scheduler.
The BPF scheduler cannot impose any reasonablesynchronisation itself,
except for a strict partition of which CPUs are allowed to SCX_KICK_WAIT
which other CPUs.
Since SCX_KICK_WAIT seems to be used quite rarely, just synchronize
all SCX_KICK_WAIT globally and don't try to be clever about cycle
detection.
Also add a testcase that reproduces the issue.
Christian Loehle (2):
sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization
sched_ext/selftests: Add SCX_KICK_WAIT cycle tests
kernel/sched/ext.c | 45 +++-
tools/testing/selftests/sched_ext/Makefile | 1 +
.../selftests/sched_ext/wait_kick_cycle.bpf.c | 70 ++++++
.../selftests/sched_ext/wait_kick_cycle.c | 223 ++++++++++++++++++
4 files changed, 337 insertions(+), 2 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.c
--
2.34.1
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle @ 2026-03-16 10:02 ` Christian Loehle 2026-03-16 10:49 ` Andrea Righi 2026-03-16 17:46 ` Tejun Heo 2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle 2026-03-29 0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo 2 siblings, 2 replies; 11+ messages in thread From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw) To: sched-ext Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann, Christian Loehle SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using smp_cond_load_acquire() until the target CPU's current SCX task has been context-switched out (its kick_sync counter advanced). If multiple CPUs each issue SCX_KICK_WAIT targeting one another concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for CPU A — all CPUs can end up wedged inside smp_cond_load_acquire() simultaneously. Because each victim CPU is spinning in hardirq/irq_work context, it cannot reschedule, so no kick_sync counter ever advances and the system deadlocks. Fix this by serializing access to the wait loop behind a global raw spinlock (scx_kick_wait_lock). Only one CPU at a time may execute the wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to acquire the lock records itself in scx_kick_wait_pending and returns. When the active waiter finishes and releases the lock, it replays the pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring no wait request is silently dropped. This is deliberately a coarse serialization: multiple simultaneous wait operations now run sequentially, increasing latency. In exchange, deadlocks are impossible regardless of the cycle length (A->B->C->...->A). Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale bits left by a CPU that deferred just as the scheduler exited are reset before the next scheduler instance loads. Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") Signed-off-by: Christian Loehle <christian.loehle@arm.com> --- kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 26a6ac2f8826..b63ae13d0486 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -89,6 +89,19 @@ struct scx_kick_syncs { static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs); +/* + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles. + * Callers failing to acquire @scx_kick_wait_lock defer by recording + * themselves in @scx_kick_wait_pending and are retriggered when the active + * waiter completes. + * + * Lock ordering: @scx_kick_wait_lock is always acquired before + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order. + */ +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock); +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock); +static cpumask_t scx_kick_wait_pending; + /* * Direct dispatch marker. * @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void) if (to_free) kvfree_rcu(to_free, rcu); } + + /* + * Clear any CPUs that were waiting for the lock when the scheduler + * exited. Their irq_work has already returned so no in-flight + * waiter can observe the stale bits on the next enable. + */ + cpumask_clear(&scx_kick_wait_pending); } static void scx_disable_workfn(struct kthread_work *work) @@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) struct rq *this_rq = this_rq(); struct scx_rq *this_scx = &this_rq->scx; struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs); - bool should_wait = false; + bool should_wait = !cpumask_empty(this_scx->cpus_to_wait); unsigned long *ksyncs; + s32 this_cpu = cpu_of(this_rq); s32 cpu; if (unlikely(!ksyncs_pcpu)) { @@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) if (!should_wait) return; + if (!raw_spin_trylock(&scx_kick_wait_lock)) { + raw_spin_lock(&scx_kick_wait_pending_lock); + cpumask_set_cpu(this_cpu, &scx_kick_wait_pending); + raw_spin_unlock(&scx_kick_wait_pending_lock); + return; + } + + raw_spin_lock(&scx_kick_wait_pending_lock); + cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending); + raw_spin_unlock(&scx_kick_wait_pending_lock); + for_each_cpu(cpu, this_scx->cpus_to_wait) { unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync; @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) * task is picked subsequently. The latter is necessary to break * the wait when $cpu is taken by a higher sched class. */ - if (cpu != cpu_of(this_rq)) + if (cpu != this_cpu) smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); } + + raw_spin_unlock(&scx_kick_wait_lock); + + raw_spin_lock(&scx_kick_wait_pending_lock); + for_each_cpu(cpu, &scx_kick_wait_pending) { + cpumask_clear_cpu(cpu, &scx_kick_wait_pending); + irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work); + } + raw_spin_unlock(&scx_kick_wait_pending_lock); } /** -- 2.34.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle @ 2026-03-16 10:49 ` Andrea Righi 2026-03-16 11:12 ` Christian Loehle 2026-03-16 17:46 ` Tejun Heo 1 sibling, 1 reply; 11+ messages in thread From: Andrea Righi @ 2026-03-16 10:49 UTC (permalink / raw) To: Christian Loehle Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo, mingo, peterz, shuah, dietmar.eggemann Hi Christian, On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: > SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using > smp_cond_load_acquire() until the target CPU's current SCX task has been > context-switched out (its kick_sync counter advanced). > > If multiple CPUs each issue SCX_KICK_WAIT targeting one another > concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for > CPU A — all CPUs can end up wedged inside smp_cond_load_acquire() > simultaneously. Because each victim CPU is spinning in hardirq/irq_work > context, it cannot reschedule, so no kick_sync counter ever advances and > the system deadlocks. > > Fix this by serializing access to the wait loop behind a global raw > spinlock (scx_kick_wait_lock). Only one CPU at a time may execute the > wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to > acquire the lock records itself in scx_kick_wait_pending and returns. > When the active waiter finishes and releases the lock, it replays the > pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring > no wait request is silently dropped. > > This is deliberately a coarse serialization: multiple simultaneous wait > operations now run sequentially, increasing latency. In exchange, > deadlocks are impossible regardless of the cycle length (A->B->C->...->A). > > Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale > bits left by a CPU that deferred just as the scheduler exited are reset > before the next scheduler instance loads. > > Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") > Signed-off-by: Christian Loehle <christian.loehle@arm.com> > --- > kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 43 insertions(+), 2 deletions(-) > > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index 26a6ac2f8826..b63ae13d0486 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -89,6 +89,19 @@ struct scx_kick_syncs { > > static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs); > > +/* > + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles. > + * Callers failing to acquire @scx_kick_wait_lock defer by recording > + * themselves in @scx_kick_wait_pending and are retriggered when the active > + * waiter completes. > + * > + * Lock ordering: @scx_kick_wait_lock is always acquired before > + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order. > + */ > +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock); > +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock); > +static cpumask_t scx_kick_wait_pending; > + > /* > * Direct dispatch marker. > * > @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void) > if (to_free) > kvfree_rcu(to_free, rcu); > } > + > + /* > + * Clear any CPUs that were waiting for the lock when the scheduler > + * exited. Their irq_work has already returned so no in-flight > + * waiter can observe the stale bits on the next enable. > + */ > + cpumask_clear(&scx_kick_wait_pending); Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()? Probably it's not that relevant at this point, but I'd keep the locking for correctness. Thanks, -Andrea > } > > static void scx_disable_workfn(struct kthread_work *work) > @@ -5647,8 +5667,9 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) > struct rq *this_rq = this_rq(); > struct scx_rq *this_scx = &this_rq->scx; > struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs); > - bool should_wait = false; > + bool should_wait = !cpumask_empty(this_scx->cpus_to_wait); > unsigned long *ksyncs; > + s32 this_cpu = cpu_of(this_rq); > s32 cpu; > > if (unlikely(!ksyncs_pcpu)) { > @@ -5672,6 +5693,17 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) > if (!should_wait) > return; > > + if (!raw_spin_trylock(&scx_kick_wait_lock)) { > + raw_spin_lock(&scx_kick_wait_pending_lock); > + cpumask_set_cpu(this_cpu, &scx_kick_wait_pending); > + raw_spin_unlock(&scx_kick_wait_pending_lock); > + return; > + } > + > + raw_spin_lock(&scx_kick_wait_pending_lock); > + cpumask_clear_cpu(this_cpu, &scx_kick_wait_pending); > + raw_spin_unlock(&scx_kick_wait_pending_lock); > + > for_each_cpu(cpu, this_scx->cpus_to_wait) { > unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync; > > @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) > * task is picked subsequently. The latter is necessary to break > * the wait when $cpu is taken by a higher sched class. > */ > - if (cpu != cpu_of(this_rq)) > + if (cpu != this_cpu) > smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); > > cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); > } > + > + raw_spin_unlock(&scx_kick_wait_lock); > + > + raw_spin_lock(&scx_kick_wait_pending_lock); > + for_each_cpu(cpu, &scx_kick_wait_pending) { > + cpumask_clear_cpu(cpu, &scx_kick_wait_pending); > + irq_work_queue(&cpu_rq(cpu)->scx.kick_cpus_irq_work); > + } > + raw_spin_unlock(&scx_kick_wait_pending_lock); > } > > /** > -- > 2.34.1 > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 10:49 ` Andrea Righi @ 2026-03-16 11:12 ` Christian Loehle 2026-03-16 14:42 ` Andrea Righi 0 siblings, 1 reply; 11+ messages in thread From: Christian Loehle @ 2026-03-16 11:12 UTC (permalink / raw) To: Andrea Righi Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo, mingo, peterz, shuah, dietmar.eggemann On 3/16/26 10:49, Andrea Righi wrote: > Hi Christian, > > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: >> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using >> smp_cond_load_acquire() until the target CPU's current SCX task has been >> context-switched out (its kick_sync counter advanced). >> >> If multiple CPUs each issue SCX_KICK_WAIT targeting one another >> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for >> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire() >> simultaneously. Because each victim CPU is spinning in hardirq/irq_work >> context, it cannot reschedule, so no kick_sync counter ever advances and >> the system deadlocks. >> >> Fix this by serializing access to the wait loop behind a global raw >> spinlock (scx_kick_wait_lock). Only one CPU at a time may execute the >> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to >> acquire the lock records itself in scx_kick_wait_pending and returns. >> When the active waiter finishes and releases the lock, it replays the >> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring >> no wait request is silently dropped. >> >> This is deliberately a coarse serialization: multiple simultaneous wait >> operations now run sequentially, increasing latency. In exchange, >> deadlocks are impossible regardless of the cycle length (A->B->C->...->A). >> >> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale >> bits left by a CPU that deferred just as the scheduler exited are reset >> before the next scheduler instance loads. >> >> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") >> Signed-off-by: Christian Loehle <christian.loehle@arm.com> >> --- >> kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 43 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c >> index 26a6ac2f8826..b63ae13d0486 100644 >> --- a/kernel/sched/ext.c >> +++ b/kernel/sched/ext.c >> @@ -89,6 +89,19 @@ struct scx_kick_syncs { >> >> static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs); >> >> +/* >> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles. >> + * Callers failing to acquire @scx_kick_wait_lock defer by recording >> + * themselves in @scx_kick_wait_pending and are retriggered when the active >> + * waiter completes. >> + * >> + * Lock ordering: @scx_kick_wait_lock is always acquired before >> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order. >> + */ >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock); >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock); >> +static cpumask_t scx_kick_wait_pending; >> + >> /* >> * Direct dispatch marker. >> * >> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void) >> if (to_free) >> kvfree_rcu(to_free, rcu); >> } >> + >> + /* >> + * Clear any CPUs that were waiting for the lock when the scheduler >> + * exited. Their irq_work has already returned so no in-flight >> + * waiter can observe the stale bits on the next enable. >> + */ >> + cpumask_clear(&scx_kick_wait_pending); > > Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make > sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()? > Probably it's not that relevant at this point, but I'd keep the locking for > correctness. Of course, thanks. Noted for v2! Are you fine with the approach, i.e. hitting it with the sledge hammer of global serialization? I have something more complex in mind too, but yeah, we'd need to at least either let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce a whole lot of infra, which seems a bit overkill for a apparently barely used interface and also would be nasty to backport. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 11:12 ` Christian Loehle @ 2026-03-16 14:42 ` Andrea Righi 0 siblings, 0 replies; 11+ messages in thread From: Andrea Righi @ 2026-03-16 14:42 UTC (permalink / raw) To: Christian Loehle Cc: sched-ext, linux-kernel, linux-kselftest, tj, void, changwoo, mingo, peterz, shuah, dietmar.eggemann On Mon, Mar 16, 2026 at 11:12:43AM +0000, Christian Loehle wrote: > On 3/16/26 10:49, Andrea Righi wrote: > > Hi Christian, > > > > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: > >> SCX_KICK_WAIT causes kick_cpus_irq_workfn() to busy-wait using > >> smp_cond_load_acquire() until the target CPU's current SCX task has been > >> context-switched out (its kick_sync counter advanced). > >> > >> If multiple CPUs each issue SCX_KICK_WAIT targeting one another > >> concurrently — e.g. CPU A waits for CPU B, B waits for CPU C, C waits for > >> CPU A — all CPUs can end up wedged inside smp_cond_load_acquire() > >> simultaneously. Because each victim CPU is spinning in hardirq/irq_work > >> context, it cannot reschedule, so no kick_sync counter ever advances and > >> the system deadlocks. > >> > >> Fix this by serializing access to the wait loop behind a global raw > >> spinlock (scx_kick_wait_lock). Only one CPU at a time may execute the > >> wait loop; any other CPU that has SCX_KICK_WAIT work to do and fails to > >> acquire the lock records itself in scx_kick_wait_pending and returns. > >> When the active waiter finishes and releases the lock, it replays the > >> pending set by re-queuing each pending CPU's kick_cpus_irq_work, ensuring > >> no wait request is silently dropped. > >> > >> This is deliberately a coarse serialization: multiple simultaneous wait > >> operations now run sequentially, increasing latency. In exchange, > >> deadlocks are impossible regardless of the cycle length (A->B->C->...->A). > >> > >> Also clear scx_kick_wait_pending in free_kick_syncs() so that any stale > >> bits left by a CPU that deferred just as the scheduler exited are reset > >> before the next scheduler instance loads. > >> > >> Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") > >> Signed-off-by: Christian Loehle <christian.loehle@arm.com> > >> --- > >> kernel/sched/ext.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- > >> 1 file changed, 43 insertions(+), 2 deletions(-) > >> > >> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > >> index 26a6ac2f8826..b63ae13d0486 100644 > >> --- a/kernel/sched/ext.c > >> +++ b/kernel/sched/ext.c > >> @@ -89,6 +89,19 @@ struct scx_kick_syncs { > >> > >> static DEFINE_PER_CPU(struct scx_kick_syncs __rcu *, scx_kick_syncs); > >> > >> +/* > >> + * Serialize %SCX_KICK_WAIT processing across CPUs to avoid wait cycles. > >> + * Callers failing to acquire @scx_kick_wait_lock defer by recording > >> + * themselves in @scx_kick_wait_pending and are retriggered when the active > >> + * waiter completes. > >> + * > >> + * Lock ordering: @scx_kick_wait_lock is always acquired before > >> + * @scx_kick_wait_pending_lock; the two are never taken in the opposite order. > >> + */ > >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_lock); > >> +static DEFINE_RAW_SPINLOCK(scx_kick_wait_pending_lock); > >> +static cpumask_t scx_kick_wait_pending; > >> + > >> /* > >> * Direct dispatch marker. > >> * > >> @@ -4279,6 +4292,13 @@ static void free_kick_syncs(void) > >> if (to_free) > >> kvfree_rcu(to_free, rcu); > >> } > >> + > >> + /* > >> + * Clear any CPUs that were waiting for the lock when the scheduler > >> + * exited. Their irq_work has already returned so no in-flight > >> + * waiter can observe the stale bits on the next enable. > >> + */ > >> + cpumask_clear(&scx_kick_wait_pending); > > > > Do we need a raw_spin_lock/unlock(&scx_kick_wait_pending_lock) here to make > > sure we're not racing with with cpumask_set_cpu()/cpumask_clear_cpu()? > > Probably it's not that relevant at this point, but I'd keep the locking for > > correctness. > > Of course, thanks. Noted for v2! > > Are you fine with the approach, i.e. hitting it with the sledge hammer of global > serialization? > I have something more complex in mind too, but yeah, we'd need to at least either > let scx_bpf_kick_cpu() fail / -ERETRY or restrict kicking/kicked CPUs and introduce > a whole lot of infra, which seems a bit overkill for a apparently barely used > interface and also would be nasty to backport. Yes, the current approach looks reasonable to me, I think the potential latency increase (assuming there's any noticeable increase) is totally acceptable in order to fix the deadlock. Thanks, -Andrea ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle 2026-03-16 10:49 ` Andrea Righi @ 2026-03-16 17:46 ` Tejun Heo 2026-03-16 22:26 ` Christian Loehle 1 sibling, 1 reply; 11+ messages in thread From: Tejun Heo @ 2026-03-16 17:46 UTC (permalink / raw) To: Christian Loehle Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann Hello, On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: > @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) > * task is picked subsequently. The latter is necessary to break > * the wait when $cpu is taken by a higher sched class. > */ > - if (cpu != cpu_of(this_rq)) > + if (cpu != this_cpu) > smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); Given that irq_work is executed at the end of IRQ handling, we can just reschedule the irq work when the condition is not met (or separate that out into its own irq_work). That way, I think we can avoid the global lock. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 17:46 ` Tejun Heo @ 2026-03-16 22:26 ` Christian Loehle 2026-03-17 8:23 ` Christian Loehle 0 siblings, 1 reply; 11+ messages in thread From: Christian Loehle @ 2026-03-16 22:26 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann On 3/16/26 17:46, Tejun Heo wrote: > Hello, > > On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: >> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) >> * task is picked subsequently. The latter is necessary to break >> * the wait when $cpu is taken by a higher sched class. >> */ >> - if (cpu != cpu_of(this_rq)) >> + if (cpu != this_cpu) >> smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); > > Given that irq_work is executed at the end of IRQ handling, we can just > reschedule the irq work when the condition is not met (or separate that out > into its own irq_work). That way, I think we can avoid the global lock. > I'll go poke at it some more, but I think it's not guaranteed that B actually advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork? Or what would the separated out irq work do differently? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 22:26 ` Christian Loehle @ 2026-03-17 8:23 ` Christian Loehle 2026-03-17 9:15 ` Christian Loehle 0 siblings, 1 reply; 11+ messages in thread From: Christian Loehle @ 2026-03-17 8:23 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann On 3/16/26 22:26, Christian Loehle wrote: > On 3/16/26 17:46, Tejun Heo wrote: >> Hello, >> >> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: >>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) >>> * task is picked subsequently. The latter is necessary to break >>> * the wait when $cpu is taken by a higher sched class. >>> */ >>> - if (cpu != cpu_of(this_rq)) >>> + if (cpu != this_cpu) >>> smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); >> >> Given that irq_work is executed at the end of IRQ handling, we can just >> reschedule the irq work when the condition is not met (or separate that out >> into its own irq_work). That way, I think we can avoid the global lock. >> > I'll go poke at it some more, but I think it's not guaranteed that B actually > advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork? > Or what would the separated out irq work do differently? So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but I don't think we can delay calling that until we've advanced our local kick_sync and if we don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy), because then the local CPU is able to continuously issue new kicks (which will have to be handled by the non-retry path) without advancing it's own kick_sync. The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency most likely a lot higher than with the global lock, is that what you had in mind? Or am I missing something here? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-17 8:23 ` Christian Loehle @ 2026-03-17 9:15 ` Christian Loehle 0 siblings, 0 replies; 11+ messages in thread From: Christian Loehle @ 2026-03-17 9:15 UTC (permalink / raw) To: Tejun Heo Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann [-- Attachment #1: Type: text/plain, Size: 1849 bytes --] On 3/17/26 08:23, Christian Loehle wrote: > On 3/16/26 22:26, Christian Loehle wrote: >> On 3/16/26 17:46, Tejun Heo wrote: >>> Hello, >>> >>> On Mon, Mar 16, 2026 at 10:02:48AM +0000, Christian Loehle wrote: >>>> @@ -5686,11 +5718,20 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) >>>> * task is picked subsequently. The latter is necessary to break >>>> * the wait when $cpu is taken by a higher sched class. >>>> */ >>>> - if (cpu != cpu_of(this_rq)) >>>> + if (cpu != this_cpu) >>>> smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); >>> >>> Given that irq_work is executed at the end of IRQ handling, we can just >>> reschedule the irq work when the condition is not met (or separate that out >>> into its own irq_work). That way, I think we can avoid the global lock. >>> >> I'll go poke at it some more, but I think it's not guaranteed that B actually >> advances kick_sync if A keeps kicking. At least not if the handling is in HARD irqwork? >> Or what would the separated out irq work do differently? > > So in my particular example I do the SCX_KICK_WAIT in ops.enqueue(), which is fair, but > I don't think we can delay calling that until we've advanced our local kick_sync and if we > don't we end up in the deadlock, even if e.g. we separate out the retry (and make that lazy), > because then the local CPU is able to continuously issue new kicks (which will have to > be handled by the non-retry path) without advancing it's own kick_sync. > The closest thing to that I can get working is separating out the SCX_KICK_WAIT entirely and > make that lazy. In practice though that would realistically make the SCX_KICK_WAIT latency > most likely a lot higher than with the global lock, is that what you had in mind? > Or am I missing something here? Something like the attached, for completeness [-- Attachment #2: 0001-sched_ext-Prevent-SCX_KICK_WAIT-deadlocks-with-lazy-.patch --] [-- Type: text/x-patch, Size: 9875 bytes --] From c2e882b4c54680a57bc0817b3c9819ff5f3bbd21 Mon Sep 17 00:00:00 2001 From: Christian Loehle <christian.loehle@arm.com> Date: Tue, 17 Mar 2026 09:07:49 +0000 Subject: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlocks with lazy irq_work SCX_KICK_WAIT waits for the target CPU's kick_sync to change after rescheduling it. If multiple CPUs form a wait cycle concurrently, e.g. A waits for B, B for C and C for A, and the wait runs from hard irq_work, each CPU can end up spinning in irq context waiting for another CPU which cannot reschedule. No kick_sync advances and the system deadlocks. Deferring only the retry path is not enough. SCX_KICK_WAIT can be issued from ops.enqueue(), so a CPU can keep arming fresh waits before it goes through another SCX scheduling cycle and advances its own kick_sync. If the initial wait handling still runs on hard irq_work, a wait cycle can keep rearming without making progress. Split the kick paths instead. Keep normal and idle kicks on the existing hard irq_work so prompt rescheduling behavior is unchanged, but route the full SCX_KICK_WAIT handling (i.e. initial kick, kick_sync snapshot and retries) through a dedicated lazy irq_work. This gives wait cycles a context that can yield and be retriggered instead of pinning all participants in hard irq context, eliminating the deadlock without serializing unrelated kicks (e.g. through a global spinlock). Fixes: 90e55164dad4 ("sched_ext: Implement SCX_KICK_WAIT") Signed-off-by: Christian Loehle <christian.loehle@arm.com> --- kernel/sched/ext.c | 131 ++++++++++++++++++++++++++++++------------- kernel/sched/sched.h | 2 + 2 files changed, 94 insertions(+), 39 deletions(-) diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index 26a6ac2f8826..8d133cc9c5f4 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -4713,6 +4713,9 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len) if (!cpumask_empty(rq->scx.cpus_to_wait)) dump_line(&ns, " cpus_to_wait : %*pb", cpumask_pr_args(rq->scx.cpus_to_wait)); + if (!cpumask_empty(rq->scx.cpus_to_wait_kick)) + dump_line(&ns, " cpus_to_wait_kick: %*pb", + cpumask_pr_args(rq->scx.cpus_to_wait_kick)); used = seq_buf_used(&ns); if (SCX_HAS_OP(sch, dump_cpu)) { @@ -5583,12 +5586,43 @@ static bool can_skip_idle_kick(struct rq *rq) return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE); } -static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs) +static void kick_one_cpu(s32 cpu, struct rq *this_rq) +{ + struct rq *rq = cpu_rq(cpu); + struct scx_rq *this_scx = &this_rq->scx; + const struct sched_class *cur_class; + unsigned long flags; + + raw_spin_rq_lock_irqsave(rq, flags); + cur_class = rq->curr->sched_class; + + /* + * During CPU hotplug, a CPU may depend on kicking itself to make + * forward progress. Allow kicking self regardless of online state. If + * @cpu is running a higher class task, we have no control over @cpu. + * Skip kicking. + */ + if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) && + !sched_class_above(cur_class, &ext_sched_class)) { + if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) { + if (cur_class == &ext_sched_class) + rq->curr->scx.slice = 0; + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); + } + + resched_curr(rq); + } else { + cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt); + } + + raw_spin_rq_unlock_irqrestore(rq, flags); +} + +static void kick_one_cpu_wait(s32 cpu, struct rq *this_rq, unsigned long *ksyncs) { struct rq *rq = cpu_rq(cpu); struct scx_rq *this_scx = &this_rq->scx; const struct sched_class *cur_class; - bool should_wait = false; unsigned long flags; raw_spin_rq_lock_irqsave(rq, flags); @@ -5609,12 +5643,10 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs) } if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) { - if (cur_class == &ext_sched_class) { + if (cur_class == &ext_sched_class) ksyncs[cpu] = rq->scx.kick_sync; - should_wait = true; - } else { + else cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); - } } resched_curr(rq); @@ -5624,8 +5656,6 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs) } raw_spin_rq_unlock_irqrestore(rq, flags); - - return should_wait; } static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq) @@ -5646,20 +5676,10 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) { struct rq *this_rq = this_rq(); struct scx_rq *this_scx = &this_rq->scx; - struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs); - bool should_wait = false; - unsigned long *ksyncs; s32 cpu; - if (unlikely(!ksyncs_pcpu)) { - pr_warn_once("kick_cpus_irq_workfn() called with NULL scx_kick_syncs"); - return; - } - - ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs; - for_each_cpu(cpu, this_scx->cpus_to_kick) { - should_wait |= kick_one_cpu(cpu, this_rq, ksyncs); + kick_one_cpu(cpu, this_rq); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); } @@ -5668,29 +5688,51 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work) kick_one_cpu_if_idle(cpu, this_rq); cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle); } +} - if (!should_wait) +static void kick_wait_irq_workfn(struct irq_work *irq_work) +{ + struct rq *this_rq = this_rq(); + struct scx_rq *this_scx = &this_rq->scx; + struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs); + unsigned long *ksyncs; + s32 this_cpu = cpu_of(this_rq); + s32 cpu; + + if (unlikely(!ksyncs_pcpu)) { + pr_warn_once("kick_wait_irq_workfn() called with NULL scx_kick_syncs"); return; + } + + ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs; + + for_each_cpu(cpu, this_scx->cpus_to_wait_kick) { + kick_one_cpu_wait(cpu, this_rq, ksyncs); + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait_kick); + } for_each_cpu(cpu, this_scx->cpus_to_wait) { unsigned long *wait_kick_sync = &cpu_rq(cpu)->scx.kick_sync; /* - * Busy-wait until the task running at the time of kicking is no - * longer running. This can be used to implement e.g. core - * scheduling. + * Check whether the target CPU has gone through a scheduling + * cycle since SCX_KICK_WAIT was issued by testing kick_sync. + * If the condition is already met, or this CPU kicked itself, + * clear from the wait mask. Otherwise leave it set and + * re-queue below: we yield and retry without ever blocking, + * which eliminates the possibility of deadlock when multiple + * CPUs issue SCX_KICK_WAIT targeting each other in a cycle. * - * smp_cond_load_acquire() pairs with store_releases in - * pick_task_scx() and put_prev_task_scx(). The former breaks - * the wait if SCX's scheduling path is entered even if the same - * task is picked subsequently. The latter is necessary to break - * the wait when $cpu is taken by a higher sched class. + * smp_load_acquire() pairs with the smp_store_release() in + * pick_task_scx() and put_prev_task_scx(). */ - if (cpu != cpu_of(this_rq)) - smp_cond_load_acquire(wait_kick_sync, VAL != ksyncs[cpu]); - - cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); + if (cpu == this_cpu || + smp_load_acquire(wait_kick_sync) != ksyncs[cpu]) + cpumask_clear_cpu(cpu, this_scx->cpus_to_wait); } + + if (!cpumask_empty(this_scx->cpus_to_wait)) + irq_work_queue(&this_rq->scx.kick_wait_irq_work); } /** @@ -5794,8 +5836,10 @@ void __init init_sched_ext_class(void) BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n)); BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n)); + BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait_kick, GFP_KERNEL, n)); rq->scx.deferred_irq_work = IRQ_WORK_INIT_HARD(deferred_irq_workfn); rq->scx.kick_cpus_irq_work = IRQ_WORK_INIT_HARD(kick_cpus_irq_workfn); + rq->scx.kick_wait_irq_work = IRQ_WORK_INIT_LAZY(kick_wait_irq_workfn); if (cpu_online(cpu)) cpu_rq(cpu)->scx.flags |= SCX_RQ_ONLINE; @@ -6543,16 +6587,25 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags) raw_spin_rq_unlock(target_rq); } cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle); + irq_work_queue(&this_rq->scx.kick_cpus_irq_work); } else { - cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick); - - if (flags & SCX_KICK_PREEMPT) - cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); - if (flags & SCX_KICK_WAIT) + if (flags & SCX_KICK_WAIT) { cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait); - } + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait_kick); + + if (flags & SCX_KICK_PREEMPT) + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); + + irq_work_queue(&this_rq->scx.kick_wait_irq_work); + } else { + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick); + + if (flags & SCX_KICK_PREEMPT) + cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt); - irq_work_queue(&this_rq->scx.kick_cpus_irq_work); + irq_work_queue(&this_rq->scx.kick_cpus_irq_work); + } + } out: local_irq_restore(irq_flags); } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 43bbf0693cca..010c0821d230 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -805,11 +805,13 @@ struct scx_rq { cpumask_var_t cpus_to_kick_if_idle; cpumask_var_t cpus_to_preempt; cpumask_var_t cpus_to_wait; + cpumask_var_t cpus_to_wait_kick; unsigned long kick_sync; local_t reenq_local_deferred; struct balance_callback deferred_bal_cb; struct irq_work deferred_irq_work; struct irq_work kick_cpus_irq_work; + struct irq_work kick_wait_irq_work; struct scx_dispatch_q bypass_dsq; }; #endif /* CONFIG_SCHED_CLASS_EXT */ -- 2.34.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests 2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle 2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle @ 2026-03-16 10:02 ` Christian Loehle 2026-03-29 0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo 2 siblings, 0 replies; 11+ messages in thread From: Christian Loehle @ 2026-03-16 10:02 UTC (permalink / raw) To: sched-ext Cc: linux-kernel, linux-kselftest, tj, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann, Christian Loehle Add wait_kick_cycle, a test stressing an SCX_KICK_WAIT cycle between three CPUs by calling SCX_KICK_WAIT between them to test if sched_ext prevents a deadlock. Note: hangs on unfixed kernels Signed-off-by: Christian Loehle <christian.loehle@arm.com> --- tools/testing/selftests/sched_ext/Makefile | 1 + .../selftests/sched_ext/wait_kick_cycle.bpf.c | 70 ++++++ .../selftests/sched_ext/wait_kick_cycle.c | 223 ++++++++++++++++++ 3 files changed, 294 insertions(+) create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c create mode 100644 tools/testing/selftests/sched_ext/wait_kick_cycle.c diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile index 006300ac6dff..0b5b527265f7 100644 --- a/tools/testing/selftests/sched_ext/Makefile +++ b/tools/testing/selftests/sched_ext/Makefile @@ -188,6 +188,7 @@ auto-test-targets := \ rt_stall \ test_example \ total_bw \ + wait_kick_cycle \ testcase-targets := $(addsuffix .o,$(addprefix $(SCXOBJ_DIR)/,$(auto-test-targets))) diff --git a/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c b/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c new file mode 100644 index 000000000000..c53cda86ec75 --- /dev/null +++ b/tools/testing/selftests/sched_ext/wait_kick_cycle.bpf.c @@ -0,0 +1,70 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2026 Christian Loehle <christian.loehle@arm.com> + * + * Stress concurrent SCX_KICK_WAIT calls to validate forward progress. + * + * Three CPUs are designated from userspace. Every enqueue from one of the + * three CPUs kicks the next CPU in the ring with SCX_KICK_WAIT, creating a + * persistent A -> B -> C -> A wait cycle pressure. + */ + +#include <scx/common.bpf.h> + +char _license[] SEC("license") = "GPL"; + +const volatile s32 test_cpu_a; +const volatile s32 test_cpu_b; +const volatile s32 test_cpu_c; + +u64 nr_enqueues; +u64 nr_wait_kicks; + +UEI_DEFINE(uei); + +static s32 target_cpu(s32 cpu) +{ + if (cpu == test_cpu_a) + return test_cpu_b; + if (cpu == test_cpu_b) + return test_cpu_c; + if (cpu == test_cpu_c) + return test_cpu_a; + return -1; +} + +void BPF_STRUCT_OPS(wait_kick_cycle_enqueue, struct task_struct *p, u64 enq_flags) +{ + s32 this_cpu = bpf_get_smp_processor_id(); + s32 tgt; + + __sync_fetch_and_add(&nr_enqueues, 1); + + if (p->flags & PF_KTHREAD) { + scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_INF, + enq_flags | SCX_ENQ_PREEMPT); + return; + } + + scx_bpf_dsq_insert(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags); + + tgt = target_cpu(this_cpu); + if (tgt < 0 || tgt == this_cpu) + return; + + __sync_fetch_and_add(&nr_wait_kicks, 1); + scx_bpf_kick_cpu(tgt, SCX_KICK_WAIT); +} + +void BPF_STRUCT_OPS(wait_kick_cycle_exit, struct scx_exit_info *ei) +{ + UEI_RECORD(uei, ei); +} + +SEC(".struct_ops.link") +struct sched_ext_ops wait_kick_cycle_ops = { + .enqueue = wait_kick_cycle_enqueue, + .exit = wait_kick_cycle_exit, + .name = "wait_kick_cycle", + .timeout_ms = 1000U, +}; diff --git a/tools/testing/selftests/sched_ext/wait_kick_cycle.c b/tools/testing/selftests/sched_ext/wait_kick_cycle.c new file mode 100644 index 000000000000..3889e7a9a0a7 --- /dev/null +++ b/tools/testing/selftests/sched_ext/wait_kick_cycle.c @@ -0,0 +1,223 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2026 Christian Loehle <christian.loehle@arm.com> + */ +#define _GNU_SOURCE + +#ifndef ARRAY_SIZE +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#endif + +#include <bpf/bpf.h> +#include <errno.h> +#include <pthread.h> +#include <sched.h> +#include <scx/common.h> +#include <stdint.h> +#include <string.h> +#include <time.h> +#include <unistd.h> + +#include "scx_test.h" +#include "wait_kick_cycle.bpf.skel.h" + +/* + * Multiple workers per test CPU. Packing several runnable threads onto each + * CPU causes frequent context switching and back-to-back enqueue() calls, which + * maximizes the chance that all three test CPUs fire enqueue() concurrently + * and enter the SCX_KICK_WAIT cycle simultaneously. + */ +#define WORKERS_PER_CPU 4 +#define NR_TEST_CPUS 3 +#define NR_WORKERS (NR_TEST_CPUS * WORKERS_PER_CPU) + +struct worker_ctx { + pthread_t tid; + int cpu; + volatile bool stop; + volatile __u64 iters; + bool started; +}; + +static int pick_test_cpus(int *cpu_a, int *cpu_b, int *cpu_c) +{ + cpu_set_t mask; + int cpus[4]; + int nr = 0; + int cpu; + + if (sched_getaffinity(0, sizeof(mask), &mask)) + return -errno; + + for (cpu = 0; cpu < CPU_SETSIZE && nr < ARRAY_SIZE(cpus); cpu++) { + if (!CPU_ISSET(cpu, &mask)) + continue; + cpus[nr++] = cpu; + } + + if (nr < 3) + return -EOPNOTSUPP; + + /* Leave one CPU unused when possible so one CPU remains uncongested. */ + if (nr >= 4) { + *cpu_a = cpus[1]; + *cpu_b = cpus[2]; + *cpu_c = cpus[3]; + } else { + *cpu_a = cpus[0]; + *cpu_b = cpus[1]; + *cpu_c = cpus[2]; + } + return 0; +} + +static void *worker_fn(void *arg) +{ + struct worker_ctx *worker = arg; + cpu_set_t mask; + + CPU_ZERO(&mask); + CPU_SET(worker->cpu, &mask); + + if (sched_setaffinity(0, sizeof(mask), &mask)) + return (void *)(uintptr_t)errno; + + /* + * Tight yield loop — no sleep. Keeping the CPU continuously busy + * with rapid context switches ensures enqueue() fires at the highest + * possible rate on each test CPU. + */ + while (!worker->stop) { + sched_yield(); + worker->iters++; + } + + return NULL; +} + +static int join_worker(struct worker_ctx *worker) +{ + void *ret; + struct timespec ts; + int err; + + if (!worker->started) + return 0; + + if (clock_gettime(CLOCK_REALTIME, &ts)) + return -errno; + + ts.tv_sec += 2; + err = pthread_timedjoin_np(worker->tid, &ret, &ts); + if (err == ETIMEDOUT) + pthread_detach(worker->tid); + if (err) + return -err; + + if ((uintptr_t)ret) + return -(int)(uintptr_t)ret; + + return 0; +} + +static enum scx_test_status setup(void **ctx) +{ + struct wait_kick_cycle *skel; + + skel = wait_kick_cycle__open(); + SCX_FAIL_IF(!skel, "Failed to open skel"); + SCX_ENUM_INIT(skel); + + *ctx = skel; + return SCX_TEST_PASS; +} + +static enum scx_test_status run(void *ctx) +{ + struct wait_kick_cycle *skel = ctx; + struct worker_ctx workers[NR_WORKERS] = {}; + struct bpf_link *link = NULL; + enum scx_test_status status = SCX_TEST_PASS; + int test_cpus[NR_TEST_CPUS] = { -1, -1, -1 }; + int ret; + int i; + + ret = pick_test_cpus(&test_cpus[0], &test_cpus[1], &test_cpus[2]); + if (ret == -EOPNOTSUPP) + return SCX_TEST_SKIP; + if (ret) { + SCX_ERR("Failed to pick test cpus (%d)", ret); + return SCX_TEST_FAIL; + } + + skel->rodata->test_cpu_a = test_cpus[0]; + skel->rodata->test_cpu_b = test_cpus[1]; + skel->rodata->test_cpu_c = test_cpus[2]; + + if (wait_kick_cycle__load(skel)) { + SCX_ERR("Failed to load skel"); + return SCX_TEST_FAIL; + } + + link = bpf_map__attach_struct_ops(skel->maps.wait_kick_cycle_ops); + if (!link) { + SCX_ERR("Failed to attach scheduler"); + return SCX_TEST_FAIL; + } + + /* WORKERS_PER_CPU threads per test CPU, all in tight yield loops. */ + for (i = 0; i < NR_WORKERS; i++) + workers[i].cpu = test_cpus[i / WORKERS_PER_CPU]; + + for (i = 0; i < NR_WORKERS; i++) { + ret = pthread_create(&workers[i].tid, NULL, worker_fn, &workers[i]); + if (ret) { + SCX_ERR("Failed to create worker thread %d (%d)", i, ret); + status = SCX_TEST_FAIL; + goto out; + } + workers[i].started = true; + } + + sleep(3); + + if (skel->data->uei.kind != EXIT_KIND(SCX_EXIT_NONE)) { + SCX_ERR("Scheduler exited unexpectedly (kind=%llu code=%lld)", + (unsigned long long)skel->data->uei.kind, + (long long)skel->data->uei.exit_code); + status = SCX_TEST_FAIL; + } + +out: + for (i = 0; i < NR_WORKERS; i++) + workers[i].stop = true; + + for (i = 0; i < NR_WORKERS; i++) { + ret = join_worker(&workers[i]); + if (ret && status == SCX_TEST_PASS) { + SCX_ERR("Failed to join worker thread %d (%d)", i, ret); + status = SCX_TEST_FAIL; + } + } + + if (link) + bpf_link__destroy(link); + + return status; +} + +static void cleanup(void *ctx) +{ + struct wait_kick_cycle *skel = ctx; + + wait_kick_cycle__destroy(skel); +} + +struct scx_test wait_kick_cycle = { + .name = "wait_kick_cycle", + .description = "Verify SCX_KICK_WAIT forward progress under a 3-CPU wait cycle", + .setup = setup, + .run = run, + .cleanup = cleanup, +}; +REGISTER_SCX_TEST(&wait_kick_cycle) -- 2.34.1 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization 2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle 2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle 2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle @ 2026-03-29 0:20 ` Tejun Heo 2 siblings, 0 replies; 11+ messages in thread From: Tejun Heo @ 2026-03-29 0:20 UTC (permalink / raw) To: Christian Loehle Cc: sched-ext, linux-kernel, linux-kselftest, void, arighi, changwoo, mingo, peterz, shuah, dietmar.eggemann, emil Hello, I posted an alternative fix here: https://lore.kernel.org/r/20260329001856.835643-1-tj@kernel.org Instead of serializing the kicks, it defers the wait to a balance callback which can drop the rq lock and enable IRQs, avoiding the deadlock while preserving the concurrent kick_wait semantics. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-03-29 0:20 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-16 10:02 [PATCH 0/2] sched_ext: Fix SCX_KICK_WAIT cycle deadlock Christian Loehle 2026-03-16 10:02 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Christian Loehle 2026-03-16 10:49 ` Andrea Righi 2026-03-16 11:12 ` Christian Loehle 2026-03-16 14:42 ` Andrea Righi 2026-03-16 17:46 ` Tejun Heo 2026-03-16 22:26 ` Christian Loehle 2026-03-17 8:23 ` Christian Loehle 2026-03-17 9:15 ` Christian Loehle 2026-03-16 10:02 ` [PATCH 2/2] sched_ext/selftests: Add SCX_KICK_WAIT cycle tests Christian Loehle 2026-03-29 0:20 ` [PATCH 1/2] sched_ext: Prevent SCX_KICK_WAIT deadlock by serialization Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox