public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v3 sched_ext/for-6.19] sched_ext: Improve bypass mode scalability
@ 2025-11-11 19:18 Tejun Heo
  2025-11-11 19:18 ` [PATCH 01/13] sched_ext: Use shorter slice in bypass mode Tejun Heo
                   ` (13 more replies)
  0 siblings, 14 replies; 25+ messages in thread
From: Tejun Heo @ 2025-11-11 19:18 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: Dan Schatzberg, Emil Tsalapatis, sched-ext, linux-kernel

v3: - Removed first patch.
    - Added READ_ONCE/WRITE_ONCE for scx_slice_dfl access (#1, Dan).
    - Added missing dummy scx_hardlockup() definition for !CONFIG_SCHED_CLASS_EXT
      (#9, kernel test bot).

v2: http://lkml.kernel.org/r/20251110205636.405592-1-tj@kernel.org

Hello,

This patchset improves bypass mode scalability on large systems with many
runnable tasks.

Problem 1: Per-node DSQ contention with affinitized tasks

When bypass mode is triggered, tasks are routed through fallback dispatch
queues. Originally, bypass used a single global DSQ, but this didn't scale on
NUMA machines and could lead to livelocks. It was changed to use per-node
global DSQs with a breather mechanism that injects delays during bypass mode
switching to reduce lock contention. This resolved the cross-node issues and
has worked well for most cases.

However, Dan Schatzberg found that per-node global DSQs can still livelock in
a different scenario: On systems with many CPUs and many threads pinned to
different small subsets of CPUs, each CPU often has to scan through many
tasks it cannot run to find the one task it can run. With high CPU counts,
this scanning overhead causes severe DSQ lock contention that can live-lock
the system, preventing bypass mode activation from completing at all.

The patchset addresses this by switching to per-CPU bypass DSQs to eliminate
the shared DSQ contention. However, per-CPU DSQs alone aren't enough - CPUs
can still get stuck in long iteration loops during dispatch and move
operations. The existing breather mechanism helps with lock contention but
doesn't help when CPUs are trapped in these loops. The patchset replaces the
breather with immediate exits from dispatch and move operations when
aborting. Since these operations only run during scheduler abort, there's no
need to maintain normal operation semantics, making immediate exit both
simpler and more effective.

As an additional safety net, the patchset hooks up the hardlockup detector.
The contention can be so severe that hardlockup can be the first sign of
trouble. For example, running scx_simple (which uses a single global DSQ)
with many affinitized tasks causes all CPUs to contend on the DSQ lock while
doing long scans, triggering hardlockup before other warnings appear.

Problem 2: Task concentration with per-CPU DSQs

The switch to per-CPU DSQs introduces a new failure mode. If the BPF
scheduler severely skews task placement before triggering bypass in a highly
over-saturated system, most tasks can end up concentrated on a few CPUs.
Those CPUs then accumulate queues that are too long to drain in a reasonable
time, leading to RCU stalls and hung tasks.

This is addressed by implementing a simple timer-based load balancer that
redistributes tasks across CPUs within each NUMA node.

The patchset also uses shorter time slices in bypass mode for faster forward
progress.

The patchset has been tested on a 192 CPU dual socket AMD EPYC machine with
~20k runnable tasks:

- For problem 1 (contention): 20k runnable threads in 20 cgroups affinitized
  to different CPU subsets running scx_simple. This creates the worst-case
  contention scenario where every CPU must scan through many incompatible
  tasks. The system can now reliably survive and kick out the scheduler.

- For problem 2 (concentration): scx_cpu0 (included in this series) queues
  all tasks to CPU0, creating worst-case task concentration. Without these
  changes, disabling the scheduler leads to RCU stalls and hung tasks. With
  these changes, disable completes in about a second.

This patchset contains the following 13 patches:

 0001-sched_ext-Use-shorter-slice-in-bypass-mode.patch
 0002-sched_ext-Refactor-do_enqueue_task-local-and-global-.patch
 0003-sched_ext-Use-per-CPU-DSQs-instead-of-per-node-globa.patch
 0004-sched_ext-Simplify-breather-mechanism-with-scx_abort.patch
 0005-sched_ext-Exit-dispatch-and-move-operations-immediat.patch
 0006-sched_ext-Make-scx_exit-and-scx_vexit-return-bool.patch
 0007-sched_ext-Refactor-lockup-handlers-into-handle_locku.patch
 0008-sched_ext-Make-handle_lockup-propagate-scx_verror-re.patch
 0009-sched_ext-Hook-up-hardlockup-detector.patch
 0010-sched_ext-Add-scx_cpu0-example-scheduler.patch
 0011-sched_ext-Factor-out-scx_dsq_list_node-cursor-initia.patch
 0012-sched_ext-Factor-out-abbreviated-dispatch-dequeue-in.patch
 0013-sched_ext-Implement-load-balancer-for-bypass-mode.patch

Based on sched_ext/for-6.19 (5a629ecbcdff).

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-bypass-scalability-v3

 include/linux/sched/ext.h        |  21 ++
 include/trace/events/sched_ext.h |  39 +++
 kernel/sched/ext.c               | 518 +++++++++++++++++++++++++++++----------
 kernel/sched/ext_internal.h      |   6 +
 kernel/sched/sched.h             |   1 +
 kernel/watchdog.c                |   9 +
 tools/sched_ext/Makefile         |   2 +-
 tools/sched_ext/scx_cpu0.bpf.c   |  88 +++++++
 tools/sched_ext/scx_cpu0.c       | 106 ++++++++
 9 files changed, 663 insertions(+), 127 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-11-14 21:19 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 19:18 [PATCHSET v3 sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
2025-11-11 19:18 ` [PATCH 01/13] sched_ext: Use shorter slice in bypass mode Tejun Heo
2025-11-11 19:18 ` [PATCH 02/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
2025-11-11 19:18 ` [PATCH 03/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
2025-11-11 19:18 ` [PATCH 04/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
2025-11-11 19:18 ` [PATCH 05/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
2025-11-11 19:18 ` [PATCH 06/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
2025-11-11 19:18 ` [PATCH 07/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
2025-11-11 19:18 ` [PATCH 08/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
2025-11-11 19:18 ` [PATCH 09/13] sched_ext: Hook up hardlockup detector Tejun Heo
2025-11-11 19:19   ` Tejun Heo
2025-11-13 22:33   ` Doug Anderson
2025-11-14  1:25     ` Tejun Heo
2025-11-14  1:33     ` [PATCH sched_ext/for-6.19] sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docs Tejun Heo
2025-11-14  2:00       ` Emil Tsalapatis
2025-11-14  7:32       ` Andrea Righi
2025-11-14 19:24       ` Doug Anderson
2025-11-14 21:15       ` Tejun Heo
2025-11-14 21:19       ` Tejun Heo
2025-11-11 19:18 ` [PATCH 10/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
2025-11-11 19:18 ` [PATCH 11/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
2025-11-11 19:18 ` [PATCH 12/13] sched_ext: Factor out abbreviated dispatch dequeue into dispatch_dequeue_locked() Tejun Heo
2025-11-11 19:18 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
2025-11-11 19:30   ` Emil Tsalapatis
2025-11-12 16:49 ` [PATCHSET v3 sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox