From: Tejun Heo <tj@kernel.org>
To: torvalds@linux-foundation.org, mingo@redhat.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
joshdon@google.com, brho@google.com, pjt@google.com,
derkling@google.com, haoluo@google.com, dvernet@meta.com,
dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com
Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
kernel-team@meta.com, Tejun Heo <tj@kernel.org>,
Bagas Sanjaya <bagasdotme@gmail.com>
Subject: [PATCH 28/30] sched_ext: Documentation: scheduler: Document extensible scheduler class
Date: Fri, 27 Jan 2023 14:16:37 -1000 [thread overview]
Message-ID: <20230128001639.3510083-29-tj@kernel.org> (raw)
In-Reply-To: <20230128001639.3510083-1-tj@kernel.org>
Add Documentation/scheduler/sched-ext.rst which gives a high-level overview
and pointers to the examples.
v2: Apply minor edits suggested by Bagas. Caveats section dropped as all of
them are addressed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
---
Documentation/scheduler/index.rst | 1 +
Documentation/scheduler/sched-ext.rst | 224 ++++++++++++++++++++++++++
include/linux/sched/ext.h | 2 +
kernel/Kconfig.preempt | 2 +
kernel/sched/ext.c | 2 +
kernel/sched/ext.h | 2 +
6 files changed, 233 insertions(+)
create mode 100644 Documentation/scheduler/sched-ext.rst
diff --git a/Documentation/scheduler/index.rst b/Documentation/scheduler/index.rst
index b430d856056a..8a27a9967284 100644
--- a/Documentation/scheduler/index.rst
+++ b/Documentation/scheduler/index.rst
@@ -18,6 +18,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
+ sched-ext
sched-debug
text_files
diff --git a/Documentation/scheduler/sched-ext.rst b/Documentation/scheduler/sched-ext.rst
new file mode 100644
index 000000000000..8a3626c884e7
--- /dev/null
+++ b/Documentation/scheduler/sched-ext.rst
@@ -0,0 +1,224 @@
+==========================
+Extensible Scheduler Class
+==========================
+
+sched_ext is a scheduler class whose behavior can be defined by a set of BPF
+programs - the BPF scheduler.
+
+* sched_ext exports a full scheduling interface so that any scheduling
+ algorithm can be implemented on top.
+
+* The BPF scheduler can group CPUs however it sees fit and schedule them
+ together, as tasks aren't tied to specific CPUs at the time of wakeup.
+
+* The BPF scheduler can be turned on and off dynamically anytime.
+
+* The system integrity is maintained no matter what the BPF scheduler does.
+ The default scheduling behavior is restored anytime an error is detected,
+ a runnable task stalls, or on invoking the SysRq key sequence
+ :kbd:`SysRq-S`.
+
+Switching to and from sched_ext
+===============================
+
+``CONFIG_SCHED_CLASS_EXT`` is the config option to enable sched_ext and
+``tools/sched_ext`` contains the example schedulers.
+
+sched_ext is used only when the BPF scheduler is loaded and running.
+
+If a task explicitly sets its scheduling policy to ``SCHED_EXT``, it will be
+treated as ``SCHED_NORMAL`` and scheduled by CFS until the BPF scheduler is
+loaded. On load, such tasks will be switched to and scheduled by sched_ext.
+
+The BPF scheduler can choose to schedule all normal and lower class tasks by
+calling ``scx_bpf_switch_all()`` from its ``init()`` operation. In this
+case, all ``SCHED_NORMAL``, ``SCHED_BATCH``, ``SCHED_IDLE`` and
+``SCHED_EXT`` tasks are scheduled by sched_ext. In the example schedulers,
+this mode can be selected with the ``-a`` option.
+
+Terminating the sched_ext scheduler program, triggering :kbd:`SysRq-S`, or
+detection of any internal error including stalled runnable tasks aborts the
+BPF scheduler and reverts all tasks back to CFS.
+
+.. code-block:: none
+
+ # make -j16 -C tools/sched_ext
+ # tools/sched_ext/scx_example_dummy -a
+ local=0 global=3
+ local=5 global=24
+ local=9 global=44
+ local=13 global=56
+ local=17 global=72
+ ^CEXIT: BPF scheduler unregistered
+
+If ``CONFIG_SCHED_DEBUG`` is set, the current status of the BPF scheduler
+and whether a given task is on sched_ext can be determined as follows:
+
+.. code-block:: none
+
+ # cat /sys/kernel/debug/sched/ext
+ ops : dummy
+ enabled : 1
+ switching_all : 1
+ switched_all : 1
+ enable_state : enabled
+
+ # grep ext /proc/self/sched
+ ext.enabled : 1
+
+The Basics
+==========
+
+Userspace can implement an arbitrary BPF scheduler by loading a set of BPF
+programs that implement ``struct sched_ext_ops``. The only mandatory field
+is ``ops.name`` which must be a valid BPF object name. All operations are
+optional. The following modified excerpt is from
+``tools/sched/scx_example_dummy.bpf.c`` showing a minimal global FIFO
+scheduler.
+
+.. code-block:: c
+
+ s32 BPF_STRUCT_OPS(dummy_init)
+ {
+ if (switch_all)
+ scx_bpf_switch_all();
+ return 0;
+ }
+
+ void BPF_STRUCT_OPS(dummy_enqueue, struct task_struct *p, u64 enq_flags)
+ {
+ if (enq_flags & SCX_ENQ_LOCAL)
+ scx_bpf_dispatch(p, SCX_DSQ_LOCAL, enq_flags);
+ else
+ scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, enq_flags);
+ }
+
+ void BPF_STRUCT_OPS(dummy_exit, struct scx_exit_info *ei)
+ {
+ exit_type = ei->type;
+ }
+
+ SEC(".struct_ops")
+ struct sched_ext_ops dummy_ops = {
+ .enqueue = (void *)dummy_enqueue,
+ .init = (void *)dummy_init,
+ .exit = (void *)dummy_exit,
+ .name = "dummy",
+ };
+
+Dispatch Queues
+---------------
+
+To match the impedance between the scheduler core and the BPF scheduler,
+sched_ext uses simple FIFOs called DSQs (dispatch queues). By default, there
+is one global FIFO (``SCX_DSQ_GLOBAL``), and one local dsq per CPU
+(``SCX_DSQ_LOCAL``). The BPF scheduler can manage an arbitrary number of
+dsq's using ``scx_bpf_create_dsq()`` and ``scx_bpf_destroy_dsq()``.
+
+A CPU always executes a task from its local DSQ. A task is "dispatched" to a
+DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
+local DSQ.
+
+When a CPU is looking for the next task to run, if the local DSQ is not
+empty, the first task is picked. Otherwise, the CPU tries to consume the
+global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
+is invoked.
+
+Scheduling Cycle
+----------------
+
+The following briefly shows how a waking task is scheduled and executed.
+
+1. When a task is waking up, ``ops.select_cpu()`` is the first operation
+ invoked. This serves two purposes. First, CPU selection optimization
+ hint. Second, waking up the selected CPU if idle.
+
+ The CPU selected by ``ops.select_cpu()`` is an optimization hint and not
+ binding. The actual decision is made at the last step of scheduling.
+ However, there is a small performance gain if the CPU
+ ``ops.select_cpu()`` returns matches the CPU the task eventually runs on.
+
+ A side-effect of selecting a CPU is waking it up from idle. While a BPF
+ scheduler can wake up any cpu using the ``scx_bpf_kick_cpu()`` helper,
+ using ``ops.select_cpu()`` judiciously can be simpler and more efficient.
+
+ Note that the scheduler core will ignore an invalid CPU selection, for
+ example, if it's outside the allowed cpumask of the task.
+
+2. Once the target CPU is selected, ``ops.enqueue()`` is invoked. It can
+ make one of the following decisions:
+
+ * Immediately dispatch the task to either the global or local DSQ by
+ calling ``scx_bpf_dispatch()`` with ``SCX_DSQ_GLOBAL`` or
+ ``SCX_DSQ_LOCAL``, respectively.
+
+ * Immediately dispatch the task to a custom DSQ by calling
+ ``scx_bpf_dispatch()`` with a DSQ ID which is smaller than 2^63.
+
+ * Queue the task on the BPF side.
+
+3. When a CPU is ready to schedule, it first looks at its local DSQ. If
+ empty, it then looks at the global DSQ. If there still isn't a task to
+ run, ``ops.dispatch()`` is invoked which can use the following two
+ functions to populate the local DSQ.
+
+ * ``scx_bpf_dispatch()`` dispatches a task to a DSQ. Any target DSQ can
+ be used - ``SCX_DSQ_LOCAL``, ``SCX_DSQ_LOCAL_ON | cpu``,
+ ``SCX_DSQ_GLOBAL`` or a custom DSQ. While ``scx_bpf_dispatch()``
+ currently can't be called with BPF locks held, this is being worked on
+ and will be supported. ``scx_bpf_dispatch()`` schedules dispatching
+ rather than performing them immediately. There can be up to
+ ``ops.dispatch_max_batch`` pending tasks.
+
+ * ``scx_bpf_consume()`` tranfers a task from the specified non-local DSQ
+ to the dispatching DSQ. This function cannot be called with any BPF
+ locks held. ``scx_bpf_consume()`` flushes the pending dispatched tasks
+ before trying to consume the specified DSQ.
+
+4. After ``ops.dispatch()`` returns, if there are tasks in the local DSQ,
+ the CPU runs the first one. If empty, the following steps are taken:
+
+ * Try to consume the global DSQ. If successful, run the task.
+
+ * If ``ops.dispatch()`` has dispatched any tasks, retry #3.
+
+ * If the previous task is an SCX task and still runnable, keep executing
+ it (see ``SCX_OPS_ENQ_LAST``).
+
+ * Go idle.
+
+Note that the BPF scheduler can always choose to dispatch tasks immediately
+in ``ops.enqueue()`` as illustrated in the above dummy example. If only the
+built-in DSQs are used, there is no need to implement ``ops.dispatch()`` as
+a task is never queued on the BPF scheduler and both the local and global
+DSQs are consumed automatically.
+
+Where to Look
+=============
+
+* ``include/linux/sched/ext.h`` defines the core data structures, ops table
+ and constants.
+
+* ``kernel/sched/ext.c`` contains sched_ext core implementation and helpers.
+ The functions prefixed with ``scx_bpf_`` can be called from the BPF
+ scheduler.
+
+* ``tools/sched_ext/`` hosts example BPF scheduler implementations.
+
+ * ``scx_example_dummy[.bpf].c``: Minimal global FIFO scheduler example
+ using a custom DSQ.
+
+ * ``scx_example_qmap[.bpf].c``: A multi-level FIFO scheduler supporting
+ five levels of priority implemented with ``BPF_MAP_TYPE_QUEUE``.
+
+ABI Instability
+===============
+
+The APIs provided by sched_ext to BPF schedulers programs have no stability
+guarantees. This includes the ops table callbacks and constants defined in
+``include/linux/sched/ext.h``, as well as the ``scx_bpf_`` kfuncs defined in
+``kernel/sched/ext.c``.
+
+While we will attempt to provide a relatively stable API surface when
+possible, they are subject to change without warning between kernel
+versions.
diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index d3c2701bb4b4..6b230ecdcfa4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <tj@kernel.org>
* Copyright (c) 2022 David Vernet <dvernet@meta.com>
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index e12a057ead7b..bae49b743834 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -154,3 +154,5 @@ config SCHED_CLASS_EXT
wish to implement scheduling policies. The struct_ops structure
exported by sched_ext is struct sched_ext_ops, and is conceptually
similar to struct sched_class.
+
+ See Documentation/scheduler/sched-ext.rst for more details.
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8619eb2dcbd5..828082e6e780 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <tj@kernel.org>
* Copyright (c) 2022 David Vernet <dvernet@meta.com>
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index c3df39984fc9..4252296ba464 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -1,5 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
* Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
* Copyright (c) 2022 Tejun Heo <tj@kernel.org>
* Copyright (c) 2022 David Vernet <dvernet@meta.com>
--
2.39.1
next prev parent reply other threads:[~2023-01-28 0:20 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-28 0:16 [PATCHSET v2] sched: Implement BPF extensible scheduler class Tejun Heo
2023-01-28 0:16 ` [PATCH 01/30] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2023-01-28 0:16 ` [PATCH 02/30] sched: Encapsulate task attribute change sequence into a helper macro Tejun Heo
2023-01-28 0:16 ` [PATCH 03/30] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2023-01-28 0:16 ` [PATCH 04/30] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2023-01-28 0:16 ` [PATCH 05/30] sched: Add sched_class->reweight_task() Tejun Heo
2023-01-28 0:16 ` [PATCH 06/30] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2023-01-28 0:16 ` [PATCH 07/30] sched: Factor out cgroup weight conversion functions Tejun Heo
2023-01-28 0:16 ` [PATCH 08/30] sched: Expose css_tg(), __setscheduler_prio() and SCHED_CHANGE_BLOCK() Tejun Heo
2023-01-28 17:24 ` kernel test robot
2023-01-28 0:16 ` [PATCH 09/30] sched: Enumerate CPU cgroup file types Tejun Heo
2023-01-28 0:16 ` [PATCH 10/30] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2023-01-28 0:16 ` [PATCH 11/30] sched: Add normal_policy() Tejun Heo
2023-01-28 0:16 ` [PATCH 12/30] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2023-01-28 0:16 ` [PATCH 13/30] sched_ext: Implement BPF " Tejun Heo
2023-01-28 0:16 ` [PATCH 14/30] sched_ext: Add scx_example_dummy and scx_example_qmap example schedulers Tejun Heo
2023-01-28 0:16 ` [PATCH 15/30] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2023-01-28 0:16 ` [PATCH 16/30] sched_ext: Implement runnable task stall watchdog Tejun Heo
2023-01-28 0:16 ` [PATCH 17/30] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2023-01-28 0:16 ` [PATCH 18/30] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2023-01-28 0:16 ` [PATCH 19/30] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2023-01-28 0:16 ` [PATCH 20/30] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2023-01-28 0:16 ` [PATCH 21/30] sched_ext: Add task state tracking operations Tejun Heo
2023-01-28 0:16 ` [PATCH 22/30] sched_ext: Implement tickless support Tejun Heo
2023-01-28 0:16 ` [PATCH 23/30] sched_ext: Add cgroup support Tejun Heo
2023-01-28 18:05 ` kernel test robot
2023-01-30 23:41 ` Tejun Heo
2023-01-28 0:16 ` [PATCH 24/30] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2023-01-28 0:16 ` [PATCH 25/30] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2023-01-28 0:16 ` [PATCH 26/30] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2023-01-28 0:16 ` [PATCH 27/30] sched_ext: Implement core-sched support Tejun Heo
2023-01-28 19:07 ` kernel test robot
2023-01-30 21:38 ` Josh Don
2023-01-31 0:26 ` Tejun Heo
2023-01-31 0:36 ` Tejun Heo
2023-01-31 1:45 ` Josh Don
2023-01-28 0:16 ` Tejun Heo [this message]
2023-01-28 0:16 ` [PATCH 29/30] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2023-01-28 0:16 ` [PATCH 30/30] sched_ext: Add a rust userspace hybrid example scheduler Tejun Heo
2023-02-08 21:55 ` [PATCHSET v2] sched: Implement BPF extensible scheduler class Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230128001639.3510083-29-tj@kernel.org \
--to=tj@kernel.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bagasdotme@gmail.com \
--cc=bpf@vger.kernel.org \
--cc=brho@google.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=daniel@iogearbox.net \
--cc=derkling@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=dschatzberg@meta.com \
--cc=dskarlat@cs.cmu.edu \
--cc=dvernet@meta.com \
--cc=haoluo@google.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.