From: Andrea Righi <andrea.righi@canonical.com>
To: Tejun Heo <tj@kernel.org>
Cc: torvalds@linux-foundation.org, mingo@redhat.com,
peterz@infradead.org, juri.lelli@redhat.com,
vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
bristot@redhat.com, vschneid@redhat.com, ast@kernel.org,
daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
joshdon@google.com, brho@google.com, pjt@google.com,
derkling@google.com, haoluo@google.com, dvernet@meta.com,
dschatzberg@meta.com, dskarlat@cs.cmu.edu, riel@surriel.com,
changwoo@igalia.com, himadrics@inria.fr, memxor@gmail.com,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
kernel-team@meta.com
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class
Date: Mon, 13 Nov 2023 15:04:24 -0500 [thread overview]
Message-ID: <ZVKBSIPqJnAvrE3g@gpd> (raw)
In-Reply-To: <20231111024835.2164816-13-tj@kernel.org>
On Fri, Nov 10, 2023 at 04:47:38PM -1000, Tejun Heo wrote:
> Implement a new scheduler class sched_ext (SCX), which allows scheduling
> policies to be implemented as BPF programs to achieve the following:
>
> 1. Ease of experimentation and exploration: Enabling rapid iteration of new
> scheduling policies.
>
> 2. Customization: Building application-specific schedulers which implement
> policies that are not applicable to general-purpose schedulers.
>
> 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
> policies in production environments.
>
> sched_ext leverages BPF’s struct_ops feature to define a structure which
> exports function callbacks and flags to BPF programs that wish to implement
> scheduling policies. The struct_ops structure exported by sched_ext is
> struct sched_ext_ops, and is conceptually similar to struct sched_class. The
> role of sched_ext is to map the complex sched_class callbacks to the more
> simple and ergonomic struct sched_ext_ops callbacks.
>
> For more detailed discussion on the motivations and overview, please refer
> to the cover letter.
>
> Later patches will also add several example schedulers and documentation.
>
> This patch implements the minimum core framework to enable implementation of
> BPF schedulers. Subsequent patches will gradually add functionalities
> including safety guarantee mechanisms, nohz and cgroup support.
>
> include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
> top, each operation should be self-explanatory. The followings are worth
> noting:
>
> * Both "sched_ext" and its shorthand "scx" are used. If the identifier
> already has "sched" in it, "ext" is used; otherwise, "scx".
>
> * In sched_ext_ops, only .name is mandatory. Every operation is optional and
> if omitted a simple but functional default behavior is provided.
>
> * A new policy constant SCHED_EXT is added and a task can select sched_ext
> by invoking sched_setscheduler(2) with the new policy constant. However,
> if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
> and the task is scheduled by CFS. When the BPF scheduler is loaded, all
> tasks which have the SCHED_EXT policy are switched to sched_ext.
>
> * To bridge the workflow imbalance between the scheduler core and
> sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
> queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
> one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
> convenience and need not be used by a scheduler that doesn't require it.
> SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
> the next task on the CPU. The BPF scheduler can manage an arbitrary number
> of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
>
> * sched_ext guarantees system integrity no matter what the BPF scheduler
> does. To enable this, each task's ownership is tracked through
> p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
> can always recover and revert all tasks back to CFS. See p->scx.ops_state
> and scx_tasks.
>
> * A task is not tied to its rq while enqueued. This decouples CPU selection
> from queueing and allows sharing a scheduling queue across an arbitrary
> subset of CPUs. This adds some complexities as a task may need to be
> bounced between rq's right before it starts executing. See
> dispatch_to_local_dsq() and move_task_to_local_dsq().
>
> * One complication that arises from the above weak association between task
> and rq is that synchronizing with dequeue() gets complicated as dequeue()
> may happen anytime while the task is enqueued and the dispatch path might
> need to release the rq lock to transfer the task. Solving this requires a
> bit of complexity. See the logic around p->scx.sticky_cpu and
> p->scx.ops_qseq.
>
> * Both enable and disable paths are a bit complicated. The enable path
> switches all tasks without blocking to avoid issues which can arise from
> partially switched states (e.g. the switching task itself being starved).
> The disable path can't trust the BPF scheduler at all, so it also has to
> guarantee forward progress without blocking. See scx_ops_enable() and
> scx_ops_disable_workfn().
>
> * When sched_ext is disabled, static_branches are used to shut down the
> entry points from hot paths.
>
> v5: * To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
> instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
> load_acquire/store_release is now unsigned long instead of u64.
>
> * Fix the bug where bpf_scx_btf_struct_access() was allowing write
> access to arbitrary fields.
>
> * Distinguish kfuncs which can be called from any sched_ext ops and from
> anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
> sched_ext ops.
>
> * Rename "type" to "kind" in scx_exit_info to make it easier to use on
> languages in which "type" is a reserved keyword.
>
> * Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
> setup"), PF_IDLE is not set on idle tasks which haven't been online
> yet which made scx_task_iter_next_filtered() include those idle tasks
> in iterations leading to oopses. Update scx_task_iter_next_filtered()
> to directly test p->sched_class against idle_sched_class instead of
> using is_idle_task() which tests PF_IDLE.
>
> * Other updates to match upstream changes such as adding const to
> set_cpumask() param and renaming check_preempt_curr() to
> wakeup_preempt().
>
> v4: * SCHED_CHANGE_BLOCK replaced with the previous
> sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
> because upstream is adaopting a different generic cleanup mechanism.
> Once that lands, the code will be adapted accordingly.
>
> * task_on_scx() used to test whether a task should be switched into SCX,
> which is confusing. Renamed to task_should_scx(). task_on_scx() now
> tests whether a task is currently on SCX.
>
> * scx_has_idle_cpus is barely used anymore and replaced with direct
> check on the idle cpumask.
>
> * SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
> fully idle cores.
>
> * ops.enable() now sees up-to-date p->scx.weight value.
>
> * ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
> schedulers expecting ->select_cpu() call.
>
> * Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
> of the scheduler.
>
> v3: * ops.set_weight() added to allow BPF schedulers to track weight changes
> without polling p->scx.weight.
>
> * move_task_to_local_dsq() was losing SCX-specific enq_flags when
> enqueueing the task on the target dsq because it goes through
> activate_task() which loses the upper 32bit of the flags. Carry the
> flags through rq->scx.extra_enq_flags.
>
> * scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
> and scx_bpf_task_cpu() now use the new KF_RCU instead of
> KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
>
> * The kfunc helper access control mechanism implemented through
> sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
> used when invoking scx_ops operations.
>
> v2: * balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
> called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
> To determine whether balance_scx() should be called from
> put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
> comment in put_prev_task_scx() for details.
>
> * sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
> with SCHED_CHANGE_BLOCK().
>
> * Unused all_dsqs list removed. This was a left-over from previous
> iterations.
>
> * p->scx.kf_mask is added to track and enforce which kfunc helpers are
> allowed. Also, init/exit sequences are updated to make some kfuncs
> always safe to call regardless of the current BPF scheduler state.
> Combined, this should make all the kfuncs safe.
>
> * BPF now supports sleepable struct_ops operations. Hacky workaround
> removed and operations and kfunc helpers are tagged appropriately.
>
> * BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
> and friends are added so that BPF schedulers can use the idle masks
> with the generic helpers. This replaces the hacky kfunc helpers added
> by a separate patch in V1.
>
> * CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
> enabled. This restriction will be removed by a later patch which adds
> core-sched support.
>
> * Add MAINTAINERS entries and other misc changes.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Co-authored-by: David Vernet <dvernet@meta.com>
> Acked-by: Josh Don <joshdon@google.com>
> Acked-by: Hao Luo <haoluo@google.com>
> Acked-by: Barret Rhoden <brho@google.com>
> Cc: Andrea Righi <andrea.righi@canonical.com>
...
> +#ifdef CONFIG_SCHED_DEBUG
> +static const char *scx_ops_enable_state_str[] = {
> + [SCX_OPS_PREPPING] = "prepping",
> + [SCX_OPS_ENABLING] = "enabling",
> + [SCX_OPS_ENABLED] = "enabled",
> + [SCX_OPS_DISABLING] = "disabling",
> + [SCX_OPS_DISABLED] = "disabled",
> +};
We may want to move scx_ops_enable_state_str[] outside of
CONFIG_SCHED_DEBUG, because we're using it later in print_scx_info()
("sched_ext: Print sched_ext info when dumping stack"), or we make
print_scx_info() dependent of CONFIG_SCHED_DEBUG.
-Andrea
next prev parent reply other threads:[~2023-11-13 20:04 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-11 2:47 [PATCHSET v5] sched: Implement BPF extensible scheduler class Tejun Heo
2023-11-11 2:47 ` [PATCH 01/36] cgroup: Implement cgroup_show_cftypes() Tejun Heo
2023-11-11 2:47 ` [PATCH 02/36] sched: Restructure sched_class order sanity checks in sched_init() Tejun Heo
2023-11-11 2:47 ` [PATCH 03/36] sched: Allow sched_cgroup_fork() to fail and introduce sched_cancel_fork() Tejun Heo
2023-11-11 2:47 ` [PATCH 04/36] sched: Add sched_class->reweight_task() Tejun Heo
2023-11-11 2:47 ` [PATCH 05/36] sched: Add sched_class->switching_to() and expose check_class_changing/changed() Tejun Heo
2023-11-11 2:47 ` [PATCH 06/36] sched: Factor out cgroup weight conversion functions Tejun Heo
2023-11-11 2:47 ` [PATCH 07/36] sched: Expose css_tg() and __setscheduler_prio() Tejun Heo
2023-11-11 2:47 ` [PATCH 08/36] sched: Enumerate CPU cgroup file types Tejun Heo
2023-11-11 2:47 ` [PATCH 09/36] sched: Add @reason to sched_class->rq_{on|off}line() Tejun Heo
2023-11-11 2:47 ` [PATCH 10/36] sched: Add normal_policy() Tejun Heo
2023-11-11 2:47 ` [PATCH 11/36] sched_ext: Add boilerplate for extensible scheduler class Tejun Heo
2023-11-11 2:47 ` [PATCH 12/36] sched_ext: Implement BPF " Tejun Heo
2023-11-13 13:34 ` Changwoo Min
2023-11-14 19:07 ` Tejun Heo
2023-11-13 20:04 ` Andrea Righi [this message]
2023-11-14 19:07 ` Tejun Heo
2023-11-23 8:07 ` Andrea Righi
2023-11-25 19:59 ` Tejun Heo
2023-11-26 9:05 ` Andrea Righi
2023-12-07 2:04 ` [PATCH] scx: set p->scx.ops_state using atomic_long_set_release Changwoo Min
2023-12-08 0:16 ` Tejun Heo
2024-03-23 2:37 ` [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class Joel Fernandes
2024-03-23 22:12 ` Tejun Heo
2024-04-25 21:28 ` Joel Fernandes
2024-04-26 16:57 ` Barret Rhoden
2024-04-26 21:58 ` Tejun Heo
2023-11-11 2:47 ` [PATCH 13/36] sched_ext: Add scx_simple and scx_example_qmap example schedulers Tejun Heo
2023-11-12 4:17 ` kernel test robot
2023-11-12 18:06 ` Tejun Heo
2023-11-11 2:47 ` [PATCH 14/36] sched_ext: Add sysrq-S which disables the BPF scheduler Tejun Heo
2023-11-11 2:47 ` [PATCH 15/36] sched_ext: Implement runnable task stall watchdog Tejun Heo
2023-11-11 2:47 ` [PATCH 16/36] sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT Tejun Heo
2023-11-11 2:47 ` [PATCH 17/36] sched_ext: Allow BPF schedulers to switch all eligible tasks into sched_ext Tejun Heo
2023-11-11 2:47 ` [PATCH 18/36] sched_ext: Print sched_ext info when dumping stack Tejun Heo
2023-11-14 19:23 ` [PATCH v2 " Tejun Heo
2023-11-11 2:47 ` [PATCH 19/36] sched_ext: Implement scx_bpf_kick_cpu() and task preemption support Tejun Heo
2023-11-11 2:47 ` [PATCH 20/36] sched_ext: Add a central scheduler which makes all scheduling decisions on one CPU Tejun Heo
2023-11-11 2:47 ` [PATCH 21/36] sched_ext: Make watchdog handle ops.dispatch() looping stall Tejun Heo
2023-11-11 2:47 ` [PATCH 22/36] sched_ext: Add task state tracking operations Tejun Heo
2023-11-11 2:47 ` [PATCH 23/36] sched_ext: Implement tickless support Tejun Heo
2023-11-11 2:47 ` [PATCH 24/36] sched_ext: Track tasks that are subjects of the in-flight SCX operation Tejun Heo
2023-11-11 2:47 ` [PATCH 25/36] sched_ext: Add cgroup support Tejun Heo
2023-11-11 2:47 ` [PATCH 26/36] sched_ext: Add a cgroup-based core-scheduling scheduler Tejun Heo
2023-11-11 2:47 ` [PATCH 27/36] sched_ext: Add a cgroup scheduler which uses flattened hierarchy Tejun Heo
2023-11-11 2:47 ` [PATCH 28/36] sched_ext: Implement SCX_KICK_WAIT Tejun Heo
2023-11-11 2:47 ` [PATCH 29/36] sched_ext: Implement sched_ext_ops.cpu_acquire/release() Tejun Heo
2023-11-11 2:47 ` [PATCH 30/36] sched_ext: Implement sched_ext_ops.cpu_online/offline() Tejun Heo
2023-11-11 2:47 ` [PATCH 31/36] sched_ext: Implement core-sched support Tejun Heo
2023-11-11 2:47 ` [PATCH 32/36] sched_ext: Add vtime-ordered priority queue to dispatch_q's Tejun Heo
2023-11-11 2:47 ` [PATCH 33/36] sched_ext: Documentation: scheduler: Document extensible scheduler class Tejun Heo
2023-11-11 2:48 ` [PATCH 34/36] sched_ext: Add a basic, userland vruntime scheduler Tejun Heo
2023-11-11 2:48 ` [PATCH 35/36] sched_ext: Add scx_rusty, a rust userspace hybrid scheduler Tejun Heo
2023-11-11 2:48 ` [PATCH 36/36] sched_ext: Add scx_layered, a highly configurable multi-layer scheduler Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZVKBSIPqJnAvrE3g@gpd \
--to=andrea.righi@canonical.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brho@google.com \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=changwoo@igalia.com \
--cc=daniel@iogearbox.net \
--cc=derkling@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=dschatzberg@meta.com \
--cc=dskarlat@cs.cmu.edu \
--cc=dvernet@meta.com \
--cc=haoluo@google.com \
--cc=himadrics@inria.fr \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=pjt@google.com \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=tj@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.