Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrea Righi <andrea.righi@linux.dev>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	Dan Schatzberg <schatzberg.dan@gmail.com>,
	Emil Tsalapatis <etsal@meta.com>,
	sched-ext@lists.linux.dev, linux-kernel@vger.kernel.org
Subject: Re: [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler
Date: Mon, 10 Nov 2025 09:36:46 +0100	[thread overview]
Message-ID: <aRGkHhAWTWdWELAY@gpd4> (raw)
In-Reply-To: <20251109183112.2412147-12-tj@kernel.org>

Hi Tejun,

On Sun, Nov 09, 2025 at 08:31:10AM -1000, Tejun Heo wrote:
> Add scx_cpu0, a simple scheduler that queues all tasks to a single DSQ and
> only dispatches them from CPU0 in FIFO order. This is useful for testing bypass
> behavior when many tasks are concentrated on a single CPU. If the load balancer
> doesn't work, bypass mode can trigger task hangs or RCU stalls as the queue is
> long and there's only one CPU working on it.
> 
> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
> Cc: Emil Tsalapatis <etsal@meta.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  tools/sched_ext/Makefile       |   2 +-
>  tools/sched_ext/scx_cpu0.bpf.c |  84 ++++++++++++++++++++++++++
>  tools/sched_ext/scx_cpu0.c     | 106 +++++++++++++++++++++++++++++++++
>  3 files changed, 191 insertions(+), 1 deletion(-)
>  create mode 100644 tools/sched_ext/scx_cpu0.bpf.c
>  create mode 100644 tools/sched_ext/scx_cpu0.c
> 
> diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
> index d68780e2e03d..069b0bc38e55 100644
> --- a/tools/sched_ext/Makefile
> +++ b/tools/sched_ext/Makefile
> @@ -187,7 +187,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP
>  
>  SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
>  
> -c-sched-targets = scx_simple scx_qmap scx_central scx_flatcg
> +c-sched-targets = scx_simple scx_cpu0 scx_qmap scx_central scx_flatcg
>  
>  $(addprefix $(BINDIR)/,$(c-sched-targets)): \
>  	$(BINDIR)/%: \
> diff --git a/tools/sched_ext/scx_cpu0.bpf.c b/tools/sched_ext/scx_cpu0.bpf.c
> new file mode 100644
> index 000000000000..8626bd369f60
> --- /dev/null
> +++ b/tools/sched_ext/scx_cpu0.bpf.c
> @@ -0,0 +1,84 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * A CPU0 scheduler.
> + *
> + * This scheduler queues all tasks to a shared DSQ and only dispatches them on
> + * CPU0 in FIFO order. This is useful for testing bypass behavior when many
> + * tasks are concentrated on a single CPU. If the load balancer doesn't work,
> + * bypass mode can trigger task hangs or RCU stalls as the queue is long and
> + * there's only one CPU working on it.
> + *
> + * - Statistics tracking how many tasks are queued to local and CPU0 DSQs.
> + * - Termination notification for userspace.
> + *
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
> + */
> +#include <scx/common.bpf.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +const volatile u32 nr_cpus = 32;	/* !0 for veristat, set during init */
> +
> +UEI_DEFINE(uei);
> +
> +/*
> + * We create a custom DSQ with ID 0 that we dispatch to and consume from on
> + * CPU0.
> + */
> +#define DSQ_CPU0 0
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(u32));
> +	__uint(value_size, sizeof(u64));
> +	__uint(max_entries, 2);			/* [local, cpu0] */
> +} stats SEC(".maps");
> +
> +static void stat_inc(u32 idx)
> +{
> +	u64 *cnt_p = bpf_map_lookup_elem(&stats, &idx);
> +	if (cnt_p)
> +		(*cnt_p)++;
> +}
> +
> +s32 BPF_STRUCT_OPS(cpu0_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
> +{
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	if (p->nr_cpus_allowed < nr_cpus) {

We could be even more aggressive with DSQ_CPU0 and check
bpf_cpumask_test_cpu(0, p->cpus_ptr), but this is fine as well.

> +		stat_inc(0);	/* count local queueing */
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);

And this is why I was suggesting to automatically fallback to the new
global default time slice internally. In this case do we want to preserve
the old 20ms default or automatically switch to the new one?

Apart than these minor details that we can address later:

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> +		return;
> +	}
> +
> +	stat_inc(1);	/* count cpu0 queueing */
> +	scx_bpf_dsq_insert(p, DSQ_CPU0, SCX_SLICE_DFL, enq_flags);
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	if (cpu == 0)
> +		scx_bpf_dsq_move_to_local(DSQ_CPU0);
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(cpu0_init)
> +{
> +	return scx_bpf_create_dsq(DSQ_CPU0, -1);
> +}
> +
> +void BPF_STRUCT_OPS(cpu0_exit, struct scx_exit_info *ei)
> +{
> +	UEI_RECORD(uei, ei);
> +}
> +
> +SCX_OPS_DEFINE(cpu0_ops,
> +	       .select_cpu		= (void *)cpu0_select_cpu,
> +	       .enqueue			= (void *)cpu0_enqueue,
> +	       .dispatch		= (void *)cpu0_dispatch,
> +	       .init			= (void *)cpu0_init,
> +	       .exit			= (void *)cpu0_exit,
> +	       .name			= "cpu0");
> diff --git a/tools/sched_ext/scx_cpu0.c b/tools/sched_ext/scx_cpu0.c
> new file mode 100644
> index 000000000000..1e4fa4ab8da9
> --- /dev/null
> +++ b/tools/sched_ext/scx_cpu0.c
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2025 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2025 Tejun Heo <tj@kernel.org>
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <assert.h>
> +#include <libgen.h>
> +#include <bpf/bpf.h>
> +#include <scx/common.h>
> +#include "scx_cpu0.bpf.skel.h"
> +
> +const char help_fmt[] =
> +"A cpu0 sched_ext scheduler.\n"
> +"\n"
> +"See the top-level comment in .bpf.c for more details.\n"
> +"\n"
> +"Usage: %s [-v]\n"
> +"\n"
> +"  -v            Print libbpf debug messages\n"
> +"  -h            Display this help and exit\n";
> +
> +static bool verbose;
> +static volatile int exit_req;
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +	if (level == LIBBPF_DEBUG && !verbose)
> +		return 0;
> +	return vfprintf(stderr, format, args);
> +}
> +
> +static void sigint_handler(int sig)
> +{
> +	exit_req = 1;
> +}
> +
> +static void read_stats(struct scx_cpu0 *skel, __u64 *stats)
> +{
> +	int nr_cpus = libbpf_num_possible_cpus();
> +	assert(nr_cpus > 0);
> +	__u64 cnts[2][nr_cpus];
> +	__u32 idx;
> +
> +	memset(stats, 0, sizeof(stats[0]) * 2);
> +
> +	for (idx = 0; idx < 2; idx++) {
> +		int ret, cpu;
> +
> +		ret = bpf_map_lookup_elem(bpf_map__fd(skel->maps.stats),
> +					  &idx, cnts[idx]);
> +		if (ret < 0)
> +			continue;
> +		for (cpu = 0; cpu < nr_cpus; cpu++)
> +			stats[idx] += cnts[idx][cpu];
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct scx_cpu0 *skel;
> +	struct bpf_link *link;
> +	__u32 opt;
> +	__u64 ecode;
> +
> +	libbpf_set_print(libbpf_print_fn);
> +	signal(SIGINT, sigint_handler);
> +	signal(SIGTERM, sigint_handler);
> +restart:
> +	skel = SCX_OPS_OPEN(cpu0_ops, scx_cpu0);
> +
> +	skel->rodata->nr_cpus = libbpf_num_possible_cpus();
> +
> +	while ((opt = getopt(argc, argv, "vh")) != -1) {
> +		switch (opt) {
> +		case 'v':
> +			verbose = true;
> +			break;
> +		default:
> +			fprintf(stderr, help_fmt, basename(argv[0]));
> +			return opt != 'h';
> +		}
> +	}
> +
> +	SCX_OPS_LOAD(skel, cpu0_ops, scx_cpu0, uei);
> +	link = SCX_OPS_ATTACH(skel, cpu0_ops, scx_cpu0);
> +
> +	while (!exit_req && !UEI_EXITED(skel, uei)) {
> +		__u64 stats[2];
> +
> +		read_stats(skel, stats);
> +		printf("local=%llu cpu0=%llu\n", stats[0], stats[1]);
> +		fflush(stdout);
> +		sleep(1);
> +	}
> +
> +	bpf_link__destroy(link);
> +	ecode = UEI_REPORT(skel, uei);
> +	scx_cpu0__destroy(skel);
> +
> +	if (UEI_ECODE_RESTART(ecode))
> +		goto restart;
> +	return 0;
> +}
> -- 
> 2.51.1
>

next prev parent reply	other threads:[~2025-11-10  8:36 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-09 18:30 [PATCHSET sched_ext/for-6.19] sched_ext: Improve bypass mode scalability Tejun Heo
2025-11-09 18:31 ` [PATCH 01/13] sched_ext: Don't set ddsp_dsq_id during select_cpu in bypass mode Tejun Heo
2025-11-10  6:57   ` Andrea Righi
2025-11-10 16:08     ` Tejun Heo
2025-11-09 18:31 ` [PATCH 02/13] sched_ext: Make slice values tunable and use shorter slice " Tejun Heo
2025-11-10  7:03   ` Andrea Righi
2025-11-10  7:59     ` Andrea Righi
2025-11-10 16:21     ` Tejun Heo
2025-11-10 16:22       ` Tejun Heo
2025-11-10  8:22   ` Andrea Righi
2025-11-11 14:57   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 03/13] sched_ext: Refactor do_enqueue_task() local and global DSQ paths Tejun Heo
2025-11-10  7:21   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 04/13] sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass mode Tejun Heo
2025-11-10  7:42   ` Andrea Righi
2025-11-10 16:42     ` Tejun Heo
2025-11-10 17:30       ` Andrea Righi
2025-11-11 15:31   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 05/13] sched_ext: Simplify breather mechanism with scx_aborting flag Tejun Heo
2025-11-10  7:45   ` Andrea Righi
2025-11-11 15:34   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 06/13] sched_ext: Exit dispatch and move operations immediately when aborting Tejun Heo
2025-11-10  8:20   ` Andrea Righi
2025-11-10 18:51     ` Tejun Heo
2025-11-11 15:46   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 07/13] sched_ext: Make scx_exit() and scx_vexit() return bool Tejun Heo
2025-11-10  8:28   ` Andrea Righi
2025-11-11 15:48   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 08/13] sched_ext: Refactor lockup handlers into handle_lockup() Tejun Heo
2025-11-10  8:29   ` Andrea Righi
2025-11-11 15:49   ` Dan Schatzberg
2025-11-09 18:31 ` [PATCH 09/13] sched_ext: Make handle_lockup() propagate scx_verror() result Tejun Heo
2025-11-10  8:29   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 10/13] sched_ext: Hook up hardlockup detector Tejun Heo
2025-11-10  8:31   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 11/13] sched_ext: Add scx_cpu0 example scheduler Tejun Heo
2025-11-10  8:36   ` Andrea Righi [this message]
2025-11-10 18:44     ` Tejun Heo
2025-11-10 21:06       ` Andrea Righi
2025-11-10 22:08         ` Tejun Heo
2025-11-09 18:31 ` [PATCH 12/13] sched_ext: Factor out scx_dsq_list_node cursor initialization into INIT_DSQ_LIST_CURSOR Tejun Heo
2025-11-10  8:37   ` Andrea Righi
2025-11-09 18:31 ` [PATCH 13/13] sched_ext: Implement load balancer for bypass mode Tejun Heo
2025-11-10  9:38   ` Andrea Righi
2025-11-10 19:21     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aRGkHhAWTWdWELAY@gpd4 \
    --to=andrea.righi@linux.dev \
    --cc=changwoo@igalia.com \
    --cc=etsal@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=schatzberg.dan@gmail.com \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox