BPF List
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Qiliang Yuan <realwujing@gmail.com>
Cc: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
	Changwoo Min <changwoo@igalia.com>,
	linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
	bpf@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads
Date: Fri, 8 May 2026 09:56:54 +0200	[thread overview]
Message-ID: <af2XRpaX_rFh_GI9@gpd4> (raw)
In-Reply-To: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com>

Hi Qiliang,

On Fri, May 08, 2026 at 03:51:35PM +0800, Qiliang Yuan wrote:
> Implement an AI-focused NUMA-aware scheduler that optimizes task dispatch for
> GPU-accelerated AI training. The scheduler maintains per-NUMA-node dispatch
> queues to preserve L3 cache warmth and minimize remote DRAM accesses that
> would stall GPU kernel launches waiting on CPU preprocessing.
> 
> Key features:
> - Per-NUMA-node DSQs (dispatch queues) to maintain cache locality
> - Idle fast path that bypasses DSQ for minimum latency
> - Per-task NUMA affinity tracking to remember task placement
> - Work stealing across nodes to prevent starvation during load imbalance
> 
> The BPF component (scx_ai_numa.bpf.c) implements the core scheduler
> callbacks, while the userspace loader (scx_ai_numa.c) detects NUMA
> topology, installs the BPF program, and reports per-node dispatch
> statistics every second.
> 
> This scheduler is suitable for AI training workloads where GPU command
> launches depend on rapid CPU preprocessing with minimal scheduling latency.
> 
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>

I think this would be more appropriate for inclusion in
https://github.com/sched-ext/scx.

Thanks,
-Andrea

> ---
>  tools/sched_ext/Makefile          |   2 +-
>  tools/sched_ext/scx_ai_numa.bpf.c | 200 ++++++++++++++++++++++++++++++++++++++
>  tools/sched_ext/scx_ai_numa.c     | 126 ++++++++++++++++++++++++
>  3 files changed, 327 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/sched_ext/Makefile b/tools/sched_ext/Makefile
> index 21554f0896923..a639b5bf4f542 100644
> --- a/tools/sched_ext/Makefile
> +++ b/tools/sched_ext/Makefile
> @@ -191,7 +191,7 @@ $(INCLUDE_DIR)/%.bpf.skel.h: $(SCXOBJ_DIR)/%.bpf.o $(INCLUDE_DIR)/vmlinux.h $(BP
>  
>  SCX_COMMON_DEPS := include/scx/common.h include/scx/user_exit_info.h | $(BINDIR)
>  
> -c-sched-targets = scx_simple scx_cpu0 scx_qmap scx_central scx_flatcg scx_userland scx_pair scx_sdt
> +c-sched-targets = scx_simple scx_cpu0 scx_qmap scx_central scx_flatcg scx_userland scx_pair scx_sdt scx_ai_numa
>  
>  $(addprefix $(BINDIR)/,$(c-sched-targets)): \
>  	$(BINDIR)/%: \
> diff --git a/tools/sched_ext/scx_ai_numa.bpf.c b/tools/sched_ext/scx_ai_numa.bpf.c
> new file mode 100644
> index 0000000000000..89d3b7dd3d474
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.bpf.c
> @@ -0,0 +1,200 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * scx_ai_numa - AI NUMA-aware scheduler (BPF side)
> + *
> + * Scheduling policy optimized for AI training workloads:
> + *
> + * 1. Per-NUMA-node DSQs: each NUMA node owns a dedicated dispatch queue.
> + *    Tasks are steered to the DSQ of the NUMA node they last ran on,
> + *    preserving L3 cache warmth and reducing remote DRAM accesses that
> + *    stall GPU kernel launches waiting on CPU preprocessing.
> + *
> + * 2. Idle fast path: when an idle CPU is found, bypass the per-node DSQ
> + *    and insert directly into SCX_DSQ_LOCAL for minimum latency.
> + *
> + * 3. Task NUMA affinity: per-task storage tracks the preferred NUMA node
> + *    (updated every time select_cpu() sees the task's prev_cpu).
> + *
> + * 4. Work stealing: if a node's DSQ is empty, try remote nodes in order
> + *    to prevent CPU starvation during load imbalance (e.g., bursty GPU
> + *    command submissions landing on a single NUMA node).
> + */
> +#include <scx/common.bpf.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +UEI_DEFINE(uei);
> +
> +#define MAX_NUMA_NODES 16
> +
> +/* One DSQ per NUMA node, IDs 0 .. MAX_NUMA_NODES-1 */
> +#define NUMA_DSQ(node) ((u64)(node))
> +
> +/* Per-task context: remember which NUMA node this task prefers */
> +struct task_ctx {
> +	u32 preferred_node;
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
> +	__uint(map_flags, BPF_F_NO_PREALLOC);
> +	__type(key, int);
> +	__type(value, struct task_ctx);
> +} task_ctx_stor SEC(".maps");
> +
> +/* Per-node counters (per-CPU to avoid false sharing) */
> +struct node_stat {
> +	__u64 local_dsq;	/* fast-path: direct SCX_DSQ_LOCAL insert */
> +	__u64 numa_dsq;		/* enqueued to per-node DSQ */
> +	__u64 steal;		/* dispatched from a remote node's DSQ */
> +};
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> +	__uint(key_size, sizeof(u32));
> +	__uint(value_size, sizeof(struct node_stat));
> +	__uint(max_entries, MAX_NUMA_NODES);
> +} node_stats SEC(".maps");
> +
> +/* Set by userspace after detecting the number of NUMA nodes */
> +const volatile u32 nr_nodes = 1;
> +
> +static __always_inline u32 cpu_to_node(s32 cpu)
> +{
> +	return __COMPAT_scx_bpf_cpu_node(cpu);
> +}
> +
> +static __always_inline void stat_inc_local(u32 node)
> +{
> +	struct node_stat *s = bpf_map_lookup_elem(&node_stats, &node);
> +
> +	if (s)
> +		s->local_dsq++;
> +}
> +
> +static __always_inline void stat_inc_numa(u32 node)
> +{
> +	struct node_stat *s = bpf_map_lookup_elem(&node_stats, &node);
> +
> +	if (s)
> +		s->numa_dsq++;
> +}
> +
> +static __always_inline void stat_inc_steal(u32 node)
> +{
> +	struct node_stat *s = bpf_map_lookup_elem(&node_stats, &node);
> +
> +	if (s)
> +		s->steal++;
> +}
> +
> +s32 BPF_STRUCT_OPS(ai_numa_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags)
> +{
> +	struct task_ctx *tctx;
> +	bool is_idle = false;
> +	u32 node;
> +	s32 cpu;
> +
> +	/* Update task's preferred NUMA node from prev_cpu */
> +	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0,
> +				     BPF_LOCAL_STORAGE_GET_F_CREATE);
> +	if (tctx) {
> +		node = cpu_to_node(prev_cpu);
> +		tctx->preferred_node = node < nr_nodes ? node : 0;
> +	}
> +
> +	/*
> +	 * Default selection tries prev_cpu first (same LLC), which preserves
> +	 * L1/L2/L3 cache across AI loop iterations without extra policy code.
> +	 */
> +	cpu = scx_bpf_select_cpu_dfl(p, prev_cpu, wake_flags, &is_idle);
> +	if (is_idle) {
> +		/* Idle CPU found: bypass DSQ for minimum latency */
> +		node = cpu_to_node(cpu);
> +		stat_inc_local(node);
> +		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> +	}
> +
> +	return cpu;
> +}
> +
> +void BPF_STRUCT_OPS(ai_numa_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	struct task_ctx *tctx;
> +	u32 node = 0;
> +
> +	/*
> +	 * Route to the task's preferred NUMA node DSQ.
> +	 * Keeping AI tasks on the same NUMA node as their GPU's host memory
> +	 * reduces cross-node DRAM traffic and PCIe DMA stalls.
> +	 */
> +	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
> +	if (tctx) {
> +		node = tctx->preferred_node;
> +		if (node >= nr_nodes)
> +			node = 0;
> +	}
> +
> +	stat_inc_numa(node);
> +	scx_bpf_dsq_insert(p, NUMA_DSQ(node), SCX_SLICE_DFL, enq_flags);
> +}
> +
> +void BPF_STRUCT_OPS(ai_numa_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	u32 my_node = cpu_to_node(cpu);
> +	u32 i;
> +
> +	/* First: consume from our own NUMA node — zero cross-node traffic */
> +	if (scx_bpf_dsq_move_to_local(NUMA_DSQ(my_node), 0))
> +		return;
> +
> +	/*
> +	 * Work steal from other nodes in order.
> +	 * Prevents CPU starvation when one GPU's launch bursts all tasks
> +	 * onto a single NUMA node while other nodes sit idle.
> +	 */
> +	for (i = 0; i < MAX_NUMA_NODES; i++) {
> +		u32 node = i;
> +
> +		if (node >= nr_nodes)
> +			break;
> +		if (node == my_node)
> +			continue;
> +		if (scx_bpf_dsq_move_to_local(NUMA_DSQ(node), 0)) {
> +			stat_inc_steal(my_node);
> +			return;
> +		}
> +	}
> +}
> +
> +s32 BPF_STRUCT_OPS_SLEEPABLE(ai_numa_init)
> +{
> +	u32 i;
> +	int ret;
> +
> +	for (i = 0; i < MAX_NUMA_NODES; i++) {
> +		if (i >= nr_nodes)
> +			break;
> +		ret = scx_bpf_create_dsq(NUMA_DSQ(i), -1);
> +		if (ret) {
> +			scx_bpf_error("failed to create DSQ for node %u: %d",
> +				      i, ret);
> +			return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +void BPF_STRUCT_OPS(ai_numa_exit, struct scx_exit_info *ei)
> +{
> +	UEI_RECORD(uei, ei);
> +}
> +
> +SCX_OPS_DEFINE(ai_numa_ops,
> +	       .select_cpu	= (void *)ai_numa_select_cpu,
> +	       .enqueue		= (void *)ai_numa_enqueue,
> +	       .dispatch	= (void *)ai_numa_dispatch,
> +	       .init		= (void *)ai_numa_init,
> +	       .exit		= (void *)ai_numa_exit,
> +	       .name		= "ai_numa");
> diff --git a/tools/sched_ext/scx_ai_numa.c b/tools/sched_ext/scx_ai_numa.c
> new file mode 100644
> index 0000000000000..58c7bb1bd6bb6
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.c
> @@ -0,0 +1,126 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * scx_ai_numa - AI NUMA-aware scheduler (userspace loader)
> + *
> + * Detects NUMA topology, configures the BPF scheduler, and prints
> + * per-node dispatch statistics every second.
> + */
> +#include <stdio.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <assert.h>
> +#include <libgen.h>
> +#include <sys/stat.h>
> +#include <bpf/bpf.h>
> +#include <scx/common.h>
> +#include "scx_ai_numa.bpf.skel.h"
> +
> +/* Must match BPF side */
> +struct node_stat {
> +	__u64 local_dsq;
> +	__u64 numa_dsq;
> +	__u64 steal;
> +};
> +
> +#define MAX_NUMA_NODES 16
> +
> +static volatile int exit_req;
> +
> +static void sigint_handler(int sig)
> +{
> +	exit_req = 1;
> +}
> +
> +/* Detect NUMA node count by scanning sysfs */
> +static __u32 detect_nr_nodes(void)
> +{
> +	struct stat st;
> +	char path[64];
> +	__u32 i, count = 0;
> +
> +	for (i = 0; i < MAX_NUMA_NODES; i++) {
> +		snprintf(path, sizeof(path),
> +			 "/sys/devices/system/node/node%u", i);
> +		if (stat(path, &st) == 0 && S_ISDIR(st.st_mode))
> +			count = i + 1;
> +		else
> +			break;
> +	}
> +	return count ? count : 1;
> +}
> +
> +static void print_stats(struct scx_ai_numa *skel, __u32 nr_nodes)
> +{
> +	int nr_cpus = libbpf_num_possible_cpus();
> +	int map_fd  = bpf_map__fd(skel->maps.node_stats);
> +
> +	printf("\n%-6s %14s %14s %14s\n",
> +	       "Node", "Local-DSQ", "NUMA-DSQ", "Steals");
> +	printf("------+--------------+--------------+--------------\n");
> +
> +	for (__u32 node = 0; node < nr_nodes; node++) {
> +		struct node_stat per_cpu[nr_cpus];
> +		struct node_stat total = {};
> +		__u32 key = node;
> +		int i;
> +
> +		if (bpf_map_lookup_elem(map_fd, &key, per_cpu) < 0)
> +			continue;
> +
> +		for (i = 0; i < nr_cpus; i++) {
> +			total.local_dsq += per_cpu[i].local_dsq;
> +			total.numa_dsq  += per_cpu[i].numa_dsq;
> +			total.steal     += per_cpu[i].steal;
> +		}
> +
> +		printf("%-6u %14llu %14llu %14llu\n", node,
> +		       total.local_dsq, total.numa_dsq, total.steal);
> +	}
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	struct scx_ai_numa *skel;
> +	struct bpf_link *link;
> +	__u64 ecode;
> +	__u32 nr_nodes;
> +
> +	signal(SIGINT, sigint_handler);
> +	signal(SIGTERM, sigint_handler);
> +
> +	nr_nodes = detect_nr_nodes();
> +	printf("scx_ai_numa: detected %u NUMA node(s)\n", nr_nodes);
> +
> +restart:
> +	/*
> +	 * Avoid SCX_OPS_OPEN() which accesses sub_attach/sub_detach/
> +	 * sub_cgroup_id at compile time. These fields may not be available
> +	 * in all supported kernel versions.
> +	 */
> +	skel = scx_ai_numa__open();
> +	SCX_BUG_ON(!skel, "Could not open scx_ai_numa");
> +	skel->struct_ops.ai_numa_ops->hotplug_seq = scx_hotplug_seq();
> +	SCX_ENUM_INIT(skel);
> +
> +	/* Pass NUMA topology to the BPF program via rodata */
> +	skel->rodata->nr_nodes = nr_nodes;
> +
> +	SCX_OPS_LOAD(skel, ai_numa_ops, scx_ai_numa, uei);
> +	link = SCX_OPS_ATTACH(skel, ai_numa_ops, scx_ai_numa);
> +
> +	printf("scx_ai_numa: running (Ctrl-C to stop)\n");
> +
> +	while (!exit_req && !UEI_EXITED(skel, uei)) {
> +		print_stats(skel, nr_nodes);
> +		fflush(stdout);
> +		sleep(1);
> +	}
> +
> +	bpf_link__destroy(link);
> +	ecode = UEI_REPORT(skel, uei);
> +	scx_ai_numa__destroy(skel);
> +
> +	if (UEI_ECODE_RESTART(ecode))
> +		goto restart;
> +	return 0;
> +}
> 
> ---
> base-commit: 8ab992f815d6736b5c7a6f5fd7bfe7bc106bb3dc
> change-id: 20260508-feat-scx_ai_example-8e1384942646
> 
> Best regards,
> -- 
> Qiliang Yuan <realwujing@gmail.com>
> 

  reply	other threads:[~2026-05-08  7:57 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  7:51 [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads Qiliang Yuan
2026-05-08  7:56 ` Andrea Righi [this message]
2026-05-08  9:29   ` Christian Loehle
2026-05-08  9:37     ` Andrea Righi
2026-05-08 19:08 ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=af2XRpaX_rFh_GI9@gpd4 \
    --to=arighi@nvidia.com \
    --cc=bpf@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=realwujing@gmail.com \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox