Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads

BPF List
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Qiliang Yuan" <realwujing@gmail.com>
Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads
Date: Fri, 08 May 2026 19:08:27 +0000	[thread overview]
Message-ID: <20260508190828.8A6B2C2BCB0@smtp.kernel.org> (raw)
In-Reply-To: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Missing bounds check on `my_node` in `ai_numa_dispatch` triggers `scx_error()` and scheduler abort on sparse or large NUMA topologies.
- [High] Tasks pinned to a single CPU bypass `select_cpu` and are improperly routed to `NUMA_DSQ(0)`, causing CPU cycle waste and breaking NUMA isolation.
- [High] Userspace NUMA node detection logic fails on sparse topologies, degrading the scheduler and triggering a kernel abort.
- [Medium] Missing error check on `libbpf_num_possible_cpus()` leads to potential negative Variable Length Array (VLA) size and userspace crash.
--

commit 5065582a0672767bbb7d29e1575961ffc48ac6d0
Author: Qiliang Yuan <realwujing@gmail.com>

sched_ext: Add scx_ai_numa scheduler example for AI workloads

This commit introduces a NUMA-aware BPF scheduler tailored for AI workloads to
preserve L3 cache warmth. The scheduler implements per-NUMA-node dispatch
queues, an idle fast path, task NUMA affinity tracking, and work stealing. A
userspace loader configures the scheduler and reports dispatch statistics.

> diff --git a/tools/sched_ext/scx_ai_numa.bpf.c b/tools/sched_ext/scx_ai_numa.bpf.c
> new file mode 100644
> index 0000000000000..89d3b7dd3d474
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.bpf.c

[ ... ]

> +void BPF_STRUCT_OPS(ai_numa_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> +	struct task_ctx *tctx;
> +	u32 node = 0;
> +
> +	/*
> +	 * Route to the task's preferred NUMA node DSQ.
> +	 * Keeping AI tasks on the same NUMA node as their GPU's host memory
> +	 * reduces cross-node DRAM traffic and PCIe DMA stalls.
> +	 */
> +	tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);

If a task is pinned to a single CPU, the sched_ext core intentionally skips
ops.select_cpu() and the task's NUMA affinity (tctx) is never initialized.
Will this cause the task to always default to NUMA_DSQ(0)? 

If so, CPUs on node 0 might constantly iterate over these tasks but fail to
dispatch them because task_can_run_on_remote_rq() returns false, potentially
breaking NUMA isolation until the pinned CPU work-steals from NUMA_DSQ(0).

[ ... ]

> +void BPF_STRUCT_OPS(ai_numa_dispatch, s32 cpu, struct task_struct *prev)
> +{
> +	u32 my_node = cpu_to_node(cpu);
> +	u32 i;
> +
> +	/* First: consume from our own NUMA node — zero cross-node traffic */
> +	if (scx_bpf_dsq_move_to_local(NUMA_DSQ(my_node), 0))

Can cpu_to_node() return a node ID greater than or equal to nr_nodes (e.g. on
systems with sparse NUMA topologies or more than 16 nodes)?

If my_node is >= nr_nodes, the DSQ would not have been created by
ai_numa_init(). Could this cause the core scheduler to fail in find_user_dsq()
and trigger an scx_error() that aborts the BPF scheduler?

[ ... ]

> diff --git a/tools/sched_ext/scx_ai_numa.c b/tools/sched_ext/scx_ai_numa.c
> new file mode 100644
> index 0000000000000..58c7bb1bd6bb6
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.c

[ ... ]

> +/* Detect NUMA node count by scanning sysfs */
> +static __u32 detect_nr_nodes(void)
> +{
> +	struct stat st;
> +	char path[64];
> +	__u32 i, count = 0;
> +
> +	for (i = 0; i < MAX_NUMA_NODES; i++) {
> +		snprintf(path, sizeof(path),
> +			 "/sys/devices/system/node/node%u", i);
> +		if (stat(path, &st) == 0 && S_ISDIR(st.st_mode))
> +			count = i + 1;
> +		else
> +			break;

If the system has sparse NUMA nodes (for example, only node 0 and node 2 are
present), will this loop prematurely break at i = 1?

This would set nr_nodes to 1, causing the initialization to only create
NUMA_DSQ(0). When tasks run on node 2, they might attempt to enqueue or
dispatch from a non-existent NUMA_DSQ(2), which could trigger a kernel abort.

[ ... ]

> +static void print_stats(struct scx_ai_numa *skel, __u32 nr_nodes)
> +{
> +	int nr_cpus = libbpf_num_possible_cpus();
> +	int map_fd  = bpf_map__fd(skel->maps.node_stats);
> +
> +	printf("\n%-6s %14s %14s %14s\n",
> +	       "Node", "Local-DSQ", "NUMA-DSQ", "Steals");
> +	printf("------+--------------+--------------+--------------\n");
> +
> +	for (__u32 node = 0; node < nr_nodes; node++) {
> +		struct node_stat per_cpu[nr_cpus];

Does libbpf_num_possible_cpus() need an error check here?

If it fails and returns a negative error code, passing a negative size to the
variable-length array per_cpu could cause undefined behavior and crash the
userspace loader.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com?part=1

     prev parent reply	other threads:[~2026-05-08 19:08 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-08  7:51 [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads Qiliang Yuan
2026-05-08  7:56 ` Andrea Righi
2026-05-08  9:29   ` Christian Loehle
2026-05-08  9:37     ` Andrea Righi
2026-05-08 19:08 ` sashiko-bot [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260508190828.8A6B2C2BCB0@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=realwujing@gmail.com \
    --cc=sashiko@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox