From: sashiko-bot@kernel.org
To: "Qiliang Yuan" <realwujing@gmail.com>
Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads
Date: Fri, 08 May 2026 19:08:27 +0000 [thread overview]
Message-ID: <20260508190828.8A6B2C2BCB0@smtp.kernel.org> (raw)
In-Reply-To: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com>
Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Missing bounds check on `my_node` in `ai_numa_dispatch` triggers `scx_error()` and scheduler abort on sparse or large NUMA topologies.
- [High] Tasks pinned to a single CPU bypass `select_cpu` and are improperly routed to `NUMA_DSQ(0)`, causing CPU cycle waste and breaking NUMA isolation.
- [High] Userspace NUMA node detection logic fails on sparse topologies, degrading the scheduler and triggering a kernel abort.
- [Medium] Missing error check on `libbpf_num_possible_cpus()` leads to potential negative Variable Length Array (VLA) size and userspace crash.
--
commit 5065582a0672767bbb7d29e1575961ffc48ac6d0
Author: Qiliang Yuan <realwujing@gmail.com>
sched_ext: Add scx_ai_numa scheduler example for AI workloads
This commit introduces a NUMA-aware BPF scheduler tailored for AI workloads to
preserve L3 cache warmth. The scheduler implements per-NUMA-node dispatch
queues, an idle fast path, task NUMA affinity tracking, and work stealing. A
userspace loader configures the scheduler and reports dispatch statistics.
> diff --git a/tools/sched_ext/scx_ai_numa.bpf.c b/tools/sched_ext/scx_ai_numa.bpf.c
> new file mode 100644
> index 0000000000000..89d3b7dd3d474
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.bpf.c
[ ... ]
> +void BPF_STRUCT_OPS(ai_numa_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> + struct task_ctx *tctx;
> + u32 node = 0;
> +
> + /*
> + * Route to the task's preferred NUMA node DSQ.
> + * Keeping AI tasks on the same NUMA node as their GPU's host memory
> + * reduces cross-node DRAM traffic and PCIe DMA stalls.
> + */
> + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
If a task is pinned to a single CPU, the sched_ext core intentionally skips
ops.select_cpu() and the task's NUMA affinity (tctx) is never initialized.
Will this cause the task to always default to NUMA_DSQ(0)?
If so, CPUs on node 0 might constantly iterate over these tasks but fail to
dispatch them because task_can_run_on_remote_rq() returns false, potentially
breaking NUMA isolation until the pinned CPU work-steals from NUMA_DSQ(0).
[ ... ]
> +void BPF_STRUCT_OPS(ai_numa_dispatch, s32 cpu, struct task_struct *prev)
> +{
> + u32 my_node = cpu_to_node(cpu);
> + u32 i;
> +
> + /* First: consume from our own NUMA node — zero cross-node traffic */
> + if (scx_bpf_dsq_move_to_local(NUMA_DSQ(my_node), 0))
Can cpu_to_node() return a node ID greater than or equal to nr_nodes (e.g. on
systems with sparse NUMA topologies or more than 16 nodes)?
If my_node is >= nr_nodes, the DSQ would not have been created by
ai_numa_init(). Could this cause the core scheduler to fail in find_user_dsq()
and trigger an scx_error() that aborts the BPF scheduler?
[ ... ]
> diff --git a/tools/sched_ext/scx_ai_numa.c b/tools/sched_ext/scx_ai_numa.c
> new file mode 100644
> index 0000000000000..58c7bb1bd6bb6
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.c
[ ... ]
> +/* Detect NUMA node count by scanning sysfs */
> +static __u32 detect_nr_nodes(void)
> +{
> + struct stat st;
> + char path[64];
> + __u32 i, count = 0;
> +
> + for (i = 0; i < MAX_NUMA_NODES; i++) {
> + snprintf(path, sizeof(path),
> + "/sys/devices/system/node/node%u", i);
> + if (stat(path, &st) == 0 && S_ISDIR(st.st_mode))
> + count = i + 1;
> + else
> + break;
If the system has sparse NUMA nodes (for example, only node 0 and node 2 are
present), will this loop prematurely break at i = 1?
This would set nr_nodes to 1, causing the initialization to only create
NUMA_DSQ(0). When tasks run on node 2, they might attempt to enqueue or
dispatch from a non-existent NUMA_DSQ(2), which could trigger a kernel abort.
[ ... ]
> +static void print_stats(struct scx_ai_numa *skel, __u32 nr_nodes)
> +{
> + int nr_cpus = libbpf_num_possible_cpus();
> + int map_fd = bpf_map__fd(skel->maps.node_stats);
> +
> + printf("\n%-6s %14s %14s %14s\n",
> + "Node", "Local-DSQ", "NUMA-DSQ", "Steals");
> + printf("------+--------------+--------------+--------------\n");
> +
> + for (__u32 node = 0; node < nr_nodes; node++) {
> + struct node_stat per_cpu[nr_cpus];
Does libbpf_num_possible_cpus() need an error check here?
If it fails and returns a negative error code, passing a negative size to the
variable-length array per_cpu could cause undefined behavior and crash the
userspace loader.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com?part=1
prev parent reply other threads:[~2026-05-08 19:08 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-08 7:51 [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads Qiliang Yuan
2026-05-08 7:56 ` Andrea Righi
2026-05-08 9:29 ` Christian Loehle
2026-05-08 9:37 ` Andrea Righi
2026-05-08 19:08 ` sashiko-bot [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260508190828.8A6B2C2BCB0@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=realwujing@gmail.com \
--cc=sashiko@lists.linux.dev \
--cc=sched-ext@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox