From: sashiko-bot@kernel.org
To: "Qiliang Yuan" <realwujing@gmail.com>
Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads
Date: Fri, 08 May 2026 19:08:27 +0000 [thread overview]
Message-ID: <20260508190828.8A6B2C2BCB0@smtp.kernel.org> (raw)
In-Reply-To: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com>
Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Missing bounds check on `my_node` in `ai_numa_dispatch` triggers `scx_error()` and scheduler abort on sparse or large NUMA topologies.
- [High] Tasks pinned to a single CPU bypass `select_cpu` and are improperly routed to `NUMA_DSQ(0)`, causing CPU cycle waste and breaking NUMA isolation.
- [High] Userspace NUMA node detection logic fails on sparse topologies, degrading the scheduler and triggering a kernel abort.
- [Medium] Missing error check on `libbpf_num_possible_cpus()` leads to potential negative Variable Length Array (VLA) size and userspace crash.
--
commit 5065582a0672767bbb7d29e1575961ffc48ac6d0
Author: Qiliang Yuan <realwujing@gmail.com>
sched_ext: Add scx_ai_numa scheduler example for AI workloads
This commit introduces a NUMA-aware BPF scheduler tailored for AI workloads to
preserve L3 cache warmth. The scheduler implements per-NUMA-node dispatch
queues, an idle fast path, task NUMA affinity tracking, and work stealing. A
userspace loader configures the scheduler and reports dispatch statistics.
> diff --git a/tools/sched_ext/scx_ai_numa.bpf.c b/tools/sched_ext/scx_ai_numa.bpf.c
> new file mode 100644
> index 0000000000000..89d3b7dd3d474
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.bpf.c
[ ... ]
> +void BPF_STRUCT_OPS(ai_numa_enqueue, struct task_struct *p, u64 enq_flags)
> +{
> + struct task_ctx *tctx;
> + u32 node = 0;
> +
> + /*
> + * Route to the task's preferred NUMA node DSQ.
> + * Keeping AI tasks on the same NUMA node as their GPU's host memory
> + * reduces cross-node DRAM traffic and PCIe DMA stalls.
> + */
> + tctx = bpf_task_storage_get(&task_ctx_stor, p, 0, 0);
If a task is pinned to a single CPU, the sched_ext core intentionally skips
ops.select_cpu() and the task's NUMA affinity (tctx) is never initialized.
Will this cause the task to always default to NUMA_DSQ(0)?
If so, CPUs on node 0 might constantly iterate over these tasks but fail to
dispatch them because task_can_run_on_remote_rq() returns false, potentially
breaking NUMA isolation until the pinned CPU work-steals from NUMA_DSQ(0).
[ ... ]
> +void BPF_STRUCT_OPS(ai_numa_dispatch, s32 cpu, struct task_struct *prev)
> +{
> + u32 my_node = cpu_to_node(cpu);
> + u32 i;
> +
> + /* First: consume from our own NUMA node — zero cross-node traffic */
> + if (scx_bpf_dsq_move_to_local(NUMA_DSQ(my_node), 0))
Can cpu_to_node() return a node ID greater than or equal to nr_nodes (e.g. on
systems with sparse NUMA topologies or more than 16 nodes)?
If my_node is >= nr_nodes, the DSQ would not have been created by
ai_numa_init(). Could this cause the core scheduler to fail in find_user_dsq()
and trigger an scx_error() that aborts the BPF scheduler?
[ ... ]
> diff --git a/tools/sched_ext/scx_ai_numa.c b/tools/sched_ext/scx_ai_numa.c
> new file mode 100644
> index 0000000000000..58c7bb1bd6bb6
> --- /dev/null
> +++ b/tools/sched_ext/scx_ai_numa.c
[ ... ]
> +/* Detect NUMA node count by scanning sysfs */
> +static __u32 detect_nr_nodes(void)
> +{
> + struct stat st;
> + char path[64];
> + __u32 i, count = 0;
> +
> + for (i = 0; i < MAX_NUMA_NODES; i++) {
> + snprintf(path, sizeof(path),
> + "/sys/devices/system/node/node%u", i);
> + if (stat(path, &st) == 0 && S_ISDIR(st.st_mode))
> + count = i + 1;
> + else
> + break;
If the system has sparse NUMA nodes (for example, only node 0 and node 2 are
present), will this loop prematurely break at i = 1?
This would set nr_nodes to 1, causing the initialization to only create
NUMA_DSQ(0). When tasks run on node 2, they might attempt to enqueue or
dispatch from a non-existent NUMA_DSQ(2), which could trigger a kernel abort.
[ ... ]
> +static void print_stats(struct scx_ai_numa *skel, __u32 nr_nodes)
> +{
> + int nr_cpus = libbpf_num_possible_cpus();
> + int map_fd = bpf_map__fd(skel->maps.node_stats);
> +
> + printf("\n%-6s %14s %14s %14s\n",
> + "Node", "Local-DSQ", "NUMA-DSQ", "Steals");
> + printf("------+--------------+--------------+--------------\n");
> +
> + for (__u32 node = 0; node < nr_nodes; node++) {
> + struct node_stat per_cpu[nr_cpus];
Does libbpf_num_possible_cpus() need an error check here?
If it fails and returns a negative error code, passing a negative size to the
variable-length array per_cpu could cause undefined behavior and crash the
userspace loader.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com?part=1
prev parent reply other threads:[~2026-05-08 19:08 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-08 7:51 [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads Qiliang Yuan
2026-05-08 7:56 ` Andrea Righi
2026-05-08 9:29 ` Christian Loehle
2026-05-08 9:37 ` Andrea Righi
2026-05-08 19:08 ` sashiko-bot [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260508190828.8A6B2C2BCB0@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=realwujing@gmail.com \
--cc=sashiko@lists.linux.dev \
--cc=sched-ext@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.