From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EFCFE2857EE; Fri, 8 May 2026 19:08:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778267309; cv=none; b=T4k1RtiyiSRo+9nNqYU4IQp0zqjgWR8qEieI3eaq2B/8cZvuWKgiQ0biPoaU/xiyzKB+EfYNXLiCwLZ4lvb6sCMhvZwllso/TpSP+4gVitqCYk52lFlQ6ka9t+zNWw/Fm1cQPO5kPoXSpK2gzt+sxqNxQX6CLrdabSrPJmza90w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778267309; c=relaxed/simple; bh=lzLVGB0mcn0nS2U9oCM5vmPBKKOzJLY4dhbjmse1ptk=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=tmDt+I2yQgoGQTY8OtF+1gdrh1EVSvhjWelcslcCAa9VgHh4P8zuH/hkVd4tZ32FQo2000Qww3ogzOsRYr6xIWLTowvEuCHwxLs3nuZbp1aOT4q+IMKseDgRFQc85bezyAg+23fnM5xc5up8RbNWWt+YUAHy7zRg/THMsEeV7nU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=eoyiiZE7; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="eoyiiZE7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8A6B2C2BCB0; Fri, 8 May 2026 19:08:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778267308; bh=lzLVGB0mcn0nS2U9oCM5vmPBKKOzJLY4dhbjmse1ptk=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date:From; b=eoyiiZE7/81nCVWdi8+v7/FsFn9vw5OU0vuhcOLpV+0GYhoQIG6tRqj+cDh5FdEtp MeUDzmlrV8U0KKdbL4v+7WrlOQIV17dkal5Y7q852HzGMNf3/QKC01i/8OzpjfCr2Q ogIhp1UJ7OHMZk0YxLsfkORudDPFJYFw9n9i/oTK0tki4yUko5oxKc32uMEzAZllO2 3low2i7inBp/FV6Rveuain6kx4FFeeErwdAGqO8n0hksMAQdKs++U0aV6cC7tVSpkY UntDrZO3OAeCVcI+176TL7tsPk0Zhc1dITuWSdqoNDzhyvnjhyGg7b2mRQvISXFYl4 ip26n5bA/piWQ== From: sashiko-bot@kernel.org Subject: Re: [PATCH] sched_ext: Add scx_ai_numa scheduler example for AI workloads Reply-To: sashiko@lists.linux.dev To: "Qiliang Yuan" Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org In-Reply-To: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com> References: <20260508-feat-scx_ai_example-v1-1-2b498af3514d@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Fri, 08 May 2026 19:08:27 +0000 Message-Id: <20260508190828.8A6B2C2BCB0@smtp.kernel.org> Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 4 potential issue(= s) to consider: - [High] Missing bounds check on `my_node` in `ai_numa_dispatch` triggers `= scx_error()` and scheduler abort on sparse or large NUMA topologies. - [High] Tasks pinned to a single CPU bypass `select_cpu` and are improperl= y routed to `NUMA_DSQ(0)`, causing CPU cycle waste and breaking NUMA isolat= ion. - [High] Userspace NUMA node detection logic fails on sparse topologies, de= grading the scheduler and triggering a kernel abort. - [Medium] Missing error check on `libbpf_num_possible_cpus()` leads to pot= ential negative Variable Length Array (VLA) size and userspace crash. -- commit 5065582a0672767bbb7d29e1575961ffc48ac6d0 Author: Qiliang Yuan sched_ext: Add scx_ai_numa scheduler example for AI workloads This commit introduces a NUMA-aware BPF scheduler tailored for AI workloads= to preserve L3 cache warmth. The scheduler implements per-NUMA-node dispatch queues, an idle fast path, task NUMA affinity tracking, and work stealing. A userspace loader configures the scheduler and reports dispatch statistics. > diff --git a/tools/sched_ext/scx_ai_numa.bpf.c b/tools/sched_ext/scx_ai_n= uma.bpf.c > new file mode 100644 > index 0000000000000..89d3b7dd3d474 > --- /dev/null > +++ b/tools/sched_ext/scx_ai_numa.bpf.c [ ... ] > +void BPF_STRUCT_OPS(ai_numa_enqueue, struct task_struct *p, u64 enq_flag= s) > +{ > + struct task_ctx *tctx; > + u32 node =3D 0; > + > + /* > + * Route to the task's preferred NUMA node DSQ. > + * Keeping AI tasks on the same NUMA node as their GPU's host memory > + * reduces cross-node DRAM traffic and PCIe DMA stalls. > + */ > + tctx =3D bpf_task_storage_get(&task_ctx_stor, p, 0, 0); If a task is pinned to a single CPU, the sched_ext core intentionally skips ops.select_cpu() and the task's NUMA affinity (tctx) is never initialized. Will this cause the task to always default to NUMA_DSQ(0)?=20 If so, CPUs on node 0 might constantly iterate over these tasks but fail to dispatch them because task_can_run_on_remote_rq() returns false, potentially breaking NUMA isolation until the pinned CPU work-steals from NUMA_DSQ(0). [ ... ] > +void BPF_STRUCT_OPS(ai_numa_dispatch, s32 cpu, struct task_struct *prev) > +{ > + u32 my_node =3D cpu_to_node(cpu); > + u32 i; > + > + /* First: consume from our own NUMA node =E2=80=94 zero cross-node traf= fic */ > + if (scx_bpf_dsq_move_to_local(NUMA_DSQ(my_node), 0)) Can cpu_to_node() return a node ID greater than or equal to nr_nodes (e.g. = on systems with sparse NUMA topologies or more than 16 nodes)? If my_node is >=3D nr_nodes, the DSQ would not have been created by ai_numa_init(). Could this cause the core scheduler to fail in find_user_ds= q() and trigger an scx_error() that aborts the BPF scheduler? [ ... ] > diff --git a/tools/sched_ext/scx_ai_numa.c b/tools/sched_ext/scx_ai_numa.c > new file mode 100644 > index 0000000000000..58c7bb1bd6bb6 > --- /dev/null > +++ b/tools/sched_ext/scx_ai_numa.c [ ... ] > +/* Detect NUMA node count by scanning sysfs */ > +static __u32 detect_nr_nodes(void) > +{ > + struct stat st; > + char path[64]; > + __u32 i, count =3D 0; > + > + for (i =3D 0; i < MAX_NUMA_NODES; i++) { > + snprintf(path, sizeof(path), > + "/sys/devices/system/node/node%u", i); > + if (stat(path, &st) =3D=3D 0 && S_ISDIR(st.st_mode)) > + count =3D i + 1; > + else > + break; If the system has sparse NUMA nodes (for example, only node 0 and node 2 are present), will this loop prematurely break at i =3D 1? This would set nr_nodes to 1, causing the initialization to only create NUMA_DSQ(0). When tasks run on node 2, they might attempt to enqueue or dispatch from a non-existent NUMA_DSQ(2), which could trigger a kernel abor= t. [ ... ] > +static void print_stats(struct scx_ai_numa *skel, __u32 nr_nodes) > +{ > + int nr_cpus =3D libbpf_num_possible_cpus(); > + int map_fd =3D bpf_map__fd(skel->maps.node_stats); > + > + printf("\n%-6s %14s %14s %14s\n", > + "Node", "Local-DSQ", "NUMA-DSQ", "Steals"); > + printf("------+--------------+--------------+--------------\n"); > + > + for (__u32 node =3D 0; node < nr_nodes; node++) { > + struct node_stat per_cpu[nr_cpus]; Does libbpf_num_possible_cpus() need an error check here? If it fails and returns a negative error code, passing a negative size to t= he variable-length array per_cpu could cause undefined behavior and crash the userspace loader. --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260508-feat-scx_a= i_example-v1-1-2b498af3514d@gmail.com?part=3D1