From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A844737D11C; Fri, 3 Jul 2026 08:02:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783065732; cv=none; b=TjWUP07fKZMoD/qLEWNN/UYps17c5NeIYOxj2rUuNKdE0MHjbMvwGh3cSuArlx8XDni8a65fsy67LkNBp1heMFP1ddOTOXGIIDZ7btseJhPBxVd/sODPtwlc7qeFEgAwcKPQvtUwGoh4Lv6fKkowAzIATJ3dFNZm41d2CbKQDBg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1783065732; c=relaxed/simple; bh=U/Y0S9VF72Yp+n0OqyMxlkNu3cz2OF13NM8bRJVVe5o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cwFtCY0pHFQj7swTyhKYz4GfkThE86gT6eNBDpF8HG1xrbPqM2HMnvUdR0kZplnxKBW9D22gQK6uM14/mbT3iitJBJiWqou991zqZhHkauDRXIt8BUWBkNq/+ghLAp3naYSatFb+oVmLYqm8y7WS7T0m4L9pWneQADByALom0Cg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=bDS/b8xk; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="bDS/b8xk" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 65A191F00A3D; Fri, 3 Jul 2026 08:02:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1783065730; bh=J2PFUNfVozGS9y2pxx810ChCX06q0dyAAXcp2KcouN8=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=bDS/b8xkSQ6eOwBDI+k2fmjWjxpXX2kTBftCYarPEP0NAioExM/hKNvKMivxIqj4k 7i/Y6Ocd99X5Vumb8waUcEKgI86RLWhpPRC6fxa4uHs7K4YQU6bqbMRSzRFXGkHrrH yHQaGwnnWrI4COSoue+o9eTjMbrEQB3ykM3S0Q9aHmgoBl9IaV931zeb2bhyg6ivzI OnlUyk/88dVrpn5HwSzK/rOhmlkl7d5fnp3w7WoikkpZVAqPsK01018DigUxbeimAJ AP8CmGW0QSRU+ybzSBifVcWO79sUuxxawdziPAvLU+iLOGGA0gXgipXvVWDa2nWJCG W8YytBmprF2Bw== From: Tejun Heo To: David Vernet , Andrea Righi , Changwoo Min Cc: sched-ext@lists.linux.dev, Emil Tsalapatis , linux-kernel@vger.kernel.org, Tejun Heo Subject: [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Date: Thu, 2 Jul 2026 22:01:37 -1000 Message-ID: <20260703080159.2314350-11-tj@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260703080159.2314350-1-tj@kernel.org> References: <20260703080159.2314350-1-tj@kernel.org> Precedence: bulk X-Mailing-List: sched-ext@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit An overridden cid mapping invalidates the auto-generated shard layout, so the override call has to provide both. Extend scx_bpf_cid_override() with a shard_start[] array that lists the first cid of each shard (starting at 0, strictly increasing, last shard implicitly extends to num_possible_cpus()). A scheduler that wants only custom shards with the auto-generated cid mapping can read the current mapping and pass it back unchanged. Overridden shards can span NUMA nodes, so scx_shard_node[] is rebuilt by majority count: each shard is assigned to the node that owns the most cpus in it. Signed-off-by: Tejun Heo --- kernel/sched/ext/cid.c | 136 +++++++++++++++++++++-- tools/sched_ext/include/scx/compat.bpf.h | 11 +- tools/sched_ext/scx_qmap.bpf.c | 16 ++- tools/sched_ext/scx_qmap.c | 34 +++++- tools/sched_ext/scx_qmap.h | 1 + 5 files changed, 174 insertions(+), 24 deletions(-) diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c index 9d75b9311978..bd0467e8a8d2 100644 --- a/kernel/sched/ext/cid.c +++ b/kernel/sched/ext/cid.c @@ -392,29 +392,58 @@ void scx_cpumask_to_cmask(const struct cpumask *src, struct scx_cmask *dst) } } +/* + * Return the index of the largest entry in @counts, or NUMA_NO_NODE if all + * entries are zero. Ties resolve to the lowest index. + */ +static s32 pick_max_node(const u32 *counts, u32 n) +{ + s32 best = NUMA_NO_NODE; + u32 best_count = 0, i; + + for (i = 0; i < n; i++) { + if (counts[i] > best_count) { + best_count = counts[i]; + best = i; + } + } + return best; +} + __bpf_kfunc_start_defs(); /** - * scx_bpf_cid_override - Install an explicit cpu->cid mapping + * scx_bpf_cid_override - Install an explicit cpu->cid mapping with shard info * @cpu_to_cid: array of nr_cpu_ids s32 entries (cid for each cpu) * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes + * @shard_start: array of first-cid-of-each-shard, strictly increasing from 0 + * @shard_start__sz: nr_shards * sizeof(s32) bytes * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs * * May only be called from ops.init_cids() of the root scheduler. Replace the - * topology-probed cid mapping with the caller-provided one. Each possible cpu - * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared. - * On invalid input, trigger scx_error() to abort the scheduler. + * topology-probed cid mapping and shard layout with caller-provided ones. Each + * possible cpu must map to a unique cid in [0, num_possible_cpus()). + * @shard_start must be strictly increasing with shard_start[0] == 0 and all + * values < num_possible_cpus(). The last shard extends to num_possible_cpus() + * and no shard may span more than SCX_CID_SHARD_MAX_CPUS cids. Topo info + * (core/LLC/node) is cleared and shard info is set from @shard_start. On + * invalid input, abort the scheduler. */ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz, - const struct bpf_prog_aux *aux) + const s32 *shard_start, u32 shard_start__sz, + const struct bpf_prog_aux *aux) { cpumask_var_t seen __free(free_cpumask_var) = CPUMASK_VAR_NULL; + u32 *node_counts __free(kfree) = NULL; + u32 npossible = num_possible_cpus(); struct scx_sched *sch; + u32 nr_shards; bool alloced; - s32 cpu, cid; + s32 cpu, cid, si; - /* GFP_KERNEL alloc must happen before the rcu read section */ + /* GFP_KERNEL allocs must happen before the rcu read section */ alloced = zalloc_cpumask_var(&seen, GFP_KERNEL); + node_counts = kcalloc(nr_node_ids, sizeof(*node_counts), GFP_KERNEL); guard(rcu)(); @@ -422,17 +451,57 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz, if (unlikely(!sch)) return; - if (!alloced) { - scx_error(sch, "scx_bpf_cid_override: failed to allocate cpumask"); + if (!alloced || !node_counts) { + scx_error(sch, "scx_bpf_cid_override: allocation failed"); return; } if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) { - scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u", + scx_error(sch, "scx_bpf_cid_override: cpu_to_cid expected %zu bytes, got %u", nr_cpu_ids * sizeof(s32), cpu_to_cid__sz); return; } + if (!shard_start__sz || shard_start__sz % sizeof(s32)) { + scx_error(sch, "scx_bpf_cid_override: invalid shard_start size %u", + shard_start__sz); + return; + } + + nr_shards = shard_start__sz / sizeof(s32); + + /* validate shard_start[]: starts at 0, strictly increasing, in range */ + if (shard_start[0] != 0) { + scx_error(sch, "scx_bpf_cid_override: shard_start[0] must be 0, got %d", + shard_start[0]); + return; + } + for (si = 1; si < nr_shards; si++) { + if (shard_start[si] <= shard_start[si - 1]) { + scx_error(sch, "scx_bpf_cid_override: shard_start not increasing at [%d]", + si); + return; + } + if (shard_start[si] >= npossible) { + scx_error(sch, "scx_bpf_cid_override: shard_start[%d]=%d >= %u", + si, shard_start[si], npossible); + return; + } + if (shard_start[si] - shard_start[si - 1] > SCX_CID_SHARD_MAX_CPUS) { + scx_error(sch, "scx_bpf_cid_override: shard[%d] span %d exceeds max %d", + si - 1, shard_start[si] - shard_start[si - 1], + SCX_CID_SHARD_MAX_CPUS); + return; + } + } + if (npossible - shard_start[nr_shards - 1] > SCX_CID_SHARD_MAX_CPUS) { + scx_error(sch, "scx_bpf_cid_override: shard[%d] span %d exceeds max %d", + nr_shards - 1, npossible - shard_start[nr_shards - 1], + SCX_CID_SHARD_MAX_CPUS); + return; + } + + /* Validate first so that invalid input leaves globals untouched. */ for_each_possible_cpu(cpu) { s32 c = cpu_to_cid[cpu]; @@ -442,13 +511,56 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz, scx_error(sch, "cid %d assigned to multiple cpus", c); return; } + } + + for_each_possible_cpu(cpu) { + s32 c = cpu_to_cid[cpu]; + scx_cpu_to_cid_tbl[cpu] = c; scx_cid_to_cpu_tbl[c] = cpu; } - /* Invalidate stale topo info - the override carries no topology. */ - for (cid = 0; cid < num_possible_cpus(); cid++) + /* + * Derive scx_shard_node[] by majority count: an overridden shard may + * span NUMA nodes, so assign each to the node that owns the most cpus. + */ + for (si = 0; si < nr_shards; si++) { + u32 end = (si + 1 < nr_shards) ? shard_start[si + 1] : npossible; + + memset(node_counts, 0, nr_node_ids * sizeof(*node_counts)); + for (cid = shard_start[si]; cid < end; cid++) { + s32 node = cpu_to_node(scx_cid_to_cpu_tbl[cid]); + + if (numa_valid_node(node)) + node_counts[node]++; + } + scx_shard_node[si] = pick_max_node(node_counts, nr_node_ids); + } + + /* + * Invalidate stale topo info and install shard layout from + * @shard_start. Walk shards to derive shard_cid/shard_idx for each cid. + */ + si = 0; + for (cid = 0; cid < npossible; cid++) { + if (si + 1 < nr_shards && cid >= shard_start[si + 1]) + si++; + scx_cid_to_shard[cid] = si; scx_cid_topo[cid] = SCX_CID_TOPO_NEG; + scx_cid_topo[cid].shard_cid = shard_start[si]; + scx_cid_topo[cid].shard_idx = si; + } + + /* Rebuild scx_cid_shard_ranges[] for the new layout. */ + memset(scx_cid_shard_ranges, 0, npossible * sizeof(*scx_cid_shard_ranges)); + for (si = 0; si < nr_shards; si++) { + u32 end = (si + 1 < nr_shards) ? shard_start[si + 1] : npossible; + + scx_cid_shard_ranges[si].base_cid = shard_start[si]; + scx_cid_shard_ranges[si].nr_cids = end - shard_start[si]; + } + + scx_nr_cid_shards = nr_shards; } /** diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h index 133058578668..cf469d5ff9ca 100644 --- a/tools/sched_ext/include/scx/compat.bpf.h +++ b/tools/sched_ext/include/scx/compat.bpf.h @@ -122,15 +122,18 @@ static inline bool scx_bpf_sub_dispatch(u64 cgroup_id) } /* - * v7.2: scx_bpf_cid_override() for explicit cpu->cid mapping. Ignore if + * v7.3: scx_bpf_cid_override() for explicit cid and shard mapping. Ignore if * missing. */ -void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) __ksym __weak; +void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz, + const s32 *shard_start, u32 shard_start__sz) __ksym __weak; -static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) +static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz, + const s32 *shard_start, u32 shard_start__sz) { if (bpf_ksym_exists(scx_bpf_cid_override___compat)) - return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz); + scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz, + shard_start, shard_start__sz); } /** diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index 2df7c53992dc..f6cfe63425d3 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -54,18 +54,20 @@ const volatile u32 max_tasks; /* * Optional cid-override test harness. When cid_override_mode is non-zero, - * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied - * cpu_to_cid array to exercise the kfunc's acceptance and error paths. See enum + * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied arrays + * to exercise the kfunc's acceptance and error paths. See enum * qmap_cid_override for the modes. */ const volatile u32 cid_override_mode; +const volatile u32 cid_override_nr_shards; /* - * Array lives in bss (writable) because scx_bpf_cid_override()'s BPF - * verifier signature treats its len-paired pointer as read/write - rodata + * Arrays live in bss (writable) because scx_bpf_cid_override()'s BPF + * verifier signature treats its len-paired pointers as read/write - rodata * fails verification with "write into map forbidden". Userspace populates - * it before SCX_OPS_LOAD, same as rodata, and nothing writes it after. + * them before SCX_OPS_LOAD, same as rodata, and nothing writes them after. */ s32 cid_override_cpu_to_cid[SCX_QMAP_MAX_CPUS]; +s32 cid_override_shard_start[SCX_QMAP_MAX_CPUS]; UEI_DEFINE(uei); @@ -1082,7 +1084,9 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init_cids) } scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid, - nr_cpu_ids * sizeof(s32)); + nr_cpu_ids * sizeof(s32), + (const s32 *)cid_override_shard_start, + cid_override_nr_shards * sizeof(s32)); return 0; } diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index c0b5cab579d6..9124183bffec 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -43,7 +43,7 @@ const char help_fmt[] = " -p Switch only tasks on SCHED_EXT policy instead of all\n" " -I Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n" " -F COUNT IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n" -" -C MODE cid-override test (shuffle|bad-dup|bad-range)\n" +" -C MODE cid-override test (shuffle|bad-dup|bad-range|bad-mono)\n" " -v Print libbpf debug messages\n" " -h Display this help and exit\n"; @@ -155,6 +155,7 @@ int main(int argc, char **argv) case 'C': { u32 nr_cpus = libbpf_num_possible_cpus(); u32 mode, i; + s32 shard_sz = 4; if (!strcmp(optarg, "shuffle")) mode = QMAP_CID_OVR_SHUFFLE; @@ -162,13 +163,15 @@ int main(int argc, char **argv) mode = QMAP_CID_OVR_BAD_DUP; else if (!strcmp(optarg, "bad-range")) mode = QMAP_CID_OVR_BAD_RANGE; + else if (!strcmp(optarg, "bad-mono")) + mode = QMAP_CID_OVR_BAD_MONO; else { fprintf(stderr, "unknown cid-override mode '%s'\n", optarg); return 1; } skel->rodata->cid_override_mode = mode; - /* shuffle: reversed cpu_to_cid, bad-dup: dup cid 0, bad-range: identity */ + /* shuffle: reversed cpu_to_cid; others: identity */ for (i = 0; i < nr_cpus; i++) { if (mode == QMAP_CID_OVR_SHUFFLE) skel->bss->cid_override_cpu_to_cid[i] = nr_cpus - 1 - i; @@ -179,6 +182,33 @@ int main(int argc, char **argv) skel->bss->cid_override_cpu_to_cid[1] = 0; if (mode == QMAP_CID_OVR_BAD_RANGE) skel->bss->cid_override_cpu_to_cid[0] = (s32)nr_cpus; + + /* + * bad-mono needs >= 3 shards to build a 0-based but + * non-monotonic shard_start. Shrink the shard size so + * the test runs on any machine with >= 3 cpus. + */ + if (mode == QMAP_CID_OVR_BAD_MONO) { + if (nr_cpus < 3) { + fprintf(stderr, "bad-mono needs >= 3 cpus (have %u)\n", + nr_cpus); + return 1; + } + shard_sz = nr_cpus / 3; + } + + /* shards of shard_sz each */ + skel->rodata->cid_override_nr_shards = (nr_cpus + shard_sz - 1) / shard_sz; + for (i = 0; i < skel->rodata->cid_override_nr_shards; i++) + skel->bss->cid_override_shard_start[i] = i * shard_sz; + + if (mode == QMAP_CID_OVR_BAD_MONO) { + /* swap [1] and [2] to break monotonicity */ + s32 tmp = skel->bss->cid_override_shard_start[1]; + skel->bss->cid_override_shard_start[1] = + skel->bss->cid_override_shard_start[2]; + skel->bss->cid_override_shard_start[2] = tmp; + } break; } case 'v': diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h index 3bcc3579839d..6c3ea1fc74ed 100644 --- a/tools/sched_ext/scx_qmap.h +++ b/tools/sched_ext/scx_qmap.h @@ -33,6 +33,7 @@ enum qmap_cid_override { QMAP_CID_OVR_SHUFFLE = 1, /* valid reversed cpu->cid mapping */ QMAP_CID_OVR_BAD_DUP = 2, /* invalid: duplicate cid assignment */ QMAP_CID_OVR_BAD_RANGE = 3, /* invalid: out-of-range cid */ + QMAP_CID_OVR_BAD_MONO = 4, /* invalid: non-monotonic shard_start */ }; struct cpu_ctx { -- 2.54.0