* [PATCH v3 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality
@ 2025-07-01 21:02 Zecheng Li
2025-07-01 21:02 ` [PATCH v3 1/3] sched/fair: Co-locate cfs_rq and sched_entity Zecheng Li
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Zecheng Li @ 2025-07-01 21:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Xu Liu, Blake Jones, Namhyung Kim, Josh Don,
Madadi Vineeth Reddy, linux-kernel, Zecheng Li
Hi all,
This patch series improves CFS cache performance by allocating cfs_rq
and sched_entity together in the per-cpu allocator. It allows for
replacing the pointer arrays in task_group with a per-cpu offset.
v3:
- Rebased on top of 6.16-rc4.
- Minor wording and comment updates.
v2:
https://lore.kernel.org/lkml/20250609193834.2556866-1-zecheng@google.com/
- Allocate cfs_rq and sched_entity together for non-root task group
instead of embedding sched_entity into cfs_rq to avoid increasing the
size of struct rq based on the feedback from Peter Zijlstra.
v1:
https://lore.kernel.org/lkml/20250604195846.193159-1-zecheng@google.com/
Accessing cfs_rq and sched_entity instances incurs many cache misses.
This series of patches aims to reduce these cache misses. A struct
cfs_rq instance is per CPU and per task_group. Each task_group instance
(and the root runqueue) holds cfs_rq instances per CPU. Additionally,
there are corresponding struct sched_entity instances for each cfs_rq
instance (except the root). Currently, both cfs_rq and sched_entity
instances are allocated in NUMA-local memory using kzalloc_node, and
tg->cfs_rq and tg->se are arrays of pointers.
Original memory layout:
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
+----+ +-----------------+
| tg | ----> | cfs_rq pointers |
+----+ +-----------------+
| | |
v v v
cfs_rq cfs_rq cfs_rq
+----+ +--------------------+
| tg | ----> | sched_entity ptrs |
+----+ +--------------------+
| | |
v v v
se se se
Layout after Optimization:
+--------+ | CPU 0 | | CPU 1 | | CPU 2 |
| tg | | percpu | | percpu | | percpu |
| | ... ... ...
| percpu | -> | cfs_rq | | cfs_rq | | cfs_rq |
| offset | | se | | se | | se |
+--------+ +--------+ +--------+ +--------+
The optimization includes two parts:
1) Co-allocate cfs_rq and sched_entity for non-root task groups.
- This benefits loading the sched_entity for the parent runqueue.
Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After
co-locating, the sched_entity fields can be loaded with simple offset
computations from cfs_rq.
2) Allocate the combined cfs_rq/se struct using percpu allocator.
- Accesses to cfs_rq instances in hot paths are mostly iterating through
multiple task_groups for the same CPU. Therefore, the new percpu
layout can reuse the base pointer, and they are more likely to reside
in the CPU cache than the per-task_group pointer arrays.
- This optimization also reduces the memory needed for the array of
pointers.
To measure the impact of the patch series, we construct a tree structure
hierarchy of cgroups, with “width” and “depth” parameters controlling
the number of children per node and the depth of the tree. Each leaf
cgroup runs a schbench workload and gets an 80% quota of the total CPU
quota divided by number of leaf cgroups (in other words, the target CPU
load is set to 80%) to exercise the throttling functions. Bandwidth
control period is set to 10ms. We run the benchmark on Intel and AMD
machines; each machine has hundreds of threads.
Tests were conducted on 6.15.
| Kernel LLC Misses | depth 3 width 10 | depth 5 width 4 |
+-------------------+---------------------+---------------------+
| AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M |
| AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M |
| Change | -11.69% | -8.248% |
| Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M |
| Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M |
| Change | -31.96% | -28.13% |
There's also a 25% improvement on kernel IPC on the AMD system. On
Intel, the improvement is 3% despite a greater LLC miss reduction.
Other workloads without CPU share limits, while also running in a cgroup
hierarchy with O(1000) instances, show no obvious regression:
sysbench, hackbench - lower is better; ebizzy - higher is better.
workload | base | opt | metric
----------+-----------------------+-----------------------+------------
sysbench | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency
hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time
ebizzy | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s
Zecheng Li (3):
sched/fair: Co-locate cfs_rq and sched_entity
sched/fair: Remove task_group->se pointer array
sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
kernel/sched/core.c | 40 +++++++-------------
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 87 ++++++++++++++++----------------------------
kernel/sched/sched.h | 48 ++++++++++++++++++++----
4 files changed, 87 insertions(+), 90 deletions(-)
base-commit: 66701750d5565c574af42bef0b789ce0203e3071
--
2.50.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v3 1/3] sched/fair: Co-locate cfs_rq and sched_entity
2025-07-01 21:02 [PATCH v3 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality Zecheng Li
@ 2025-07-01 21:02 ` Zecheng Li
2025-07-01 21:02 ` [PATCH v3 2/3] sched/fair: Remove task_group->se pointer array Zecheng Li
2025-07-01 21:02 ` [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu Zecheng Li
2 siblings, 0 replies; 6+ messages in thread
From: Zecheng Li @ 2025-07-01 21:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Xu Liu, Blake Jones, Namhyung Kim, Josh Don,
Madadi Vineeth Reddy, linux-kernel, Zecheng Li
Improve data locality and reduce pointer chasing by allocating struct
cfs_rq and struct sched_entity together for non-root task groups. This
is achieved by introducing a new combined struct cfs_rq_with_se that
holds both objects in a single allocation.
This patch:
- Defines the new struct cfs_rq_with_se.
- Modifies alloc_fair_sched_group() and free_fair_sched_group() to
allocate and free the new struct as a single unit.
- Modifies the per-CPU pointers in task_group->se and task_group->cfs_rq
to point to the members in the new combined structure.
Signed-off-by: Zecheng Li <zecheng@google.com>
---
kernel/sched/fair.c | 23 ++++++++++-------------
kernel/sched/sched.h | 8 ++++++++
2 files changed, 18 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a14da5396fb..3a1b55b74203 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13356,10 +13356,11 @@ void free_fair_sched_group(struct task_group *tg)
int i;
for_each_possible_cpu(i) {
- if (tg->cfs_rq)
- kfree(tg->cfs_rq[i]);
- if (tg->se)
- kfree(tg->se[i]);
+ if (tg->cfs_rq && tg->cfs_rq[i]) {
+ struct cfs_rq_with_se *combined =
+ container_of(tg->cfs_rq[i], struct cfs_rq_with_se, cfs_rq);
+ kfree(combined);
+ }
}
kfree(tg->cfs_rq);
@@ -13368,6 +13369,7 @@ void free_fair_sched_group(struct task_group *tg)
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
+ struct cfs_rq_with_se *combined;
struct sched_entity *se;
struct cfs_rq *cfs_rq;
int i;
@@ -13384,16 +13386,13 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
for_each_possible_cpu(i) {
- cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
+ combined = kzalloc_node(sizeof(struct cfs_rq_with_se),
GFP_KERNEL, cpu_to_node(i));
- if (!cfs_rq)
+ if (!combined)
goto err;
- se = kzalloc_node(sizeof(struct sched_entity_stats),
- GFP_KERNEL, cpu_to_node(i));
- if (!se)
- goto err_free_rq;
-
+ cfs_rq = &combined->cfs_rq;
+ se = &combined->se;
init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
init_entity_runnable_average(se);
@@ -13401,8 +13400,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
return 1;
-err_free_rq:
- kfree(cfs_rq);
err:
return 0;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 475bb5998295..6f32a76d38c2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -740,6 +740,14 @@ struct cfs_rq {
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
+#ifdef CONFIG_FAIR_GROUP_SCHED
+struct cfs_rq_with_se {
+ struct cfs_rq cfs_rq;
+ /* cfs_rq's sched_entity on parent runqueue */
+ struct sched_entity se ____cacheline_aligned;
+};
+#endif
+
#ifdef CONFIG_SCHED_CLASS_EXT
/* scx_rq->flags, protected by the rq lock */
enum scx_rq_flags {
--
2.50.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 2/3] sched/fair: Remove task_group->se pointer array
2025-07-01 21:02 [PATCH v3 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality Zecheng Li
2025-07-01 21:02 ` [PATCH v3 1/3] sched/fair: Co-locate cfs_rq and sched_entity Zecheng Li
@ 2025-07-01 21:02 ` Zecheng Li
2025-07-01 21:02 ` [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu Zecheng Li
2 siblings, 0 replies; 6+ messages in thread
From: Zecheng Li @ 2025-07-01 21:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Xu Liu, Blake Jones, Namhyung Kim, Josh Don,
Madadi Vineeth Reddy, linux-kernel, Zecheng Li
Now that struct sched_entity is co-located with struct cfs_rq for
non-root task groups, the task_group->se pointer array is redundant. The
associated sched_entity can be loaded directly from the cfs_rq.
This patch performs the access conversion with the helpers:
- is_root_task_group(tg): checks if a task group is the root task group.
It compares the task group's address with the global root_task_group
variable.
- tg_se(tg, cpu): retrieves the cfs_rq and returns the address of the
co-located se. This function checks if tg is the root task group to
ensure behaving the same of previous tg->se[cpu]. Replaces all accesses
that use the tg->se[cpu] pointer array with calls to the new tg_se(tg,
cpu) accessor.
- cfs_rq_se(cfs_rq): simplifies access paths like cfs_rq->tg->se[...] to
use the co-located sched_entity. This function also checks if tg is the
root task group to ensure same behavior.
Since tg_se is not in very hot code paths, and the branch is a register
comparison with an immediate value (`&root_task_group`), the performance
impact is expected to be negligible.
Signed-off-by: Zecheng Li <zecheng@google.com>
---
kernel/sched/core.c | 7 ++-----
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 27 ++++++++++-----------------
kernel/sched/sched.h | 29 ++++++++++++++++++++++++-----
4 files changed, 37 insertions(+), 28 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8988d38d46a3..2efa7e9590c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8526,7 +8526,7 @@ void __init sched_init(void)
wait_bit_init();
#ifdef CONFIG_FAIR_GROUP_SCHED
- ptr += 2 * nr_cpu_ids * sizeof(void **);
+ ptr += nr_cpu_ids * sizeof(void **);
#endif
#ifdef CONFIG_RT_GROUP_SCHED
ptr += 2 * nr_cpu_ids * sizeof(void **);
@@ -8535,9 +8535,6 @@ void __init sched_init(void)
ptr = (unsigned long)kzalloc(ptr, GFP_NOWAIT);
#ifdef CONFIG_FAIR_GROUP_SCHED
- root_task_group.se = (struct sched_entity **)ptr;
- ptr += nr_cpu_ids * sizeof(void **);
-
root_task_group.cfs_rq = (struct cfs_rq **)ptr;
ptr += nr_cpu_ids * sizeof(void **);
@@ -9729,7 +9726,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v)
int i;
for_each_possible_cpu(i) {
- stats = __schedstats_from_se(tg->se[i]);
+ stats = __schedstats_from_se(tg_se(tg, i));
ws += schedstat_val(stats->wait_sum);
}
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 9d71baf08075..c2868367e17e 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -657,7 +657,7 @@ void dirty_sched_domain_sysctl(int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group *tg)
{
- struct sched_entity *se = tg->se[cpu];
+ struct sched_entity *se = tg_se(tg, cpu);
#define P(F) SEQ_printf(m, " .%-30s: %lld\n", #F, (long long)F)
#define P_SCHEDSTAT(F) SEQ_printf(m, " .%-30s: %lld\n", \
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a1b55b74203..244b20222eb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5912,7 +5912,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue)
return false; /* Throttle no longer required. */
- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+ se = cfs_rq_se(cfs_rq);
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
@@ -5997,7 +5997,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
long queued_delta, runnable_delta, idle_delta;
long rq_h_nr_queued = rq->cfs.h_nr_queued;
- se = cfs_rq->tg->se[cpu_of(rq)];
+ se = cfs_rq_se(cfs_rq);
cfs_rq->throttled = 0;
@@ -9801,7 +9801,6 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
{
struct cfs_rq *cfs_rq, *pos;
bool decayed = false;
- int cpu = cpu_of(rq);
/*
* Iterates the task_group tree in a bottom up fashion, see
@@ -9821,7 +9820,7 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
}
/* Propagate pending load changes to the parent, if any: */
- se = cfs_rq->tg->se[cpu];
+ se = cfs_rq_se(cfs_rq);
if (se && !skip_blocked_update(se))
update_load_avg(cfs_rq_of(se), se, UPDATE_TG);
@@ -9847,8 +9846,7 @@ static bool __update_blocked_fair(struct rq *rq, bool *done)
*/
static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
{
- struct rq *rq = rq_of(cfs_rq);
- struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+ struct sched_entity *se = cfs_rq_se(cfs_rq);
unsigned long now = jiffies;
unsigned long load;
@@ -13364,7 +13362,6 @@ void free_fair_sched_group(struct task_group *tg)
}
kfree(tg->cfs_rq);
- kfree(tg->se);
}
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
@@ -13377,9 +13374,6 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
if (!tg->cfs_rq)
goto err;
- tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
- if (!tg->se)
- goto err;
tg->shares = NICE_0_LOAD;
@@ -13394,7 +13388,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
cfs_rq = &combined->cfs_rq;
se = &combined->se;
init_cfs_rq(cfs_rq);
- init_tg_cfs_entry(tg, cfs_rq, se, i, parent->se[i]);
+ init_tg_cfs_entry(tg, cfs_rq, se, i, tg_se(parent, i));
init_entity_runnable_average(se);
}
@@ -13413,7 +13407,7 @@ void online_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(i) {
rq = cpu_rq(i);
- se = tg->se[i];
+ se = tg_se(tg, i);
rq_lock_irq(rq, &rf);
update_rq_clock(rq);
attach_entity_cfs_rq(se);
@@ -13430,7 +13424,7 @@ void unregister_fair_sched_group(struct task_group *tg)
for_each_possible_cpu(cpu) {
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
- struct sched_entity *se = tg->se[cpu];
+ struct sched_entity *se = tg_se(tg, cpu);
struct rq *rq = cpu_rq(cpu);
if (se) {
@@ -13467,7 +13461,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
init_cfs_rq_runtime(cfs_rq);
tg->cfs_rq[cpu] = cfs_rq;
- tg->se[cpu] = se;
/* se could be NULL for root_task_group */
if (!se)
@@ -13498,7 +13491,7 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
/*
* We can't change the weight of the root cgroup.
*/
- if (!tg->se[0])
+ if (is_root_task_group(tg))
return -EINVAL;
shares = clamp(shares, scale_load(MIN_SHARES), scale_load(MAX_SHARES));
@@ -13509,7 +13502,7 @@ static int __sched_group_set_shares(struct task_group *tg, unsigned long shares)
tg->shares = shares;
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
- struct sched_entity *se = tg->se[i];
+ struct sched_entity *se = tg_se(tg, i);
struct rq_flags rf;
/* Propagate contribution to hierarchy */
@@ -13560,7 +13553,7 @@ int sched_group_set_idle(struct task_group *tg, long idle)
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
- struct sched_entity *se = tg->se[i];
+ struct sched_entity *se = tg_se(tg, i);
struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
long idle_task_delta;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6f32a76d38c2..3fdcdcdab76c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -437,8 +437,6 @@ struct task_group {
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
- /* schedulable entities of this group on each CPU */
- struct sched_entity **se;
/* runqueue "owned" by this group on each CPU */
struct cfs_rq **cfs_rq;
unsigned long shares;
@@ -903,7 +901,8 @@ struct dl_rq {
};
#ifdef CONFIG_FAIR_GROUP_SCHED
-
+/* Check whether a task group is root tg */
+#define is_root_task_group(tg) ((tg) == &root_task_group)
/* An entity is a task if it doesn't "own" a runqueue */
#define entity_is_task(se) (!se->my_q)
@@ -1594,6 +1593,26 @@ static inline struct task_struct *task_of(struct sched_entity *se)
return container_of(se, struct task_struct, se);
}
+static inline struct sched_entity *tg_se(struct task_group *tg, int cpu)
+{
+ if (is_root_task_group(tg))
+ return NULL;
+
+ struct cfs_rq_with_se *combined =
+ container_of(tg->cfs_rq[cpu], struct cfs_rq_with_se, cfs_rq);
+ return &combined->se;
+}
+
+static inline struct sched_entity *cfs_rq_se(struct cfs_rq *cfs_rq)
+{
+ if (is_root_task_group(cfs_rq->tg))
+ return NULL;
+
+ struct cfs_rq_with_se *combined =
+ container_of(cfs_rq, struct cfs_rq_with_se, cfs_rq);
+ return &combined->se;
+}
+
static inline struct cfs_rq *task_cfs_rq(struct task_struct *p)
{
return p->se.cfs_rq;
@@ -2168,8 +2187,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
p->se.cfs_rq = tg->cfs_rq[cpu];
- p->se.parent = tg->se[cpu];
- p->se.depth = tg->se[cpu] ? tg->se[cpu]->depth + 1 : 0;
+ p->se.parent = tg_se(tg, cpu);
+ p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
#endif
#ifdef CONFIG_RT_GROUP_SCHED
--
2.50.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
2025-07-01 21:02 [PATCH v3 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality Zecheng Li
2025-07-01 21:02 ` [PATCH v3 1/3] sched/fair: Co-locate cfs_rq and sched_entity Zecheng Li
2025-07-01 21:02 ` [PATCH v3 2/3] sched/fair: Remove task_group->se pointer array Zecheng Li
@ 2025-07-01 21:02 ` Zecheng Li
2025-07-16 8:50 ` kernel test robot
2 siblings, 1 reply; 6+ messages in thread
From: Zecheng Li @ 2025-07-01 21:02 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, Xu Liu, Blake Jones, Namhyung Kim, Josh Don,
Madadi Vineeth Reddy, linux-kernel, Zecheng Li
To remove the cfs_rq pointer array in task_group, allocate the combined
cfs_rq and sched_entity using the per-cpu allocator.
This patch implements the following:
- Changes task_group->cfs_rq from struct cfs_rq ** to struct cfs_rq
__percpu *.
- Updates memory allocation in alloc_fair_sched_group() and
free_fair_sched_group() to use alloc_percpu() and free_percpu()
respectively.
- Uses the inline accessor tg_cfs_rq(tg, cpu) with per_cpu_ptr() to
retrieve the pointer to cfs_rq for the given task group and CPU.
- Replaces direct accesses tg->cfs_rq[cpu] with calls to the new
tg_cfs_rq(tg, cpu) helper.
- Handles the root_task_group: since struct rq is already a per-cpu
variable (runqueues), its embedded cfs_rq (rq->cfs) is also per-cpu.
Therefore, we assign root_task_group.cfs_rq = &runqueues.cfs.
- Cleanup the code in initializing the root task group.
This change places each CPU's cfs_rq and sched_entity in its local
per-cpu memory area to remove the per-task_group pointer arrays.
Signed-off-by: Zecheng Li <zecheng@google.com>
---
kernel/sched/core.c | 35 ++++++++++----------------
kernel/sched/fair.c | 59 +++++++++++++++++---------------------------
kernel/sched/sched.h | 13 +++++++---
3 files changed, 45 insertions(+), 62 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2efa7e9590c7..377361fae8e8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8508,7 +8508,7 @@ static struct kmem_cache *task_group_cache __ro_after_init;
void __init sched_init(void)
{
- unsigned long ptr = 0;
+ unsigned long __maybe_unused ptr = 0;
int i;
/* Make sure the linker didn't screw up */
@@ -8526,33 +8526,24 @@ void __init sched_init(void)
wait_bit_init();
#ifdef CONFIG_FAIR_GROUP_SCHED
- ptr += nr_cpu_ids * sizeof(void **);
-#endif
-#ifdef CONFIG_RT_GROUP_SCHED
- ptr += 2 * nr_cpu_ids * sizeof(void **);
-#endif
- if (ptr) {
- ptr = (unsigned long)kzalloc(ptr, GFP_NOWAIT);
+ root_task_group.cfs_rq = &runqueues.cfs;
-#ifdef CONFIG_FAIR_GROUP_SCHED
- root_task_group.cfs_rq = (struct cfs_rq **)ptr;
- ptr += nr_cpu_ids * sizeof(void **);
-
- root_task_group.shares = ROOT_TASK_GROUP_LOAD;
- init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
+ root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+ init_cfs_bandwidth(&root_task_group.cfs_bandwidth, NULL);
#endif /* CONFIG_FAIR_GROUP_SCHED */
#ifdef CONFIG_EXT_GROUP_SCHED
- scx_tg_init(&root_task_group);
+ scx_tg_init(&root_task_group);
#endif /* CONFIG_EXT_GROUP_SCHED */
#ifdef CONFIG_RT_GROUP_SCHED
- root_task_group.rt_se = (struct sched_rt_entity **)ptr;
- ptr += nr_cpu_ids * sizeof(void **);
+ ptr += 2 * nr_cpu_ids * sizeof(void **);
+ ptr = (unsigned long)kzalloc(ptr, GFP_NOWAIT);
+ root_task_group.rt_se = (struct sched_rt_entity **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
- root_task_group.rt_rq = (struct rt_rq **)ptr;
- ptr += nr_cpu_ids * sizeof(void **);
+ root_task_group.rt_rq = (struct rt_rq **)ptr;
+ ptr += nr_cpu_ids * sizeof(void **);
#endif /* CONFIG_RT_GROUP_SCHED */
- }
#ifdef CONFIG_SMP
init_defrootdomain();
@@ -9497,7 +9488,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota,
}
for_each_online_cpu(i) {
- struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, i);
struct rq *rq = cfs_rq->rq;
guard(rq_lock_irq)(rq);
@@ -9745,7 +9736,7 @@ static u64 throttled_time_self(struct task_group *tg)
u64 total = 0;
for_each_possible_cpu(i) {
- total += READ_ONCE(tg->cfs_rq[i]->throttled_clock_self_time);
+ total += READ_ONCE(tg_cfs_rq(tg, i)->throttled_clock_self_time);
}
return total;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 244b20222eb5..37d6b00b3a3b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -329,7 +329,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* to a tree or when we reach the top of the tree
*/
if (cfs_rq->tg->parent &&
- cfs_rq->tg->parent->cfs_rq[cpu]->on_list) {
+ tg_cfs_rq(cfs_rq->tg->parent, cpu)->on_list) {
/*
* If parent is already on the list, we add the child
* just before. Thanks to circular linked property of
@@ -337,7 +337,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
* of the list that starts by parent.
*/
list_add_tail_rcu(&cfs_rq->leaf_cfs_rq_list,
- &(cfs_rq->tg->parent->cfs_rq[cpu]->leaf_cfs_rq_list));
+ &(tg_cfs_rq(cfs_rq->tg->parent, cpu)->leaf_cfs_rq_list));
/*
* The branch is now connected to its tree so we can
* reset tmp_alone_branch to the beginning of the
@@ -4180,7 +4180,7 @@ static void __maybe_unused clear_tg_offline_cfs_rqs(struct rq *rq)
rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu_of(rq));
clear_tg_load_avg(cfs_rq);
}
@@ -5828,8 +5828,8 @@ static inline int throttled_lb_pair(struct task_group *tg,
{
struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
- src_cfs_rq = tg->cfs_rq[src_cpu];
- dest_cfs_rq = tg->cfs_rq[dest_cpu];
+ src_cfs_rq = tg_cfs_rq(tg, src_cpu);
+ dest_cfs_rq = tg_cfs_rq(tg, dest_cpu);
return throttled_hierarchy(src_cfs_rq) ||
throttled_hierarchy(dest_cfs_rq);
@@ -5838,7 +5838,7 @@ static inline int throttled_lb_pair(struct task_group *tg,
static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu_of(rq));
cfs_rq->throttle_count--;
if (!cfs_rq->throttle_count) {
@@ -5867,7 +5867,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu_of(rq));
/* group is entering throttled state, stop time */
if (!cfs_rq->throttle_count) {
@@ -6454,8 +6454,8 @@ static void sync_throttle(struct task_group *tg, int cpu)
if (!tg->parent)
return;
- cfs_rq = tg->cfs_rq[cpu];
- pcfs_rq = tg->parent->cfs_rq[cpu];
+ cfs_rq = tg_cfs_rq(tg, cpu);
+ pcfs_rq = tg_cfs_rq(tg->parent, cpu);
cfs_rq->throttle_count = pcfs_rq->throttle_count;
cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
@@ -6640,7 +6640,7 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu_of(rq));
raw_spin_lock(&cfs_b->lock);
cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
@@ -6669,7 +6669,7 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
rcu_read_lock();
list_for_each_entry_rcu(tg, &task_groups, list) {
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu_of(rq));
if (!cfs_rq->runtime_enabled)
continue;
@@ -9378,7 +9378,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
struct cfs_rq *dst_cfs_rq;
#ifdef CONFIG_FAIR_GROUP_SCHED
- dst_cfs_rq = task_group(p)->cfs_rq[dest_cpu];
+ dst_cfs_rq = tg_cfs_rq(task_group(p), dest_cpu);
#else
dst_cfs_rq = &cpu_rq(dest_cpu)->cfs;
#endif
@@ -13095,7 +13095,7 @@ static int task_is_throttled_fair(struct task_struct *p, int cpu)
struct cfs_rq *cfs_rq;
#ifdef CONFIG_FAIR_GROUP_SCHED
- cfs_rq = task_group(p)->cfs_rq[cpu];
+ cfs_rq = tg_cfs_rq(task_group(p), cpu);
#else
cfs_rq = &cpu_rq(cpu)->cfs;
#endif
@@ -13351,42 +13351,31 @@ static void task_change_group_fair(struct task_struct *p)
void free_fair_sched_group(struct task_group *tg)
{
- int i;
-
- for_each_possible_cpu(i) {
- if (tg->cfs_rq && tg->cfs_rq[i]) {
- struct cfs_rq_with_se *combined =
- container_of(tg->cfs_rq[i], struct cfs_rq_with_se, cfs_rq);
- kfree(combined);
- }
- }
-
- kfree(tg->cfs_rq);
+ free_percpu(tg->cfs_rq);
}
int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
{
- struct cfs_rq_with_se *combined;
+ struct cfs_rq_with_se __percpu *combined;
struct sched_entity *se;
struct cfs_rq *cfs_rq;
int i;
- tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
- if (!tg->cfs_rq)
+ combined = alloc_percpu_gfp(struct cfs_rq_with_se, GFP_KERNEL);
+ if (!combined)
goto err;
+ tg->cfs_rq = &combined->cfs_rq;
tg->shares = NICE_0_LOAD;
init_cfs_bandwidth(tg_cfs_bandwidth(tg), tg_cfs_bandwidth(parent));
for_each_possible_cpu(i) {
- combined = kzalloc_node(sizeof(struct cfs_rq_with_se),
- GFP_KERNEL, cpu_to_node(i));
- if (!combined)
+ cfs_rq = tg_cfs_rq(tg, i);
+ if (!cfs_rq)
goto err;
- cfs_rq = &combined->cfs_rq;
- se = &combined->se;
+ se = tg_se(tg, i);
init_cfs_rq(cfs_rq);
init_tg_cfs_entry(tg, cfs_rq, se, i, tg_se(parent, i));
init_entity_runnable_average(se);
@@ -13423,7 +13412,7 @@ void unregister_fair_sched_group(struct task_group *tg)
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
for_each_possible_cpu(cpu) {
- struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+ struct cfs_rq *cfs_rq = tg_cfs_rq(tg, cpu);
struct sched_entity *se = tg_se(tg, cpu);
struct rq *rq = cpu_rq(cpu);
@@ -13460,8 +13449,6 @@ void init_tg_cfs_entry(struct task_group *tg, struct cfs_rq *cfs_rq,
cfs_rq->rq = rq;
init_cfs_rq_runtime(cfs_rq);
- tg->cfs_rq[cpu] = cfs_rq;
-
/* se could be NULL for root_task_group */
if (!se)
return;
@@ -13554,7 +13541,7 @@ int sched_group_set_idle(struct task_group *tg, long idle)
for_each_possible_cpu(i) {
struct rq *rq = cpu_rq(i);
struct sched_entity *se = tg_se(tg, i);
- struct cfs_rq *grp_cfs_rq = tg->cfs_rq[i];
+ struct cfs_rq *grp_cfs_rq = tg_cfs_rq(tg, i);
bool was_idle = cfs_rq_is_idle(grp_cfs_rq);
long idle_task_delta;
struct rq_flags rf;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fdcdcdab76c..a794bec99604 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -438,7 +438,7 @@ struct task_group {
#ifdef CONFIG_FAIR_GROUP_SCHED
/* runqueue "owned" by this group on each CPU */
- struct cfs_rq **cfs_rq;
+ struct cfs_rq __percpu *cfs_rq;
unsigned long shares;
#ifdef CONFIG_SMP
/*
@@ -1592,6 +1592,11 @@ static inline struct task_struct *task_of(struct sched_entity *se)
WARN_ON_ONCE(!entity_is_task(se));
return container_of(se, struct task_struct, se);
}
+/* Access a specific CPU's cfs_rq from a task group */
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+ return per_cpu_ptr(tg->cfs_rq, cpu);
+}
static inline struct sched_entity *tg_se(struct task_group *tg, int cpu)
{
@@ -1599,7 +1604,7 @@ static inline struct sched_entity *tg_se(struct task_group *tg, int cpu)
return NULL;
struct cfs_rq_with_se *combined =
- container_of(tg->cfs_rq[cpu], struct cfs_rq_with_se, cfs_rq);
+ container_of(tg_cfs_rq(tg, cpu), struct cfs_rq_with_se, cfs_rq);
return &combined->se;
}
@@ -2185,8 +2190,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
#ifdef CONFIG_FAIR_GROUP_SCHED
- set_task_rq_fair(&p->se, p->se.cfs_rq, tg->cfs_rq[cpu]);
- p->se.cfs_rq = tg->cfs_rq[cpu];
+ set_task_rq_fair(&p->se, p->se.cfs_rq, tg_cfs_rq(tg, cpu));
+ p->se.cfs_rq = tg_cfs_rq(tg, cpu);
p->se.parent = tg_se(tg, cpu);
p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
#endif
--
2.50.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
2025-07-01 21:02 ` [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu Zecheng Li
@ 2025-07-16 8:50 ` kernel test robot
2025-07-25 21:05 ` Zecheng Li
0 siblings, 1 reply; 6+ messages in thread
From: kernel test robot @ 2025-07-16 8:50 UTC (permalink / raw)
To: Zecheng Li
Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Xu Liu, Blake Jones, Namhyung Kim, Josh Don, Madadi Vineeth Reddy,
Zecheng Li, oliver.sang
Hello,
kernel test robot noticed a 8.8% improvement of stress-ng.session.ops_per_sec on:
commit: ac215b990e70e247344522e1736fd878cb3c25b6 ("[PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu")
url: https://github.com/intel-lab-lkp/linux/commits/Zecheng-Li/sched-fair-Co-locate-cfs_rq-and-sched_entity/20250702-050528
patch link: https://lore.kernel.org/all/20250701210230.2985885-4-zecheng@google.com/
patch subject: [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 192 threads 2 sockets Intel(R) Xeon(R) 6740E CPU @ 2.4GHz (Sierra Forest) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: session
cpufreq_governor: performance
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250716/202507161052.ed3213f4-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-srf-2sp3/session/stress-ng/60s
commit:
10d69ea6ba ("sched/fair: Remove task_group->se pointer array")
ac215b990e ("sched/fair: Allocate both cfs_rq and sched_entity with per-cpu")
10d69ea6ba7641d0 ac215b990e70e247344522e1736
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.00 ± 33% -0.0 0.00 mpstat.cpu.all.iowait%
4.46 -1.9 2.60 mpstat.cpu.all.soft%
119077 -40.8% 70550 numa-vmstat.node0.nr_slab_unreclaimable
108930 ± 2% -47.0% 57734 numa-vmstat.node1.nr_slab_unreclaimable
182080 +6.2% 193357 vmstat.system.cs
523378 +2.8% 538219 vmstat.system.in
492392 ± 2% +13.0% 556325 ± 4% sched_debug.cfs_rq:/.avg_vruntime.min
492392 ± 2% +13.0% 556325 ± 4% sched_debug.cfs_rq:/.min_vruntime.min
138.58 ± 5% -7.4% 128.29 ± 2% sched_debug.cfs_rq:/.util_est.stddev
472216 -40.6% 280726 numa-meminfo.node0.SUnreclaim
574944 ± 4% -37.1% 361520 ± 7% numa-meminfo.node0.Slab
432134 ± 2% -46.3% 231873 ± 2% numa-meminfo.node1.SUnreclaim
479462 ± 5% -37.2% 301105 ± 9% numa-meminfo.node1.Slab
10287656 -10.7% 9189229 meminfo.Memused
135165 +249.4% 472245 ± 2% meminfo.Percpu
904298 -43.4% 512235 meminfo.SUnreclaim
1054350 -37.2% 662242 meminfo.Slab
10568312 -11.6% 9345843 meminfo.max_used_kB
1207478 +8.7% 1312849 stress-ng.session.ops
20155 +8.8% 21919 stress-ng.session.ops_per_sec
1.019e+08 ± 2% +6.9% 1.09e+08 stress-ng.time.minor_page_faults
16524 +2.1% 16869 stress-ng.time.percent_of_cpu_this_job_got
9893 +2.1% 10097 stress-ng.time.system_time
4781984 +7.2% 5125238 stress-ng.time.voluntary_context_switches
481639 +2.6% 494200 proc-vmstat.nr_active_anon
1169838 +1.0% 1181222 proc-vmstat.nr_file_pages
41310 +2.1% 42196 proc-vmstat.nr_kernel_stack
279620 +4.1% 291007 proc-vmstat.nr_shmem
226450 -43.4% 128195 proc-vmstat.nr_slab_unreclaimable
481639 +2.6% 494200 proc-vmstat.nr_zone_active_anon
89251914 ± 3% +7.3% 95746015 ± 3% proc-vmstat.numa_hit
87248173 ± 3% +7.5% 93770181 ± 3% proc-vmstat.numa_local
1.169e+08 ± 2% -15.8% 98422922 ± 3% proc-vmstat.pgalloc_normal
1.024e+08 ± 2% +6.9% 1.095e+08 proc-vmstat.pgfault
1.161e+08 ± 2% -16.0% 97515011 ± 3% proc-vmstat.pgfree
0.73 +0.0 0.76 perf-stat.i.branch-miss-rate%
1.795e+08 +5.7% 1.897e+08 perf-stat.i.branch-misses
39.65 -2.8 36.83 perf-stat.i.cache-miss-rate%
1.485e+09 +7.2% 1.592e+09 perf-stat.i.cache-references
188846 +7.1% 202265 perf-stat.i.context-switches
4.35 -1.1% 4.30 perf-stat.i.cpi
50615 +8.6% 54987 perf-stat.i.cpu-migrations
18.17 +8.4% 19.69 perf-stat.i.metric.K/sec
1682096 ± 2% +6.7% 1795278 perf-stat.i.minor-faults
1682116 ± 2% +6.7% 1795306 perf-stat.i.page-faults
0.71 +0.0 0.74 perf-stat.overall.branch-miss-rate%
39.60 -2.9 36.74 perf-stat.overall.cache-miss-rate%
4.38 -1.1% 4.33 perf-stat.overall.cpi
0.23 +1.1% 0.23 perf-stat.overall.ipc
1.75e+08 +5.6% 1.847e+08 perf-stat.ps.branch-misses
1.454e+09 +6.9% 1.555e+09 perf-stat.ps.cache-references
184886 +6.9% 197661 perf-stat.ps.context-switches
49237 +8.5% 53411 perf-stat.ps.cpu-migrations
1646954 ± 2% +6.5% 1754401 perf-stat.ps.minor-faults
1646973 ± 2% +6.5% 1754428 perf-stat.ps.page-faults
0.14 ± 12% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.__kmalloc_cache_node_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
1.86 ± 55% -95.6% 0.08 ± 56% perf-sched.sch_delay.avg.ms.__cond_resched.__kmalloc_node_noprof.alloc_slab_obj_exts.allocate_slab.___slab_alloc
0.07 ±115% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.__kmalloc_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
0.03 ± 30% -89.1% 0.00 ± 30% perf-sched.sch_delay.avg.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
0.02 ± 43% -69.1% 0.01 ± 33% perf-sched.sch_delay.avg.ms.__cond_resched.__tlb_batch_free_encoded_pages.tlb_finish_mmu.exit_mmap.__mmput
0.47 ± 8% -16.7% 0.39 ± 5% perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
0.27 ± 47% -69.6% 0.08 ± 97% perf-sched.sch_delay.avg.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
0.26 ± 27% -40.1% 0.15 ± 7% perf-sched.sch_delay.avg.ms.__cond_resched.copy_page_range.dup_mmap.dup_mm.constprop
0.27 ± 16% -43.3% 0.15 ± 26% perf-sched.sch_delay.avg.ms.__cond_resched.down_write.anon_vma_clone.anon_vma_fork.dup_mmap
0.23 ± 10% -26.2% 0.17 ± 9% perf-sched.sch_delay.avg.ms.__cond_resched.down_write.dup_mmap.dup_mm.constprop
0.03 ± 41% -86.7% 0.00 ± 32% perf-sched.sch_delay.avg.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
0.04 ± 69% -93.6% 0.00 ± 62% perf-sched.sch_delay.avg.ms.__cond_resched.down_write.unlink_file_vma_batch_process.unlink_file_vma_batch_add.free_pgtables
0.56 ± 49% -75.3% 0.14 ± 39% perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
0.25 ± 7% -21.0% 0.19 ± 8% perf-sched.sch_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.anon_vma_fork.dup_mmap.dup_mm
0.43 ± 44% -86.3% 0.06 ± 55% perf-sched.sch_delay.avg.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
0.01 ± 76% +13586.7% 1.03 ± 72% perf-sched.sch_delay.avg.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork
0.09 ± 44% -76.0% 0.02 ± 26% perf-sched.sch_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.39 ± 52% -75.9% 0.10 ± 20% perf-sched.sch_delay.avg.ms.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
0.09 ± 21% -30.0% 0.06 ± 20% perf-sched.sch_delay.avg.ms.anon_pipe_read.fifo_pipe_read.vfs_read.ksys_read
0.02 ± 4% -47.3% 0.01 perf-sched.sch_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
0.20 -12.8% 0.17 ± 3% perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.22 ± 26% -52.3% 0.10 ± 29% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll
0.02 ± 12% -55.8% 0.01 ± 16% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
0.07 ± 10% -24.5% 0.05 ± 14% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.anon_vma_clone
0.07 ± 6% -22.6% 0.05 ± 13% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.anon_vma_fork
0.01 ± 20% -57.3% 0.01 ± 6% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
0.10 ± 18% -27.5% 0.07 ± 10% perf-sched.sch_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
0.44 ± 4% -20.2% 0.35 ± 2% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
8.47 ± 85% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_cache_node_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
10.95 ± 57% -95.9% 0.45 ± 59% perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_node_noprof.alloc_slab_obj_exts.allocate_slab.___slab_alloc
0.34 ±160% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.__kmalloc_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
17.30 ± 16% -86.0% 2.42 ± 65% perf-sched.sch_delay.max.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
8.99 ± 57% -81.2% 1.69 ± 22% perf-sched.sch_delay.max.ms.__cond_resched.__tlb_batch_free_encoded_pages.tlb_finish_mmu.exit_mmap.__mmput
36.94 ± 97% -90.9% 3.35 ± 9% perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.affine_move_task.__set_cpus_allowed_ptr.__sched_setaffinity
0.67 ± 47% -75.4% 0.16 ±148% perf-sched.sch_delay.max.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork
26.40 ± 89% -69.8% 7.96 ± 67% perf-sched.sch_delay.max.ms.__cond_resched.copy_page_range.dup_mmap.dup_mm.constprop
36.26 ± 33% -70.8% 10.58 ±109% perf-sched.sch_delay.max.ms.__cond_resched.down_write.anon_vma_clone.anon_vma_fork.dup_mmap
10.02 ± 42% -86.2% 1.38 ± 6% perf-sched.sch_delay.max.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
11.63 ± 51% -92.3% 0.89 ± 39% perf-sched.sch_delay.max.ms.__cond_resched.down_write.unlink_file_vma_batch_process.unlink_file_vma_batch_add.free_pgtables
13.19 ± 50% -88.0% 1.58 ± 44% perf-sched.sch_delay.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
9.58 ± 68% -91.3% 0.83 ± 49% perf-sched.sch_delay.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.mm_init.dup_mm
0.01 ± 83% +17570.0% 1.47 ± 76% perf-sched.sch_delay.max.ms.__cond_resched.process_one_work.worker_thread.kthread.ret_from_fork
13.57 ± 16% -89.9% 1.37 ± 65% perf-sched.sch_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
13.09 ± 35% -80.8% 2.51 ± 59% perf-sched.sch_delay.max.ms.__cond_resched.wp_page_copy.__handle_mm_fault.handle_mm_fault.do_user_addr_fault
23.68 ± 15% -82.9% 4.05 ± 33% perf-sched.sch_delay.max.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
3.16 ± 15% +26.9% 4.01 ± 7% perf-sched.sch_delay.max.ms.io_schedule.folio_wait_bit_common.filemap_fault.__do_fault
4.74 ±147% +499.9% 28.44 ± 59% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.pcpu_alloc_noprof
22.00 ± 24% -86.8% 2.90 ± 43% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
19.90 ± 9% -85.2% 2.95 ± 29% perf-sched.sch_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
26.25 ± 10% -75.7% 6.37 ± 12% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
29.41 ± 22% -78.6% 6.29 ± 19% perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.13 ± 3% -14.3% 0.11 ± 3% perf-sched.total_sch_delay.average.ms
554.44 ± 80% -84.6% 85.42 ± 13% perf-sched.total_sch_delay.max.ms
2379 ± 19% +49.8% 3565 ± 24% perf-sched.total_wait_and_delay.max.ms
2379 ± 19% +49.8% 3564 ± 24% perf-sched.total_wait_time.max.ms
0.08 ± 30% -87.6% 0.01 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
4.97 ± 2% -8.4% 4.56 perf-sched.wait_and_delay.avg.ms.__cond_resched.kmem_cache_alloc_noprof.anon_vma_fork.dup_mmap.dup_mm
0.04 ± 4% -46.9% 0.02 ± 3% perf-sched.wait_and_delay.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
0.10 ± 4% -33.6% 0.06 ± 4% perf-sched.wait_and_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
0.08 ± 10% -35.0% 0.05 perf-sched.wait_and_delay.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
5676 ± 2% -10.9% 5057 perf-sched.wait_and_delay.count.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
5076 -11.4% 4498 ± 2% perf-sched.wait_and_delay.count.__cond_resched.kmem_cache_alloc_noprof.anon_vma_fork.dup_mmap.dup_mm
1072 ± 15% -57.1% 460.33 ± 13% perf-sched.wait_and_delay.count.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
4236 ± 8% +29.2% 5471 ± 10% perf-sched.wait_and_delay.count.io_schedule.folio_wait_bit_common.filemap_fault.__do_fault
102.00 ± 29% +39.7% 142.50 ± 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_poll
2.17 ±223% +2.1e+05% 4619 ± 12% perf-sched.wait_and_delay.count.schedule_preempt_disabled.__mutex_lock.constprop.0.pcpu_alloc_noprof
11630 ± 5% -12.4% 10192 ± 7% perf-sched.wait_and_delay.count.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
24971 ± 4% -9.2% 22682 ± 2% perf-sched.wait_and_delay.count.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.anon_vma_clone
22567 ± 2% -9.3% 20475 ± 2% perf-sched.wait_and_delay.count.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
37.53 ± 12% -84.8% 5.69 ± 48% perf-sched.wait_and_delay.max.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
7.92 ±223% +839.2% 74.40 ± 30% perf-sched.wait_and_delay.max.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.pcpu_alloc_noprof
45.50 ± 23% -86.0% 6.35 ± 35% perf-sched.wait_and_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
39.97 ± 9% -83.3% 6.69 ± 22% perf-sched.wait_and_delay.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
4.05 ± 50% -77.4% 0.92 ± 64% perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_cache_node_noprof.__get_vm_area_node.__vmalloc_node_range_noprof.__vmalloc_node_noprof
0.01 ± 42% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.__kmalloc_cache_node_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
0.05 ± 29% -86.9% 0.01 ± 11% perf-sched.wait_time.avg.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
0.04 ± 30% -73.8% 0.01 ± 20% perf-sched.wait_time.avg.ms.__cond_resched.__tlb_batch_free_encoded_pages.tlb_finish_mmu.exit_mmap.__mmput
0.05 ± 34% -85.3% 0.01 ± 36% perf-sched.wait_time.avg.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
0.07 ± 50% -89.2% 0.01 ± 27% perf-sched.wait_time.avg.ms.__cond_resched.down_write.unlink_file_vma_batch_process.unlink_file_vma_batch_add.free_pgtables
2.74 ± 49% +78.9% 4.90 ± 24% perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.mas_dup_build.constprop.0
1.52 ± 12% +32.6% 2.02 ± 14% perf-sched.wait_time.avg.ms.__cond_resched.kmem_cache_alloc_noprof.prepare_creds.copy_creds.copy_process
1.29 ± 9% -21.5% 1.01 ± 7% perf-sched.wait_time.avg.ms.__cond_resched.mutex_lock.perf_event_exit_task.do_exit.do_group_exit
3.12 ± 4% +9.4% 3.41 ± 3% perf-sched.wait_time.avg.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.__percpu_counter_init_many.mm_init
0.02 ± 4% -48.1% 0.01 ± 4% perf-sched.wait_time.avg.ms.do_task_dead.do_exit.do_group_exit.__x64_sys_exit_group.x64_sys_call
0.55 ± 19% -38.1% 0.34 ± 41% perf-sched.wait_time.avg.ms.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi.[unknown]
0.08 ± 4% -28.8% 0.06 ± 4% perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
0.06 ± 9% -29.7% 0.04 perf-sched.wait_time.avg.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
34.86 ± 44% -80.5% 6.80 ± 49% perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_cache_node_noprof.__get_vm_area_node.__vmalloc_node_range_noprof.__vmalloc_node_noprof
3.98 ± 62% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.__kmalloc_cache_node_noprof.alloc_fair_sched_group.sched_create_group.sched_autogroup_create_attach
23.75 ± 25% -81.5% 4.40 ± 23% perf-sched.wait_time.max.ms.__cond_resched.__put_anon_vma.unlink_anon_vmas.free_pgtables.exit_mmap
12.62 ± 54% -83.7% 2.06 ± 20% perf-sched.wait_time.max.ms.__cond_resched.__tlb_batch_free_encoded_pages.tlb_finish_mmu.exit_mmap.__mmput
12.65 ± 29% -74.1% 3.27 ± 36% perf-sched.wait_time.max.ms.__cond_resched.down_write.unlink_anon_vmas.free_pgtables.exit_mmap
14.23 ± 33% -85.1% 2.12 ± 49% perf-sched.wait_time.max.ms.__cond_resched.down_write.unlink_file_vma_batch_process.unlink_file_vma_batch_add.free_pgtables
26.52 ± 54% -58.3% 11.06 ± 10% perf-sched.wait_time.max.ms.__cond_resched.kmem_cache_alloc_noprof.alloc_pid.copy_process.kernel_clone
21.46 ± 31% -79.3% 4.44 ± 50% perf-sched.wait_time.max.ms.__cond_resched.mutex_lock.perf_event_exit_task.do_exit.do_group_exit
36.27 ± 38% -61.4% 14.00 ± 27% perf-sched.wait_time.max.ms.__cond_resched.mutex_lock_killable.pcpu_alloc_noprof.__percpu_counter_init_many.mm_init
43.85 ± 20% -52.4% 20.88 ± 63% perf-sched.wait_time.max.ms.__cond_resched.uprobe_start_dup_mmap.dup_mm.constprop.0
1.21 ± 30% -65.9% 0.41 ± 86% perf-sched.wait_time.max.ms.exit_to_user_mode_loop.ret_from_fork.ret_from_fork_asm.[unknown]
3.06 ± 38% +816.6% 28.04 ± 89% perf-sched.wait_time.max.ms.io_schedule.folio_wait_bit_common.filemap_fault.__do_fault
11.39 ± 63% +393.7% 56.25 ± 18% perf-sched.wait_time.max.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.pcpu_alloc_noprof
26.67 ± 14% -83.3% 4.45 ± 13% perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.__put_anon_vma
24.94 ± 26% -82.1% 4.48 ± 5% perf-sched.wait_time.max.ms.schedule_preempt_disabled.rwsem_down_write_slowpath.down_write.unlink_anon_vmas
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
2025-07-16 8:50 ` kernel test robot
@ 2025-07-25 21:05 ` Zecheng Li
0 siblings, 0 replies; 6+ messages in thread
From: Zecheng Li @ 2025-07-25 21:05 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, linux-kernel, aubrey.li, yu.c.chen, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
Xu Liu, Blake Jones, Namhyung Kim, Josh Don, Madadi Vineeth Reddy,
zli94
Gentle ping on this patch series.
I'll be communicating from a new email address, zli94@ncsu.edu as I'll
soon lose access to my current corp account.
On Wed, Jul 16, 2025 at 4:51 AM kernel test robot <oliver.sang@intel.com> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed a 8.8% improvement of stress-ng.session.ops_per_sec on:
>
>
> commit: ac215b990e70e247344522e1736fd878cb3c25b6 ("[PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu")
> url: https://github.com/intel-lab-lkp/linux/commits/Zecheng-Li/sched-fair-Co-locate-cfs_rq-and-sched_entity/20250702-050528
> patch link: https://lore.kernel.org/all/20250701210230.2985885-4-zecheng@google.com/
> patch subject: [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu
It's great to see the kernel test robot can measure the improvement of
this patch.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-07-25 21:05 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-01 21:02 [PATCH v3 0/3] sched/fair: Optimize cfs_rq and sched_entity allocation for better data locality Zecheng Li
2025-07-01 21:02 ` [PATCH v3 1/3] sched/fair: Co-locate cfs_rq and sched_entity Zecheng Li
2025-07-01 21:02 ` [PATCH v3 2/3] sched/fair: Remove task_group->se pointer array Zecheng Li
2025-07-01 21:02 ` [PATCH v3 3/3] sched/fair: Allocate both cfs_rq and sched_entity with per-cpu Zecheng Li
2025-07-16 8:50 ` kernel test robot
2025-07-25 21:05 ` Zecheng Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).