* [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
` (15 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
From: Chen Yu <yu.c.chen@intel.com>
Currently, only the domain with SD_SHARE_PKG_RESOURCES flag
would share 1 sd_share for every CPU in this domain. Remove this
restriction and extend it for other sched domains under NUMA
domain.
This shared field will be used by a later patch which optimizes
newidle balancing.
Suggested-by: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/topology.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index c49aea8c1025..815474823b3f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1680,10 +1680,10 @@ sd_init(struct sched_domain_topology_level *tl,
}
/*
- * For all levels sharing cache; connect a sched_domain_shared
+ * For all levels except for NUMA; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_LLC) {
+ if (!(sd->flags & SD_NUMA)) {
sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
atomic_inc(&sd->shared->ref);
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 2/8] sched/topology: Introduce sg->shared
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
` (14 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
sched_group(s) of a particular sched_domain are created using the
sched_domain struct of the child domain. Attach the sched_domain_shared
struct from the corresponding child domain to the sched_group.
This shared struct will be used to propagate the sched group stats up
the sched domain hierarchy to optimize load balancing in subsequent
commits.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/sched.h | 3 +++
kernel/sched/topology.c | 27 +++++++++++++++++++++++++++
2 files changed, 30 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 023b844159c9..38aa4cba5d1f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2089,6 +2089,9 @@ struct sched_group {
int asym_prefer_cpu; /* CPU of highest priority in group */
int flags;
+ /* sd->shared of the domain from which this group was created */
+ struct sched_domain_shared *shared;
+
/*
* The CPUs this group covers.
*
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 815474823b3f..508ee8aa492b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -612,6 +612,23 @@ static struct root_domain *alloc_rootdomain(void)
return rd;
}
+static void link_sg_shared(struct sched_group *sg, struct sched_domain_shared *sds)
+{
+ if (!sds)
+ return;
+
+ sg->shared = sds;
+ atomic_inc(&sds->ref);
+}
+
+static void free_sg_shared(struct sched_group *sg)
+{
+ if (sg->shared && atomic_dec_and_test(&sg->shared->ref))
+ kfree(sg->shared);
+
+ sg->shared = NULL;
+}
+
static void free_sched_groups(struct sched_group *sg, int free_sgc)
{
struct sched_group *tmp, *first;
@@ -626,6 +643,8 @@ static void free_sched_groups(struct sched_group *sg, int free_sgc)
if (free_sgc && atomic_dec_and_test(&sg->sgc->ref))
kfree(sg->sgc);
+ free_sg_shared(sg);
+
if (atomic_dec_and_test(&sg->ref))
kfree(sg);
sg = tmp;
@@ -746,6 +765,9 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
if (parent->parent) {
parent->parent->child = tmp;
parent->parent->groups->flags = tmp->flags;
+
+ free_sg_shared(parent->parent->groups);
+ link_sg_shared(parent->parent->groups, tmp->shared);
}
/*
@@ -773,6 +795,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
* the child is being destroyed.
*/
do {
+ free_sg_shared(sg);
sg->flags = 0;
} while (sg != sd->groups);
@@ -972,10 +995,12 @@ build_group_from_child_sched_domain(struct sched_domain *sd, int cpu)
if (!sg)
return NULL;
+ sg->shared = NULL;
sg_span = sched_group_span(sg);
if (sd->child) {
cpumask_copy(sg_span, sched_domain_span(sd->child));
sg->flags = sd->child->flags;
+ link_sg_shared(sg, sd->child->shared);
} else {
cpumask_copy(sg_span, sched_domain_span(sd));
}
@@ -1225,9 +1250,11 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
if (already_visited)
return sg;
+ sg->shared = NULL;
if (child) {
cpumask_copy(sched_group_span(sg), sched_domain_span(child));
cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
+ link_sg_shared(sg, child->shared);
sg->flags = child->flags;
} else {
cpumask_set_cpu(cpu, sched_group_span(sg));
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 1/8] sched/topology: Assign sd_share for all non NUMA sched domains K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 2/8] sched/topology: Introduce sg->shared K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
` (13 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
"struct sg_lb_stats" will be embedded into "struct sched_domain_shared"
to propagate load balancing information up the sched domain hierarchy in
the subsequent commits. Move it, and the internal types in depends on
from fair.c to sched.h
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 66 --------------------------------------------
kernel/sched/sched.h | 66 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+), 66 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dafb374d76d..39bee40dde27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9168,49 +9168,6 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
enum fbq_type { regular, remote, all };
-/*
- * 'group_type' describes the group of CPUs at the moment of load balancing.
- *
- * The enum is ordered by pulling priority, with the group with lowest priority
- * first so the group_type can simply be compared when selecting the busiest
- * group. See update_sd_pick_busiest().
- */
-enum group_type {
- /* The group has spare capacity that can be used to run more tasks. */
- group_has_spare = 0,
- /*
- * The group is fully used and the tasks don't compete for more CPU
- * cycles. Nevertheless, some tasks might wait before running.
- */
- group_fully_busy,
- /*
- * One task doesn't fit with CPU's capacity and must be migrated to a
- * more powerful CPU.
- */
- group_misfit_task,
- /*
- * Balance SMT group that's fully busy. Can benefit from migration
- * a task on SMT with busy sibling to another CPU on idle core.
- */
- group_smt_balance,
- /*
- * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
- * and the task should be migrated to it instead of running on the
- * current CPU.
- */
- group_asym_packing,
- /*
- * The tasks' affinity constraints previously prevented the scheduler
- * from balancing the load across the system.
- */
- group_imbalanced,
- /*
- * The CPU is overloaded and can't provide expected CPU cycles to all
- * tasks.
- */
- group_overloaded
-};
-
enum migration_type {
migrate_load = 0,
migrate_util,
@@ -9916,29 +9873,6 @@ static void sched_balance_update_blocked_averages(int cpu)
/********** Helpers for sched_balance_find_src_group ************************/
-/*
- * sg_lb_stats - stats of a sched_group required for load-balancing:
- */
-struct sg_lb_stats {
- unsigned long avg_load; /* Avg load over the CPUs of the group */
- unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long group_capacity; /* Capacity over the CPUs of the group */
- unsigned long group_util; /* Total utilization over the CPUs of the group */
- unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
- unsigned int sum_nr_running; /* Nr of all tasks running in the group */
- unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
- unsigned int idle_cpus; /* Nr of idle CPUs in the group */
- unsigned int group_weight;
- enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
- unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
-#ifdef CONFIG_NUMA_BALANCING
- unsigned int nr_numa_running;
- unsigned int nr_preferred_running;
-#endif
-};
-
/*
* sd_lb_stats - stats of a sched_domain required for load-balancing:
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 38aa4cba5d1f..dc9d6e4c704b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2102,6 +2102,72 @@ struct sched_group {
unsigned long cpumask[];
};
+/*
+ * 'group_type' describes the group of CPUs at the moment of load balancing.
+ *
+ * The enum is ordered by pulling priority, with the group with lowest priority
+ * first so the group_type can simply be compared when selecting the busiest
+ * group. See update_sd_pick_busiest().
+ */
+enum group_type {
+ /* The group has spare capacity that can be used to run more tasks. */
+ group_has_spare = 0,
+ /*
+ * The group is fully used and the tasks don't compete for more CPU
+ * cycles. Nevertheless, some tasks might wait before running.
+ */
+ group_fully_busy,
+ /*
+ * One task doesn't fit with CPU's capacity and must be migrated to a
+ * more powerful CPU.
+ */
+ group_misfit_task,
+ /*
+ * Balance SMT group that's fully busy. Can benefit from migration
+ * a task on SMT with busy sibling to another CPU on idle core.
+ */
+ group_smt_balance,
+ /*
+ * SD_ASYM_PACKING only: One local CPU with higher capacity is available,
+ * and the task should be migrated to it instead of running on the
+ * current CPU.
+ */
+ group_asym_packing,
+ /*
+ * The tasks' affinity constraints previously prevented the scheduler
+ * from balancing the load across the system.
+ */
+ group_imbalanced,
+ /*
+ * The CPU is overloaded and can't provide expected CPU cycles to all
+ * tasks.
+ */
+ group_overloaded
+};
+
+/*
+ * sg_lb_stats - stats of a sched_group required for load-balancing:
+ */
+struct sg_lb_stats {
+ unsigned long avg_load; /* Avg load over the CPUs of the group */
+ unsigned long group_load; /* Total load over the CPUs of the group */
+ unsigned long group_capacity; /* Capacity over the CPUs of the group */
+ unsigned long group_util; /* Total utilization over the CPUs of the group */
+ unsigned long group_runnable; /* Total runnable time over the CPUs of the group */
+ unsigned int sum_nr_running; /* Nr of all tasks running in the group */
+ unsigned int sum_h_nr_running; /* Nr of CFS tasks running in the group */
+ unsigned int idle_cpus; /* Nr of idle CPUs in the group */
+ unsigned int group_weight;
+ enum group_type group_type;
+ unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned int nr_numa_running;
+ unsigned int nr_preferred_running;
+#endif
+};
+
static inline struct cpumask *sched_group_span(struct sched_group *sg)
{
return to_cpumask(sg->cpumask);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (2 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 3/8] sched/fair: Move "struct sg_lb_stats" and its dependencies to sched.h K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
` (12 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
update_sg_lb_stats() used a pointer to update the group overloaded and
overutilized status to propagate to root domain. Discard the pointer
passing and use a flag in sg_lb_stats struct to indicate the overloaded
and overutilized status. This will be used in subsequent commits to
propagate the overloaded and overutilized status up the sched domain
hierarchy and set these status at highest domain.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 14 +++++++-------
kernel/sched/sched.h | 6 ++++--
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39bee40dde27..3b1ed14e4b5e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10287,9 +10287,7 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
static inline void update_sg_lb_stats(struct lb_env *env,
struct sd_lb_stats *sds,
struct sched_group *group,
- struct sg_lb_stats *sgs,
- bool *sg_overloaded,
- bool *sg_overutilized)
+ struct sg_lb_stats *sgs)
{
int i, nr_running, local_group, sd_flags = env->sd->flags;
bool balancing_at_rd = !env->sd->parent;
@@ -10311,7 +10309,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_nr_running += nr_running;
if (cpu_overutilized(i))
- *sg_overutilized = 1;
+ sgs->overutilized = 1;
/*
* No need to call idle_cpu() if nr_running is not 0
@@ -10324,7 +10322,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Overload indicator is only updated at root domain */
if (balancing_at_rd && nr_running > 1)
- *sg_overloaded = 1;
+ sgs->overloaded = 1;
#ifdef CONFIG_NUMA_BALANCING
/* Only fbq_classify_group() uses this to classify NUMA groups */
@@ -10340,7 +10338,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
/* Check for a misfit task on the cpu */
if (sgs->group_misfit_task_load < rq->misfit_task_load) {
sgs->group_misfit_task_load = rq->misfit_task_load;
- *sg_overloaded = 1;
+ sgs->overloaded = 1;
}
} else if (env->idle && sched_reduced_capacity(rq, env->sd)) {
/* Check for a task running on a CPU with reduced capacity */
@@ -10982,7 +10980,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_group_capacity(env->sd, env->dst_cpu);
}
- update_sg_lb_stats(env, sds, sg, sgs, &sg_overloaded, &sg_overutilized);
+ update_sg_lb_stats(env, sds, sg, sgs);
if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
@@ -10992,6 +10990,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sg_overloaded |= sgs->overloaded;
+ sg_overutilized |= sgs->overutlizied;
sum_util += sgs->group_util;
sg = sg->next;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc9d6e4c704b..9372a75ab3cf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2159,8 +2159,10 @@ struct sg_lb_stats {
unsigned int idle_cpus; /* Nr of idle CPUs in the group */
unsigned int group_weight;
enum group_type group_type;
- unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
- unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned char group_asym_packing; /* Tasks should be moved to preferred CPU */
+ unsigned char group_smt_balance; /* Task on busy SMT be moved */
+ unsigned char overloaded; /* Contains at least one overloaded CPU */
+ unsigned char overutilized; /* Contains at least one overutilized CPU */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (3 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 4/8] sched/fair: Move sg_{overloaded,overutilized} calculation to sg_lb_stats K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-13 9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
` (11 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
"struct sg_lb_stats_prop" is a container around "sg_lb_stats" to help
propagate the load balancing stats up the sched domain hierarchy. Embed
the same in "sched_domain_shared" for concurrent load balancing
instances to reuse the statistics collected for domains below.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
include/linux/sched/topology.h | 9 +++++----
kernel/sched/sched.h | 11 +++++++++++
kernel/sched/topology.c | 26 +++++++++++++++++++++++---
3 files changed, 39 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 7f3dbafe1817..a16d7d9dd9d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -78,10 +78,11 @@ extern int sched_domain_level_max;
struct sched_group;
struct sched_domain_shared {
- atomic_t ref;
- atomic_t nr_busy_cpus;
- int has_idle_cores;
- int nr_idle_scan;
+ atomic_t ref;
+ atomic_t nr_busy_cpus;
+ int has_idle_cores;
+ int nr_idle_scan;
+ void *private; /* lb stats propagation field */
};
struct sched_domain {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9372a75ab3cf..391c4180eeb3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2170,6 +2170,17 @@ struct sg_lb_stats {
#endif
};
+/*
+ * sg_lb_stats_prop - Load balancer stats propagation container.
+ * This is embedded in sg->shared->private and is used to propagate
+ * sched_domain load balancing statistics up the hierarchy.
+ */
+struct sg_lb_stats_prop {
+ raw_spinlock_t stats_lock; /* Lock for updating the cached stats */
+ unsigned long last_update; /* Time when stats was last updated (jiffies) */
+ struct sg_lb_stats sg_stats; /* Cached sched_group stats */
+};
+
static inline struct cpumask *sched_group_span(struct sched_group *sg)
{
return to_cpumask(sg->cpumask);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 508ee8aa492b..aeb55f66e8d6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -621,10 +621,19 @@ static void link_sg_shared(struct sched_group *sg, struct sched_domain_shared *s
atomic_inc(&sds->ref);
}
+static void free_sched_domain_shared(struct sched_domain_shared *sd_shared)
+{
+ if (!sd_shared)
+ return;
+
+ kfree(sd_shared->private);
+ kfree(sd_shared);
+}
+
static void free_sg_shared(struct sched_group *sg)
{
if (sg->shared && atomic_dec_and_test(&sg->shared->ref))
- kfree(sg->shared);
+ free_sched_domain_shared(sg->shared);
sg->shared = NULL;
}
@@ -661,7 +670,7 @@ static void destroy_sched_domain(struct sched_domain *sd)
free_sched_groups(sd->groups, 1);
if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
- kfree(sd->shared);
+ free_sched_domain_shared(sd->shared);
kfree(sd);
}
@@ -2273,6 +2282,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
struct sched_domain_shared *sds;
struct sched_group *sg;
struct sched_group_capacity *sgc;
+ struct sg_lb_stats_prop *sg_stats;
sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
@@ -2288,6 +2298,16 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
*per_cpu_ptr(sdd->sds, j) = sds;
+ sg_stats = kzalloc_node(sizeof(struct sg_lb_stats_prop),
+ GFP_KERNEL, cpu_to_node(j));
+
+ if (!sg_stats)
+ return -ENOMEM;
+
+ raw_spin_lock_init(&sg_stats->stats_lock);
+ sg_stats->last_update = 0;
+ sds->private = (void *)sg_stats;
+
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
GFP_KERNEL, cpu_to_node(j));
if (!sg)
@@ -2332,7 +2352,7 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
if (sdd->sds)
- kfree(*per_cpu_ptr(sdd->sds, j));
+ free_sched_domain_shared(*per_cpu_ptr(sdd->sds, j));
if (sdd->sg)
kfree(*per_cpu_ptr(sdd->sg, j));
if (sdd->sgc)
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (4 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 5/8] sched/topology: Define sg_lb_stats_prop and embed it inside sched_domain_shared K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-17 18:07 ` Chen, Yu C
2025-03-13 9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
` (10 subsequent siblings)
16 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
The load balancer will start caching the sg_lb_stats during load
balancing and propagate it up the sched domain hierarchy in the
subsequent commits.
Increase the probability of load balancing intervals across domains to
be aligned to improve the reuse efficiency of the propagated stats.
Go one step further and proactively explore balancing at a higher domain
if the next update time for a higher domain in before the next update
time for its children.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b1ed14e4b5e..60517a732c10 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
/* scale ms to jiffies */
interval = msecs_to_jiffies(interval);
-
- /*
- * Reduce likelihood of busy balancing at higher domains racing with
- * balancing at lower domains by preventing their balancing periods
- * from being multiples of each other.
- */
- if (cpu_busy)
- interval -= 1;
-
interval = clamp(interval, 1UL, max_load_balance_interval);
return interval;
@@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
- unsigned long interval;
+ unsigned long interval, prev_sd_next_balance = 0;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
@@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ unsigned long next_interval;
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains.
@@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
goto out;
}
- if (time_after_eq(jiffies, sd->last_balance + interval)) {
+ next_interval = sd->last_balance + interval;
+ if (time_after_eq(jiffies, next_interval) ||
+ (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
@@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
+ prev_sd_next_balance = sd->last_balance + interval;
}
if (need_serialize)
atomic_set_release(&sched_balance_running, 0);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
2025-03-13 9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
@ 2025-03-17 18:07 ` Chen, Yu C
2025-03-19 6:51 ` K Prateek Nayak
0 siblings, 1 reply; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:07 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, linux-kernel, yu.c.chen,
yu.chen.surf
On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> The load balancer will start caching the sg_lb_stats during load
> balancing and propagate it up the sched domain hierarchy in the
> subsequent commits.
>
> Increase the probability of load balancing intervals across domains to
> be aligned to improve the reuse efficiency of the propagated stats.
> Go one step further and proactively explore balancing at a higher domain
> if the next update time for a higher domain in before the next update
> time for its children.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> kernel/sched/fair.c | 18 +++++++-----------
> 1 file changed, 7 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3b1ed14e4b5e..60517a732c10 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>
> /* scale ms to jiffies */
> interval = msecs_to_jiffies(interval);
> -
> - /*
> - * Reduce likelihood of busy balancing at higher domains racing with
> - * balancing at lower domains by preventing their balancing periods
> - * from being multiples of each other.
> - */
> - if (cpu_busy)
> - interval -= 1;
> -
> interval = clamp(interval, 1UL, max_load_balance_interval);
>
> return interval;
> @@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> int continue_balancing = 1;
> int cpu = rq->cpu;
> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> - unsigned long interval;
> + unsigned long interval, prev_sd_next_balance = 0;
> struct sched_domain *sd;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> @@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> rcu_read_lock();
> for_each_domain(cpu, sd) {
> + unsigned long next_interval;
> +
> /*
> * Decay the newidle max times here because this is a regular
> * visit to all the domains.
> @@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> goto out;
> }
>
> - if (time_after_eq(jiffies, sd->last_balance + interval)) {
> + next_interval = sd->last_balance + interval;
> + if (time_after_eq(jiffies, next_interval) ||
> + (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
(prev_sd_next_balance && time_after(jiffies, prev_sd_next_balance))?
thanks,
Chenyu
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> * The LBF_DST_PINNED logic could have changed
> @@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> + prev_sd_next_balance = sd->last_balance + interval;
> }
> if (need_serialize)
> atomic_set_release(&sched_balance_running, 0);
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused
2025-03-17 18:07 ` Chen, Yu C
@ 2025-03-19 6:51 ` K Prateek Nayak
0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-19 6:51 UTC (permalink / raw)
To: Chen, Yu C
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, Ingo Molnar, Peter Zijlstra,
Juri Lelli, Vincent Guittot, linux-kernel, yu.chen.surf
Hello Chenyu,
On 3/17/2025 11:37 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> The load balancer will start caching the sg_lb_stats during load
>> balancing and propagate it up the sched domain hierarchy in the
>> subsequent commits.
>>
>> Increase the probability of load balancing intervals across domains to
>> be aligned to improve the reuse efficiency of the propagated stats.
>> Go one step further and proactively explore balancing at a higher domain
>> if the next update time for a higher domain in before the next update
>> time for its children.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
>> kernel/sched/fair.c | 18 +++++++-----------
>> 1 file changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3b1ed14e4b5e..60517a732c10 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11956,15 +11956,6 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
>> /* scale ms to jiffies */
>> interval = msecs_to_jiffies(interval);
>> -
>> - /*
>> - * Reduce likelihood of busy balancing at higher domains racing with
>> - * balancing at lower domains by preventing their balancing periods
>> - * from being multiples of each other.
>> - */
>> - if (cpu_busy)
>> - interval -= 1;
>> -
>> interval = clamp(interval, 1UL, max_load_balance_interval);
>> return interval;
>> @@ -12126,7 +12117,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>> int continue_balancing = 1;
>> int cpu = rq->cpu;
>> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> - unsigned long interval;
>> + unsigned long interval, prev_sd_next_balance = 0;
>> struct sched_domain *sd;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
>> @@ -12136,6 +12127,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>> rcu_read_lock();
>> for_each_domain(cpu, sd) {
>> + unsigned long next_interval;
>> +
>> /*
>> * Decay the newidle max times here because this is a regular
>> * visit to all the domains.
>> @@ -12162,7 +12155,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>> goto out;
>> }
>> - if (time_after_eq(jiffies, sd->last_balance + interval)) {
>> + next_interval = sd->last_balance + interval;
>> + if (time_after_eq(jiffies, next_interval) ||
>> + (prev_sd_next_balance && time_after(prev_sd_next_balance, next_interval))) {
>
> (prev_sd_next_balance && time_after(jiffies, prev_sd_next_balance))?
So the rationale here is to sync the balancing at different levels if the
load balancing interval at the parent is somewhere between now and the
next load balancing interval of the child domain:
Move MC balance
here for more
reuse
v
jiffies <---------------------------------------------
^ ^ ^
Next balance Next balance Current balance
at SMT domain at MC domain at SMT domain
On some topology, it can mean slightly more aggressive load balancing at
higher domains but the goal is that cost savings of a stats reuse will
eventually hide this jitter of doing load balancing at multiple domains
at once.
I would like to fo one step further and modify the cpumask_first() is
should we balance to instead return the last CPU doing load balancing
for this tick but it became slightly harder to cover the case of delay
in SOFTIRQ handler being executed so I left it out of this prototype.
I'll try to add something in proper v2.
--
Thanks and Regards,
Prateek
>
> thanks,
> Chenyu
>
>> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>> /*
>> * The LBF_DST_PINNED logic could have changed
>> @@ -12174,6 +12169,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>> }
>> sd->last_balance = jiffies;
>> interval = get_sd_balance_interval(sd, busy);
>> + prev_sd_next_balance = sd->last_balance + interval;
>> }
>> if (need_serialize)
>> atomic_set_release(&sched_balance_running, 0);
^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (5 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 6/8] sched/fair: Increase probability of lb stats being reused K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-17 18:04 ` Chen, Yu C
2025-03-13 9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
` (9 subsequent siblings)
16 siblings, 1 reply; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
Allow update_sg_lb_stats() to retrieve the group stats cached in
sg_lb_stats_prop saved by another CPU performing load balancing around
the same time (same jiffy)
Current implementation without invalidation of cached stats have few
limitations namely that the stats reuse is limited to busy load
balancing since stats can only be updated once a jiffy. Newidle Balance
can happen frequently and concurrently on many CPUs which can result in
readers seeing inconsitent values for the propagated stats.
For this iteration, the focus is to reduce the time taken for busy load
balancing allowing the busy CPU to resume renning the task as quickly as
possible.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 81 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60517a732c10..3b402f294f0b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
return check_cpu_capacity(rq, sd);
}
+static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
+{
+ /*
+ * Only under perioric load balancing can we ensure that no concurrent
+ * CPUs modifies the stats being propagated upwards since
+ * should_we_balance() can allow multiple concurrent newidle balance
+ * to progress and an idle -> busy transition for idle balance will
+ * require the stats to be recomputed since idleness metrics will
+ * change with migration.
+ */
+ if (idle)
+ return 0;
+
+ /*
+ * If individual groups are separate NUMA domains, migrations can cause
+ * preferred task statistics to change and will require recomputing of
+ * stats.
+ */
+ if (sd->child && (sd->child->flags & SD_NUMA))
+ return 0;
+
+ /*
+ * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
+ * domains. Skip caching stats for them.
+ */
+ if (sd->flags & SD_ASYM_CPUCAPACITY)
+ return 0;
+
+ /*
+ * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
+ * transition but for the time being, save some cycles during busy
+ * load balancing.
+ */
+ return 1;
+}
+
+static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
+{
+ struct sched_domain_shared *sg_share = group->shared;
+ unsigned long current_jiffy = jiffies;
+ struct sg_lb_stats_prop *lb_prop;
+
+ if (!sg_share)
+ return 0;
+
+ lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
+ if (!lb_prop)
+ return 0;
+
+ /* Stale stats */
+ if (READ_ONCE(lb_prop->last_update) != current_jiffy)
+ return 0;
+
+ /*
+ * Pairs against the update to sgs_prop->last_update to
+ * prevent readers from seeing an inconsistent value of
+ * the propagated stats from a concurrent update.
+ */
+ smp_rmb();
+ *sg_stats = lb_prop->sg_stats;
+
+ /*
+ * If stats were read in the same interval, it cannot
+ * read an inconsistent state since stats are only
+ * updated once per jiffy.
+ */
+ return time_before_eq(jiffies, current_jiffy);
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
int i, nr_running, local_group, sd_flags = env->sd->flags;
bool balancing_at_rd = !env->sd->parent;
- memset(sgs, 0, sizeof(*sgs));
-
local_group = group == sds->local;
+ /*
+ * If stats can be retrieved, we are doing a busy load balancing.
+ * Skip right ahead to group_classify() since group_asym_packing and
+ * group_smt_balance is not possible under busy load balancing.
+ */
+ if (can_retrieve_stats(env->sd, env->idle) &&
+ retrieve_cached_stats(group, sgs))
+ goto group_classify;
+
+ memset(sgs, 0, sizeof(*sgs));
+
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
unsigned long load = cpu_load(rq);
@@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && smt_balance(env, sgs, group))
sgs->group_smt_balance = 1;
+group_classify:
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
/* Computing avg_load makes sense only when group is overloaded */
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
2025-03-13 9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
@ 2025-03-17 18:04 ` Chen, Yu C
2025-03-19 6:42 ` K Prateek Nayak
0 siblings, 1 reply; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:04 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, yu.c.chen, yu.chen.surf,
Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
linux-kernel
On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
> Allow update_sg_lb_stats() to retrieve the group stats cached in
> sg_lb_stats_prop saved by another CPU performing load balancing around
> the same time (same jiffy)
>
If I understand correctly, we allow update_sg_lb_stats() to retrieve
cached sg stats if another CPU in the same sched group has just done
load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.
There are two roles, writer who updates the cached stats,
the reader who reads the cache stats. For both cache writer and
the cache reader, do we trigger them only when it is in busy periodic
load balance? If yes, consider the periodic load balance is usually
triggered on 1 CPU in each SD(should_we_balance()), and the
interval increases with the number of CPUs in that sd, just wonder
if 10 ms is a little short to find a cached stats on large LLC?
thanks,
Chenyu
> Current implementation without invalidation of cached stats have few
> limitations namely that the stats reuse is limited to busy load
> balancing since stats can only be updated once a jiffy. Newidle Balance
> can happen frequently and concurrently on many CPUs which can result in
> readers seeing inconsitent values for the propagated stats.
>
> For this iteration, the focus is to reduce the time taken for busy load
> balancing allowing the busy CPU to resume renning the task as quickly as
> possible.
>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> kernel/sched/fair.c | 83 +++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 60517a732c10..3b402f294f0b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10275,6 +10275,75 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
> return check_cpu_capacity(rq, sd);
> }
>
> +static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
> +{
> + /*
> + * Only under perioric load balancing can we ensure that no concurrent
> + * CPUs modifies the stats being propagated upwards since
> + * should_we_balance() can allow multiple concurrent newidle balance
> + * to progress and an idle -> busy transition for idle balance will
> + * require the stats to be recomputed since idleness metrics will
> + * change with migration.
> + */
> + if (idle)
> + return 0;
> +
> + /*
> + * If individual groups are separate NUMA domains, migrations can cause
> + * preferred task statistics to change and will require recomputing of
> + * stats.
> + */
> + if (sd->child && (sd->child->flags & SD_NUMA))
> + return 0;
> +
> + /*
> + * misfit_task_load requires recalculation on SD_ASYM_CPUCAPACITY
> + * domains. Skip caching stats for them.
> + */
> + if (sd->flags & SD_ASYM_CPUCAPACITY)
> + return 0;
> +
> + /*
> + * TODO: For CPU_IDLE case, invalidate stats for an idle -> busy
> + * transition but for the time being, save some cycles during busy
> + * load balancing.
> + */
> + return 1;
> +}
> +
> +static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
> +{
> + struct sched_domain_shared *sg_share = group->shared;
> + unsigned long current_jiffy = jiffies;
> + struct sg_lb_stats_prop *lb_prop;
> +
> + if (!sg_share)
> + return 0;
> +
> + lb_prop = (struct sg_lb_stats_prop *)sg_share->private;
> + if (!lb_prop)
> + return 0;
> +
> + /* Stale stats */
> + if (READ_ONCE(lb_prop->last_update) != current_jiffy)
> + return 0;
> +
> + /*
> + * Pairs against the update to sgs_prop->last_update to
> + * prevent readers from seeing an inconsistent value of
> + * the propagated stats from a concurrent update.
> + */
> + smp_rmb();
> + *sg_stats = lb_prop->sg_stats;
> +
> + /*
> + * If stats were read in the same interval, it cannot
> + * read an inconsistent state since stats are only
> + * updated once per jiffy.
> + */
> + return time_before_eq(jiffies, current_jiffy);
> +}
> +
> /**
> * update_sg_lb_stats - Update sched_group's statistics for load balancing.
> * @env: The load balancing environment.
> @@ -10292,10 +10361,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> int i, nr_running, local_group, sd_flags = env->sd->flags;
> bool balancing_at_rd = !env->sd->parent;
>
> - memset(sgs, 0, sizeof(*sgs));
> -
> local_group = group == sds->local;
>
> + /*
> + * If stats can be retrieved, we are doing a busy load balancing.
> + * Skip right ahead to group_classify() since group_asym_packing and
> + * group_smt_balance is not possible under busy load balancing.
> + */
> + if (can_retrieve_stats(env->sd, env->idle) &&
> + retrieve_cached_stats(group, sgs))
> + goto group_classify;
> +
> + memset(sgs, 0, sizeof(*sgs));
> +
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> unsigned long load = cpu_load(rq);
> @@ -10360,6 +10438,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> if (!local_group && smt_balance(env, sgs, group))
> sgs->group_smt_balance = 1;
>
> +group_classify:
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop
2025-03-17 18:04 ` Chen, Yu C
@ 2025-03-19 6:42 ` K Prateek Nayak
0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-19 6:42 UTC (permalink / raw)
To: Chen, Yu C
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, yu.chen.surf, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, linux-kernel
Hello Chenyu,
Thank you for taking a look at the series.
On 3/17/2025 11:34 PM, Chen, Yu C wrote:
> On 3/13/2025 5:37 PM, K Prateek Nayak wrote:
>> Allow update_sg_lb_stats() to retrieve the group stats cached in
>> sg_lb_stats_prop saved by another CPU performing load balancing around
>> the same time (same jiffy)
>>
>
> If I understand correctly, we allow update_sg_lb_stats() to retrieve
> cached sg stats if another CPU in the same sched group has just done
> load balance within a jiffy ago, say 10ms for CONFIG_100_HZ.
Quick disclaimer: All of this is best effort currently.
Periodic load balancing is easy to start since it only happens once a
jiffy at the maximum so "last_update" as a jiffy counter should be
good enough (in most cases).
Secondly, and this is slightly harder to solve for, is to get all the
CPUs to actually sync. Currently it is a best effort case since the
tick can fire late due to disabled interrupts on CPU, SCHED_SOFTIRQ
may run at different times depending on how much work is done at tick
prior to reaching the softirq handler etc.
But assuming some amount of sync, I would like:
- During busy balance only one CPU gets to proceed as per
should_we_balance() heuristics. In addition to that, since all CPUs
are busy (should_we_balance() would have allowed the first idle CPU
to go ahead otherwise) the "idle_cpus" and "overloaded" situations
may change and those are hard to propagate.
- By the time this CPU does busy balancing, other groups below it
hopefully had enough time to reach update_sd_lb_stats() and cache
their copy for this jiffy in there. If not - the load balancing CPU
will recompute.
- Since stats at a higher domain is used only once, there was no need
to invalidate them which I couldn't get right back then (or maybe
even now :)
>
> There are two roles, writer who updates the cached stats,
> the reader who reads the cache stats. For both cache writer and
> the cache reader, do we trigger them only when it is in busy periodic
> load balance? If yes, consider the periodic load balance is usually
> triggered on 1 CPU in each SD(should_we_balance()), and the
> interval increases with the number of CPUs in that sd, just wonder
> if 10 ms is a little short to find a cached stats on large LLC?
So the reader is always the CPU going to the higher domain and
recomputing stats. The writer should have updated the stats by then
or the reader won't care and recompute it.
At the very least, since the CPU has to look at local stats too, the
logic ensures at least that is reused and not recomputed.
Beyond the annotated PATCH 9, I've moved to a versioning scheme that
could also be reused for newidle balancing with stats invalidation
and that should help reuse stats more. There are some stats on the
empty PATCH 9.
--
Thanks and Regards,
Prateek
> thanks,
> Chenyu
>
>
>> Current implementation without invalidation of cached stats have few
>> limitations namely that the stats reuse is limited to busy load
>> balancing since stats can only be updated once a jiffy. Newidle Balance
>> can happen frequently and concurrently on many CPUs which can result in
>> readers seeing inconsitent values for the propagated stats.
>>
>> For this iteration, the focus is to reduce the time taken for busy load
>> balancing allowing the busy CPU to resume renning the task as quickly as
>> possible.
>>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
>> ---
^ permalink raw reply [flat|nested] 24+ messages in thread
* [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (6 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 7/8] sched/fair: Retrieve cached group stats from sg_lb_stats_prop K Prateek Nayak
@ 2025-03-13 9:37 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
` (8 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-13 9:37 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
Aggregate the individual sched_group stats to compute the stat for the
entire sched_domain. Cache this in sd->shared which the sg->shared also
points to for the corresponding sched_group of sd for its parent. This
ensures that the stats are readily available for the higher domains if
the load balancing continues.
With the new infrastructure in place, following are the benchmark
numbers:
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) stats_prop[pct imp](CV)
1-groups 1.00 [ -0.00](10.12) 1.09 [ -9.11](11.93)
2-groups 1.00 [ -0.00]( 6.92) 1.00 [ -0.22]( 4.57)
4-groups 1.00 [ -0.00]( 3.14) 0.99 [ 0.83]( 1.77)
8-groups 1.00 [ -0.00]( 1.35) 1.00 [ -0.31]( 2.24)
16-groups 1.00 [ -0.00]( 1.32) 0.99 [ 0.84]( 0.67)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) stats_prop[pct imp](CV)
1 1.00 [ 0.00]( 0.43) 0.99 [ -0.87]( 1.34)
2 1.00 [ 0.00]( 0.58) 1.02 [ 2.14]( 0.29)
4 1.00 [ 0.00]( 0.54) 1.01 [ 1.24]( 0.82)
8 1.00 [ 0.00]( 0.49) 1.01 [ 0.62]( 0.97)
16 1.00 [ 0.00]( 1.06) 1.01 [ 0.94]( 0.70)
32 1.00 [ 0.00]( 1.27) 0.99 [ -1.24]( 1.38)
64 1.00 [ 0.00]( 1.54) 1.00 [ -0.43]( 0.36)
128 1.00 [ 0.00]( 0.38) 1.00 [ -0.01]( 1.22)
256 1.00 [ 0.00]( 1.85) 1.02 [ 1.58]( 0.90)
512 1.00 [ 0.00]( 0.31) 1.01 [ 0.76]( 1.19)
1024 1.00 [ 0.00]( 0.19) 1.00 [ 0.44]( 0.35)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) stats_prop[pct imp](CV)
Copy 1.00 [ 0.00](11.31) 1.02 [ 1.69]( 6.44)
Scale 1.00 [ 0.00]( 6.62) 1.01 [ 0.80]( 5.37)
Add 1.00 [ 0.00]( 7.06) 1.02 [ 1.54]( 6.72)
Triad 1.00 [ 0.00]( 8.91) 1.01 [ 1.36]( 6.73)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) stats_prop[pct imp](CV)
Copy 1.00 [ 0.00]( 2.01) 0.98 [ -1.55]( 2.15)
Scale 1.00 [ 0.00]( 1.49) 1.00 [ 0.23]( 0.58)
Add 1.00 [ 0.00]( 2.67) 1.01 [ 0.65]( 1.95)
Triad 1.00 [ 0.00]( 2.19) 1.01 [ 0.61]( 1.37)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) stats_prop[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.43) 1.00 [ 0.17]( 0.32)
2-clients 1.00 [ 0.00]( 1.02) 1.01 [ 1.00]( 0.44)
4-clients 1.00 [ 0.00]( 0.83) 1.01 [ 0.62]( 0.36)
8-clients 1.00 [ 0.00]( 0.73) 1.00 [ -0.11]( 0.65)
16-clients 1.00 [ 0.00]( 0.97) 1.00 [ 0.49]( 0.77)
32-clients 1.00 [ 0.00]( 0.88) 1.00 [ 0.30]( 0.94)
64-clients 1.00 [ 0.00]( 1.49) 1.00 [ 0.36]( 1.57)
128-clients 1.00 [ 0.00]( 1.05) 1.00 [ 0.14]( 1.46)
256-clients 1.00 [ 0.00]( 3.85) 1.00 [ -0.04]( 4.85)
512-clients 1.00 [ 0.00](59.63) 1.00 [ -0.02](62.28)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) stats_prop[pct imp](CV)
1 1.00 [ -0.00]( 6.67) 0.76 [ 24.44](35.80)
2 1.00 [ -0.00](10.18) 0.87 [ 13.04](10.38)
4 1.00 [ -0.00]( 4.49) 1.04 [ -4.26]( 3.14)
8 1.00 [ -0.00]( 6.68) 0.98 [ 1.89]( 8.07)
16 1.00 [ -0.00]( 1.87) 1.03 [ -3.28]( 5.21)
32 1.00 [ -0.00]( 4.01) 0.98 [ 2.20]( 1.31)
64 1.00 [ -0.00]( 3.21) 1.00 [ -0.00]( 3.23)
128 1.00 [ -0.00](44.13) 1.06 [ -6.43](113.66)
256 1.00 [ -0.00](14.46) 1.04 [ -3.52]( 8.43)
512 1.00 [ -0.00]( 1.95) 1.02 [ -1.80]( 1.14)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) stats_prop[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 1.00 [ 0.00]( 0.55)
2 1.00 [ 0.00]( 0.15) 0.99 [ -0.88]( 0.26)
4 1.00 [ 0.00]( 0.15) 0.99 [ -0.59]( 0.15)
8 1.00 [ 0.00]( 0.15) 0.99 [ -0.88]( 0.26)
16 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.15)
32 1.00 [ 0.00]( 3.40) 1.07 [ 6.59]( 0.16)
64 1.00 [ 0.00]( 7.09) 1.00 [ -0.38]( 0.96)
128 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.20)
256 1.00 [ 0.00]( 1.12) 1.00 [ -0.30]( 1.50)
512 1.00 [ 0.00]( 0.22) 1.05 [ 4.86]( 0.71)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) stats_prop[pct imp](CV)
1 1.00 [ -0.00](19.72) 0.85 [ 15.38](16.75)
2 1.00 [ -0.00](15.96) 1.00 [ -0.00]( 0.00)
4 1.00 [ -0.00]( 3.87) 1.00 [ -0.00]( 4.08)
8 1.00 [ -0.00]( 8.15) 1.00 [ -0.00](11.71)
16 1.00 [ -0.00]( 3.87) 0.92 [ 7.69]( 4.19)
32 1.00 [ -0.00](12.99) 0.73 [ 26.67]( 0.00)
64 1.00 [ -0.00]( 6.20) 1.12 [-12.50]( 9.94)
128 1.00 [ -0.00]( 0.96) 0.98 [ 1.55]( 0.95)
256 1.00 [ -0.00]( 2.76) 0.99 [ 1.45]( 1.38)
512 1.00 [ -0.00]( 0.20) 1.20 [-20.42]( 0.00)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) stats_prop[pct imp](CV)
1 1.00 [ -0.00]( 1.07) 1.02 [ -2.08]( 0.13)
2 1.00 [ -0.00]( 0.14) 1.04 [ -3.97]( 0.13)
4 1.00 [ -0.00]( 1.39) 1.03 [ -3.15]( 0.13)
8 1.00 [ -0.00]( 0.36) 1.03 [ -3.16]( 0.00)
16 1.00 [ -0.00]( 1.18) 1.02 [ -1.59]( 0.75)
32 1.00 [ -0.00]( 8.42) 0.81 [ 19.08]( 0.25)
64 1.00 [ -0.00]( 4.85) 1.01 [ -1.10]( 2.58)
128 1.00 [ -0.00]( 0.28) 1.00 [ -0.21]( 0.38)
256 1.00 [ -0.00](10.52) 0.95 [ 4.74]( 6.94)
512 1.00 [ -0.00]( 0.69) 1.09 [ -8.99]( 0.27)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -0.54%
ycsb-mongodb 0.09%
deathstarbench-1x -0.30%
deathstarbench-2x 2.38%
deathstarbench-3x 0.58%
deathstarbench-6x 0.62%
hammerdb+mysql 16VU 0.76%
hammerdb+mysql 64VU 0.74%
* The tail latencies reported by schbench increases possibly due to the
sync in load balancing across multiple domains however it remains to
be investigated.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 99 +++++++++++++++++++++++++++++++++++++++++----
1 file changed, 92 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3b402f294f0b..212bee3e9f35 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,6 +10275,38 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
return check_cpu_capacity(rq, sd);
}
+static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *sd_stats)
+{
+ struct sched_domain_shared *sd_share = sd->shared;
+ unsigned long current_jiffy = jiffies;
+ struct sg_lb_stats_prop *lb_prop;
+
+ if (!sd_share)
+ return;
+
+ lb_prop = (struct sg_lb_stats_prop *)sd_share->private;
+ if (!lb_prop)
+ return;
+
+ /* Concurrent load balancing instance already updated the stats */
+ if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+ return;
+
+ scoped_guard(raw_spinlock_irqsave_try, &lb_prop->stats_lock) {
+ if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+ break;
+
+ lb_prop->sg_stats = *sd_stats;
+
+ /*
+ * Pairs against readers checking the last_update
+ * before reading the cached stats.
+ */
+ smp_wmb();
+ WRITE_ONCE(lb_prop->last_update, current_jiffy);
+ }
+}
+
static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
{
/*
@@ -10344,6 +10376,35 @@ static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_
return time_before_eq(jiffies, current_jiffy);
}
+/**
+ * aggregate_sd_prop_stats - Compute sched domains's stats from group stats.
+ * @env: The load balancing environment.
+ * @sgs_prop: variable to hold the statistics to propagate for the sd
+ * @sgs: Group stat that was computed or retrieved
+ */
+static inline void aggregate_sd_stats(struct lb_env *env,
+ struct sg_lb_stats *sd_stats,
+ struct sg_lb_stats *sg_stats)
+{
+ sd_stats->group_load += sg_stats->group_load;
+ sd_stats->group_util += sg_stats->group_util;
+ sd_stats->group_runnable += sg_stats->group_runnable;
+ sd_stats->sum_h_nr_running += sg_stats->sum_h_nr_running;
+ sd_stats->sum_nr_running += sg_stats->sum_nr_running;
+ sd_stats->idle_cpus += sg_stats->idle_cpus;
+ sd_stats->group_capacity += sg_stats->group_capacity;
+ sd_stats->group_weight += sg_stats->group_weight;
+ sd_stats->overloaded |= sg_stats->overloaded;
+ sd_stats->overutilized |= sg_stats->overutilized;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (env->sd->flags & SD_NUMA) {
+ sd_stats->nr_numa_running += sg_stats->nr_numa_running;
+ sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
+ }
+#endif
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
@@ -11041,9 +11102,18 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
{
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
- struct sg_lb_stats tmp_sgs;
- unsigned long sum_util = 0;
bool sg_overloaded = 0, sg_overutilized = 0;
+ struct sg_lb_stats tmp_sgs, sd_stats;
+ unsigned long sum_util = 0;
+ bool should_prop = false;
+
+ /*
+ * If a parent domain exists and the cached stats can be retrieved when
+ * load balancing there, aggregate the statistics at current domain
+ * to be retrieved when load balancing at parent.
+ */
+ if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle))
+ should_prop = true;
do {
struct sg_lb_stats *sgs = &tmp_sgs;
@@ -11061,21 +11131,36 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
update_sg_lb_stats(env, sds, sg, sgs);
+ if (should_prop)
+ aggregate_sd_stats(env, &sd_stats, sgs);
+
if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
sds->busiest_stat = *sgs;
}
/* Now, start updating sd_lb_stats */
- sds->total_load += sgs->group_load;
- sds->total_capacity += sgs->group_capacity;
- sg_overloaded |= sgs->overloaded;
- sg_overutilized |= sgs->overutlizied;
+ if (!should_prop) {
+ sds->total_load += sgs->group_load;
+ sds->total_capacity += sgs->group_capacity;
+ sg_overloaded |= sgs->overloaded;
+ sg_overutilized |= sgs->overutilized;
+ sum_util += sgs->group_util;
+ }
- sum_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
+ if (should_prop) {
+ sds->total_load = sd_stats.group_load;
+ sds->total_capacity = sd_stats.group_capacity;
+ sg_overloaded = sd_stats.overloaded;
+ sg_overutilized = sd_stats.overutilized;
+ sum_util = sd_stats.group_util;
+
+ cache_sd_stats(env->sd, &sd_stats);
+ }
+
/*
* Indicate that the child domain of the busiest group prefers tasks
* go to a child's sibling domains first. NB the flags of a sched group
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (7 preceding siblings ...)
2025-03-13 9:37 ` [RFC PATCH 8/8] sched/fair: Update stats for sched_domain using the sched_group stats K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
` (7 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
I would have loved to spin another version of this but me being slightly
short on time before OSPM decided to add these bits on top of the RFC.
Sorry for the inconvenience.
Stats versioning
================
Earlier experiments looked at aggressive stats caching and reuse. Load
balancing instances computed and cached the stats for non-local groups
hoping that they would be reused.
With stats versioning, the load balancing CPU only caches the stats for
the local hierarchy. Instead of the jiffy based "last_update" freshness,
this moves to versioning based on sched_clock_cpu() value.
Stats cached are invalidated if the CPU doing load balance is done
allowing fresher stats to be propagated. Stats computed by a concurrent
load balancing instance can now be reused allowing idle and newidle
balance to reuse stats effectively.
Stats versioning nuances are explained in Patch 11/08. Since idle and
newidle balance can reuse stats, changes have been made in aggregation
to consider reduced capacity, but also forego computing total capacity.
Benchmark results are as follows:
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) versioning[pct imp](CV)
1-groups 1.00 [ -0.00](10.12) 1.00 [ 0.44](13.86)
2-groups 1.00 [ -0.00]( 6.92) 1.04 [ -4.32]( 3.00)
4-groups 1.00 [ -0.00]( 3.14) 1.00 [ -0.21]( 2.16)
8-groups 1.00 [ -0.00]( 1.35) 1.01 [ -1.25]( 1.32)
16-groups 1.00 [ -0.00]( 1.32) 1.01 [ -0.50]( 2.00)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ 0.00]( 0.43) 0.98 [ -1.65]( 0.15)
2 1.00 [ 0.00]( 0.58) 1.01 [ 1.27]( 0.49)
4 1.00 [ 0.00]( 0.54) 1.00 [ 0.47]( 0.40)
8 1.00 [ 0.00]( 0.49) 1.00 [ -0.44]( 1.18)
16 1.00 [ 0.00]( 1.06) 1.00 [ -0.07]( 1.14)
32 1.00 [ 0.00]( 1.27) 1.00 [ 0.02]( 0.11)
64 1.00 [ 0.00]( 1.54) 0.99 [ -1.12]( 1.09)
128 1.00 [ 0.00]( 0.38) 0.98 [ -2.43]( 1.00)
256 1.00 [ 0.00]( 1.85) 0.99 [ -0.50]( 0.94)
512 1.00 [ 0.00]( 0.31) 0.99 [ -1.03]( 0.35)
1024 1.00 [ 0.00]( 0.19) 0.99 [ -0.56]( 0.42)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) versioning[pct imp](CV)
Copy 1.00 [ 0.00](11.31) 1.08 [ 7.51]( 4.74)
Scale 1.00 [ 0.00]( 6.62) 1.00 [ -0.31]( 7.45)
Add 1.00 [ 0.00]( 7.06) 1.02 [ 2.50]( 7.34)
Triad 1.00 [ 0.00]( 8.91) 1.08 [ 7.78]( 2.88)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) versioning[pct imp](CV)
Copy 1.00 [ 0.00]( 2.01) 1.02 [ 1.82]( 1.26)
Scale 1.00 [ 0.00]( 1.49) 1.00 [ 0.26]( 0.80)
Add 1.00 [ 0.00]( 2.67) 1.01 [ 0.98]( 1.29)
Triad 1.00 [ 0.00]( 2.19) 1.02 [ 2.06]( 1.01)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) versioning[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.43) 0.99 [ -0.72]( 0.81)
2-clients 1.00 [ 0.00]( 1.02) 1.00 [ -0.09]( 1.11)
4-clients 1.00 [ 0.00]( 0.83) 1.00 [ 0.31]( 0.29)
8-clients 1.00 [ 0.00]( 0.73) 1.00 [ -0.25]( 0.61)
16-clients 1.00 [ 0.00]( 0.97) 1.00 [ -0.26]( 0.89)
32-clients 1.00 [ 0.00]( 0.88) 0.99 [ -0.61]( 0.82)
64-clients 1.00 [ 0.00]( 1.49) 0.99 [ -1.11]( 1.77)
128-clients 1.00 [ 0.00]( 1.05) 1.00 [ -0.03]( 1.13)
256-clients 1.00 [ 0.00]( 3.85) 1.00 [ -0.24]( 2.63)
512-clients 1.00 [ 0.00](59.63) 0.99 [ -0.74](59.01)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00]( 6.67) 0.93 [ 6.67](15.25)
2 1.00 [ -0.00](10.18) 0.83 [ 17.39]( 7.15)
4 1.00 [ -0.00]( 4.49) 1.04 [ -4.26]( 6.12)
8 1.00 [ -0.00]( 6.68) 1.06 [ -5.66](12.98)
16 1.00 [ -0.00]( 1.87) 1.00 [ -0.00]( 3.38)
32 1.00 [ -0.00]( 4.01) 0.98 [ 2.20]( 4.79)
64 1.00 [ -0.00]( 3.21) 1.02 [ -1.68]( 0.84)
128 1.00 [ -0.00](44.13) 1.16 [-15.98](14.99)
256 1.00 [ -0.00](14.46) 0.90 [ 9.99](17.45)
512 1.00 [ -0.00]( 1.95) 0.98 [ 1.54]( 1.13)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 1.00 [ 0.00]( 0.26)
2 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.15)
4 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.30)
8 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.26)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 3.40) 1.06 [ 5.93]( 1.22)
64 1.00 [ 0.00]( 7.09) 1.00 [ 0.00]( 0.20)
128 1.00 [ 0.00]( 0.00) 0.98 [ -1.52]( 0.34)
256 1.00 [ 0.00]( 1.12) 0.98 [ -2.41]( 1.19)
512 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 0.43)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00](19.72) 1.00 [ -0.00]( 8.37)
2 1.00 [ -0.00](15.96) 1.09 [ -9.09](11.08)
4 1.00 [ -0.00]( 3.87) 1.15 [-15.38](17.44)
8 1.00 [ -0.00]( 8.15) 0.92 [ 8.33]( 8.85)
16 1.00 [ -0.00]( 3.87) 1.23 [-23.08]( 5.59)
32 1.00 [ -0.00](12.99) 0.73 [ 26.67](16.75)
64 1.00 [ -0.00]( 6.20) 1.25 [-25.00]( 2.63)
128 1.00 [ -0.00]( 0.96) 1.62 [-62.37]( 1.30)
256 1.00 [ -0.00]( 2.76) 0.82 [ 17.89](10.56)
512 1.00 [ -0.00]( 0.20) 1.00 [ -0.00]( 0.34)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00]( 1.07) 1.02 [ -2.34]( 0.13)
2 1.00 [ -0.00]( 0.14) 1.04 [ -3.97]( 0.13)
4 1.00 [ -0.00]( 1.39) 1.03 [ -3.15]( 0.13)
8 1.00 [ -0.00]( 0.36) 1.03 [ -3.43]( 0.66)
16 1.00 [ -0.00]( 1.18) 0.99 [ 0.79]( 1.22)
32 1.00 [ -0.00]( 8.42) 0.82 [ 18.29]( 9.02)
64 1.00 [ -0.00]( 4.85) 1.00 [ -0.44]( 1.61)
128 1.00 [ -0.00]( 0.28) 1.06 [ -5.64]( 1.10)
256 1.00 [ -0.00](10.52) 0.81 [ 19.18](12.55)
512 1.00 [ -0.00]( 0.69) 1.00 [ 0.33]( 1.27)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -0.76%
ycsb-mongodb 0.49%
deathstarbench-1x -2.37%
deathstarbench-2x 0.12%
deathstarbench-3x 2.30%
deathstarbench-6x 1.88%
hammerdb+mysql 16VU 3.85%
hammerdb+mysql 64VU 0.27%
Following are the schedstats diff for sched-messaging 4-group and
16-groups:
o 4-groups:
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>
----------------------------------------------------------------------------------------------------
DESC COUNT1 COUNT2 PCT_CHANGE PCT_CHANGE1 PCT_CHANGE2
----------------------------------------------------------------------------------------------------
sched_yield() count : 0, 0 | 0.00% |
Legacy counter can be ignored : 0, 0 | 0.00% |
schedule() called : 174683, 176871 | 1.25% |
schedule() left the processor idle : 86742, 88113 | 1.58% | ( 49.66%, 49.82% )
try_to_wake_up() was called : 87675, 88622 | 1.08% |
try_to_wake_up() was called to wake up the local cpu : 28, 26 | -7.14% | ( 0.03%, 0.03% )
total runtime by tasks on this processor (in jiffies) : 2124248214, 2118780927 | -0.26% |
total waittime by tasks on this processor (in jiffies) : 24160304, 16912073 | -30.00% | ( 1.14%, 0.80% )
total timeslices run on this cpu : 87936, 88753 | 0.93% |
----------------------------------------------------------------------------------------------------
---------------------------------------- <Category newidle> ----------------------------------------
SMT:
load_balance() total time to balance on newly idle : 449650, 465044 | 3.42% |
load_balance() stats reused on newly idle : 0, 0 | 0.00% |
load_balance() stats recomputed on newly idle : 2493, 2679 | 7.46% |
MC:
load_balance() total time to balance on newly idle : 660742, 610346 | -7.63% |
load_balance() stats reused on newly idle : 0, 1898 | 0.00% |
load_balance() stats recomputed on newly idle : 3985, 3527 | -11.49% |
PKG:
load_balance() total time to balance on newly idle : 725938, 530707 | -26.89% |
load_balance() stats reused on newly idle : 0, 401 | 0.00% |
load_balance() stats recomputed on newly idle : 722, 474 | -34.35% |
NUMA:
load_balance() total time to balance on newly idle : 406862, 410386 | 0.87% |
load_balance() stats reused on newly idle : 0, 36 | 0.00% |
load_balance() stats recomputed on newly idle : 48, 39 | -18.75% |
o 16-groups:
----------------------------------------------------------------------------------------------------
CPU <ALL CPUS SUMMARY>
----------------------------------------------------------------------------------------------------
DESC COUNT1 COUNT2 PCT_CHANGE PCT_CHANGE1 PCT_CHANGE2
----------------------------------------------------------------------------------------------------
sched_yield() count : 0, 0 | 0.00% |
Legacy counter can be ignored : 0, 0 | 0.00% |
schedule() called : 566558, 554784 | -2.08% |
schedule() left the processor idle : 222161, 212164 | -4.50% | ( 39.21%, 38.24% )
try_to_wake_up() was called : 325303, 322690 | -0.80% |
try_to_wake_up() was called to wake up the local cpu : 990, 1017 | 2.73% | ( 0.30%, 0.32% )
total runtime by tasks on this processor (in jiffies) : 8807593610, 9142526964 | 3.80% |
total waittime by tasks on this processor (in jiffies) : 4093286876, 4314147489 | 5.40% | ( 46.47%, 47.19% )
total timeslices run on this cpu : 344281, 342495 | -0.52% |
----------------------------------------------------------------------------------------------------
---------------------------------------- <Category newidle> ----------------------------------------
SMT:
load_balance() total time to balance on newly idle : 9841719, 11615891 | 18.03% |
load_balance() stats reused on newly idle : 0, 0 | 0.00% |
load_balance() stats recomputed on newly idle : 28103, 27084 | -3.63% |
MC:
load_balance() total time to balance on newly idle : 20079305, 18103792 | -9.84% |
load_balance() stats reused on newly idle : 0, 37820 | 0.00% |
load_balance() stats recomputed on newly idle : 63885, 33518 | -47.53% |
PKG:
load_balance() total time to balance on newly idle : 17972213, 16430220 | -8.58% |
load_balance() stats reused on newly idle : 0, 8461 | 0.00% |
load_balance() stats recomputed on newly idle : 11513, 6318 | -45.12% |
NUMA:
load_balance() total time to balance on newly idle : 11050651, 9890509 | -10.50% |
load_balance() stats reused on newly idle : 0, 496 | 0.00% |
load_balance() stats recomputed on newly idle : 827, 524 | -36.64% |
---
Note: perf sched stats cannot properly aggregate "min" and "max" fields
yet.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
--
2.43.0
^ permalink raw reply [flat|nested] 24+ messages in thread* [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (8 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 09/08] [ANNOTATE] sched/fair: Stats versioning and invalidation K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
` (6 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
Migrations within a NUMA domain will not change
nr_{numa,preferred}_running stats. Compute it for non-NUMA groups for it
to be propagated and reused for the first NUMA domain when it exists.
While at it, also clear sd_stats before aggregation.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 212bee3e9f35..d09f900a3107 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10398,10 +10398,8 @@ static inline void aggregate_sd_stats(struct lb_env *env,
sd_stats->overutilized |= sg_stats->overutilized;
#ifdef CONFIG_NUMA_BALANCING
- if (env->sd->flags & SD_NUMA) {
- sd_stats->nr_numa_running += sg_stats->nr_numa_running;
- sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
- }
+ sd_stats->nr_numa_running += sg_stats->nr_numa_running;
+ sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
#endif
}
@@ -10464,11 +10462,8 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->overloaded = 1;
#ifdef CONFIG_NUMA_BALANCING
- /* Only fbq_classify_group() uses this to classify NUMA groups */
- if (sd_flags & SD_NUMA) {
- sgs->nr_numa_running += rq->nr_numa_running;
- sgs->nr_preferred_running += rq->nr_preferred_running;
- }
+ sgs->nr_numa_running += rq->nr_numa_running;
+ sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
if (local_group)
continue;
@@ -11112,8 +11107,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
* load balancing there, aggregate the statistics at current domain
* to be retrieved when load balancing at parent.
*/
- if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle))
+ if (env->sd->parent && can_retrieve_stats(env->sd->parent, env->idle)) {
+ memset(&sd_stats, 0, sizeof(sd_stats));
should_prop = true;
+ }
do {
struct sg_lb_stats *sgs = &tmp_sgs;
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (9 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 10/08] sched/fair: Compute nr_{numa,preferred}_running for non-NUMA domains K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
` (5 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
The combination of "stats_lock" and jiffy based "last_update" is not
scalable for newidle balance. Instead move to a versioning-based scheme
where the version number helps with both readers reading consistent data
without the need for a lock and writers using the version for both
locking and indicating stats freshness.
Additional semantics have been added for the writers to update state
stats if the time elapsed since last update has crossed the 50us
threshold.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 83 +++++++++++++++++++++++++++--------------
kernel/sched/sched.h | 22 ++++++++++-
kernel/sched/topology.c | 3 +-
3 files changed, 77 insertions(+), 31 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d09f900a3107..6c486e194a9d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10275,11 +10275,13 @@ sched_reduced_capacity(struct rq *rq, struct sched_domain *sd)
return check_cpu_capacity(rq, sd);
}
-static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *sd_stats)
+static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_stats)
{
- struct sched_domain_shared *sd_share = sd->shared;
- unsigned long current_jiffy = jiffies;
+ struct sched_domain_shared *sd_share = env->sd->shared;
struct sg_lb_stats_prop *lb_prop;
+ int cpu, retry_limit = 3;
+ u64 time, lock;
+ long version;
if (!sd_share)
return;
@@ -10288,23 +10290,52 @@ static inline void cache_sd_stats(struct sched_domain *sd, struct sg_lb_stats *s
if (!lb_prop)
return;
- /* Concurrent load balancing instance already updated the stats */
- if (READ_ONCE(lb_prop->last_update) == current_jiffy)
+ version = atomic_long_read_acquire(&lb_prop->version);
+ if (version < 0) /* Raced with a concurrent update. */
return;
- scoped_guard(raw_spinlock_irqsave_try, &lb_prop->stats_lock) {
- if (READ_ONCE(lb_prop->last_update) == current_jiffy)
- break;
+ guard(irqsave)(); /* Minimize interruptions. */
+
+ cpu = smp_processor_id();
+ time = sched_clock_cpu(cpu);
- lb_prop->sg_stats = *sd_stats;
+ /* Version is still fresh, no need to be rude yet. */
+ if (version > 0 && (s64)(time - version) <= 50 * NSEC_PER_USEC)
+ return;
+retry:
+ /*
+ * Try to grab the stats for update. If the cmpxchg fails,
+ * a concurrent writer succeeded to grab the stats before
+ * this load balancing instance did. The acquire ordering
+ * also pairs against readers checking the version after
+ * reading the stats to ensure consistent state.
+ */
+ lock = atomic_long_cmpxchg_acquire(&lb_prop->version, version, LLONG_MIN);
+
+ /* Someone else grabbed the version. */
+ if (lock != version) {
/*
- * Pairs against readers checking the last_update
- * before reading the cached stats.
+ * Version is up for grabs! Try again. If the CPU grabs
+ * the lock next time around lock = version = 0 and this
+ * is skipped. If it cannot grab the version, lock != 0
+ * and we return from here thus ensuring on a single
+ * retry.
*/
- smp_wmb();
- WRITE_ONCE(lb_prop->last_update, current_jiffy);
+ if (!lock) {
+ version = 0;
+ goto retry;
+ }
+ return;
}
+
+ lb_prop->sg_stats = *sd_stats;
+
+ /*
+ * Pairs against readers checking the version
+ * before reading the stats.
+ */
+ atomic_long_set_release(&lb_prop->version, time);
}
static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
@@ -10346,8 +10377,8 @@ static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type
static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_stats *sg_stats)
{
struct sched_domain_shared *sg_share = group->shared;
- unsigned long current_jiffy = jiffies;
struct sg_lb_stats_prop *lb_prop;
+ long version;
if (!sg_share)
return 0;
@@ -10356,24 +10387,22 @@ static inline int retrieve_cached_stats(struct sched_group *group, struct sg_lb_
if (!lb_prop)
return 0;
- /* Stale stats */
- if (READ_ONCE(lb_prop->last_update) != current_jiffy)
- return 0;
-
/*
- * Pairs against the update to sgs_prop->last_update to
- * prevent readers from seeing an inconsistent value of
- * the propagated stats from a concurrent update.
+ * Pairs with writer atomically updating version after
+ * writing the stats.
*/
- smp_rmb();
+ version = atomic_long_read_acquire(&lb_prop->version);
+ if (version <= 0) /* Stats have gone stale / being updated. */
+ return 0;
+
*sg_stats = lb_prop->sg_stats;
/*
- * If stats were read in the same interval, it cannot
- * read an inconsistent state since stats are only
- * updated once per jiffy.
+ * Pairs with writer atomically invalidating a version
+ * before updating the stats.
*/
- return time_before_eq(jiffies, current_jiffy);
+ smp_rmb();
+ return atomic_long_read(&lb_prop->version) == version;
}
/**
@@ -11155,7 +11184,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
sg_overutilized = sd_stats.overutilized;
sum_util = sd_stats.group_util;
- cache_sd_stats(env->sd, &sd_stats);
+ cache_sd_stats(env, &sd_stats);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 391c4180eeb3..64f7e013fd59 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2176,8 +2176,26 @@ struct sg_lb_stats {
* sched_domain load balancing statistics up the hierarchy.
*/
struct sg_lb_stats_prop {
- raw_spinlock_t stats_lock; /* Lock for updating the cached stats */
- unsigned long last_update; /* Time when stats was last updated (jiffies) */
+ /*
+ * Stats version has the following semantics:
+ *
+ * When 0, stats are considered state. A writer can lock the
+ * stats by atomically changing it to LLONG_MIN. Once the
+ * stats are written, the version is atomically updated to the
+ * value returned by sched_clock_cpu().
+ *
+ * If the reader finds a positive value for version, the stats
+ * are considered to be fresh and the reader will copy it for
+ * load balancing. The version seen before and after the read
+ * is compared to ensure the stats copied are consistent.
+ *
+ * Since invalidations under uncertain circumstances can take a
+ * long time, a rude writer can always attempt to take over the
+ * stats by atomically updating the version to LLONG_MIN if it
+ * finds a large difference betwwen a valid version and the
+ * value returned by sched_clock_cpu().
+ */
+ atomic_long_t version;
struct sg_lb_stats sg_stats; /* Cached sched_group stats */
};
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index aeb55f66e8d6..2e72ef8d8d8e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2304,8 +2304,7 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
if (!sg_stats)
return -ENOMEM;
- raw_spin_lock_init(&sg_stats->stats_lock);
- sg_stats->last_update = 0;
+ atomic_long_set(&sg_stats->version, 0);
sds->private = (void *)sg_stats;
sg = kzalloc_node(sizeof(struct sched_group) + cpumask_size(),
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (10 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 11/08] sched/fair: Move from "last_update" to stats versioning K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
` (4 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
Record which CPU updated the stats last. This will be used to invalidate
the stats in the following commits.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 5 +++--
kernel/sched/sched.h | 1 +
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6c486e194a9d..2a34d73d824b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10279,9 +10279,9 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
{
struct sched_domain_shared *sd_share = env->sd->shared;
struct sg_lb_stats_prop *lb_prop;
- int cpu, retry_limit = 3;
u64 time, lock;
long version;
+ int cpu;
if (!sd_share)
return;
@@ -10319,7 +10319,7 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
* Version is up for grabs! Try again. If the CPU grabs
* the lock next time around lock = version = 0 and this
* is skipped. If it cannot grab the version, lock != 0
- * and we return from here thus ensuring on a single
+ * and we return from here thus ensuring only a single
* retry.
*/
if (!lock) {
@@ -10330,6 +10330,7 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
}
lb_prop->sg_stats = *sd_stats;
+ lb_prop->update_cpu = cpu;
/*
* Pairs against readers checking the version
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 64f7e013fd59..adf4fa2ed031 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2197,6 +2197,7 @@ struct sg_lb_stats_prop {
*/
atomic_long_t version;
struct sg_lb_stats sg_stats; /* Cached sched_group stats */
+ int update_cpu; /* CPU that updated the stats */
};
static inline struct cpumask *sched_group_span(struct sched_group *sg)
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (11 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 12/08] sched/fair: Record the cpu that updated the stats last K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
` (3 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
The CPU doing the load balancing propagates the stats bottom-up, reusing
them as it traverses up the hierarchy. Once done, or if a decision is
taken to migrate the tasks that affect the stats, the old version needs
to be invalidated for a newer CPU with a recent view to recompute and
cache it.
Invalidate the old version once load balancing instance is done. Rudely
take over the stats if another CPU sees that cached stats are older than
50us.
This allows idle and newidle balance to propagate at the very least its
local stats up the hierarchy.
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) versioning[pct imp](CV)
1-groups 1.00 [ -0.00](10.12) 1.00 [ 0.44](13.86)
2-groups 1.00 [ -0.00]( 6.92) 1.04 [ -4.32]( 3.00)
4-groups 1.00 [ -0.00]( 3.14) 1.00 [ -0.21]( 2.16)
8-groups 1.00 [ -0.00]( 1.35) 1.01 [ -1.25]( 1.32)
16-groups 1.00 [ -0.00]( 1.32) 1.01 [ -0.50]( 2.00)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ 0.00]( 0.43) 0.98 [ -1.65]( 0.15)
2 1.00 [ 0.00]( 0.58) 1.01 [ 1.27]( 0.49)
4 1.00 [ 0.00]( 0.54) 1.00 [ 0.47]( 0.40)
8 1.00 [ 0.00]( 0.49) 1.00 [ -0.44]( 1.18)
16 1.00 [ 0.00]( 1.06) 1.00 [ -0.07]( 1.14)
32 1.00 [ 0.00]( 1.27) 1.00 [ 0.02]( 0.11)
64 1.00 [ 0.00]( 1.54) 0.99 [ -1.12]( 1.09)
128 1.00 [ 0.00]( 0.38) 0.98 [ -2.43]( 1.00)
256 1.00 [ 0.00]( 1.85) 0.99 [ -0.50]( 0.94)
512 1.00 [ 0.00]( 0.31) 0.99 [ -1.03]( 0.35)
1024 1.00 [ 0.00]( 0.19) 0.99 [ -0.56]( 0.42)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) versioning[pct imp](CV)
Copy 1.00 [ 0.00](11.31) 1.08 [ 7.51]( 4.74)
Scale 1.00 [ 0.00]( 6.62) 1.00 [ -0.31]( 7.45)
Add 1.00 [ 0.00]( 7.06) 1.02 [ 2.50]( 7.34)
Triad 1.00 [ 0.00]( 8.91) 1.08 [ 7.78]( 2.88)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) versioning[pct imp](CV)
Copy 1.00 [ 0.00]( 2.01) 1.02 [ 1.82]( 1.26)
Scale 1.00 [ 0.00]( 1.49) 1.00 [ 0.26]( 0.80)
Add 1.00 [ 0.00]( 2.67) 1.01 [ 0.98]( 1.29)
Triad 1.00 [ 0.00]( 2.19) 1.02 [ 2.06]( 1.01)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) versioning[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.43) 0.99 [ -0.72]( 0.81)
2-clients 1.00 [ 0.00]( 1.02) 1.00 [ -0.09]( 1.11)
4-clients 1.00 [ 0.00]( 0.83) 1.00 [ 0.31]( 0.29)
8-clients 1.00 [ 0.00]( 0.73) 1.00 [ -0.25]( 0.61)
16-clients 1.00 [ 0.00]( 0.97) 1.00 [ -0.26]( 0.89)
32-clients 1.00 [ 0.00]( 0.88) 0.99 [ -0.61]( 0.82)
64-clients 1.00 [ 0.00]( 1.49) 0.99 [ -1.11]( 1.77)
128-clients 1.00 [ 0.00]( 1.05) 1.00 [ -0.03]( 1.13)
256-clients 1.00 [ 0.00]( 3.85) 1.00 [ -0.24]( 2.63)
512-clients 1.00 [ 0.00](59.63) 0.99 [ -0.74](59.01)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00]( 6.67) 0.93 [ 6.67](15.25)
2 1.00 [ -0.00](10.18) 0.83 [ 17.39]( 7.15)
4 1.00 [ -0.00]( 4.49) 1.04 [ -4.26]( 6.12)
8 1.00 [ -0.00]( 6.68) 1.06 [ -5.66](12.98)
16 1.00 [ -0.00]( 1.87) 1.00 [ -0.00]( 3.38)
32 1.00 [ -0.00]( 4.01) 0.98 [ 2.20]( 4.79)
64 1.00 [ -0.00]( 3.21) 1.02 [ -1.68]( 0.84)
128 1.00 [ -0.00](44.13) 1.16 [-15.98](14.99)
256 1.00 [ -0.00](14.46) 0.90 [ 9.99](17.45)
512 1.00 [ -0.00]( 1.95) 0.98 [ 1.54]( 1.13)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ 0.00]( 0.46) 1.00 [ 0.00]( 0.26)
2 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.15)
4 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.30)
8 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.26)
16 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 3.40) 1.06 [ 5.93]( 1.22)
64 1.00 [ 0.00]( 7.09) 1.00 [ 0.00]( 0.20)
128 1.00 [ 0.00]( 0.00) 0.98 [ -1.52]( 0.34)
256 1.00 [ 0.00]( 1.12) 0.98 [ -2.41]( 1.19)
512 1.00 [ 0.00]( 0.22) 1.00 [ 0.00]( 0.43)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00](19.72) 1.00 [ -0.00]( 8.37)
2 1.00 [ -0.00](15.96) 1.09 [ -9.09](11.08)
4 1.00 [ -0.00]( 3.87) 1.15 [-15.38](17.44)
8 1.00 [ -0.00]( 8.15) 0.92 [ 8.33]( 8.85)
16 1.00 [ -0.00]( 3.87) 1.23 [-23.08]( 5.59)
32 1.00 [ -0.00](12.99) 0.73 [ 26.67](16.75)
64 1.00 [ -0.00]( 6.20) 1.25 [-25.00]( 2.63)
128 1.00 [ -0.00]( 0.96) 1.62 [-62.37]( 1.30)
256 1.00 [ -0.00]( 2.76) 0.82 [ 17.89](10.56)
512 1.00 [ -0.00]( 0.20) 1.00 [ -0.00]( 0.34)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) versioning[pct imp](CV)
1 1.00 [ -0.00]( 1.07) 1.02 [ -2.34]( 0.13)
2 1.00 [ -0.00]( 0.14) 1.04 [ -3.97]( 0.13)
4 1.00 [ -0.00]( 1.39) 1.03 [ -3.15]( 0.13)
8 1.00 [ -0.00]( 0.36) 1.03 [ -3.43]( 0.66)
16 1.00 [ -0.00]( 1.18) 0.99 [ 0.79]( 1.22)
32 1.00 [ -0.00]( 8.42) 0.82 [ 18.29]( 9.02)
64 1.00 [ -0.00]( 4.85) 1.00 [ -0.44]( 1.61)
128 1.00 [ -0.00]( 0.28) 1.06 [ -5.64]( 1.10)
256 1.00 [ -0.00](10.52) 0.81 [ 19.18](12.55)
512 1.00 [ -0.00]( 0.69) 1.00 [ 0.33]( 1.27)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -0.76%
ycsb-mongodb 0.49%
deathstarbench-1x -2.37%
deathstarbench-2x 0.12%
deathstarbench-3x 2.30%
deathstarbench-6x 1.88%
hammerdb+mysql 16VU 3.85%
hammerdb+mysql 64VU 0.27%
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
kernel/sched/fair.c | 92 ++++++++++++++++++++++++++++++++++-----------
1 file changed, 71 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a34d73d824b..31501b933d45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10341,17 +10341,6 @@ static inline void cache_sd_stats(struct lb_env *env, struct sg_lb_stats *sd_sta
static inline int can_retrieve_stats(struct sched_domain *sd, enum cpu_idle_type idle)
{
- /*
- * Only under perioric load balancing can we ensure that no concurrent
- * CPUs modifies the stats being propagated upwards since
- * should_we_balance() can allow multiple concurrent newidle balance
- * to progress and an idle -> busy transition for idle balance will
- * require the stats to be recomputed since idleness metrics will
- * change with migration.
- */
- if (idle)
- return 0;
-
/*
* If individual groups are separate NUMA domains, migrations can cause
* preferred task statistics to change and will require recomputing of
@@ -10422,8 +10411,6 @@ static inline void aggregate_sd_stats(struct lb_env *env,
sd_stats->sum_h_nr_running += sg_stats->sum_h_nr_running;
sd_stats->sum_nr_running += sg_stats->sum_nr_running;
sd_stats->idle_cpus += sg_stats->idle_cpus;
- sd_stats->group_capacity += sg_stats->group_capacity;
- sd_stats->group_weight += sg_stats->group_weight;
sd_stats->overloaded |= sg_stats->overloaded;
sd_stats->overutilized |= sg_stats->overutilized;
@@ -10431,6 +10418,52 @@ static inline void aggregate_sd_stats(struct lb_env *env,
sd_stats->nr_numa_running += sg_stats->nr_numa_running;
sd_stats->nr_preferred_running += sg_stats->nr_preferred_running;
#endif
+
+ if (env->idle &&
+ sg_stats->group_misfit_task_load > sd_stats->group_misfit_task_load)
+ sd_stats->group_misfit_task_load = sg_stats->group_misfit_task_load;
+}
+
+static inline void __invalidate_stats(struct sched_domain *sd)
+{
+ struct sched_domain_shared *sd_share = sd->shared;
+ struct sg_lb_stats_prop *lb_prop;
+ long version;
+
+ if (!sd_share)
+ return;
+
+ lb_prop = (struct sg_lb_stats_prop *)sd_share->private;
+ if (!lb_prop)
+ return;
+
+ /*
+ * The acquire ordering pairs against the writer updating the
+ * "update_cpu" before setting a valid version.
+ */
+ version = atomic_long_read_acquire(&lb_prop->version);
+ if (version <= 0) /* Stats are invalidated / being updated. */
+ return;
+
+ guard(irqsave)();
+
+ /*
+ * Stats were not updated by this CPU. Leave it to the
+ * update_cpu to clean it up.
+ */
+ if (lb_prop->update_cpu != smp_processor_id())
+ return;
+
+ /* Invalidate the stats. */
+ atomic_long_cmpxchg_release(&lb_prop->version, version, 0);
+}
+
+static inline void invalidate_below(struct sched_domain *sd)
+{
+ struct sched_domain *child;
+
+ for (child = sd->child; child; child = child->child)
+ __invalidate_stats(child);
}
/**
@@ -10495,8 +10528,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->nr_numa_running += rq->nr_numa_running;
sgs->nr_preferred_running += rq->nr_preferred_running;
#endif
- if (local_group)
- continue;
if (sd_flags & SD_ASYM_CPUCAPACITY) {
/* Check for a misfit task on the cpu */
@@ -10511,10 +10542,13 @@ static inline void update_sg_lb_stats(struct lb_env *env,
}
}
+group_classify:
sgs->group_capacity = group->sgc->capacity;
-
sgs->group_weight = group->group_weight;
+ if (local_group || !env->idle)
+ sgs->group_misfit_task_load = 0;
+
/* Check if dst CPU is idle and preferred to this group */
if (!local_group && env->idle && sgs->sum_h_nr_running &&
sched_group_asym(env, sgs, group))
@@ -10524,7 +10558,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (!local_group && smt_balance(env, sgs, group))
sgs->group_smt_balance = 1;
-group_classify:
sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
/* Computing avg_load makes sense only when group is overloaded */
@@ -11167,9 +11200,10 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
/* Now, start updating sd_lb_stats */
+ sds->total_capacity += sgs->group_capacity;
+
if (!should_prop) {
sds->total_load += sgs->group_load;
- sds->total_capacity += sgs->group_capacity;
sg_overloaded |= sgs->overloaded;
sg_overutilized |= sgs->overutilized;
sum_util += sgs->group_util;
@@ -11180,7 +11214,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
if (should_prop) {
sds->total_load = sd_stats.group_load;
- sds->total_capacity = sd_stats.group_capacity;
sg_overloaded = sd_stats.overloaded;
sg_overutilized = sd_stats.overutilized;
sum_util = sd_stats.group_util;
@@ -11947,6 +11980,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
if (cur_ld_moved) {
attach_tasks(&env);
+ /*
+ * If tasks have moved to an idle CPU, the idleness
+ * metrics have changed. Invalidate stats for the
+ * next instance to compute them afresh.
+ */
+ if (env.idle)
+ __invalidate_stats(env.sd);
ld_moved += cur_ld_moved;
}
@@ -12308,12 +12348,12 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
+ int need_serialize, need_decay = 0, invalidate = 1;
unsigned long interval, prev_sd_next_balance = 0;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
u64 max_cost = 0;
rcu_read_lock();
@@ -12333,6 +12373,11 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
* actively.
*/
if (!continue_balancing) {
+ if (invalidate) {
+ invalidate_below(sd);
+ invalidate = 0;
+ }
+
if (need_decay)
continue;
break;
@@ -12369,6 +12414,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
next_balance = sd->last_balance + interval;
update_next_balance = 1;
}
+
+ if (!sd->parent)
+ invalidate_below(sd);
}
if (need_decay) {
/*
@@ -12987,8 +13035,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
* Stop searching for tasks to pull if there are
* now runnable tasks on this rq.
*/
- if (pulled_task || !continue_balancing)
+ if (pulled_task || !continue_balancing || !sd->parent) {
+ invalidate_below(sd);
break;
+ }
}
rcu_read_unlock();
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (12 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 13/08] sched/fair: Invalidate stats once the load balancing instance is done K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
` (2 subsequent siblings)
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
Add stats for load balancing time and stats reuse efficiency.
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
include/linux/sched/topology.h | 5 +++++
kernel/sched/fair.c | 21 ++++++++++++++++++++-
kernel/sched/stats.c | 9 +++++++--
3 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index a16d7d9dd9d3..dea65eb263c6 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -123,6 +123,11 @@ struct sched_domain {
unsigned int lb_hot_gained[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyg[CPU_MAX_IDLE_TYPES];
unsigned int lb_nobusyq[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_min_time[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_max_time[CPU_MAX_IDLE_TYPES];
+ unsigned long lb_total_time[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_stats_reused[CPU_MAX_IDLE_TYPES];
+ unsigned int lb_stats_recomputed[CPU_MAX_IDLE_TYPES];
/* Active load balancing */
unsigned int alb_count;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 31501b933d45..bb7b21421415 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10491,10 +10491,13 @@ static inline void update_sg_lb_stats(struct lb_env *env,
* group_smt_balance is not possible under busy load balancing.
*/
if (can_retrieve_stats(env->sd, env->idle) &&
- retrieve_cached_stats(group, sgs))
+ retrieve_cached_stats(group, sgs)) {
+ schedstat_inc(env->sd->lb_stats_reused[env->idle]);
goto group_classify;
+ }
memset(sgs, 0, sizeof(*sgs));
+ schedstat_inc(env->sd->lb_stats_recomputed[env->idle]);
for_each_cpu_and(i, sched_group_span(group), env->cpus) {
struct rq *rq = cpu_rq(i);
@@ -11901,6 +11904,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
{
int ld_moved, cur_ld_moved, active_balance = 0;
struct sched_domain *sd_parent = sd->parent;
+ u64 lb_start = sched_clock_cpu(this_cpu);
struct sched_group *group;
struct rq *busiest;
struct rq_flags rf;
@@ -12174,6 +12178,21 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
out:
+ if (schedstat_enabled()) {
+ u64 now = sched_clock_cpu(this_cpu);
+ u64 elapsed = now - lb_start;
+
+ if (!schedstat_val(sd->lb_min_time[idle]) ||
+ elapsed < schedstat_val(sd->lb_min_time[idle]))
+ __schedstat_set(sd->lb_min_time[idle], elapsed);
+
+ if (!schedstat_val(sd->lb_max_time[idle]) ||
+ elapsed > schedstat_val(sd->lb_max_time[idle]))
+ __schedstat_set(sd->lb_max_time[idle], elapsed);
+
+ __schedstat_add(sd->lb_total_time[idle], elapsed);
+ }
+
return ld_moved;
}
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 4346fd81c31f..b2ace3c51062 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -141,7 +141,7 @@ static int show_schedstat(struct seq_file *seq, void *v)
seq_printf(seq, "domain%d %s %*pb", dcount++, sd->name,
cpumask_pr_args(sched_domain_span(sd)));
for (itype = 0; itype < CPU_MAX_IDLE_TYPES; itype++) {
- seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u",
+ seq_printf(seq, " %u %u %u %u %u %u %u %u %u %u %u %u %u %lu %u %u",
sd->lb_count[itype],
sd->lb_balanced[itype],
sd->lb_failed[itype],
@@ -152,7 +152,12 @@ static int show_schedstat(struct seq_file *seq, void *v)
sd->lb_gained[itype],
sd->lb_hot_gained[itype],
sd->lb_nobusyq[itype],
- sd->lb_nobusyg[itype]);
+ sd->lb_nobusyg[itype],
+ sd->lb_min_time[itype],
+ sd->lb_max_time[itype],
+ sd->lb_total_time[itype],
+ sd->lb_stats_reused[itype],
+ sd->lb_stats_recomputed[itype]);
}
seq_printf(seq,
" %u %u %u %u %u %u %u %u %u %u %u %u\n",
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (13 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 14/08] [DEBUG] sched/fair: Add more lb_stats around lb_time and stats reuse K Prateek Nayak
@ 2025-03-16 10:29 ` K Prateek Nayak
2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
2025-03-21 10:04 ` Libo Chen
16 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-16 10:29 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot, Chen Yu,
linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak
The previous commit hacked up schedstats v17 to add more fields.
Extend the header file for perf sched stats for analysis. These changes
depend on perf sched stats tools being developed in [1].
Link: https://lore.kernel.org/lkml/20250311120230.61774-1-swapnil.sapkal@amd.com/ [1]
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
tools/lib/perf/include/perf/schedstat-v17.h | 30 +++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/tools/lib/perf/include/perf/schedstat-v17.h b/tools/lib/perf/include/perf/schedstat-v17.h
index 00009bd5f006..888dfa982a55 100644
--- a/tools/lib/perf/include/perf/schedstat-v17.h
+++ b/tools/lib/perf/include/perf/schedstat-v17.h
@@ -47,6 +47,16 @@ DOMAIN_FIELD(__u32, busy_lb_nobusyq,
"load_balance() failed to find busier queue on cpu busy", "%11u", true, v17);
DOMAIN_FIELD(__u32, busy_lb_nobusyg,
"load_balance() failed to find busier group on cpu busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_min_time,
+ "load_balance() min time to balance on busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_max_time,
+ "load_balance() max time to balance on busy", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, busy_lb_total_time,
+ "load_balance() total time to balance on busy", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_stats_reused,
+ "load_balance() stats reused on busy", "%11u", true, v17);
+DOMAIN_FIELD(__u32, busy_lb_stats_recompute,
+ "load_balance() stats recomputed on busy", "%11u", true, v17);
#ifdef DERIVED_CNT_FIELD
DERIVED_CNT_FIELD("load_balance() success count on cpu busy", "%11u",
busy_lb_count, busy_lb_balanced, busy_lb_failed, v17);
@@ -80,6 +90,16 @@ DOMAIN_FIELD(__u32, idle_lb_nobusyq,
"load_balance() failed to find busier queue on cpu idle", "%11u", true, v17);
DOMAIN_FIELD(__u32, idle_lb_nobusyg,
"load_balance() failed to find busier group on cpu idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_min_time,
+ "load_balance() min time to balance on idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_max_time,
+ "load_balance() max time to balance on idle", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, idle_lb_total_time,
+ "load_balance() total time to balance on idle", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_stats_reused,
+ "load_balance() stats reused on idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, idle_lb_stats_recompute,
+ "load_balance() stats recomputed on idle", "%11u", true, v17);
#ifdef DERIVED_CNT_FIELD
DERIVED_CNT_FIELD("load_balance() success count on cpu idle", "%11u",
idle_lb_count, idle_lb_balanced, idle_lb_failed, v17);
@@ -113,6 +133,16 @@ DOMAIN_FIELD(__u32, newidle_lb_nobusyq,
"load_balance() failed to find busier queue on cpu newly idle", "%11u", true, v17);
DOMAIN_FIELD(__u32, newidle_lb_nobusyg,
"load_balance() failed to find busier group on cpu newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_min_time,
+ "load_balance() min time to balance on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_max_time,
+ "load_balance() max time to balance on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(unsigned long, newidle_lb_total_time,
+ "load_balance() total time to balance on newly idle", "%11lu", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_stats_reused,
+ "load_balance() stats reused on newly idle", "%11u", true, v17);
+DOMAIN_FIELD(__u32, newidle_lb_stats_recompute,
+ "load_balance() stats recomputed on newly idle", "%11u", true, v17);
#ifdef DERIVED_CNT_FIELD
DERIVED_CNT_FIELD("load_balance() success count on cpu newly idle", "%11u",
newidle_lb_count, newidle_lb_balanced, newidle_lb_failed, v17);
--
2.43.0
^ permalink raw reply related [flat|nested] 24+ messages in thread* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (14 preceding siblings ...)
2025-03-16 10:29 ` [RFC PATCH 15/08] [DEBUG] tools/lib/perf: Extend schedstats v17 headers to include the new debug fields K Prateek Nayak
@ 2025-03-17 17:25 ` Peter Zijlstra
2025-03-17 18:23 ` Chen, Yu C
2025-03-21 10:04 ` Libo Chen
16 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2025-03-17 17:25 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Chen Yu, linux-kernel,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde
On Thu, Mar 13, 2025 at 09:37:38AM +0000, K Prateek Nayak wrote:
> tl;dr
>
> This prototype is currently limited in the sense that it can only reuse
> statistics for busy load balancing. Reusing stats for newidle load
> balancing specifically ran into issues elaborated below.
Right, it makes sense for busy load balance, newidle I think:
> David had proposed SHARED_RUNQ [4] to improve on the shortcomings of
> newidle balance for Meta's production workloads.
we need to look at this again. Something around the EEVDF merge made the
thing unhappy -- if we figure out what and fix it, I think this makes
more sense than trying to optimize the current scheme for newidle.
newidle really is about getting *any* work fast, which is a totally
different game than the regular busy balancing.
Anyway, I'll try and have a look through the patches.
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
@ 2025-03-17 18:23 ` Chen, Yu C
0 siblings, 0 replies; 24+ messages in thread
From: Chen, Yu C @ 2025-03-17 18:23 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, linux-kernel,
Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde, K Prateek Nayak, yu.c.chen,
yu.chen.surf
On 3/18/2025 1:25 AM, Peter Zijlstra wrote:
> On Thu, Mar 13, 2025 at 09:37:38AM +0000, K Prateek Nayak wrote:
>> tl;dr
>>
>> This prototype is currently limited in the sense that it can only reuse
>> statistics for busy load balancing. Reusing stats for newidle load
>> balancing specifically ran into issues elaborated below.
>
> Right, it makes sense for busy load balance, newidle I think:
>
>> David had proposed SHARED_RUNQ [4] to improve on the shortcomings of
>> newidle balance for Meta's production workloads.
>
> we need to look at this again. Something around the EEVDF merge made the
> thing unhappy -- if we figure out what and fix it, I think this makes
Could you give some links on what the issue is? The newly-idle balance
fail to pull tasks after switching to EEVDF?(I don't
see the connection between EEVDF and newly-idle balance on top of
my head)
> more sense than trying to optimize the current scheme for newidle.
>
> newidle really is about getting *any* work fast, which is a totally
> different game than the regular busy balancing.
>
The newly idle iterates every CPU in the domain to find the busiest one,
would the following work: find a relative busy CPU and stop the search,
say, rq->nr_running >= 2 and also consider the candidate task's average
duration.
thanks,
Chenyu
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
2025-03-13 9:37 [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy K Prateek Nayak
` (15 preceding siblings ...)
2025-03-17 17:25 ` [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy Peter Zijlstra
@ 2025-03-21 10:04 ` Libo Chen
2025-03-24 3:58 ` K Prateek Nayak
16 siblings, 1 reply; 24+ messages in thread
From: Libo Chen @ 2025-03-21 10:04 UTC (permalink / raw)
To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Chen Yu, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde
On 3/13/25 02:37, K Prateek Nayak wrote:
> Benchmark results
> =================
>
Hi Prateek,
Definitely like the idea, esp. if we can pull this off on newidle lb
which tends to be more problematic on systems with a large number
of cores. But the data below on periodic lb isn't I guess as good as
I expect. So I am wondering if the cost of update_[sd|sg]_lb_stats()
actually went down as the result of the caching?
Thanks,
Libo
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1-groups 1.00 [ -0.00](10.12) 1.09 [ -9.11](11.93)
> 2-groups 1.00 [ -0.00]( 6.92) 1.00 [ -0.22]( 4.57)
> 4-groups 1.00 [ -0.00]( 3.14) 0.99 [ 0.83]( 1.77)
> 8-groups 1.00 [ -0.00]( 1.35) 1.00 [ -0.31]( 2.24)
> 16-groups 1.00 [ -0.00]( 1.32) 0.99 [ 0.84]( 0.67)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1 1.00 [ 0.00]( 0.43) 0.99 [ -0.87]( 1.34)
> 2 1.00 [ 0.00]( 0.58) 1.02 [ 2.14]( 0.29)
> 4 1.00 [ 0.00]( 0.54) 1.01 [ 1.24]( 0.82)
> 8 1.00 [ 0.00]( 0.49) 1.01 [ 0.62]( 0.97)
> 16 1.00 [ 0.00]( 1.06) 1.01 [ 0.94]( 0.70)
> 32 1.00 [ 0.00]( 1.27) 0.99 [ -1.24]( 1.38)
> 64 1.00 [ 0.00]( 1.54) 1.00 [ -0.43]( 0.36)
> 128 1.00 [ 0.00]( 0.38) 1.00 [ -0.01]( 1.22)
> 256 1.00 [ 0.00]( 1.85) 1.02 [ 1.58]( 0.90)
> 512 1.00 [ 0.00]( 0.31) 1.01 [ 0.76]( 1.19)
> 1024 1.00 [ 0.00]( 0.19) 1.00 [ 0.44]( 0.35)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) stats_prop[pct imp](CV)
> Copy 1.00 [ 0.00](11.31) 1.02 [ 1.69]( 6.44)
> Scale 1.00 [ 0.00]( 6.62) 1.01 [ 0.80]( 5.37)
> Add 1.00 [ 0.00]( 7.06) 1.02 [ 1.54]( 6.72)
> Triad 1.00 [ 0.00]( 8.91) 1.01 [ 1.36]( 6.73)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) stats_prop[pct imp](CV)
> Copy 1.00 [ 0.00]( 2.01) 0.98 [ -1.55]( 2.15)
> Scale 1.00 [ 0.00]( 1.49) 1.00 [ 0.23]( 0.58)
> Add 1.00 [ 0.00]( 2.67) 1.01 [ 0.65]( 1.95)
> Triad 1.00 [ 0.00]( 2.19) 1.01 [ 0.61]( 1.37)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 1.43) 1.00 [ 0.17]( 0.32)
> 2-clients 1.00 [ 0.00]( 1.02) 1.01 [ 1.00]( 0.44)
> 4-clients 1.00 [ 0.00]( 0.83) 1.01 [ 0.62]( 0.36)
> 8-clients 1.00 [ 0.00]( 0.73) 1.00 [ -0.11]( 0.65)
> 16-clients 1.00 [ 0.00]( 0.97) 1.00 [ 0.49]( 0.77)
> 32-clients 1.00 [ 0.00]( 0.88) 1.00 [ 0.30]( 0.94)
> 64-clients 1.00 [ 0.00]( 1.49) 1.00 [ 0.36]( 1.57)
> 128-clients 1.00 [ 0.00]( 1.05) 1.00 [ 0.14]( 1.46)
> 256-clients 1.00 [ 0.00]( 3.85) 1.00 [ -0.04]( 4.85)
> 512-clients 1.00 [ 0.00](59.63) 1.00 [ -0.02](62.28)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1 1.00 [ -0.00]( 6.67) 0.76 [ 24.44](35.80)
> 2 1.00 [ -0.00](10.18) 0.87 [ 13.04](10.38)
> 4 1.00 [ -0.00]( 4.49) 1.04 [ -4.26]( 3.14)
> 8 1.00 [ -0.00]( 6.68) 0.98 [ 1.89]( 8.07)
> 16 1.00 [ -0.00]( 1.87) 1.03 [ -3.28]( 5.21)
> 32 1.00 [ -0.00]( 4.01) 0.98 [ 2.20]( 1.31)
> 64 1.00 [ -0.00]( 3.21) 1.00 [ -0.00]( 3.23)
> 128 1.00 [ -0.00](44.13) 1.06 [ -6.43](113.66)
> 256 1.00 [ -0.00](14.46) 1.04 [ -3.52]( 8.43)
> 512 1.00 [ -0.00]( 1.95) 1.02 [ -1.80]( 1.14)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1 1.00 [ 0.00]( 0.46) 1.00 [ 0.00]( 0.55)
> 2 1.00 [ 0.00]( 0.15) 0.99 [ -0.88]( 0.26)
> 4 1.00 [ 0.00]( 0.15) 0.99 [ -0.59]( 0.15)
> 8 1.00 [ 0.00]( 0.15) 0.99 [ -0.88]( 0.26)
> 16 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.15)
> 32 1.00 [ 0.00]( 3.40) 1.07 [ 6.59]( 0.16)
> 64 1.00 [ 0.00]( 7.09) 1.00 [ -0.38]( 0.96)
> 128 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.20)
> 256 1.00 [ 0.00]( 1.12) 1.00 [ -0.30]( 1.50)
> 512 1.00 [ 0.00]( 0.22) 1.05 [ 4.86]( 0.71)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1 1.00 [ -0.00](19.72) 0.85 [ 15.38](16.75)
> 2 1.00 [ -0.00](15.96) 1.00 [ -0.00]( 0.00)
> 4 1.00 [ -0.00]( 3.87) 1.00 [ -0.00]( 4.08)
> 8 1.00 [ -0.00]( 8.15) 1.00 [ -0.00](11.71)
> 16 1.00 [ -0.00]( 3.87) 0.92 [ 7.69]( 4.19)
> 32 1.00 [ -0.00](12.99) 0.73 [ 26.67]( 0.00)
> 64 1.00 [ -0.00]( 6.20) 1.12 [-12.50]( 9.94)
> 128 1.00 [ -0.00]( 0.96) 0.98 [ 1.55]( 0.95)
> 256 1.00 [ -0.00]( 2.76) 0.99 [ 1.45]( 1.38)
> 512 1.00 [ -0.00]( 0.20) 1.20 [-20.42]( 0.00)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) stats_prop[pct imp](CV)
> 1 1.00 [ -0.00]( 1.07) 1.02 [ -2.08]( 0.13)
> 2 1.00 [ -0.00]( 0.14) 1.04 [ -3.97]( 0.13)
> 4 1.00 [ -0.00]( 1.39) 1.03 [ -3.15]( 0.13)
> 8 1.00 [ -0.00]( 0.36) 1.03 [ -3.16]( 0.00)
> 16 1.00 [ -0.00]( 1.18) 1.02 [ -1.59]( 0.75)
> 32 1.00 [ -0.00]( 8.42) 0.81 [ 19.08]( 0.25)
> 64 1.00 [ -0.00]( 4.85) 1.01 [ -1.10]( 2.58)
> 128 1.00 [ -0.00]( 0.28) 1.00 [ -0.21]( 0.38)
> 256 1.00 [ -0.00](10.52) 0.95 [ 4.74]( 6.94)
> 512 1.00 [ -0.00]( 0.69) 1.09 [ -8.99]( 0.27)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
>
> ycsb-cassandra -0.54%
> ycsb-mongodb 0.09%
>
> deathstarbench-1x -0.30%
> deathstarbench-2x 2.38%
> deathstarbench-3x 0.58%
> deathstarbench-6x 0.62%
>
> hammerdb+mysql 16VU 0.76%
> hammerdb+mysql 64VU 0.74%
> ---
>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: [RFC PATCH 0/8] sched/fair: Propagate load balancing stats up the sched domain hierarchy
2025-03-21 10:04 ` Libo Chen
@ 2025-03-24 3:58 ` K Prateek Nayak
0 siblings, 0 replies; 24+ messages in thread
From: K Prateek Nayak @ 2025-03-24 3:58 UTC (permalink / raw)
To: Libo Chen, Ingo Molnar, Peter Zijlstra, Juri Lelli,
Vincent Guittot, Chen Yu, linux-kernel
Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
Valentin Schneider, David Vernet, Gautham R. Shenoy,
Swapnil Sapkal, Shrikanth Hegde
Hello Libo,
Thank you for taking a look at the series and sorry for the late
response.
On 3/21/2025 3:34 PM, Libo Chen wrote:
>
>
> On 3/13/25 02:37, K Prateek Nayak wrote:
>
>> Benchmark results
>> =================
>>
>
> Hi Prateek,
>
> Definitely like the idea, esp. if we can pull this off on newidle lb
> which tends to be more problematic on systems with a large number
> of cores. But the data below on periodic lb isn't I guess as good as
> I expect. So I am wondering if the cost of update_[sd|sg]_lb_stats()
> actually went down as the result of the caching?
I have some numbers for versioning idea that I got working just before
OSPM in [1] The benchmark results don't move much but the total time
for newidle balance reduces by ~5% overall.
There is a ~30% overhead of aggregating and propagating the stats
upwards at SMT domain that offsets some of the benefits of propagation
at higher domains but I'm working to see if this can be reduced and
only done if required.
Some ideas were discussed at OSPM to reduce the overheads further and
shared the burden of busy load balancing across all CPUs in the domain
and I'll tackle that next.
If you have any benchmark where this shows up prominently, please do let
me know and I can try adding it to the bunch.
[1] https://lore.kernel.org/lkml/20250316102916.10614-1-kprateek.nayak@amd.com/
--
Thanks and Regards,
Prateek
>
> Thanks,
> Libo
^ permalink raw reply [flat|nested] 24+ messages in thread