* [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
` (14 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Jianyong Wu, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Ben Segall, Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Jianyong Wu <wujianyong@hygon.cn>
Scanning online CPUs to calculate the occupancy might be
time-consuming. Only allow 1 thread of the process to scan
the CPUs at the same time, which is similar to what
NUMA balance does in task_numa_work().
Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2010483cd77..6d883f109ba3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
struct sched_cache_time __percpu *pcpu_sched;
raw_spinlock_t lock;
unsigned long epoch;
+ unsigned long next_scan;
int cpu;
} ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f22e5a097cf..a759ea669d74 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1451,6 +1451,7 @@ void mm_init_sched(struct mm_struct *mm,
raw_spin_lock_init(&mm->sc_stat.lock);
mm->sc_stat.epoch = epoch;
mm->sc_stat.cpu = -1;
+ mm->sc_stat.next_scan = jiffies;
/*
* The update to mm->sc_stat should not be reordered
@@ -1661,6 +1662,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
static void task_cache_work(struct callback_head *work)
{
+ unsigned long next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
unsigned long m_a_occ = 0;
@@ -1675,6 +1677,15 @@ static void task_cache_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;
+ next_scan = READ_ONCE(mm->sc_stat.next_scan);
+ if (time_before(now, next_scan))
+ return;
+
+ /* only 1 thread is allowed to scan */
+ if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan,
+ now + EPOCH_PERIOD))
+ return;
+
if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
return;
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
2026-05-13 20:39 ` [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes Tim Chen
` (13 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
A performance regression was observed by Prateek when running hackbench
with many threads per process (high fd count). To avoid this, processes
with a large number of active threads are excluded from cache-aware
scheduling.
With sched_cache enabled, record the number of active threads in each
process during the periodic task_cache_work(). While iterating over
CPUs, if the currently running task belongs to the same process as the
task that launched task_cache_work(), increment the active thread count.
If the number of active threads within the process exceeds the number
of Cores (divided by the SMT number) in the LLC, do not enable
cache-aware scheduling. However, on systems with a smaller number of
CPUs within 1 LLC, like Power10/Power11 with SMT4 and an LLC size of 4,
this check effectively disables cache-aware scheduling for any process.
One possible solution suggested by Peter is to use an LLC-mask instead
of a single LLC value for preference. Once there are a 'few' LLCs as
preference, this constraint becomes a little easier. It could be an
enhancement in the future.
For users who wish to perform task aggregation regardless, a debugfs knob
is provided for tuning in a subsequent change.
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 48 ++++++++++++++++++++++++++++++++++++++-----
2 files changed, 44 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d883f109ba3..6701911eaaf7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2423,6 +2423,7 @@ struct sched_cache_stat {
struct sched_cache_time __percpu *pcpu_sched;
raw_spinlock_t lock;
unsigned long epoch;
+ u64 nr_running_avg;
unsigned long next_scan;
int cpu;
} ____cacheline_aligned_in_smp;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a759ea669d74..808f614fc2d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,12 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
+static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+{
+ return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
+ per_cpu(sd_llc_size, cpu));
+}
+
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
{
struct sched_domain *sd;
@@ -1452,7 +1458,7 @@ void mm_init_sched(struct mm_struct *mm,
mm->sc_stat.epoch = epoch;
mm->sc_stat.cpu = -1;
mm->sc_stat.next_scan = jiffies;
-
+ mm->sc_stat.nr_running_avg = 0;
/*
* The update to mm->sc_stat should not be reordered
* before initialization to mm's other fields, in case
@@ -1574,7 +1580,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* If this process hasn't hit task_cache_work() for a while invalidate
* its preferred state.
*/
- if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT) {
+ if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+ invalid_llc_nr(mm, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1660,14 +1667,32 @@ static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
cpumask_copy(cpus, cpu_online_mask);
}
+static inline void update_avg_scale(u64 *avg, u64 sample)
+{
+ int factor = per_cpu(sd_llc_size, raw_smp_processor_id());
+ s64 diff = sample - *avg;
+ u32 divisor;
+
+ /*
+ * Scale the divisor based on the number of CPUs contained
+ * in the LLC. This scaling ensures smaller LLC domains use
+ * a smaller divisor to achieve more precise sensitivity to
+ * changes in nr_running, while larger LLC domains are capped
+ * at a maximum divisor of 8 which is the default smoothing
+ * factor of EWMA in update_avg().
+ */
+ divisor = clamp_t(u32, (factor >> 2), 2, 8);
+ *avg += div64_s64(diff, divisor);
+}
+
static void task_cache_work(struct callback_head *work)
{
unsigned long next_scan, now = jiffies;
- struct task_struct *p = current;
+ struct task_struct *p = current, *cur;
+ int cpu, m_a_cpu = -1, nr_running = 0;
+ unsigned long curr_m_a_occ = 0;
struct mm_struct *mm = p->mm;
unsigned long m_a_occ = 0;
- unsigned long curr_m_a_occ = 0;
- int cpu, m_a_cpu = -1;
cpumask_var_t cpus;
WARN_ON_ONCE(work != &p->cache_work);
@@ -1711,6 +1736,11 @@ static void task_cache_work(struct callback_head *work)
m_occ = occ;
m_cpu = i;
}
+
+ cur = rcu_dereference_all(cpu_rq(i)->curr);
+ if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
+ cur->mm == mm)
+ nr_running++;
}
/*
@@ -1754,6 +1784,7 @@ static void task_cache_work(struct callback_head *work)
mm->sc_stat.cpu = m_a_cpu;
}
+ update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
free_cpumask_var(cpus);
}
@@ -10294,6 +10325,13 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
return mig_unrestricted;
+ /* skip cache aware load balance for too many threads */
+ if (invalid_llc_nr(mm, dst_cpu)) {
+ if (mm->sc_stat.cpu != -1)
+ mm->sc_stat.cpu = -1;
+ return mig_unrestricted;
+ }
+
if (cpus_share_cache(dst_cpu, cpu))
to_pref = true;
else if (cpus_share_cache(src_cpu, cpu))
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
2026-05-13 20:39 ` [Patch v4 01/16] sched/cache: Allow only 1 thread of the process to calculate the LLC occupancy Tim Chen
2026-05-13 20:39 ` [Patch v4 02/16] sched/cache: Disable cache aware scheduling for processes with high thread counts Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain Tim Chen
` (12 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
For a single thread, the current wakeup path tends to place it
on the same LLC where it was previously running with cache-hot
data. There is no need to enable cache-aware scheduling for
single-threaded processes for the following reasons:
1. Cache-aware scheduling primarily benefits multi-threaded
processes where threads share data. Single-threaded processes
typically have no inter-thread data sharing and thus gain little.
2. Enabling it incurs the additional overhead of tracking the
thread's residency in the LLCs.
3. Bypassing single-threaded processes avoids excessive
concentration of such tasks on a single LLC.
Nevertheless, this check can be omitted if users explicitly
provide hints for such single-threaded workloads where different
processes have shared memory, e.g., via prctl() or other interfaces
to be added in the future.
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 808f614fc2d2..df21366ba1ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,8 +1384,12 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
-static bool invalid_llc_nr(struct mm_struct *mm, int cpu)
+static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
+ int cpu)
{
+ if (get_nr_threads(p) <= 1)
+ return true;
+
return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
per_cpu(sd_llc_size, cpu));
}
@@ -1581,7 +1585,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* its preferred state.
*/
if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
- invalid_llc_nr(mm, cpu_of(rq))) {
+ invalid_llc_nr(mm, p, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1687,9 +1691,9 @@ static inline void update_avg_scale(u64 *avg, u64 sample)
static void task_cache_work(struct callback_head *work)
{
+ int cpu, m_a_cpu = -1, nr_running = 0, curr_cpu;
unsigned long next_scan, now = jiffies;
struct task_struct *p = current, *cur;
- int cpu, m_a_cpu = -1, nr_running = 0;
unsigned long curr_m_a_occ = 0;
struct mm_struct *mm = p->mm;
unsigned long m_a_occ = 0;
@@ -1711,6 +1715,14 @@ static void task_cache_work(struct callback_head *work)
now + EPOCH_PERIOD))
return;
+ curr_cpu = task_cpu(p);
+ if (invalid_llc_nr(mm, p, curr_cpu)) {
+ if (mm->sc_stat.cpu != -1)
+ mm->sc_stat.cpu = -1;
+
+ return;
+ }
+
if (!zalloc_cpumask_var(&cpus, GFP_KERNEL))
return;
@@ -10326,7 +10338,7 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
return mig_unrestricted;
/* skip cache aware load balance for too many threads */
- if (invalid_llc_nr(mm, dst_cpu)) {
+ if (invalid_llc_nr(mm, p, dst_cpu)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
return mig_unrestricted;
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (2 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 03/16] sched/cache: Skip cache-aware scheduling for single-threaded processes Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
` (11 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Cache aware scheduling needs to know the LLC size that a process
can use, so as to avoid memory-intensive tasks from being
over-aggregated on a single LLC.
Introduce a preparation patch to add get_effective_llc_bytes() to
get the LLC size that a CPU can use. The function can be further
enhanced by subtracting the LLC cache ways reserved by resctrl
(CAT in Intel RDT, etc).
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
drivers/base/cacheinfo.c | 23 ++++++++
include/linux/cacheinfo.h | 1 +
include/linux/sched/topology.h | 7 +++
kernel/sched/topology.c | 98 ++++++++++++++++++++++++++++++++--
4 files changed, 126 insertions(+), 3 deletions(-)
diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 391ac5e3d2f5..70701d3bc81c 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -17,6 +17,7 @@
#include <linux/init.h>
#include <linux/of.h>
#include <linux/sched.h>
+#include <linux/sched/topology.h>
#include <linux/slab.h>
#include <linux/smp.h>
#include <linux/sysfs.h>
@@ -68,6 +69,24 @@ bool last_level_cache_is_valid(unsigned int cpu)
}
+/*
+ * Get the cacheinfo of the LLC associated with @cpu.
+ * Derived from update_per_cpu_data_slice_size_cpu().
+ */
+struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu)
+{
+ struct cacheinfo *llc;
+
+ if (!last_level_cache_is_valid(cpu))
+ return NULL;
+
+ llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1);
+ if (llc->type != CACHE_TYPE_DATA && llc->type != CACHE_TYPE_UNIFIED)
+ return NULL;
+
+ return llc;
+}
+
bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y)
{
struct cacheinfo *llc_x, *llc_y;
@@ -1018,6 +1037,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
goto err;
if (cpu_map_shared_cache(true, cpu, &cpu_map))
update_per_cpu_data_slice_size(true, cpu, cpu_map);
+ sched_update_llc_bytes(cpu);
return 0;
err:
free_cache_attributes(cpu);
@@ -1036,6 +1056,9 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
free_cache_attributes(cpu);
if (nr_shared > 1)
update_per_cpu_data_slice_size(false, cpu, cpu_map);
+
+ sched_update_llc_bytes(cpu);
+
return 0;
}
diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index c8f4f0a0b874..fc879ac4cc4f 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -89,6 +89,7 @@ int populate_cache_leaves(unsigned int cpu);
int cache_setup_acpi(unsigned int cpu);
bool last_level_cache_is_valid(unsigned int cpu);
bool last_level_cache_is_shared(unsigned int cpu_x, unsigned int cpu_y);
+struct cacheinfo *get_cpu_cacheinfo_llc(unsigned int cpu);
int fetch_cache_info(unsigned int cpu);
int detect_cache_attributes(unsigned int cpu);
#ifndef CONFIG_ACPI_PPTT
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 0036d6b4bd67..fe09d3268bc9 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -106,6 +106,7 @@ struct sched_domain {
#ifdef CONFIG_SCHED_CACHE
unsigned int llc_max;
unsigned int *llc_counts __counted_by_ptr(llc_max);
+ unsigned long llc_bytes;
#endif
#ifdef CONFIG_SCHEDSTATS
@@ -265,4 +266,10 @@ static inline int task_node(const struct task_struct *p)
return cpu_to_node(task_cpu(p));
}
+#ifdef CONFIG_SCHED_CACHE
+extern void sched_update_llc_bytes(unsigned int cpu);
+#else
+static inline void sched_update_llc_bytes(unsigned int cpu) { }
+#endif
+
#endif /* _LINUX_SCHED_TOPOLOGY_H */
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 9fc99346ef4f..7248a7279abe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -776,9 +776,11 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
/* move buffer to parent as child is being destroyed */
sd->llc_counts = tmp->llc_counts;
sd->llc_max = tmp->llc_max;
+ sd->llc_bytes = tmp->llc_bytes;
/* make sure destroy_sched_domain() does not free it */
tmp->llc_counts = NULL;
tmp->llc_max = 0;
+ tmp->llc_bytes = 0;
#endif
/*
* sched groups hold the flags of the child sched
@@ -831,10 +833,42 @@ DEFINE_STATIC_KEY_FALSE(sched_cache_active);
/* user wants cache aware scheduling [0 or 1] */
int sysctl_sched_cache_user = 1;
+/*
+ * Get the effective LLC size in bytes that @cpu's bottom sched_domain
+ * can use. A CPU within a cpuset partition can only use a proportion
+ * of the physical LLC, scaled by the ratio of the partition's span
+ * weight to the hardware LLC sharing weight. @sd should be the
+ * topmost domain with SD_SHARE_LLC.
+ *
+ * Returns 0 if cacheinfo is not yet populated. This happens during
+ * early boot when build_sched_domains() runs before the generic
+ * cacheinfo framework has been initialized (cacheinfo_cpu_online()
+ * is a device_initcall cpuhp callback). In that case,
+ * cacheinfo_cpu_online() will later call sched_update_llc_bytes()
+ * to fill in the bottom domain's llc_bytes once the cache attributes
+ * are available.
+ */
+static unsigned long get_effective_llc_bytes(int cpu,
+ struct sched_domain *sd)
+{
+ struct cacheinfo *ci;
+ unsigned int hw_weight;
+
+ ci = get_cpu_cacheinfo_llc(cpu);
+ if (!ci)
+ return 0;
+
+ hw_weight = cpumask_weight(&ci->shared_cpu_map);
+ if (!hw_weight)
+ return 0;
+
+ return div_u64((u64)ci->size * sd->span_weight, hw_weight);
+}
+
static bool alloc_sd_llc(const struct cpumask *cpu_map,
struct s_data *d)
{
- struct sched_domain *sd;
+ struct sched_domain *sd, *top_llc, *parent;
unsigned int *p;
int i;
@@ -848,8 +882,24 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
if (!p)
goto err;
- sd->llc_max = max_lid + 1;
- sd->llc_counts = p;
+ top_llc = sd;
+ /*
+ * Find the topmost SD_SHARE_LLC domain.
+ * Not yet attached to the CPU, so per_cpu(sd_llc, i)
+ * can not be used.
+ */
+ while ((parent = rcu_dereference_protected(top_llc->parent, true)) &&
+ (parent->flags & SD_SHARE_LLC))
+ top_llc = parent;
+
+ if (top_llc->flags & SD_SHARE_LLC) {
+ sd->llc_max = max_lid + 1;
+ sd->llc_counts = p;
+ sd->llc_bytes = get_effective_llc_bytes(i, top_llc);
+ } else {
+ /* avoid memory leak */
+ kfree(p);
+ }
}
return true;
@@ -860,6 +910,7 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
kfree(sd->llc_counts);
sd->llc_counts = NULL;
sd->llc_max = 0;
+ sd->llc_bytes = 0;
}
}
@@ -919,6 +970,47 @@ void sched_cache_active_set_unlocked(void)
{
return sched_cache_active_set(false);
}
+
+/*
+ * Update the bottom sched_domain's llc_bytes for @cpu and all its
+ * LLC siblings. Called from cacheinfo_cpu_online() or
+ * cacheinfo_cpu_pre_down() with cpu hotplug lock held.
+ *
+ * Note: get_effective_llc_bytes() returns 0 on PowerPC.
+ * thus cache aware scheduling is disabled on PowerPC for
+ * now. PowerPC does not use the generic cacheinfo framework --
+ * it has its own cacheinfo with a separate struct cache hierarchy
+ * and does not populates the per-CPU struct cpu_cacheinfo array
+ * that get_cpu_cacheinfo_llc() reads.
+ */
+void sched_update_llc_bytes(unsigned int cpu)
+{
+ struct sched_domain *sd, *sdp;
+ unsigned int i;
+
+ sched_domains_mutex_lock();
+
+ sdp = rcu_dereference_sched_domain(per_cpu(sd_llc, cpu));
+ if (!sdp)
+ goto unlock;
+
+ /*
+ * ci->shared_cpu_map is built incrementally as CPUs come
+ * online, so the first CPU in an LLC initially sees
+ * hw_weight == 1 and computes an inflated llc_bytes in
+ * get_effective_llc_bytes(). Re-evaluating every LLC
+ * sibling on each online event corrects this once the full
+ * shared_cpu_map is known.
+ */
+ for_each_cpu(i, sched_domain_span(sdp)) {
+ sd = rcu_dereference_sched_domain(cpu_rq(i)->sd);
+ if (sd)
+ sd->llc_bytes = get_effective_llc_bytes(i, sdp);
+ }
+
+unlock:
+ sched_domains_mutex_unlock();
+}
#else
static bool alloc_sd_llc(const struct cpumask *cpu_map,
struct s_data *d)
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (3 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 04/16] sched/cache: Calculate the LLC size and store it in sched_domain Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
` (10 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Prateek and Tingyin reported that memory-intensive workloads (such as
stream) can saturate memory bandwidth and caches on the preferred LLC
when sched_cache aggregates too many threads.
To mitigate this, estimate a process's memory footprint by comparing
its NUMA balancing fault statistics to the size of the LLC. If the
footprint exceeds the LLC size, skip cache-aware scheduling.
Note that footprint is only an approximation of the memory footprint,
since the kernel lacks suitable metrics to estimate the real working
set. If a user-provided hint is available in the future, it would be
more accurate. A later patch will allow users to provide a hint to
adjust this threshold.
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Vern Hao <vernhao@tencent.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
include/linux/sched.h | 1 +
kernel/exit.c | 29 ++++++++++++++++++++
kernel/sched/fair.c | 62 ++++++++++++++++++++++++++++++++++++++++---
3 files changed, 89 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6701911eaaf7..95729670929c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2425,6 +2425,7 @@ struct sched_cache_stat {
unsigned long epoch;
u64 nr_running_avg;
unsigned long next_scan;
+ unsigned long footprint;
int cpu;
} ____cacheline_aligned_in_smp;
diff --git a/kernel/exit.c b/kernel/exit.c
index ede3117fa7d4..77275c26a2a1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -543,6 +543,32 @@ void mm_update_next_owner(struct mm_struct *mm)
}
#endif /* CONFIG_MEMCG */
+#if defined(CONFIG_SCHED_CACHE) && defined(CONFIG_NUMA_BALANCING)
+/*
+ * Subtract the memory footprint of the current task from
+ * mm.
+ */
+static void exit_mm_sched_cache(struct mm_struct *mm)
+{
+ unsigned long fp, sub;
+
+ if (!current->total_numa_faults)
+ return;
+ /*
+ * No lock protection due to performance considerations.
+ * Make sure mm->sc_stat.footprint does not become
+ * negative.
+ */
+ fp = READ_ONCE(mm->sc_stat.footprint);
+ sub = min(fp, current->total_numa_faults);
+ WRITE_ONCE(mm->sc_stat.footprint, fp - sub);
+}
+#else
+static inline void exit_mm_sched_cache(struct mm_struct *mm)
+{
+}
+#endif /* CONFIG_SCHED_CACHE CONFIG_NUMA_BALANCING */
+
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -554,6 +580,9 @@ static void exit_mm(void)
exit_mm_release(current, mm);
if (!mm)
return;
+
+ exit_mm_sched_cache(mm);
+
mmap_read_lock(mm);
mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df21366ba1ca..a10116ffe0d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1384,6 +1384,32 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
+static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long llc, footprint;
+ struct sched_domain *sd;
+
+ guard(rcu)();
+
+ sd = rcu_dereference_sched_domain(cpu_rq(cpu)->sd);
+ if (!sd)
+ return true;
+
+ if (static_branch_likely(&sched_numa_balancing)) {
+ /*
+ * TBD: RDT exclusive LLC ways reserved should be
+ * excluded.
+ */
+ llc = sd->llc_bytes;
+ footprint = READ_ONCE(mm->sc_stat.footprint);
+
+ return (llc < (footprint * PAGE_SIZE));
+ }
+#endif
+ return false;
+}
+
static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
int cpu)
{
@@ -1463,6 +1489,7 @@ void mm_init_sched(struct mm_struct *mm,
mm->sc_stat.cpu = -1;
mm->sc_stat.next_scan = jiffies;
mm->sc_stat.nr_running_avg = 0;
+ mm->sc_stat.footprint = 0;
/*
* The update to mm->sc_stat should not be reordered
* before initialization to mm's other fields, in case
@@ -1585,7 +1612,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* its preferred state.
*/
if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
- invalid_llc_nr(mm, p, cpu_of(rq))) {
+ invalid_llc_nr(mm, p, cpu_of(rq)) ||
+ exceed_llc_capacity(mm, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
}
@@ -1716,7 +1744,8 @@ static void task_cache_work(struct callback_head *work)
return;
curr_cpu = task_cpu(p);
- if (invalid_llc_nr(mm, p, curr_cpu)) {
+ if (invalid_llc_nr(mm, p, curr_cpu) ||
+ exceed_llc_capacity(mm, curr_cpu)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
@@ -3515,6 +3544,7 @@ static void task_numa_placement(struct task_struct *p)
unsigned long total_faults;
u64 runtime, period;
spinlock_t *group_lock = NULL;
+ long __maybe_unused new_fp;
struct numa_group *ng;
/*
@@ -3589,6 +3619,31 @@ static void task_numa_placement(struct task_struct *p)
ng->total_faults += diff;
group_faults += ng->faults[mem_idx];
}
+#ifdef CONFIG_SCHED_CACHE
+ /*
+ * Per task p->numa_faults[mem_idx] converges,
+ * so the accumulation of each task's faults
+ * converges too - Given the number of threads,
+ * it cannot overflow an unsigned long.
+ * Racy with concurrent updates from other threads
+ * sharing this mm. Acceptable since footprint is a
+ * heuristic and occasional lost updates are tolerable.
+ *
+ * If a task exits, its corresponding footprint must
+ * be subtracted from the mm->sc_stat.footprint, otherwise
+ * the mm->sc_stat.footprint will not converge:
+ * the exiting thread's footprint remains unchanged/undecayed
+ * in mm->sc_stat.footprint. See exit_mm().
+ *
+ * Lost updates and unsynchronized subtraction
+ * in exit_mm() can cause footprint + diff to
+ * go negative. Clamp to zero to prevent the
+ * unsigned footprint from wrapping.
+ */
+ new_fp = (long)READ_ONCE(p->mm->sc_stat.footprint) + diff;
+ WRITE_ONCE(p->mm->sc_stat.footprint,
+ max(new_fp, 0L));
+#endif
}
if (!ng) {
@@ -10338,7 +10393,8 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
return mig_unrestricted;
/* skip cache aware load balance for too many threads */
- if (invalid_llc_nr(mm, p, dst_cpu)) {
+ if (invalid_llc_nr(mm, p, dst_cpu) ||
+ exceed_llc_capacity(mm, dst_cpu)) {
if (mm->sc_stat.cpu != -1)
mm->sc_stat.cpu = -1;
return mig_unrestricted;
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (4 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 05/16] sched/cache: Avoid cache-aware scheduling for memory-heavy processes Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain Tim Chen
` (9 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
Introduce a set of debugfs knobs to control how aggressively the
cache aware scheduling does the task aggregation.
(1) aggr_tolerance
With sched_cache enabled, the scheduler uses a process's footprint
as a proxy for its LLC footprint to determine if aggregating tasks
on the preferred LLC could cause cache contention. If the footprint
exceeds the LLC size, aggregation is skipped. Since the kernel
cannot efficiently track per-task cache usage (resctrl is
user-space only), userspace can provide a more accurate hint.
Introduce /sys/kernel/debug/sched/llc_balancing/aggr_tolerance to
let users control how strictly footprint limits aggregation. Values
range from 0 to 100:
- 0: Cache-aware scheduling is disabled.
- 1: Strict; tasks with footprint larger than LLC size are skipped.
- >=100: Aggressive; tasks are aggregated regardless of footprint.
For example, with a 32MB L3 cache:
- aggr_tolerance=1 -> tasks with footprint > 32MB are skipped.
- aggr_tolerance=99 -> tasks with footprint > 784GB are skipped
(784GB = (1 + (99 - 1) * 256) * 32MB).
Similarly, /sys/kernel/debug/sched/llc_balancing/aggr_tolerance also
controls how strictly the number of active threads is considered when
doing cache aware load balance. The number of SMTs is also considered.
High SMT counts reduce the aggregation capacity, preventing excessive
task aggregation on SMT-heavy systems like Power10/Power11.
Yangyu suggested introducing separate aggregation controls for the
number of active threads and memory footprint checks. Since there are
plans to add per-process/task group controls, fine-grained tunables are
deferred to that implementation.
(2) epoch_period, epoch_affinity_timeout,
imb_pct, overaggr_pct are also turned into tunables.
Tested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>
Suggested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Suggested-by: Tingyin Duan <tingyin.duan@gmail.com>
Suggested-by: Jianyong Wu <jianyong.wu@outlook.com>
Suggested-by: Yangyu Chen <cyy@cyyself.name>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/debug.c | 10 +++++++
kernel/sched/fair.c | 68 ++++++++++++++++++++++++++++++++++++++------
kernel/sched/sched.h | 5 ++++
3 files changed, 75 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2eae67cd2ba2..fe569539e888 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -670,6 +670,16 @@ static __init int sched_init_debug(void)
llc = debugfs_create_dir("llc_balancing", debugfs_sched);
debugfs_create_file("enabled", 0644, llc, NULL,
&sched_cache_enable_fops);
+ debugfs_create_u32("aggr_tolerance", 0644, llc,
+ &llc_aggr_tolerance);
+ debugfs_create_u32("epoch_period", 0644, llc,
+ &llc_epoch_period);
+ debugfs_create_u32("epoch_affinity_timeout", 0644, llc,
+ &llc_epoch_affinity_timeout);
+ debugfs_create_u32("overaggr_pct", 0644, llc,
+ &llc_overaggr_pct);
+ debugfs_create_u32("imb_pct", 0644, llc,
+ &llc_imb_pct);
#endif
debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a10116ffe0d1..01ce646792ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1375,6 +1375,11 @@ static void set_next_buddy(struct sched_entity *se);
*/
#define EPOCH_PERIOD (HZ / 100) /* 10 ms */
#define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */
+__read_mostly unsigned int llc_aggr_tolerance = 1;
+__read_mostly unsigned int llc_epoch_period = EPOCH_PERIOD;
+__read_mostly unsigned int llc_epoch_affinity_timeout = EPOCH_LLC_AFFINITY_TIMEOUT;
+__read_mostly unsigned int llc_imb_pct = 20;
+__read_mostly unsigned int llc_overaggr_pct = 50;
static int llc_id(int cpu)
{
@@ -1384,11 +1389,25 @@ static int llc_id(int cpu)
return per_cpu(sd_llc_id, cpu);
}
+static inline int get_sched_cache_scale(int mul)
+{
+ unsigned int tol = READ_ONCE(llc_aggr_tolerance);
+
+ if (!tol)
+ return 0;
+
+ if (tol >= 100)
+ return INT_MAX;
+
+ return (1 + (tol - 1) * mul);
+}
+
static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
{
#ifdef CONFIG_NUMA_BALANCING
unsigned long llc, footprint;
struct sched_domain *sd;
+ int scale;
guard(rcu)();
@@ -1404,7 +1423,28 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
llc = sd->llc_bytes;
footprint = READ_ONCE(mm->sc_stat.footprint);
- return (llc < (footprint * PAGE_SIZE));
+ /*
+ * Scale the LLC size by 256*llc_aggr_tolerance
+ * and compare it to the task's footprint.
+ *
+ * Suppose the L3 size is 32MB. If the
+ * llc_aggr_tolerance is 1:
+ * When the footprint is larger than 32MB, the
+ * process is regarded as exceeding the LLC
+ * capacity. If the llc_aggr_tolerance is 99:
+ * When the footprint is larger than 784GB, the
+ * process is regarded as exceeding the LLC
+ * capacity:
+ * 784GB = (1 + (99 - 1) * 256) * 32MB
+ * If the llc_aggr_tolerance is 100:
+ * ignore the footprint and do the aggregation
+ * anyway.
+ */
+ scale = get_sched_cache_scale(256);
+ if (scale == INT_MAX)
+ return false;
+
+ return ((llc * (u64)scale) < (footprint * PAGE_SIZE));
}
#endif
return false;
@@ -1413,11 +1453,21 @@ static bool exceed_llc_capacity(struct mm_struct *mm, int cpu)
static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
int cpu)
{
+ int scale;
+
if (get_nr_threads(p) <= 1)
return true;
+ /*
+ * Scale the number of 'cores' in a LLC by llc_aggr_tolerance
+ * and compare it to the task's active threads.
+ */
+ scale = get_sched_cache_scale(1);
+ if (scale == INT_MAX)
+ return false;
+
return !fits_capacity((mm->sc_stat.nr_running_avg * cpu_smt_num_threads),
- per_cpu(sd_llc_size, cpu));
+ (scale * per_cpu(sd_llc_size, cpu)));
}
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
@@ -1513,13 +1563,14 @@ static inline void __update_mm_sched(struct rq *rq,
{
lockdep_assert_held(&rq->cpu_epoch_lock);
+ unsigned int period = max(READ_ONCE(llc_epoch_period), 1U);
unsigned long n, now = jiffies;
long delta = now - rq->cpu_epoch_next;
if (delta > 0) {
- n = (delta + EPOCH_PERIOD - 1) / EPOCH_PERIOD;
+ n = (delta + period - 1) / period;
rq->cpu_epoch += n;
- rq->cpu_epoch_next += n * EPOCH_PERIOD;
+ rq->cpu_epoch_next += n * period;
__shr_u64(&rq->cpu_runtime, n);
}
@@ -1611,7 +1662,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* If this process hasn't hit task_cache_work() for a while invalidate
* its preferred state.
*/
- if (epoch - READ_ONCE(mm->sc_stat.epoch) > EPOCH_LLC_AFFINITY_TIMEOUT ||
+ if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
invalid_llc_nr(mm, p, cpu_of(rq)) ||
exceed_llc_capacity(mm, cpu_of(rq))) {
if (mm->sc_stat.cpu != -1)
@@ -1740,7 +1791,8 @@ static void task_cache_work(struct callback_head *work)
/* only 1 thread is allowed to scan */
if (!try_cmpxchg(&mm->sc_stat.next_scan, &next_scan,
- now + EPOCH_PERIOD))
+ now + max_t(unsigned long,
+ READ_ONCE(llc_epoch_period), 1)))
return;
curr_cpu = task_cpu(p);
@@ -10232,7 +10284,7 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
*/
static bool fits_llc_capacity(unsigned long util, unsigned long max)
{
- u32 aggr_pct = 50;
+ u32 aggr_pct = llc_overaggr_pct;
/*
* For single core systems, raise the aggregation
@@ -10252,7 +10304,7 @@ static bool fits_llc_capacity(unsigned long util, unsigned long max)
*/
/* Allows dst util to be bigger than src util by up to bias percent */
#define util_greater(util1, util2) \
- ((util1) * 100 > (util2) * 120)
+ ((util1) * 100 > (util2) * (100 + llc_imb_pct))
static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
unsigned long *cap)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f499d5dd1130..27409399137c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4072,6 +4072,11 @@ static inline void mm_cid_switch_to(struct task_struct *prev, struct task_struct
DECLARE_STATIC_KEY_FALSE(sched_cache_present);
DECLARE_STATIC_KEY_FALSE(sched_cache_active);
extern int sysctl_sched_cache_user;
+extern unsigned int llc_aggr_tolerance;
+extern unsigned int llc_epoch_period;
+extern unsigned int llc_epoch_affinity_timeout;
+extern unsigned int llc_imb_pct;
+extern unsigned int llc_overaggr_pct;
static inline bool sched_cache_enabled(void)
{
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (5 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 06/16] sched/cache: Add user control to adjust the aggressiveness of cache-aware scheduling Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access Tim Chen
` (8 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
rcu_dereference_all() should be used to access the
sd_llc domain under RCU protection.
This bug was reported by sashiko.
Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01ce646792ff..be96d80c9310 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1814,7 +1814,7 @@ static void task_cache_work(struct callback_head *work)
for_each_cpu(cpu, cpus) {
/* XXX sched_cluster_active */
- struct sched_domain *sd = per_cpu(sd_llc, cpu);
+ struct sched_domain *sd = rcu_dereference_all(per_cpu(sd_llc, cpu));
unsigned long occ, m_occ = 0, a_occ = 0;
int m_cpu = -1, i;
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (6 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 07/16] sched/cache: Fix rcu warning when accessing sd_llc domain Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu Tim Chen
` (7 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
A concurrent task exit might cause a NULL pointer dereference
in account_mm_sched(). Use the locally cached mm pointer instead,
since the active_mm reference guarantees the structure remains
allocated. Meanwhile, skip the kernel thread because it has
nothing to do with cache aware scheduling.
This bug was reported by sashiko and Vern.
Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing")
Reported-by: Vern Hao <haoxing990@gmail.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Link: https://lore.kernel.org/all/09cf7ee3-6e27-4505-9692-4b4a4707c8b2@gmail.com/
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be96d80c9310..913b09254732 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1649,7 +1649,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
if (!mm || !mm->sc_stat.pcpu_sched)
return;
- pcpu_sched = per_cpu_ptr(p->mm->sc_stat.pcpu_sched, cpu_of(rq));
+ pcpu_sched = per_cpu_ptr(mm->sc_stat.pcpu_sched, cpu_of(rq));
scoped_guard (raw_spinlock, &rq->cpu_epoch_lock) {
__update_mm_sched(rq, pcpu_sched);
@@ -1689,7 +1689,8 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
if (!sched_cache_enabled())
return;
- if (!mm || !mm->sc_stat.pcpu_sched)
+ if (!mm || p->flags & PF_KTHREAD ||
+ !mm->sc_stat.pcpu_sched)
return;
epoch = rq->cpu_epoch;
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (7 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 08/16] sched/cache: Fix potential NULL mm pointer access Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue Tim Chen
` (6 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
mm->sc_stat.cpu is written by task_cache_work() and could be read
locklessly by several functions on other CPUs. Use READ_ONCE and
WRITE_ONCE on mm->sc_stat.cpu access and write to prevent inconsistent
values from compiler optimizations when there are multiple accesses.
For example in get_pref_llc(), if the writer updated the field between
two compiler-generated loads, the validation (e.g., cpu != -1) and
subsequent use (e.g., llc_id(cpu)) could operate on different values,
allowing a negative CPU ID to be used as an index.
Leave plain write in mm_init_sched(), where the mm is not
yet visible to other CPUs.
This bug was reported by sashiko.
Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 913b09254732..73f185ba6e48 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1598,13 +1598,14 @@ static unsigned long fraction_mm_sched(struct rq *rq,
static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
{
- int mm_sched_llc = -1;
+ int mm_sched_llc = -1, mm_sched_cpu;
if (!mm)
return -1;
- if (mm->sc_stat.cpu != -1) {
- mm_sched_llc = llc_id(mm->sc_stat.cpu);
+ mm_sched_cpu = READ_ONCE(mm->sc_stat.cpu);
+ if (mm_sched_cpu != -1) {
+ mm_sched_llc = llc_id(mm_sched_cpu);
#ifdef CONFIG_NUMA_BALANCING
/*
@@ -1619,7 +1620,7 @@ static int get_pref_llc(struct task_struct *p, struct mm_struct *mm)
*/
if (static_branch_likely(&sched_numa_balancing) &&
p->numa_preferred_nid >= 0 &&
- cpu_to_node(mm->sc_stat.cpu) != p->numa_preferred_nid)
+ cpu_to_node(mm_sched_cpu) != p->numa_preferred_nid)
mm_sched_llc = -1;
#endif
}
@@ -1665,8 +1666,8 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
invalid_llc_nr(mm, p, cpu_of(rq)) ||
exceed_llc_capacity(mm, cpu_of(rq))) {
- if (mm->sc_stat.cpu != -1)
- mm->sc_stat.cpu = -1;
+ if (READ_ONCE(mm->sc_stat.cpu) != -1)
+ WRITE_ONCE(mm->sc_stat.cpu, -1);
}
mm_sched_llc = get_pref_llc(p, mm);
@@ -1714,7 +1715,7 @@ static void get_scan_cpumasks(cpumask_var_t cpus, struct task_struct *p)
if (!static_branch_likely(&sched_numa_balancing))
goto out;
- cpu = p->mm->sc_stat.cpu;
+ cpu = READ_ONCE(p->mm->sc_stat.cpu);
if (cpu != -1)
nid = cpu_to_node(cpu);
curr_cpu = task_cpu(p);
@@ -1799,8 +1800,8 @@ static void task_cache_work(struct callback_head *work)
curr_cpu = task_cpu(p);
if (invalid_llc_nr(mm, p, curr_cpu) ||
exceed_llc_capacity(mm, curr_cpu)) {
- if (mm->sc_stat.cpu != -1)
- mm->sc_stat.cpu = -1;
+ if (READ_ONCE(mm->sc_stat.cpu) != -1)
+ WRITE_ONCE(mm->sc_stat.cpu, -1);
return;
}
@@ -1857,7 +1858,7 @@ static void task_cache_work(struct callback_head *work)
m_a_cpu = m_cpu;
}
- if (llc_id(cpu) == llc_id(mm->sc_stat.cpu))
+ if (llc_id(cpu) == llc_id(READ_ONCE(mm->sc_stat.cpu)))
curr_m_a_occ = a_occ;
cpumask_andnot(cpus, cpus, sched_domain_span(sd));
@@ -1875,7 +1876,7 @@ static void task_cache_work(struct callback_head *work)
* 3. 2X is chosen based on test results, as it delivers
* the optimal performance gain so far.
*/
- mm->sc_stat.cpu = m_a_cpu;
+ WRITE_ONCE(mm->sc_stat.cpu, m_a_cpu);
}
update_avg_scale(&mm->sc_stat.nr_running_avg, nr_running);
@@ -10441,15 +10442,15 @@ static enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
if (!mm)
return mig_unrestricted;
- cpu = mm->sc_stat.cpu;
+ cpu = READ_ONCE(mm->sc_stat.cpu);
if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
return mig_unrestricted;
/* skip cache aware load balance for too many threads */
if (invalid_llc_nr(mm, p, dst_cpu) ||
exceed_llc_capacity(mm, dst_cpu)) {
- if (mm->sc_stat.cpu != -1)
- mm->sc_stat.cpu = -1;
+ if (READ_ONCE(mm->sc_stat.cpu) != -1)
+ WRITE_ONCE(mm->sc_stat.cpu, -1);
return mig_unrestricted;
}
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (8 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 09/16] sched/cache: Annotate lockless accesses to mm->sc_stat.cpu Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task Tim Chen
` (5 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
There is a race condition that, after a task is enqueued
on a runqueue, task_llc(p) may change due to CPU hotplug,
because the llc_id is dynamically allocated and adjusted
at runtime.
Therefore, checking task_llc(p) to determine whether the
task is being dequeued from its preferred LLC is unreliable
and can cause inconsistent values.
To fix this problem, record whether p is enqueued on its
preferred LLC, in order to pair with account_llc_dequeue()
to maintain a consistent nr_pref_llc_running per runqueue.
This bug was reported by sashiko, and the solution was once
suggested by Prateek.
Fixes: 46afe3af7ead ("sched/cache: Track LLC-preferred tasks per runqueue")
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
include/linux/sched.h | 2 ++
init/init_task.c | 1 +
kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++---
3 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 95729670929c..2c9e8e2edde1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1410,6 +1410,8 @@ struct task_struct {
#ifdef CONFIG_SCHED_CACHE
struct callback_head cache_work;
int preferred_llc;
+ /* 1: task was enqueued to its preferred LLC, 0 otherwise */
+ int pref_llc_queued;
#endif
struct rseq_data rseq;
diff --git a/init/init_task.c b/init/init_task.c
index 5d90db4ff1f8..3ecd66fbd563 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -217,6 +217,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
#endif
#ifdef CONFIG_SCHED_CACHE
.preferred_llc = -1,
+ .pref_llc_queued = 0,
#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
.kasan_depth = 1,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 73f185ba6e48..9e6edd40cd80 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1472,15 +1472,32 @@ static bool invalid_llc_nr(struct mm_struct *mm, struct task_struct *p,
static void account_llc_enqueue(struct rq *rq, struct task_struct *p)
{
+ int pref_llc, pref_llc_queued;
struct sched_domain *sd;
- int pref_llc;
pref_llc = p->preferred_llc;
if (pref_llc < 0)
return;
+ pref_llc_queued = (pref_llc == task_llc(p));
rq->nr_llc_running++;
- rq->nr_pref_llc_running += (pref_llc == task_llc(p));
+ rq->nr_pref_llc_running += pref_llc_queued;
+
+ /*
+ * Record whether p is enqueued on its preferred
+ * LLC, in order to pair with account_llc_dequeue()
+ * to maintain a consistent nr_pref_llc_running per
+ * runqueue.
+ * This is necessary because a race condition exists:
+ * after a task is enqueued on a runqueue, task_llc(p)
+ * may change due to CPU hotplug. Therefore, checking
+ * task_llc(p) to determine whether the task is being
+ * dequeued from its preferred LLC is unreliable and
+ * can cause inconsistent values - checking the
+ * p->pref_llc_queued in account_llc_dequeue() would
+ * be reliable.
+ */
+ p->pref_llc_queued = pref_llc_queued;
sd = rcu_dereference_all(rq->sd);
if (sd && (unsigned int)pref_llc < sd->llc_max)
@@ -1497,7 +1514,15 @@ static void account_llc_dequeue(struct rq *rq, struct task_struct *p)
return;
rq->nr_llc_running--;
- rq->nr_pref_llc_running -= (pref_llc == task_llc(p));
+ if (p->pref_llc_queued) {
+ rq->nr_pref_llc_running--;
+ /*
+ * Update the status in case
+ * other logic might query
+ * this.
+ */
+ p->pref_llc_queued = 0;
+ }
sd = rcu_dereference_all(rq->sd);
if (sd && (unsigned int)pref_llc < sd->llc_max) {
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (9 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 10/16] sched/cache: Fix unpaired account_llc_enqueue/dequeue Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild Tim Chen
` (4 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
The currently running task cur may not be a CFS task, such as
an RT or Deadline task. For non-CFS tasks, the task_util(cur)
utilization average is not maintained, so this might pass a
stale or meaningless value to can_migrate_llc().
Check if the task is CFS before getting its task_util().
This bug was reported by sashiko.
Fixes: 714059f79ff0 ("sched/cache: Handle moving single tasks to/from their preferred LLC")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9e6edd40cd80..8617cd3642c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10509,7 +10509,8 @@ alb_break_llc(struct lb_env *env)
/*
* All tasks prefer to stay on their current CPU.
* Do not pull a task from its preferred CPU if:
- * 1. It is the only task running there(not too imbalance); OR
+ * 1. It is the only task running and does not exceed
+ * imbalance allowance; OR
* 2. Migrating it away from its preferred LLC would violate
* the cache-aware scheduling policy.
*/
@@ -10522,7 +10523,7 @@ alb_break_llc(struct lb_env *env)
return true;
cur = rcu_dereference_all(env->src_rq->curr);
- if (cur)
+ if (cur && cur->sched_class == &fair_sched_class)
util = task_util(cur);
if (can_migrate_llc(env->src_cpu, env->dst_cpu,
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (10 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 11/16] sched/cache: Fix checking active load balance by only considering the CFS task Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system Tim Chen
` (3 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
sched_cache_active_set_unlocked() checks hardware support without
locks:
static void sched_cache_active_set(bool locked)
{
/* hardware does not support */
if (!static_branch_likely(&sched_cache_present)) {
_sched_cache_active_set(false, locked);
return;
}
...
If build_sched_domains() runs concurrently during CPU hotplug,
it can disable sched_cache_present under sched_domains_mutex
and the CPU hotplug lock. If a debugfs write thread evaluates
sched_cache_present as true right before that, and then blocks
or gets preempted, it might proceed to enable sched_cache_active
after the hardware support has been marked as absent. Make it
safer by acquiring cpus_read_lock() and sched_domains_mutex_lock()
when the user changes sched_cache_active via debugfs.
This bug was reported by sashiko.
Fixes: 067a31358143 ("sched/cache: Allow the user space to turn on and off cache aware scheduling")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/debug.c | 4 +++-
kernel/sched/sched.h | 2 +-
kernel/sched/topology.c | 42 +++++++++++++++--------------------------
3 files changed, 19 insertions(+), 29 deletions(-)
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index fe569539e888..ed3a0d65da0c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -224,7 +224,9 @@ sched_cache_enable_write(struct file *filp, const char __user *ubuf,
sysctl_sched_cache_user = val;
- sched_cache_active_set_unlocked();
+ sched_cache_active_set();
+
+ *ppos += cnt;
return cnt;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 27409399137c..45a3b77f46aa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4083,7 +4083,7 @@ static inline bool sched_cache_enabled(void)
return static_branch_unlikely(&sched_cache_active);
}
-extern void sched_cache_active_set_unlocked(void);
+extern void sched_cache_active_set(void);
#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 7248a7279abe..cff5a0ecd64d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -917,30 +917,19 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
return false;
}
-static void _sched_cache_active_set(bool enable, bool locked)
-{
- if (enable) {
- if (locked)
- static_branch_enable_cpuslocked(&sched_cache_active);
- else
- static_branch_enable(&sched_cache_active);
- } else {
- if (locked)
- static_branch_disable_cpuslocked(&sched_cache_active);
- else
- static_branch_disable(&sched_cache_active);
- }
-}
-
/*
* Enable/disable cache aware scheduling according to
* user input and the presence of hardware support.
+ * Expected to be protected by cpus_read_lock() and
+ * sched_domains_mutex_lock()
*/
-static void sched_cache_active_set(bool locked)
+static void _sched_cache_active_set(void)
{
/* hardware does not support */
if (!static_branch_likely(&sched_cache_present)) {
- _sched_cache_active_set(false, locked);
+ static_branch_disable_cpuslocked(&sched_cache_active);
+ if (sched_debug())
+ pr_info("%s: cache aware scheduling not supported on this platform\n", __func__);
return;
}
@@ -951,24 +940,23 @@ static void sched_cache_active_set(bool locked)
* for now.
*/
if (sysctl_sched_cache_user) {
- _sched_cache_active_set(true, locked);
+ static_branch_enable_cpuslocked(&sched_cache_active);
if (sched_debug())
pr_info("%s: enabling cache aware scheduling\n", __func__);
} else {
- _sched_cache_active_set(false, locked);
+ static_branch_disable_cpuslocked(&sched_cache_active);
if (sched_debug())
pr_info("%s: disabling cache aware scheduling\n", __func__);
}
}
-static void sched_cache_active_set_locked(void)
-{
- return sched_cache_active_set(true);
-}
-
-void sched_cache_active_set_unlocked(void)
+void sched_cache_active_set(void)
{
- return sched_cache_active_set(false);
+ cpus_read_lock();
+ sched_domains_mutex_lock();
+ _sched_cache_active_set();
+ sched_domains_mutex_unlock();
+ cpus_read_unlock();
}
/*
@@ -3082,7 +3070,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
else
static_branch_disable_cpuslocked(&sched_cache_present);
- sched_cache_active_set_locked();
+ _sched_cache_active_set();
#endif
__free_domain_allocs(&d, alloc_state, cpu_map);
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (11 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 12/16] sched/cache: Fix race condition during sched domain rebuild Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs Tim Chen
` (2 subsequent siblings)
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
If there are multiple LLCs in the system, cache aware scheduling
should be enabled. However, there is a corner case where, if there
is a single NUMA node and a single LLC per node, cache aware
scheduling will be turned on in the current implementation -
because at this moment, the parent domain has not yet been
degenerated, and it is possible that the current domain has the
same cpu span as its parent. There is no need to turn cache aware
scheduling on in this scenario.
Fix it by iterating the parent domains to find a domain that is
a superset of the current sd_llc, so that later, after the duplicated
parent domains have been degenerated, cache aware scheduling will
take effect.
For example, the expected behavior would be:
2 sockets, 1 LLC per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
1 socket, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
2 sockets, 2 LLCs per socket: MC span=0-3, PKG span=0-7, has_multi_llcs=true
1 socket, 1 LLC per socket: MC span=0-3, PKG span=0-3, has_multi_llcs=false
This bug was reported by sashiko.
Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for
multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/topology.c | 39 ++++++++++++++++++++++++++++++++++++---
1 file changed, 36 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cff5a0ecd64d..07f0a3d28253 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1007,6 +1007,37 @@ static bool alloc_sd_llc(const struct cpumask *cpu_map,
}
#endif
+/*
+ * Return true if @sd belongs to an LLC group whose enclosing
+ * partition spans more than one LLC. @sd must be the topmost
+ * SD_SHARE_LLC domain.
+ *
+ * Any duplicated parent domains with the same span as @sd are
+ * skipped: before cpu_attach_domain() degeneration these still
+ * exist, after degeneration the loop is a no-op. This makes the
+ * helper usable both during sched domain build and against an
+ * already-attached domain tree.
+ *
+ * Note: For systems with a single LLC per node, cache-aware
+ * scheduling is still enabled when multiple nodes exist.
+ * However, NUMA balancing decisions take precedence over
+ * cache-aware scheduling. Conversely, if there is only one
+ * LLC per partition, cache-aware scheduling should be disabled.
+ */
+static bool sd_in_multi_llcs(struct sched_domain *sd)
+{
+ struct sched_domain *sdp = sd->parent;
+
+ /* it does not make sense to aggregate to 1 CPU */
+ if (sd->span_weight == 1)
+ return false;
+
+ while (sdp && sdp->span_weight == sd->span_weight)
+ sdp = sdp->parent;
+
+ return !!sdp;
+}
+
/*
* Return the canonical balance CPU for this group, this is the first CPU
* of this group that's also in the balance mask.
@@ -3016,9 +3047,11 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
* NUMA imbalance stats for the hierarchy.
*/
if (sd->parent) {
- if (IS_ENABLED(CONFIG_NUMA))
- adjust_numa_imbalance(sd);
- has_multi_llcs = true;
+ if (IS_ENABLED(CONFIG_NUMA))
+ adjust_numa_imbalance(sd);
+
+ if (sd_in_multi_llcs(sd))
+ has_multi_llcs = true;
}
}
}
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (12 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 13/16] sched/cache: Fix cache aware scheduling enabling for multi LLCs system Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU Tim Chen
2026-05-13 20:39 ` [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task Tim Chen
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
sched_cache_present is a global static key, but build_sched_domains()
is called per partition from the "Build new domains" loop in
partition_sched_domains_locked(). Each call unconditionally sets the
key based solely on the has_multi_llcs local variable for that partition.
The call to the last partition set the value even when there
are previous partitions with multiple LLCs.
If partition A (multi-LLC) is built first, the key is enabled. Then
when partition B (single-LLC) is built, the key is disabled. The
multi-LLC partition A is still active but the key is now off.
Fix it by doing a similar thing as sched_energy_present: check the
multi-LLCs during the iteration over all the partitions rather than
checking it on a single partition.
This bug was reported by sashiko.
Fixes: d59f4fd1d303 ("sched/cache: Enable cache aware scheduling for
multi LLCs NUMA node")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/topology.c | 69 +++++++++++++++++++++++++++++++----------
1 file changed, 53 insertions(+), 16 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 07f0a3d28253..4c5ea369d835 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -950,6 +950,7 @@ static void _sched_cache_active_set(void)
}
}
+/* used by debugfs */
void sched_cache_active_set(void)
{
cpus_read_lock();
@@ -999,12 +1000,27 @@ void sched_update_llc_bytes(unsigned int cpu)
unlock:
sched_domains_mutex_unlock();
}
+
+static void sched_cache_set(bool has_multi_llcs)
+{
+ /*
+ * TBD: check before writing to it. sched domain rebuild
+ * is not in the critical path, leave as-is for now.
+ */
+ if (has_multi_llcs)
+ static_branch_enable_cpuslocked(&sched_cache_present);
+ else
+ static_branch_disable_cpuslocked(&sched_cache_present);
+
+ _sched_cache_active_set();
+}
#else
static bool alloc_sd_llc(const struct cpumask *cpu_map,
struct s_data *d)
{
return false;
}
+static inline void sched_cache_set(bool has_multi_llcs) { }
#endif
/*
@@ -2949,7 +2965,8 @@ void sched_domains_free_llc_id(int cpu)
* to the individual CPUs
*/
static int
-build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
+build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr,
+ bool *multi_llcs)
{
enum s_alloc alloc_state = sa_none;
bool has_multi_llcs = false;
@@ -3093,18 +3110,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
ret = 0;
error:
-#ifdef CONFIG_SCHED_CACHE
- /*
- * TBD: check before writing to it. sched domain rebuild
- * is not in the critical path, leave as-is for now.
- */
- if (!ret && has_multi_llcs)
- static_branch_enable_cpuslocked(&sched_cache_present);
- else
- static_branch_disable_cpuslocked(&sched_cache_present);
-
- _sched_cache_active_set();
-#endif
+ *multi_llcs = has_multi_llcs;
__free_domain_allocs(&d, alloc_state, cpu_map);
return ret;
@@ -3167,6 +3173,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms)
*/
int __init sched_init_domains(const struct cpumask *cpu_map)
{
+ bool multi_llcs;
int err;
zalloc_cpumask_var(&sched_domains_llc_id_allocmask, GFP_KERNEL);
@@ -3181,7 +3188,9 @@ int __init sched_init_domains(const struct cpumask *cpu_map)
if (!doms_cur)
doms_cur = &fallback_doms;
cpumask_and(doms_cur[0], cpu_map, housekeeping_cpumask(HK_TYPE_DOMAIN));
- err = build_sched_domains(doms_cur[0], NULL);
+ err = build_sched_domains(doms_cur[0], NULL, &multi_llcs);
+ if (!err)
+ sched_cache_set(multi_llcs);
return err;
}
@@ -3254,6 +3263,7 @@ static void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new
struct sched_domain_attr *dattr_new)
{
bool __maybe_unused has_eas = false;
+ bool has_multi_llcs = false, multi_llcs;
int i, j, n;
int new_topology;
@@ -3303,14 +3313,41 @@ static void partition_sched_domains_locked(int ndoms_new, cpumask_var_t doms_new
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < n && !new_topology; j++) {
if (cpumask_equal(doms_new[i], doms_cur[j]) &&
- dattrs_equal(dattr_new, i, dattr_cur, j))
+ dattrs_equal(dattr_new, i, dattr_cur, j)) {
+ /*
+ * Reused partition has to be taken care
+ * of here, because there could be a corner
+ * case that if the reused partition is skipped
+ * and only new partition is considered, an
+ * incorrect has_multi_llcs would be set. For
+ * example:
+ * If the only multi-LLC partition is reused
+ * and a new single-LLC partition is built,
+ * sched_cache_set(false) disables cache-aware
+ * scheduling globally despite the reused
+ * multi-LLC partition still being active.
+ */
+ struct sched_domain *sd;
+ int cpu = cpumask_first(doms_cur[j]);
+
+ guard(rcu)();
+ sd = rcu_dereference(cpu_rq(cpu)->sd);
+ while (sd && sd->parent && (sd->parent->flags & SD_SHARE_LLC))
+ sd = sd->parent;
+ if (sd && (sd->flags & SD_SHARE_LLC) && sd->parent &&
+ sd_in_multi_llcs(sd))
+ has_multi_llcs = true;
goto match2;
+ }
}
/* No match - add a new doms_new */
- build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL);
+ build_sched_domains(doms_new[i], dattr_new ? dattr_new + i : NULL,
+ &multi_llcs);
+ has_multi_llcs |= multi_llcs;
match2:
;
}
+ sched_cache_set(has_multi_llcs);
#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
/* Build perf domains: */
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (13 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 14/16] sched/cache: Fix has_multi_llcs iff at least one partition has multiple LLCs Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
2026-05-13 20:39 ` [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task Tim Chen
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
epoch comes from the local rq->cpu_epoch, but mm->sc_stat.epoch is written
by task_tick_cache() running on any CPU - potentially a different CPU whose
rq->cpu_epoch is further ahead. The unsigned underflow wraps to a huge number,
so the condition fires incorrectly.
Fix this by converting the result to long.
Fixes: df0d98475954 ("sched/cache: Introduce infrastructure for cache-aware load balancing")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8617cd3642c7..7e64cd18727e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1688,7 +1688,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
* If this process hasn't hit task_cache_work() for a while invalidate
* its preferred state.
*/
- if (epoch - READ_ONCE(mm->sc_stat.epoch) > llc_epoch_affinity_timeout ||
+ if ((long)(epoch - READ_ONCE(mm->sc_stat.epoch)) > (long)llc_epoch_affinity_timeout ||
invalid_llc_nr(mm, p, cpu_of(rq)) ||
exceed_llc_capacity(mm, cpu_of(rq))) {
if (READ_ONCE(mm->sc_stat.cpu) != -1)
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* [Patch v4 16/16] sched/cache: Fix stale preferred_llc for a new task
2026-05-13 20:39 [Patch v4 00/16] Cache aware scheduling enhancements Tim Chen
` (14 preceding siblings ...)
2026-05-13 20:39 ` [Patch v4 15/16] sched/cache: Fix possible overflow when invalidating the preferred CPU Tim Chen
@ 2026-05-13 20:39 ` Tim Chen
15 siblings, 0 replies; 17+ messages in thread
From: Tim Chen @ 2026-05-13 20:39 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, K Prateek Nayak, Vincent Guittot
Cc: Chen Yu, Juri Lelli, Dietmar Eggemann, Steven Rostedt, Ben Segall,
Mel Gorman, Valentin Schneider, Madadi Vineeth Reddy,
Hillf Danton, Shrikanth Hegde, Jianyong Wu, Yangyu Chen,
Tingyin Duan, Vern Hao, Vern Hao, Len Brown, Tim Chen, Aubrey Li,
Zhao Liu, Chen Yu, Adam Li, Aaron Lu, Tim Chen, Josh Don,
Gavin Guo, Qais Yousef, Libo Chen, Luo Gengkun, linux-kernel
From: Chen Yu <yu.c.chen@intel.com>
On fork without CLONE_VM, the child gets a new mm,
the parent's preferred_llc value is stale for the
child.
Fix this by resetting the task's preferred_llc to -1.
This bug was reported by sashiko.
Fixes: 47d8696b95f7 ("sched/cache: Assign preferred LLC ID to processes")
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e64cd18727e..73da6f8fc9ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1914,6 +1914,11 @@ void init_sched_mm(struct task_struct *p)
init_task_work(work, task_cache_work);
work->next = work;
+ /*
+ * Reset new task's preference to avoid
+ * polluting account_llc_enqueue().
+ */
+ p->preferred_llc = -1;
}
#else /* CONFIG_SCHED_CACHE */
--
2.32.0
^ permalink raw reply related [flat|nested] 17+ messages in thread