* [RFC PATCH v5 0/9] steal tasks to improve CPU utilization
@ 2026-03-20 5:59 Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 1/9] sched: Provide sparsemask, a reduced contention bitmap Chen Jinghuang
` (9 more replies)
0 siblings, 10 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
When a CPU has no more CFS tasks to run, and idle_balance() fails to
find a task, then attempt to steal a task from an overloaded CPU in the
same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
identify candidates. To minimize search time, steal the first migratable
task that is found when the bitmap is traversed. For fairness, search
for migratable tasks on an overloaded CPU in order of next to run.
This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.
The bitmap of overloaded CPUs is a new type of sparse bitmap, designed to
reduce cache contention vs the usual bitmap when many threads concurrently
set, clear, and visit elements.
Patch 1 defines the sparsemask type and its operations.
Patches 2, 3, and 4 implement the bitmap of overloaded CPUs.
Patches 5 and 6 refactor existing code for a cleaner merge of later
patches
Patches 7 and 8 implement task stealing using the overloaded CPUs bitmap.
Patch 9 adds schedstats for comparing the new behavior to the old, and
provided as a convenience for developers only, not for integration.
The patch series is based on kernel 7.0.0-rc1. It compiles, boots, and
runs with/without each of CONFIG_SCHED_SMT, CONFIG_SMP, CONFIG_SCHED_DEBUG,
and CONFIG_PREEMPT.
Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default.
Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:
steal - number of times a task is stolen from another CPU.
X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench <grps> process 100000
baseline
grps time %busy sched idle wake steal
1 2.182 20.00 35876 17905 17958 0
2 2.391 39.00 67753 33808 33921 0
3 2.871 47.00 100944 48966 51538 0
4 2.928 62.00 114489 55171 59059 0
8 4.852 83.00 219907 92961 121703 0
new
grps time %busy sched idle wake steal %speedup
1 2.229 18.00 45450 22691 22751 52 -2.1
2 2.123 40.00 49975 24977 24990 6 12.6
3 2.690 61.00 56118 22641 32780 9073 6.7
4 2.828 80.00 37927 12828 24165 8442 3.5
8 4.120 95.00 85929 8613 57858 11098 17.8
Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.
Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation. Although I have implemented the necessary
adaptations to support DELAY_DEQUEUE, I observed a noticeable performance
regression in hackbench when both DELAY_DEQUEUE and SCHED_STEAL are enabled
simultaneously, specifically in heavily overloaded scenarios where the
number of tasks far exceeds the number of CPUs. Any suggestions on how to
address this would be appreciated.
---
Changes from v1 to v2:
- Remove stray find_time hunk from patch 5
- Fix "warning: label out defined but not used" for !CONFIG_SCHED_SMT
- Set SCHED_STEAL_NODE_LIMIT_DEFAULT to 2
- Steal iff avg_idle exceeds the cost of stealing
Changes from v2 to v3:
- Update series for kernel 4.20. Context changes only.
Changes from v3 to v4:
- Avoid 64-bit division on 32-bit processors in compute_skid()
- Replace IF_SMP with inline functions to set idle_stamp
- Push ZALLOC_MASK body into calling function
- Set rq->cfs_overload_cpus in update_top_cache_domain instead of
cpu_attach_domain
- Rewrite sparsemask iterator for complete inlining
- Cull and clean up sparsemask functions and moved all into
sched/sparsemask.h
Changes from v4 to v5:
- Adapt overload_set and overload_clear to support the DELAY_DEQUEUE
feature.
- Rebase onto kernel v7.0.0-rc1 and refresh all benchmark results.
- Remove the previous Patch 9 ("sched/fair: disable stealing if too many
NUMA nodes").
Steve Sistare (9):
sched: Provide sparsemask, a reduced contention bitmap
sched/topology: Provide hooks to allocate data shared per LLC
sched/topology: Provide cfs_overload_cpus bitmap
sched/fair: Dynamically update cfs_overload_cpus
sched/fair: Hoist idle_stamp up from idle_balance
sched/fair: Generalize the detach_task interface
sched/fair: Provide can_migrate_task_llc
sched/fair: Steal work from an overloaded CPU when CPU goes idle
sched/fair: Provide idle search schedstats
include/linux/sched/topology.h | 1 +
kernel/sched/core.c | 31 ++-
kernel/sched/fair.c | 340 +++++++++++++++++++++++++++++++--
kernel/sched/features.h | 6 +
kernel/sched/sched.h | 11 ++
kernel/sched/sparsemask.h | 210 ++++++++++++++++++++
kernel/sched/stats.c | 9 +
kernel/sched/stats.h | 13 ++
kernel/sched/topology.c | 96 +++++++++-
9 files changed, 694 insertions(+), 23 deletions(-)
create mode 100644 kernel/sched/sparsemask.h
--
2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH v5 1/9] sched: Provide sparsemask, a reduced contention bitmap
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 2/9] sched/topology: Provide hooks to allocate data shared per LLC Chen Jinghuang
` (8 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
Provide struct sparsemask and functions to manipulate it. A sparsemask is
a sparse bitmap. It reduces cache contention vs the usual bitmap when many
threads concurrently set, clear, and visit elements, by reducing the number
of significant bits per cacheline. For each cacheline chunk of the mask,
only the first K bits of the first word are used, and the remaining bits
are ignored, where K is a creation time parameter. Thus a sparsemask that
can represent a set of N elements is approximately (N/K * CACHELINE) bytes
in size.
This type is simpler and more efficient than the struct sbitmap used by
block drivers.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/sparsemask.h | 210 ++++++++++++++++++++++++++++++++++++++
1 file changed, 210 insertions(+)
create mode 100644 kernel/sched/sparsemask.h
diff --git a/kernel/sched/sparsemask.h b/kernel/sched/sparsemask.h
new file mode 100644
index 000000000000..11948620a1a2
--- /dev/null
+++ b/kernel/sched/sparsemask.h
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * sparsemask.h - sparse bitmap operations
+ *
+ * Copyright (c) 2018 Oracle Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __LINUX_SPARSEMASK_H
+#define __LINUX_SPARSEMASK_H
+
+#include <linux/kernel.h>
+#include <linux/bitmap.h>
+#include <linux/bug.h>
+
+/*
+ * A sparsemask is a sparse bitmap. It reduces cache contention vs the usual
+ * bitmap when many threads concurrently set, clear, and visit elements. For
+ * each cacheline chunk of the mask, only the first K bits of the first word are
+ * used, and the remaining bits are ignored, where K is a creation time
+ * parameter. Thus a sparsemask that can represent a set of N elements is
+ * approximately (N/K * CACHELINE) bytes in size.
+ *
+ * Clients pass and receive element numbers in the public API, and the
+ * implementation translates them to bit numbers to perform the bitmap
+ * operations.
+ */
+
+struct sparsemask_chunk {
+ unsigned long word; /* the significant bits */
+} ____cacheline_aligned_in_smp;
+
+struct sparsemask {
+ short nelems; /* current number of elements */
+ short density; /* store 2^density elements per chunk */
+ struct sparsemask_chunk chunks[0]; /* embedded array of chunks */
+};
+
+#define _SMASK_INDEX(density, elem) ((elem) >> (density))
+#define _SMASK_BIT(density, elem) ((elem) & ((1U << (density)) - 1U))
+#define SMASK_INDEX(mask, elem) _SMASK_INDEX((mask)->density, elem)
+#define SMASK_BIT(mask, elem) _SMASK_BIT((mask)->density, elem)
+#define SMASK_WORD(mask, elem) \
+ (&(mask)->chunks[SMASK_INDEX((mask), (elem))].word)
+
+/*
+ * sparsemask_next() - Return the next one bit in a bitmap, starting at a
+ * specified position and wrapping from the last bit to the first, up to but
+ * not including a specified origin. This is a helper, so do not call it
+ * directly.
+ *
+ * @mask: Bitmap to search.
+ * @origin: Origin.
+ * @prev: Previous bit. Start search after this bit number.
+ * If -1, start search at @origin.
+ *
+ * Return: the bit number, else mask->nelems if no bits are set in the range.
+ */
+static inline int
+sparsemask_next(const struct sparsemask *mask, int origin, int prev)
+{
+ int density = mask->density;
+ int bits_per_word = 1U << density;
+ const struct sparsemask_chunk *chunk;
+ int nelems = mask->nelems;
+ int next, bit, nbits;
+ unsigned long word;
+
+ /* Calculate number of bits to be searched. */
+ if (prev == -1) {
+ nbits = nelems;
+ next = origin;
+ } else if (prev < origin) {
+ nbits = origin - prev;
+ next = prev + 1;
+ } else {
+ nbits = nelems - prev + origin - 1;
+ next = prev + 1;
+ }
+
+ if (unlikely(next >= nelems))
+ return nelems;
+
+ /*
+ * Fetch and adjust first word. Clear word bits below @next, and round
+ * @next down to @bits_per_word boundary because later ffs will add
+ * those bits back.
+ */
+ chunk = &mask->chunks[_SMASK_INDEX(density, next)];
+ bit = _SMASK_BIT(density, next);
+ word = chunk->word & (~0UL << bit);
+ next -= bit;
+ nbits += bit;
+
+ while (!word) {
+ next += bits_per_word;
+ nbits -= bits_per_word;
+ if (nbits <= 0)
+ return nelems;
+
+ if (next >= nelems) {
+ chunk = mask->chunks;
+ nbits -= (next - nelems);
+ next = 0;
+ } else {
+ chunk++;
+ }
+ word = chunk->word;
+ }
+
+ next += __ffs(word);
+ if (next >= origin && prev != -1)
+ return nelems;
+ return next;
+}
+
+/****************** The public API ********************/
+
+/*
+ * Max value for the density parameter, limited by 64 bits in the chunk word.
+ */
+#define SMASK_DENSITY_MAX 6
+
+/*
+ * Return bytes to allocate for a sparsemask, for custom allocators.
+ */
+static inline size_t sparsemask_size(int nelems, int density)
+{
+ int index = _SMASK_INDEX(density, nelems) + 1;
+
+ return offsetof(struct sparsemask, chunks[index]);
+}
+
+/*
+ * Initialize an allocated sparsemask, for custom allocators.
+ */
+static inline void
+sparsemask_init(struct sparsemask *mask, int nelems, int density)
+{
+ WARN_ON(density < 0 || density > SMASK_DENSITY_MAX || nelems < 0);
+ mask->nelems = nelems;
+ mask->density = density;
+}
+
+/*
+ * sparsemask_alloc_node() - Allocate, initialize, and return a sparsemask.
+ *
+ * @nelems - maximum number of elements.
+ * @density - store 2^density elements per cacheline chunk.
+ * values from 0 to SMASK_DENSITY_MAX inclusive.
+ * @flags - kmalloc allocation flags
+ * @node - numa node
+ */
+static inline struct sparsemask *
+sparsemask_alloc_node(int nelems, int density, gfp_t flags, int node)
+{
+ int nbytes = sparsemask_size(nelems, density);
+ struct sparsemask *mask = kmalloc_node(nbytes, flags, node);
+
+ if (mask)
+ sparsemask_init(mask, nelems, density);
+ return mask;
+}
+
+static inline void sparsemask_free(struct sparsemask *mask)
+{
+ kfree(mask);
+}
+
+static inline void sparsemask_set_elem(struct sparsemask *dst, int elem)
+{
+ set_bit(SMASK_BIT(dst, elem), SMASK_WORD(dst, elem));
+}
+
+static inline void sparsemask_clear_elem(struct sparsemask *dst, int elem)
+{
+ clear_bit(SMASK_BIT(dst, elem), SMASK_WORD(dst, elem));
+}
+
+static inline int sparsemask_test_elem(const struct sparsemask *mask, int elem)
+{
+ return test_bit(SMASK_BIT(mask, elem), SMASK_WORD(mask, elem));
+}
+
+/*
+ * sparsemask_for_each() - iterate over each set bit in a bitmap, starting at a
+ * specified position, and wrapping from the last bit to the first.
+ *
+ * @mask: Bitmap to iterate over.
+ * @origin: Bit number at which to start searching.
+ * @elem: Iterator. Can be signed or unsigned integer.
+ *
+ * The implementation does not assume any bit in @mask is set, including
+ * @origin. After the loop, @elem = @mask->nelems.
+ */
+#define sparsemask_for_each(mask, origin, elem) \
+ for ((elem) = -1; \
+ (elem) = sparsemask_next((mask), (origin), (elem)), \
+ (elem) < (mask)->nelems;)
+
+#endif /* __LINUX_SPARSEMASK_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 2/9] sched/topology: Provide hooks to allocate data shared per LLC
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 1/9] sched: Provide sparsemask, a reduced contention bitmap Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 3/9] sched/topology: Provide cfs_overload_cpus bitmap Chen Jinghuang
` (7 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
Add functions sd_llc_alloc_all() and sd_llc_free_all() to allocate and
free data pointed to by struct sched_domain_shared at the last-level-cache
domain. sd_llc_alloc_all() is called after the SD hierarchy is known, to
eliminate the unnecessary allocations that would occur if we instead
allocated in __sdt_alloc() and then figured out which shared nodes are
redundant.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/topology.c | 75 ++++++++++++++++++++++++++++++++++++++++-
1 file changed, 74 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 32dcddaead82..fac1b9155b6e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -21,6 +21,12 @@ void sched_domains_mutex_unlock(void)
static cpumask_var_t sched_domains_tmpmask;
static cpumask_var_t sched_domains_tmpmask2;
+struct s_data;
+static int sd_llc_alloc(struct sched_domain *sd);
+static void sd_llc_free(struct sched_domain *sd);
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d);
+static void sd_llc_free_all(const struct cpumask *cpu_map);
+
static int __init sched_debug_setup(char *str)
{
sched_debug_verbose = true;
@@ -630,8 +636,10 @@ static void destroy_sched_domain(struct sched_domain *sd)
*/
free_sched_groups(sd->groups, 1);
- if (sd->shared && atomic_dec_and_test(&sd->shared->ref))
+ if (sd->shared && atomic_dec_and_test(&sd->shared->ref)) {
+ sd_llc_free(sd);
kfree(sd->shared);
+ }
kfree(sd);
}
@@ -1546,6 +1554,7 @@ static void __free_domain_allocs(struct s_data *d, enum s_alloc what,
free_percpu(d->sd);
fallthrough;
case sa_sd_storage:
+ sd_llc_free_all(cpu_map);
__sdt_free(cpu_map);
fallthrough;
case sa_none:
@@ -2463,6 +2472,62 @@ static void __sdt_free(const struct cpumask *cpu_map)
}
}
+static int sd_llc_alloc(struct sched_domain *sd)
+{
+ /* Allocate sd->shared data here. Empty for now. */
+
+ return 0;
+}
+
+static void sd_llc_free(struct sched_domain *sd)
+{
+ struct sched_domain_shared *sds = sd->shared;
+
+ if (!sds)
+ return;
+
+ /* Free data here. Empty for now. */
+}
+
+static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
+{
+ struct sched_domain *sd, *hsd;
+ int i;
+
+ for_each_cpu(i, cpu_map) {
+ /* Find highest domain that shares resources */
+ hsd = NULL;
+ for (sd = *per_cpu_ptr(d->sd, i); sd; sd = sd->parent) {
+ if (!(sd->flags & SD_SHARE_LLC))
+ break;
+ hsd = sd;
+ }
+ if (hsd && sd_llc_alloc(hsd))
+ return 1;
+ }
+
+ return 0;
+}
+
+static void sd_llc_free_all(const struct cpumask *cpu_map)
+{
+ struct sched_domain_topology_level *tl;
+ struct sched_domain *sd;
+ struct sd_data *sdd;
+ int j;
+
+ for_each_sd_topology(tl) {
+ sdd = &tl->data;
+ if (!sdd || !sdd->sd)
+ continue;
+ for_each_cpu(j, cpu_map) {
+ sd = *per_cpu_ptr(sdd->sd, j);
+ if (sd)
+ sd_llc_free(sd);
+ }
+ }
+}
+
static struct sched_domain *build_sched_domain(struct sched_domain_topology_level *tl,
const struct cpumask *cpu_map, struct sched_domain_attr *attr,
struct sched_domain *child, int cpu)
@@ -2674,6 +2739,14 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
}
}
+ /*
+ * Allocate shared sd data at last level cache. Must be done after
+ * domains are built above, but before the data is used in
+ * cpu_attach_domain and descendants below.
+ */
+ if (sd_llc_alloc_all(cpu_map, &d))
+ goto error;
+
/* Attach the domains */
rcu_read_lock();
for_each_cpu(i, cpu_map) {
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 3/9] sched/topology: Provide cfs_overload_cpus bitmap
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 1/9] sched: Provide sparsemask, a reduced contention bitmap Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 2/9] sched/topology: Provide hooks to allocate data shared per LLC Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus Chen Jinghuang
` (6 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steve.sistare@oracle.com>
Define and initialize a sparse bitmap of overloaded CPUs, per
last-level-cache scheduling domain, for use by the CFS scheduling class.
Save a pointer to cfs_overload_cpus in the rq for efficient access.
Signed-off-by: Steve Sistare <steve.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
include/linux/sched/topology.h | 1 +
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 25 +++++++++++++++++++++++--
3 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..472c3dcf5a34 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -67,6 +67,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ struct sparsemask *cfs_overload_cpus;
int nr_idle_scan;
};
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b82fb70a9d54..4989a92eeb9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -85,6 +85,7 @@ struct cfs_rq;
struct rt_rq;
struct sched_group;
struct cpuidle_state;
+struct sparsemask;
#if defined(CONFIG_PARAVIRT) && !defined(CONFIG_HAVE_PV_STEAL_CLOCK_GEN)
# include <asm/paravirt.h>
@@ -1173,6 +1174,7 @@ struct rq {
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
+ struct sparsemask *cfs_overload_cpus;
#ifdef CONFIG_SCHED_CLASS_EXT
struct scx_rq scx;
struct sched_dl_entity ext_server;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index fac1b9155b6e..7bf1f68dac32 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -6,6 +6,7 @@
#include <linux/sched/isolation.h>
#include <linux/bsearch.h>
#include "sched.h"
+#include "sparsemask.h"
DEFINE_MUTEX(sched_domains_mutex);
void sched_domains_mutex_lock(void)
@@ -683,7 +684,9 @@ DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
static void update_top_cache_domain(int cpu)
{
+ struct sparsemask *cfs_overload_cpus = NULL;
struct sched_domain_shared *sds = NULL;
+ struct rq *rq = cpu_rq(cpu);
struct sched_domain *sd;
int id = cpu;
int size = 1;
@@ -693,8 +696,10 @@ static void update_top_cache_domain(int cpu)
id = cpumask_first(sched_domain_span(sd));
size = cpumask_weight(sched_domain_span(sd));
sds = sd->shared;
+ cfs_overload_cpus = sds->cfs_overload_cpus;
}
+ rcu_assign_pointer(rq->cfs_overload_cpus, cfs_overload_cpus);
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_size, cpu) = size;
per_cpu(sd_llc_id, cpu) = id;
@@ -2474,7 +2479,22 @@ static void __sdt_free(const struct cpumask *cpu_map)
static int sd_llc_alloc(struct sched_domain *sd)
{
- /* Allocate sd->shared data here. Empty for now. */
+ struct sched_domain_shared *sds = sd->shared;
+ struct cpumask *span = sched_domain_span(sd);
+ int nid = cpu_to_node(cpumask_first(span));
+ int flags = __GFP_ZERO | GFP_KERNEL;
+ struct sparsemask *mask;
+
+ /*
+ * Allocate the bitmap if not already allocated. This is called for
+ * every CPU in the LLC but only allocates once per sd_llc_shared.
+ */
+ if (!sds->cfs_overload_cpus) {
+ mask = sparsemask_alloc_node(nr_cpu_ids, 3, flags, nid);
+ if (!mask)
+ return 1;
+ sds->cfs_overload_cpus = mask;
+ }
return 0;
}
@@ -2486,7 +2506,8 @@ static void sd_llc_free(struct sched_domain *sd)
if (!sds)
return;
- /* Free data here. Empty for now. */
+ sparsemask_free(sds->cfs_overload_cpus);
+ sds->cfs_overload_cpus = NULL;
}
static int sd_llc_alloc_all(const struct cpumask *cpu_map, struct s_data *d)
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (2 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 3/9] sched/topology: Provide cfs_overload_cpus bitmap Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-24 13:56 ` kernel test robot
2026-03-20 5:59 ` [RFC PATCH v5 5/9] sched/fair: Hoist idle_stamp up from idle_balance Chen Jinghuang
` (5 subsequent siblings)
9 siblings, 1 reply; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
An overloaded CPU has more than 1 runnable task. When a CFS task wakes
on a CPU, if h_nr_runnable transitions from 1 to more, then set the CPU in
the cfs_overload_cpus bitmap. When a CFS task sleeps, if h_nr_runnable
transitions from 2 to less, then clear the CPU in cfs_overload_cpus.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
v5: Rename h_nr_running to h_nr_runnable and reposition
overload_set/overload_clear to fix overload detection for delay dequeue.
v4: Detect CPU overload via changes in h_nr_running.
---
kernel/sched/fair.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 44 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eea99ec01a3f..92c3bcff5b6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -55,6 +55,7 @@
#include <uapi/linux/sched/types.h>
#include "sched.h"
+#include "sparsemask.h"
#include "stats.h"
#include "autogroup.h"
@@ -5076,6 +5077,33 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
}
+#ifdef CONFIG_SMP
+static void overload_clear(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_clear_elem(overload_cpus, rq->cpu);
+ rcu_read_unlock();
+}
+
+static void overload_set(struct rq *rq)
+{
+ struct sparsemask *overload_cpus;
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
+ if (overload_cpus)
+ sparsemask_set_elem(overload_cpus, rq->cpu);
+ rcu_read_unlock();
+}
+#else /* CONFIG_SMP */
+static inline void overload_clear(struct rq *rq) {}
+static inline void overload_set(struct rq *rq) {}
+#endif
+
void __setparam_fair(struct task_struct *p, const struct sched_attr *attr)
{
struct sched_entity *se = &p->se;
@@ -5955,6 +5983,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue)
return false; /* Throttle no longer required. */
+
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
@@ -6875,6 +6904,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int h_nr_idle = task_has_idle_policy(p);
int h_nr_runnable = 1;
int task_new = !(flags & ENQUEUE_WAKEUP);
+ unsigned int prev_nr = rq->cfs.h_nr_runnable;
int rq_h_nr_queued = rq->cfs.h_nr_queued;
u64 slice = 0;
@@ -6892,6 +6922,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (flags & ENQUEUE_DELAYED) {
requeue_delayed_entity(se);
+
+ if (prev_nr <= 1 && rq->cfs.h_nr_runnable >= 2)
+ overload_set(rq);
+
return;
}
@@ -6961,6 +6995,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* At this point se is NULL and we are at root level*/
add_nr_running(rq, 1);
+ if (prev_nr <= 1 && rq->cfs.h_nr_runnable >= 2)
+ overload_set(rq);
/*
* Since new tasks are assigned an initial util_avg equal to
@@ -7003,6 +7039,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
int h_nr_idle = 0;
int h_nr_queued = 0;
int h_nr_runnable = 0;
+ unsigned int prev_nr = rq->cfs.h_nr_runnable;
struct cfs_rq *cfs_rq;
u64 slice = 0;
@@ -7018,8 +7055,12 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
cfs_rq = cfs_rq_of(se);
if (!dequeue_entity(cfs_rq, se, flags)) {
- if (p && &p->se == se)
+ if (p && &p->se == se) {
+ if (prev_nr >= 2 && rq->cfs.h_nr_runnable <= 1)
+ overload_clear(rq);
+
return -1;
+ }
slice = cfs_rq_min_slice(cfs_rq);
break;
@@ -7077,6 +7118,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
}
sub_nr_running(rq, h_nr_queued);
+ if (prev_nr >= 2 && rq->cfs.h_nr_runnable <= 1)
+ overload_clear(rq);
/* balance early to pull high priority tasks */
if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 5/9] sched/fair: Hoist idle_stamp up from idle_balance
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (3 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 6/9] sched/fair: Generalize the detach_task interface Chen Jinghuang
` (4 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
Move the update of idle_stamp from idle_balance to the call site in
pick_next_task_fair, to prepare for a future patch that adds work to
pick_next_task_fair which must be included in the idle_stamp interval.
No functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/fair.c | 32 ++++++++++++++++++++++----------
1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92c3bcff5b6b..742462d41118 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5078,6 +5078,16 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
}
#ifdef CONFIG_SMP
+static inline void rq_idle_stamp_update(struct rq *rq)
+{
+ rq->idle_stamp = rq_clock(rq);
+}
+
+static inline void rq_idle_stamp_clear(struct rq *rq)
+{
+ rq->idle_stamp = 0;
+}
+
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
@@ -5100,6 +5110,8 @@ static void overload_set(struct rq *rq)
rcu_read_unlock();
}
#else /* CONFIG_SMP */
+static inline void rq_idle_stamp_update(struct rq *rq) {}
+static inline void rq_idle_stamp_clear(struct rq *rq) {}
static inline void overload_clear(struct rq *rq) {}
static inline void overload_set(struct rq *rq) {}
#endif
@@ -9011,8 +9023,17 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
idle:
if (rf) {
+ /*
+ * We must set idle_stamp _before_ calling idle_balance(), such that we
+ * measure the duration of idle_balance() as idle time.
+ */
+ rq_idle_stamp_update(rq);
+
new_tasks = sched_balance_newidle(rq, rf);
+ if (new_tasks)
+ rq_idle_stamp_clear(rq);
+
/*
* Because sched_balance_newidle() releases (and re-acquires)
* rq->lock, it is possible for any higher priority task to
@@ -12911,13 +12932,6 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (this_rq->ttwu_pending)
return 0;
- /*
- * We must set idle_stamp _before_ calling sched_balance_rq()
- * for CPU_NEWLY_IDLE, such that we measure the this duration
- * as idle time.
- */
- this_rq->idle_stamp = rq_clock(this_rq);
-
/*
* Do not pull tasks towards !active CPUs...
*/
@@ -13026,9 +13040,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;
- if (pulled_task)
- this_rq->idle_stamp = 0;
- else
+ if (!pulled_task)
nohz_newidle_balance(this_rq);
rq_repin_lock(this_rq, rf);
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 6/9] sched/fair: Generalize the detach_task interface
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (4 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 5/9] sched/fair: Hoist idle_stamp up from idle_balance Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 7/9] sched/fair: Provide can_migrate_task_llc Chen Jinghuang
` (3 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
The detach_task function takes a struct lb_env argument, but only needs a
few of its members. Pass the rq and cpu arguments explicitly so the
function may be called from code that is not based on lb_env. No
functional change.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 742462d41118..ebb13108dabe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9602,6 +9602,17 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
set_task_cpu(p, env->dst_cpu);
}
+/*
+ * detach_task_steal() -- detach the task for the migration from @src_rq to @dst_cpu.
+ */
+static void detach_task_steal(struct task_struct *p, struct rq *src_rq, int dst_cpu)
+{
+ lockdep_assert_rq_held(src_rq);
+
+ deactivate_task(src_rq, p, DEQUEUE_NOCLOCK);
+ set_task_cpu(p, dst_cpu);
+}
+
/*
* detach_one_task() -- tries to dequeue exactly one task from env->src_rq, as
* part of active balancing operations within "domain".
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 7/9] sched/fair: Provide can_migrate_task_llc
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (5 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 6/9] sched/fair: Generalize the detach_task interface Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
` (2 subsequent siblings)
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
Define a simpler version of can_migrate_task called can_migrate_task_llc
which does not require a struct lb_env argument, and judges whether a
migration from one CPU to another within the same LLC should be allowed.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
changes from v4 to v5:
Skip tasks with the sched_delayed flag during overload detection.
---
kernel/sched/fair.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebb13108dabe..0bf6d18dac05 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9582,6 +9582,34 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}
+/*
+ * Return true if task @p can migrate from @rq to @dst_rq in the same LLC.
+ * No need to test for co-locality, and no need to test task_hot(), as sharing
+ * LLC provides cache warmth at that level.
+ */
+static bool
+can_migrate_task_llc(struct task_struct *p, struct rq *rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+
+ lockdep_assert_rq_held(rq);
+
+ if (!cpumask_test_cpu(dst_cpu, p->cpus_ptr)) {
+ schedstat_inc(p->stats.nr_failed_migrations_affine);
+ return false;
+ }
+
+ if (task_on_cpu(rq, p)) {
+ schedstat_inc(p->stats.nr_failed_migrations_running);
+ return false;
+ }
+
+ if (p->se.sched_delayed)
+ return false;
+
+ return true;
+}
+
/*
* detach_task() -- detach the task for the migration specified in env
*/
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (6 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 7/9] sched/fair: Provide can_migrate_task_llc Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-22 12:09 ` kernel test robot
` (2 more replies)
2026-03-20 5:59 ` [RFC PATCH v5 9/9] sched/fair: Provide idle search schedstats Chen Jinghuang
2026-03-20 8:53 ` [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Peter Zijlstra
9 siblings, 3 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a
task, then attempt to steal a task from an overloaded CPU in the same LLC,
using the cfs_overload_cpus bitmap to efficiently identify candidates. To
minimize search time, steal the first migratable task that is found when
the bitmap is traversed. For fairness, search for migratable tasks on an
overloaded CPU in order of next to run.
This simple stealing yields a higher CPU utilization than idle_balance()
alone, because the search is cheap, so it may be called every time the CPU
is about to go idle. idle_balance() does more work because it searches
widely for the busiest queue, so to limit its CPU consumption, it declines
to search if the system is too busy. Simple stealing does not offload the
globally busiest queue, but it is much better than running nothing at all.
Stealing is controlled by the sched feature SCHED_STEAL, which is enabled
by default. Note that all test results presented below are based on the
NO_DELAY_DEQUEUE implementation.
Stealing imprroves utilization with only a modest CPU overhead in scheduler
code. In the following experiment, hackbench is run with varying numbers
of groups (40 tasks per group), and the delta in /proc/schedstat is shown
for each run, averaged per CPU, augmented with these non-standard stats:
steal - number of times a task is stolen from another CPU.
X6-2: 2 socket * 40 cores * 2 hyperthreads = 160 CPUs
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
hackbench <grps> process 100000
baseline
grps time %busy sched idle wake steal
1 2.182 20.00 35876 17905 17958 0
2 2.391 39.00 67753 33808 33921 0
3 2.871 47.00 100944 48966 51538 0
4 2.928 62.00 114489 55171 59059 0
8 4.852 83.00 219907 92961 121703 0
new
grps time %busy sched idle wake steal %speedup
1 2.229 18.00 45450 22691 22751 52 -2.1
2 2.123 40.00 49975 24977 24990 6 12.6
3 2.690 61.00 56118 22641 32780 9073 6.7
4 2.828 80.00 37927 12828 24165 8442 3.5
8 4.120 95.00 85929 8613 57858 11098 17.8
Elapsed time improves by 17.8, and CPU busy utilization is up
by 1 to 18% hitting 95% at peak load.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/fair.c | 174 ++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 6 ++
2 files changed, 174 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0bf6d18dac05..500215a57392 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5092,6 +5092,9 @@ static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
@@ -5103,17 +5106,29 @@ static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ if (!sched_feat(STEAL))
+ return;
+
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_set_elem(overload_cpus, rq->cpu);
rcu_read_unlock();
}
+
+static int try_steal(struct rq *this_rq, struct rq_flags *rf);
+
#else /* CONFIG_SMP */
static inline void rq_idle_stamp_update(struct rq *rq) {}
static inline void rq_idle_stamp_clear(struct rq *rq) {}
static inline void overload_clear(struct rq *rq) {}
static inline void overload_set(struct rq *rq) {}
+
+static inline int try_steal(struct rq *this_rq, struct rq_flags *rf)
+{
+ return 0;
+}
+
#endif
void __setparam_fair(struct task_struct *p, const struct sched_attr *attr)
@@ -9024,21 +9039,24 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
idle:
if (rf) {
/*
- * We must set idle_stamp _before_ calling idle_balance(), such that we
- * measure the duration of idle_balance() as idle time.
+ * We must set idle_stamp _before_ calling try_steal() or
+ * sched_balance_newidle(), such that we measure the duration
+ * as idle time.
*/
rq_idle_stamp_update(rq);
new_tasks = sched_balance_newidle(rq, rf);
+ if (new_tasks == 0)
+ new_tasks = try_steal(rq, rf);
if (new_tasks)
rq_idle_stamp_clear(rq);
/*
- * Because sched_balance_newidle() releases (and re-acquires)
- * rq->lock, it is possible for any higher priority task to
- * appear. In that case we must re-start the pick_next_entity()
- * loop.
+ * Because try_steal() and sched_balance_newidle() release
+ * (and re-acquire) rq->lock, it is possible for any higher priority
+ * task to appear. In that case we must re-start the
+ * pick_next_entity() loop.
*/
if (new_tasks < 0)
return RETRY_TASK;
@@ -13133,6 +13151,150 @@ void sched_balance_trigger(struct rq *rq)
nohz_balancer_kick(rq);
}
+/*
+ * Search the runnable tasks in @cfs_rq in order of next to run, and find
+ * the first one that can be migrated to @dst_rq. @cfs_rq is locked on entry.
+ * On success, dequeue the task from @cfs_rq and return it, else return NULL.
+ */
+static struct task_struct *
+detach_next_task(struct cfs_rq *cfs_rq, struct rq *dst_rq)
+{
+ int dst_cpu = dst_rq->cpu;
+ struct task_struct *p;
+ struct rq *rq = rq_of(cfs_rq);
+
+ lockdep_assert_rq_held(rq);
+
+ list_for_each_entry_reverse(p, &rq->cfs_tasks, se.group_node) {
+ if (can_migrate_task_llc(p, rq, dst_rq)) {
+ detach_task_steal(p, rq, dst_cpu);
+ return p;
+ }
+ }
+ return NULL;
+}
+
+/*
+ * Attempt to migrate a CFS task from @src_cpu to @dst_rq. @locked indicates
+ * whether @dst_rq is already locked on entry. This function may lock or
+ * unlock @dst_rq, and updates @locked to indicate the locked state on return.
+ * The locking protocol is based on idle_balance().
+ * Returns 1 on success and 0 on failure.
+ */
+static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
+ int src_cpu)
+{
+ struct task_struct *p;
+ struct rq_flags rf;
+ int stolen = 0;
+ int dst_cpu = dst_rq->cpu;
+ struct rq *src_rq = cpu_rq(src_cpu);
+
+ if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)
+ return 0;
+
+ if (*locked) {
+ rq_unpin_lock(dst_rq, dst_rf);
+ raw_spin_rq_unlock(dst_rq);
+ *locked = false;
+ }
+ rq_lock_irqsave(src_rq, &rf);
+ update_rq_clock(src_rq);
+
+ if (src_rq->cfs.h_nr_runnable < 2 || !cpu_active(src_cpu))
+ p = NULL;
+ else
+ p = detach_next_task(&src_rq->cfs, dst_rq);
+
+ rq_unlock(src_rq, &rf);
+
+ if (p) {
+ raw_spin_rq_lock(dst_rq);
+ rq_repin_lock(dst_rq, dst_rf);
+ *locked = true;
+ update_rq_clock(dst_rq);
+ attach_task(dst_rq, p);
+ stolen = 1;
+ }
+ local_irq_restore(rf.flags);
+
+ return stolen;
+}
+
+/*
+ * Conservative upper bound on the max cost of a steal, in nsecs (the typical
+ * cost is 1-2 microsec). Do not steal if average idle time is less.
+ */
+#define SCHED_STEAL_COST 10000
+
+/*
+ * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
+ * and migrate it to @dst_rq. rq_lock is held on entry and return, but
+ * may be dropped in between. Return 1 on success, 0 on failure, and -1
+ * if a task in a different scheduling class has become runnable on @dst_rq.
+ */
+static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
+{
+ int src_cpu;
+ int dst_cpu = dst_rq->cpu;
+ bool locked = true;
+ int stolen = 0;
+ struct sparsemask *overload_cpus;
+
+ if (!sched_feat(STEAL))
+ return 0;
+
+ if (!cpu_active(dst_cpu))
+ return 0;
+
+ if (dst_rq->avg_idle < SCHED_STEAL_COST)
+ return 0;
+
+ /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
+
+ rcu_read_lock();
+ overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
+ if (!overload_cpus) {
+ rcu_read_unlock();
+ return 0;
+ }
+
+#ifdef CONFIG_SCHED_SMT
+ /*
+ * First try overloaded CPUs on the same core to preserve cache warmth.
+ */
+ if (static_branch_likely(&sched_smt_present)) {
+ for_each_cpu(src_cpu, cpu_smt_mask(dst_cpu)) {
+ if (sparsemask_test_elem(overload_cpus, src_cpu) &&
+ steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ goto out;
+ }
+ }
+ }
+#endif /* CONFIG_SCHED_SMT */
+
+ /* Accept any suitable task in the LLC */
+
+ sparsemask_for_each(overload_cpus, dst_cpu, src_cpu) {
+ if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
+ stolen = 1;
+ goto out;
+ }
+ }
+
+out:
+ rcu_read_unlock();
+ if (!locked) {
+ raw_spin_rq_lock(dst_rq);
+ rq_repin_lock(dst_rq, dst_rf);
+ }
+ stolen |= (dst_rq->cfs.h_nr_runnable > 0);
+ if (dst_rq->nr_running != dst_rq->cfs.h_nr_runnable)
+ stolen = -1;
+ return stolen;
+}
+
static void rq_online_fair(struct rq *rq)
{
update_sysctl();
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 136a6584be79..e8c3e19bf585 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -87,6 +87,12 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_UTIL, true)
+/*
+ * Steal a CFS task from another CPU when going idle.
+ * Improves CPU utilization.
+ */
+SCHED_FEAT(STEAL, true)
+
/*
* Issue a WARN when we do multiple update_rq_clock() calls
* in a single rq->lock section. Default disabled because the
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH v5 9/9] sched/fair: Provide idle search schedstats
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (7 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
@ 2026-03-20 5:59 ` Chen Jinghuang
2026-03-20 8:53 ` [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Peter Zijlstra
9 siblings, 0 replies; 18+ messages in thread
From: Chen Jinghuang @ 2026-03-20 5:59 UTC (permalink / raw)
To: mingo, peterz, juri.lelli, vincent.guittot, linux-kernel
Cc: bsegall, dietmar.eggemann, rostedt, mgorman, vschneid,
kprateek.nayak, chenl311, steve.sistare
From: Steve Sistare <steven.sistare@oracle.com>
Add schedstats to measure the effectiveness of searching for idle CPUs
and stealing tasks. This is a temporary patch intended for use during
development only. SCHEDSTAT_VERSION is bumped to 16, and the following
fields are added to the per-CPU statistics of /proc/schedstat:
field 10: # of times select_idle_sibling "easily" found an idle CPU --
prev or target is idle.
field 11: # of times select_idle_sibling searched and found an idle cpu.
field 12: # of times select_idle_sibling searched and found an idle core.
field 13: # of times select_idle_sibling failed to find anything idle.
field 14: time in nanoseconds spent in functions that search for idle
CPUs and search for tasks to steal.
field 15: # of times an idle CPU steals a task from another CPU.
field 16: # of times try_steal finds overloaded CPUs but no task is
migratable.
Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Chen Jinghuang <chenjinghuang2@huawei.com>
---
kernel/sched/core.c | 31 +++++++++++++++++++++++--
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 9 ++++++++
kernel/sched/stats.c | 9 ++++++++
kernel/sched/stats.h | 13 +++++++++++
5 files changed, 109 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 759777694c78..841a4ca7e173 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4505,17 +4505,44 @@ static int sysctl_numa_balancing(const struct ctl_table *table, int write,
DEFINE_STATIC_KEY_FALSE(sched_schedstats);
+unsigned long schedstat_skid;
+
+static void compute_skid(void)
+{
+ int i, n = 0;
+ s64 t;
+ int skid = 0;
+
+ for (i = 0; i < 100; i++) {
+ t = local_clock();
+ t = local_clock() - t;
+ if (t > 0 && t < 1000) { /* only use sane samples */
+ skid += (int) t;
+ n++;
+ }
+ }
+
+ if (n > 0)
+ schedstat_skid = skid / n;
+ else
+ schedstat_skid = 0;
+ pr_info("schedstat_skid = %lu\n", schedstat_skid);
+}
+
static void set_schedstats(bool enabled)
{
- if (enabled)
+ if (enabled) {
+ compute_skid();
static_branch_enable(&sched_schedstats);
- else
+ } else {
static_branch_disable(&sched_schedstats);
+ }
}
void force_schedstat_enabled(void)
{
if (!schedstat_enabled()) {
+ compute_skid();
pr_info("kernel profiling enabled schedstats, disable via kernel.sched_schedstats.\n");
static_branch_enable(&sched_schedstats);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 500215a57392..ba2b9f811135 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5091,29 +5091,35 @@ static inline void rq_idle_stamp_clear(struct rq *rq)
static void overload_clear(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;
if (!sched_feat(STEAL))
return;
+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_clear_elem(overload_cpus, rq->cpu);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}
static void overload_set(struct rq *rq)
{
struct sparsemask *overload_cpus;
+ unsigned long time;
if (!sched_feat(STEAL))
return;
+ time = schedstat_start_time();
rcu_read_lock();
overload_cpus = rcu_dereference(rq->cfs_overload_cpus);
if (overload_cpus)
sparsemask_set_elem(overload_cpus, rq->cpu);
rcu_read_unlock();
+ schedstat_end_time(rq->find_time, time);
}
static int try_steal(struct rq *this_rq, struct rq_flags *rf);
@@ -7830,6 +7836,16 @@ static inline bool asym_fits_cpu(unsigned long util,
return true;
}
+#define SET_STAT(STAT) \
+ do { \
+ if (schedstat_enabled()) { \
+ struct rq *rq = this_rq(); \
+ \
+ if (rq) \
+ __schedstat_inc(rq->STAT); \
+ } \
+ } while (0)
+
/*
* Try and locate an idle core/thread in the LLC cache domain.
*/
@@ -7857,8 +7873,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
lockdep_assert_irqs_disabled();
if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
- asym_fits_cpu(task_util, util_min, util_max, target))
+ asym_fits_cpu(task_util, util_min, util_max, target)) {
+ SET_STAT(found_idle_cpu_easy);
return target;
+ }
/*
* If the previous CPU is cache affine and idle, don't be stupid:
@@ -7868,8 +7886,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
asym_fits_cpu(task_util, util_min, util_max, prev)) {
if (!static_branch_unlikely(&sched_cluster_active) ||
- cpus_share_resources(prev, target))
+ cpus_share_resources(prev, target)) {
+ SET_STAT(found_idle_cpu_easy);
return prev;
+ }
prev_aff = prev;
}
@@ -7887,6 +7907,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
prev == smp_processor_id() &&
this_rq()->nr_running <= 1 &&
asym_fits_cpu(task_util, util_min, util_max, prev)) {
+ SET_STAT(found_idle_cpu_easy);
return prev;
}
@@ -7901,8 +7922,10 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
if (!static_branch_unlikely(&sched_cluster_active) ||
- cpus_share_resources(recent_used_cpu, target))
+ cpus_share_resources(recent_used_cpu, target)) {
+ SET_STAT(found_idle_cpu_easy);
return recent_used_cpu;
+ }
} else {
recent_used_cpu = -1;
@@ -7924,13 +7947,16 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
if (sd) {
i = select_idle_capacity(p, sd, target);
+ SET_STAT(found_idle_cpu_capacity);
return ((unsigned)i < nr_cpumask_bits) ? i : target;
}
}
sd = rcu_dereference_all(per_cpu(sd_llc, target));
- if (!sd)
+ if (!sd) {
+ SET_STAT(nofound_idle_cpu);
return target;
+ }
if (sched_smt_active()) {
has_idle_core = test_idle_cores(target);
@@ -7943,8 +7969,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
}
i = select_idle_cpu(p, sd, has_idle_core, target);
- if ((unsigned)i < nr_cpumask_bits)
+ if ((unsigned)i < nr_cpumask_bits) {
+ SET_STAT(found_idle_cpu);
return i;
+ }
+
+ SET_STAT(nofound_idle_cpu);
/*
* For cluster machines which have lower sharing cache like L2 or
@@ -8580,6 +8610,7 @@ static int
select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
{
int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
+ unsigned long time;
struct sched_domain *tmp, *sd = NULL;
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
@@ -8587,6 +8618,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
+ time = schedstat_start_time();
+
/*
* required for stable ->cpus_allowed
*/
@@ -8643,6 +8676,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
}
rcu_read_unlock();
+ schedstat_end_time(cpu_rq(cpu)->find_time, time);
+
return new_cpu;
}
@@ -8981,6 +9016,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
+ unsigned long time;
again:
p = pick_task_fair(rq, rf);
@@ -9038,6 +9074,7 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
idle:
if (rf) {
+ time = schedstat_start_time();
/*
* We must set idle_stamp _before_ calling try_steal() or
* sched_balance_newidle(), such that we measure the duration
@@ -9052,6 +9089,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
if (new_tasks)
rq_idle_stamp_clear(rq);
+ schedstat_end_time(rq->find_time, time);
+
/*
* Because try_steal() and sched_balance_newidle() release
* (and re-acquire) rq->lock, it is possible for any higher priority
@@ -13215,6 +13254,7 @@ static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
update_rq_clock(dst_rq);
attach_task(dst_rq, p);
stolen = 1;
+ schedstat_inc(dst_rq->steal);
}
local_irq_restore(rf.flags);
@@ -13239,6 +13279,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
int dst_cpu = dst_rq->cpu;
bool locked = true;
int stolen = 0;
+ bool any_overload = false;
struct sparsemask *overload_cpus;
if (!sched_feat(STEAL))
@@ -13281,6 +13322,7 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen = 1;
goto out;
}
+ any_overload = true;
}
out:
@@ -13292,6 +13334,8 @@ static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
stolen |= (dst_rq->cfs.h_nr_runnable > 0);
if (dst_rq->nr_running != dst_rq->cfs.h_nr_runnable)
stolen = -1;
+ if (!stolen && any_overload)
+ schedstat_inc(dst_rq->steal_fail);
return stolen;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4989a92eeb9b..530b80fbf897 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1304,6 +1304,15 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ /* Idle search stats */
+ unsigned int found_idle_cpu_capacity;
+ unsigned int found_idle_cpu;
+ unsigned int found_idle_cpu_easy;
+ unsigned int nofound_idle_cpu;
+ unsigned long find_time;
+ unsigned int steal;
+ unsigned int steal_fail;
#endif
#ifdef CONFIG_CPU_IDLE
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index d1c9429a4ac5..7063c9712f68 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -129,6 +129,15 @@ static int show_schedstat(struct seq_file *seq, void *v)
rq->rq_cpu_time,
rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ seq_printf(seq, " %u %u %u %u %lu %u %u",
+ rq->found_idle_cpu_easy,
+ rq->found_idle_cpu_capacity,
+ rq->found_idle_cpu,
+ rq->nofound_idle_cpu,
+ rq->find_time,
+ rq->steal,
+ rq->steal_fail);
+
seq_printf(seq, "\n");
/* domain-specific stats */
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index a612cf253c87..55f31a4df8fa 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -43,6 +43,17 @@ rq_sched_info_dequeue(struct rq *rq, unsigned long long delta)
#define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
#define schedstat_val(var) (var)
#define schedstat_val_or_zero(var) ((schedstat_enabled()) ? (var) : 0)
+#define schedstat_start_time() schedstat_val_or_zero(local_clock())
+#define schedstat_end_time(stat, time) \
+ do { \
+ unsigned long endtime; \
+ \
+ if (schedstat_enabled() && (time)) { \
+ endtime = local_clock() - (time) - schedstat_skid; \
+ schedstat_add((stat), endtime); \
+ } \
+ } while (0)
+extern unsigned long schedstat_skid;
void __update_stats_wait_start(struct rq *rq, struct task_struct *p,
struct sched_statistics *stats);
@@ -81,6 +92,8 @@ static inline void rq_sched_info_depart (struct rq *rq, unsigned long long delt
# define schedstat_set(var, val) do { } while (0)
# define schedstat_val(var) 0
# define schedstat_val_or_zero(var) 0
+# define schedstat_start_time() 0
+# define schedstat_end_time(stat, t) do { } while (0)
# define __update_stats_wait_start(rq, p, stats) do { } while (0)
# define __update_stats_wait_end(rq, p, stats) do { } while (0)
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
` (8 preceding siblings ...)
2026-03-20 5:59 ` [RFC PATCH v5 9/9] sched/fair: Provide idle search schedstats Chen Jinghuang
@ 2026-03-20 8:53 ` Peter Zijlstra
2026-03-27 13:54 ` Valentin Schneider
2026-03-28 2:48 ` chenjinghuang
9 siblings, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2026-03-20 8:53 UTC (permalink / raw)
To: Chen Jinghuang
Cc: mingo, juri.lelli, vincent.guittot, linux-kernel, bsegall,
dietmar.eggemann, rostedt, mgorman, vschneid, kprateek.nayak,
chenl311, steve.sistare
On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
> When a CPU has no more CFS tasks to run, and idle_balance() fails to
> find a task, then attempt to steal a task from an overloaded CPU in the
> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
> identify candidates. To minimize search time, steal the first migratable
> task that is found when the bitmap is traversed. For fairness, search
> for migratable tasks on an overloaded CPU in order of next to run.
This makes no sense. It is the task of newidle to get more work -- if
that is failing for you, then we should fix that, not build a second way
to get tasks.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
@ 2026-03-22 12:09 ` kernel test robot
2026-03-23 0:14 ` kernel test robot
2026-03-23 4:51 ` kernel test robot
2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-03-22 12:09 UTC (permalink / raw)
To: Chen Jinghuang; +Cc: oe-kbuild-all
Hi Chen,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v7.0-rc4]
[cannot apply to tip/sched/core peterz-queue/sched/core next-20260320]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jinghuang/sched-Provide-sparsemask-a-reduced-contention-bitmap/20260321-084706
base: linus/master
patch link: https://lore.kernel.org/r/20260320055920.2518389-9-chenjinghuang2%40huawei.com
patch subject: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
config: nios2-allnoconfig (https://download.01.org/0day-ci/archive/20260322/202603222032.E2CS3R19-lkp@intel.com/config)
compiler: nios2-linux-gcc (GCC) 11.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260322/202603222032.E2CS3R19-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603222032.E2CS3R19-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
>> kernel/sched/fair.c:13310:12: error: redefinition of 'try_steal'
13310 | static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
| ^~~~~~~~~
kernel/sched/fair.c:5195:19: note: previous definition of 'try_steal' with type 'int(struct rq *, struct rq_flags *)'
5195 | static inline int try_steal(struct rq *this_rq, struct rq_flags *rf)
| ^~~~~~~~~
>> kernel/sched/fair.c:13310:12: warning: 'try_steal' defined but not used [-Wunused-function]
13310 | static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
| ^~~~~~~~~
vim +/try_steal +13310 kernel/sched/fair.c
13303
13304 /*
13305 * Try to steal a runnable CFS task from a CPU in the same LLC as @dst_rq,
13306 * and migrate it to @dst_rq. rq_lock is held on entry and return, but
13307 * may be dropped in between. Return 1 on success, 0 on failure, and -1
13308 * if a task in a different scheduling class has become runnable on @dst_rq.
13309 */
13310 static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
13311 {
13312 int src_cpu;
13313 int dst_cpu = dst_rq->cpu;
13314 bool locked = true;
13315 int stolen = 0;
13316 struct sparsemask *overload_cpus;
13317
13318 if (!sched_feat(STEAL))
13319 return 0;
13320
13321 if (!cpu_active(dst_cpu))
13322 return 0;
13323
13324 if (dst_rq->avg_idle < SCHED_STEAL_COST)
13325 return 0;
13326
13327 /* Get bitmap of overloaded CPUs in the same LLC as @dst_rq */
13328
13329 rcu_read_lock();
13330 overload_cpus = rcu_dereference(dst_rq->cfs_overload_cpus);
13331 if (!overload_cpus) {
13332 rcu_read_unlock();
13333 return 0;
13334 }
13335
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
2026-03-22 12:09 ` kernel test robot
@ 2026-03-23 0:14 ` kernel test robot
2026-03-23 4:51 ` kernel test robot
2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-03-23 0:14 UTC (permalink / raw)
To: Chen Jinghuang; +Cc: llvm, oe-kbuild-all
Hi Chen,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v7.0-rc5]
[cannot apply to tip/sched/core peterz-queue/sched/core next-20260320]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jinghuang/sched-Provide-sparsemask-a-reduced-contention-bitmap/20260321-084706
base: linus/master
patch link: https://lore.kernel.org/r/20260320055920.2518389-9-chenjinghuang2%40huawei.com
patch subject: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
config: arm-randconfig-001-20260323 (https://download.01.org/0day-ci/archive/20260323/202603230856.FTVv1O5z-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project b1cf9b0835d214bcbd0d6e8882760c07cfccb298)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260323/202603230856.FTVv1O5z-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603230856.FTVv1O5z-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> kernel/sched/fair.c:13272:3: warning: releasing raw_spinlock 'rq_lockp(dst_rq)' that was not held [-Wthread-safety-analysis]
13272 | raw_spin_rq_unlock(dst_rq);
| ^
>> kernel/sched/fair.c:13293:2: warning: raw_spinlock 'rq_lockp(dst_rq)' is not held on every path through here [-Wthread-safety-analysis]
13293 | local_irq_restore(rf.flags);
| ^
include/linux/irqflags.h:223:8: note: expanded from macro 'local_irq_restore'
223 | if (!raw_irqs_disabled_flags(flags)) \
| ^
include/linux/irqflags.h:188:13: note: expanded from macro 'raw_irqs_disabled_flags'
188 | typecheck(unsigned long, flags); \
| ^
kernel/sched/fair.c:13286:3: note: raw_spinlock acquired here
13286 | raw_spin_rq_lock(dst_rq);
| ^
kernel/sched/fair.c:13310:12: error: redefinition of 'try_steal'
13310 | static int try_steal(struct rq *dst_rq, struct rq_flags *dst_rf)
| ^
kernel/sched/fair.c:5195:19: note: previous definition is here
5195 | static inline int try_steal(struct rq *this_rq, struct rq_flags *rf)
| ^
2 warnings and 1 error generated.
vim +13272 kernel/sched/fair.c
13250
13251 /*
13252 * Attempt to migrate a CFS task from @src_cpu to @dst_rq. @locked indicates
13253 * whether @dst_rq is already locked on entry. This function may lock or
13254 * unlock @dst_rq, and updates @locked to indicate the locked state on return.
13255 * The locking protocol is based on idle_balance().
13256 * Returns 1 on success and 0 on failure.
13257 */
13258 static int steal_from(struct rq *dst_rq, struct rq_flags *dst_rf, bool *locked,
13259 int src_cpu)
13260 {
13261 struct task_struct *p;
13262 struct rq_flags rf;
13263 int stolen = 0;
13264 int dst_cpu = dst_rq->cpu;
13265 struct rq *src_rq = cpu_rq(src_cpu);
13266
13267 if (dst_cpu == src_cpu || src_rq->cfs.h_nr_runnable < 2)
13268 return 0;
13269
13270 if (*locked) {
13271 rq_unpin_lock(dst_rq, dst_rf);
13272 raw_spin_rq_unlock(dst_rq);
13273 *locked = false;
13274 }
13275 rq_lock_irqsave(src_rq, &rf);
13276 update_rq_clock(src_rq);
13277
13278 if (src_rq->cfs.h_nr_runnable < 2 || !cpu_active(src_cpu))
13279 p = NULL;
13280 else
13281 p = detach_next_task(&src_rq->cfs, dst_rq);
13282
13283 rq_unlock(src_rq, &rf);
13284
13285 if (p) {
13286 raw_spin_rq_lock(dst_rq);
13287 rq_repin_lock(dst_rq, dst_rf);
13288 *locked = true;
13289 update_rq_clock(dst_rq);
13290 attach_task(dst_rq, p);
13291 stolen = 1;
13292 }
13293 local_irq_restore(rf.flags);
13294
13295 return stolen;
13296 }
13297
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
2026-03-22 12:09 ` kernel test robot
2026-03-23 0:14 ` kernel test robot
@ 2026-03-23 4:51 ` kernel test robot
2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-03-23 4:51 UTC (permalink / raw)
To: Chen Jinghuang; +Cc: llvm, oe-kbuild-all
Hi Chen,
[This is a private test report for your RFC patch.]
kernel test robot noticed the following build warnings:
[auto build test WARNING on linus/master]
[also build test WARNING on v7.0-rc5]
[cannot apply to tip/sched/core peterz-queue/sched/core next-20260320]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jinghuang/sched-Provide-sparsemask-a-reduced-contention-bitmap/20260321-084706
base: linus/master
patch link: https://lore.kernel.org/r/20260320055920.2518389-9-chenjinghuang2%40huawei.com
patch subject: [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle
config: riscv-allmodconfig (https://download.01.org/0day-ci/archive/20260323/202603231203.g4WzI66q-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project b1cf9b0835d214bcbd0d6e8882760c07cfccb298)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260323/202603231203.g4WzI66q-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603231203.g4WzI66q-lkp@intel.com/
All warnings (new ones prefixed by >>):
kernel/sched/fair.c:13272:3: warning: releasing raw_spinlock 'rq_lockp(dst_rq)' that was not held [-Wthread-safety-analysis]
13272 | raw_spin_rq_unlock(dst_rq);
| ^
kernel/sched/fair.c:13293:2: warning: raw_spinlock 'rq_lockp(dst_rq)' is not held on every path through here [-Wthread-safety-analysis]
13293 | local_irq_restore(rf.flags);
| ^
include/linux/irqflags.h:223:8: note: expanded from macro 'local_irq_restore'
223 | if (!raw_irqs_disabled_flags(flags)) \
| ^
include/linux/irqflags.h:188:13: note: expanded from macro 'raw_irqs_disabled_flags'
188 | typecheck(unsigned long, flags); \
| ^
kernel/sched/fair.c:13286:3: note: raw_spinlock acquired here
13286 | raw_spin_rq_lock(dst_rq);
| ^
>> kernel/sched/fair.c:13366:2: warning: raw_spinlock 'rq_lockp(this_rq)' is not held on every path through here [-Wthread-safety-analysis]
13366 | stolen |= (dst_rq->cfs.h_nr_runnable > 0);
| ^
kernel/sched/fair.c:13363:3: note: raw_spinlock acquired here
13363 | raw_spin_rq_lock(dst_rq);
| ^
3 warnings generated.
vim +13366 kernel/sched/fair.c
13350
13351 /* Accept any suitable task in the LLC */
13352
13353 sparsemask_for_each(overload_cpus, dst_cpu, src_cpu) {
13354 if (steal_from(dst_rq, dst_rf, &locked, src_cpu)) {
13355 stolen = 1;
13356 goto out;
13357 }
13358 }
13359
13360 out:
13361 rcu_read_unlock();
13362 if (!locked) {
13363 raw_spin_rq_lock(dst_rq);
13364 rq_repin_lock(dst_rq, dst_rf);
13365 }
13366 stolen |= (dst_rq->cfs.h_nr_runnable > 0);
13367 if (dst_rq->nr_running != dst_rq->cfs.h_nr_runnable)
13368 stolen = -1;
13369 return stolen;
13370 }
13371
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus
2026-03-20 5:59 ` [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus Chen Jinghuang
@ 2026-03-24 13:56 ` kernel test robot
0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2026-03-24 13:56 UTC (permalink / raw)
To: Chen Jinghuang
Cc: oe-lkp, lkp, Chen Jinghuang, linux-kernel, aubrey.li, yu.c.chen,
mingo, peterz, juri.lelli, vincent.guittot, bsegall,
dietmar.eggemann, rostedt, mgorman, vschneid, kprateek.nayak,
chenl311, steve.sistare, oliver.sang
Hello,
kernel test robot noticed "UBSAN:array-index-out-of-bounds_in_kernel/sched/sparsemask.h" on:
commit: 7c86475c501e5d551136d0c2c8702d7c61a6b73f ("[RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus")
url: https://github.com/intel-lab-lkp/linux/commits/Chen-Jinghuang/sched-Provide-sparsemask-a-reduced-contention-bitmap/20260321-084706
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 42bddab0563fe67882b2722620a66dd98c8dbf33
patch link: https://lore.kernel.org/all/20260320055920.2518389-5-chenjinghuang2@huawei.com/
patch subject: [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus
in testcase: boot
config: x86_64-randconfig-016-20260323
compiler: clang-20
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202603242133.f66e336f-lkp@intel.com
[ 3.884372][ T1] ------------[ cut here ]------------
[ 3.884372][ T1] UBSAN: array-index-out-of-bounds in kernel/sched/sparsemask.h:181:32
[ 3.884372][ T1] index 0 is out of range for type 'struct sparsemask_chunk[0]'
[ 3.884372][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc4-00274-g7c86475c501e #1 PREEMPT 3f169c7b099bc50b6180dc864130cea2b43f07ba
[ 3.884372][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 3.884372][ T1] Call Trace:
[ 3.884372][ T1] <TASK>
[ 3.884372][ T1] __dump_stack (lib/dump_stack.c:95)
[ 3.884372][ T1] dump_stack_lvl (lib/dump_stack.c:123 (discriminator 1))
[ 3.884372][ T1] dump_stack (lib/dump_stack.c:130)
[ 3.884372][ T1] ubsan_epilogue (lib/ubsan.c:234 (discriminator 2))
[ 3.884372][ T1] __ubsan_handle_out_of_bounds (lib/ubsan.c:?)
[ 3.884372][ T1] ? overload_set (include/linux/rcupdate.h:312 (discriminator 1) include/linux/rcupdate.h:850 (discriminator 1) kernel/sched/fair.c:5164 (discriminator 1))
[ 3.884372][ T1] overload_set (kernel/sched/sparsemask.h:181 (discriminator 4) kernel/sched/fair.c:5167 (discriminator 4))
[ 3.884372][ T1] ? overload_set (include/linux/rcupdate.h:312 (discriminator 1) include/linux/rcupdate.h:850 (discriminator 1) kernel/sched/fair.c:5164 (discriminator 1))
[ 3.884372][ T1] enqueue_task_fair (kernel/sched/fair.c:401 kernel/sched/fair.c:7092)
[ 3.884372][ T1] enqueue_task (kernel/sched/core.c:2111)
[ 3.884372][ T1] ttwu_do_activate (kernel/sched/core.c:2149 (discriminator 4) kernel/sched/core.c:3667 (discriminator 4))
[ 3.884372][ T1] try_to_wake_up (kernel/sched/sched.h:1883 kernel/sched/sched.h:1969 kernel/sched/core.c:3919 kernel/sched/core.c:4242)
[ 3.884372][ T1] wake_up_process (kernel/sched/core.c:4374)
[ 3.884372][ T1] __kthread_create_on_node (kernel/kthread.c:?)
[ 3.884372][ T1] kthread_create_on_node (kernel/kthread.c:562)
[ 3.884372][ T1] rcu_spawn_tasks_kthread_generic (kernel/rcu/tasks.h:682 (discriminator 512))
[ 3.884372][ T1] rcu_init_tasks_generic (kernel/rcu/tasks.h:1587)
[ 3.884372][ T1] do_one_initcall (init/main.c:1382)
[ 3.884372][ T1] ? stack_depot_save_flags (lib/stackdepot.c:?)
[ 3.884372][ T1] ? tasks_cblist_init_generic (kernel/rcu/tasks.h:1577)
[ 3.884372][ T1] ? kasan_save_track (arch/x86/include/asm/current.h:25 (discriminator 3) mm/kasan/common.c:70 (discriminator 3) mm/kasan/common.c:79 (discriminator 3))
[ 3.884372][ T1] ? kasan_save_track (mm/kasan/common.c:58 mm/kasan/common.c:78)
[ 3.884372][ T1] ? kasan_save_alloc_info (mm/kasan/generic.c:571 (discriminator 1))
[ 3.884372][ T1] ? __kasan_kmalloc (mm/kasan/common.c:419)
[ 3.884372][ T1] ? __kmalloc_noprof (mm/slub.c:5261 mm/slub.c:5272)
[ 3.884372][ T1] ? do_initcalls (init/main.c:1454)
[ 3.884372][ T1] ? do_basic_setup (init/main.c:1480)
[ 3.884372][ T1] ? kernel_init_freeable (init/main.c:1694)
[ 3.884372][ T1] ? kernel_init (init/main.c:1584)
[ 3.884372][ T1] ? ret_from_fork (arch/x86/kernel/process.c:164)
[ 3.884372][ T1] ? ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 3.884372][ T1] ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:179 (discriminator 1) kernel/locking/spinlock.c:194 (discriminator 1))
[ 3.884372][ T1] ? stack_depot_save_flags (lib/stackdepot.c:?)
[ 3.884372][ T1] ? stack_depot_save (lib/stackdepot.c:747)
[ 3.884372][ T1] ? set_track_prepare (mm/slub.c:1036)
[ 3.884372][ T1] ? do_initcalls (init/main.c:1454)
[ 3.884372][ T1] ? do_basic_setup (init/main.c:1480)
[ 3.884372][ T1] ? kernel_init_freeable (init/main.c:1694)
[ 3.884372][ T1] ? kernel_init (init/main.c:1584)
[ 3.884372][ T1] ? ret_from_fork (arch/x86/kernel/process.c:164)
[ 3.884372][ T1] ? ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 3.884372][ T1] ? next_arg (lib/cmdline.c:273)
[ 3.884372][ T1] ? parameq (kernel/params.c:90 (discriminator 1) kernel/params.c:99 (discriminator 1))
[ 3.884372][ T1] ? parse_args (kernel/params.c:153 kernel/params.c:186)
[ 3.884372][ T1] do_initcall_level (init/main.c:1443 (discriminator 6))
[ 3.884372][ T1] do_initcalls (init/main.c:1457 (discriminator 2))
[ 3.884372][ T1] ? kernel_init (init/main.c:1584)
[ 3.884372][ T1] do_basic_setup (init/main.c:1480)
[ 3.884372][ T1] kernel_init_freeable (init/main.c:1694)
[ 3.884372][ T1] ? rest_init (init/main.c:1574)
[ 3.884372][ T1] kernel_init (init/main.c:1584)
[ 3.884372][ T1] ? rest_init (init/main.c:1574)
[ 3.884372][ T1] ret_from_fork (arch/x86/kernel/process.c:164)
[ 3.884372][ T1] ? rest_init (init/main.c:1574)
[ 3.884372][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:258)
[ 3.884372][ T1] </TASK>
[ 3.884372][ T1] ---[ end trace ]---
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260324/202603242133.f66e336f-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization
2026-03-20 8:53 ` [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Peter Zijlstra
@ 2026-03-27 13:54 ` Valentin Schneider
2026-03-28 2:48 ` chenjinghuang
1 sibling, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2026-03-27 13:54 UTC (permalink / raw)
To: Peter Zijlstra, Chen Jinghuang
Cc: mingo, juri.lelli, vincent.guittot, linux-kernel, bsegall,
dietmar.eggemann, rostedt, mgorman, kprateek.nayak, chenl311,
steve.sistare
On 20/03/26 09:53, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates. To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed. For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>
> This makes no sense. It is the task of newidle to get more work -- if
> that is failing for you, then we should fix that, not build a second way
> to get tasks.
Yeah I don't recall exactly how it was stitched together, but IIRC Steve's
patches had stealing as the default newidle balance up until we got to NUMA
balancing because it acted funny there.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization
2026-03-20 8:53 ` [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Peter Zijlstra
2026-03-27 13:54 ` Valentin Schneider
@ 2026-03-28 2:48 ` chenjinghuang
2026-04-20 4:01 ` chenjinghuang
1 sibling, 1 reply; 18+ messages in thread
From: chenjinghuang @ 2026-03-28 2:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org, linux-kernel@vger.kernel.org,
bsegall@google.com, dietmar.eggemann@arm.com, rostedt@goodmis.org,
mgorman@suse.de, vschneid@redhat.com, kprateek.nayak@amd.com,
chenl311@chinatelecom.cn, steve.sistare@oracle.com
On 3/20/2026 4:53 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates. To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed. For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>
> This makes no sense. It is the task of newidle to get more work -- if
> that is failing for you, then we should fix that, not build a second way
> to get tasks.
>
>
>
Hi Peter,
That is a very valid point. However, profiling data indicates that
newidle_balance() still incurs excessive scanning overhead. Sharing the
expensive tick balance path is becoming an issue on high-core-count
systems.
As highlighted in recent ILB threads, profiling sqlite on a 224-CPU
Sapphire Rapids shows:
https://lore.kernel.org/all/cover.1690273854.git.yu.c.chen@intel.com/
6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
Walking the sd hierarchy in update_sd_lb_stats alone consumes over 5% of
CPU cycles. If we spend all that time scanning just to pull a tiny,
short-lived task, the search overhead dwarfs the actual runtime. It's a net
loss.
To mitigate this cost, I'd like to check if you are open to these two
directions:
1.Goal: As Tim questioned in the thread above:
"Do we always have to find the busiest group and pull from it? Would a
relatively busy group be enough?"
Instead of chasing absolute fairness, shouldn't newidle_balance()
prioritize fast task acquisition? A task-stealing mechanism can be
effective in this regard.
2.Refactoring: Instead of a standalone fix, we could track LLC overload in
sched_domain_shared and add a lightweight fast path atop newidle_balance(),
gated by rq->avg_idle.
Do you think refactoring along these lines to integrate into the existing
framework makes sense?
Thanks, Chen Jinghuang
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization
2026-03-28 2:48 ` chenjinghuang
@ 2026-04-20 4:01 ` chenjinghuang
0 siblings, 0 replies; 18+ messages in thread
From: chenjinghuang @ 2026-04-20 4:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: mingo@redhat.com, juri.lelli@redhat.com,
vincent.guittot@linaro.org, linux-kernel@vger.kernel.org,
bsegall@google.com, dietmar.eggemann@arm.com, rostedt@goodmis.org,
mgorman@suse.de, vschneid@redhat.com, kprateek.nayak@amd.com,
chenl311@chinatelecom.cn, steve.sistare@oracle.com
On 3/28/2026 10:48 AM, chenjinghuang wrote:
> On 3/20/2026 4:53 PM, Peter Zijlstra wrote:
>> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>>> find a task, then attempt to steal a task from an overloaded CPU in the
>>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>>> identify candidates. To minimize search time, steal the first migratable
>>> task that is found when the bitmap is traversed. For fairness, search
>>> for migratable tasks on an overloaded CPU in order of next to run.
>>
>> This makes no sense. It is the task of newidle to get more work -- if
>> that is failing for you, then we should fix that, not build a second way
>> to get tasks.
>>
>>
>>
> Hi Peter,
>
> That is a very valid point. However, profiling data indicates that
> newidle_balance() still incurs excessive scanning overhead. Sharing the
> expensive tick balance path is becoming an issue on high-core-count
> systems.
>
> As highlighted in recent ILB threads, profiling sqlite on a 224-CPU
> Sapphire Rapids shows:
>
> https://lore.kernel.org/all/cover.1690273854.git.yu.c.chen@intel.com/
>
> 6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
> 5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats
>
> Walking the sd hierarchy in update_sd_lb_stats alone consumes over 5% of
> CPU cycles. If we spend all that time scanning just to pull a tiny,
> short-lived task, the search overhead dwarfs the actual runtime. It's a net
> loss.
>
> To mitigate this cost, I'd like to check if you are open to these two
> directions:
>
> 1.Goal: As Tim questioned in the thread above:
>
> "Do we always have to find the busiest group and pull from it? Would a
> relatively busy group be enough?"
>
> Instead of chasing absolute fairness, shouldn't newidle_balance()
> prioritize fast task acquisition? A task-stealing mechanism can be
> effective in this regard.
>
> 2.Refactoring: Instead of a standalone fix, we could track LLC overload in
> sched_domain_shared and add a lightweight fast path atop newidle_balance(),
> gated by rq->avg_idle.
>
> Do you think refactoring along these lines to integrate into the existing
> framework makes sense?
>
> Thanks, Chen Jinghuang
>
Hi,
Gentle ping on this RFC.
I'd appreciate any feedback whenever you get time, or let me know if I
should resend/rework anything.
Regards,
Chen Jinghuang
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-04-20 4:01 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 5:59 [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 1/9] sched: Provide sparsemask, a reduced contention bitmap Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 2/9] sched/topology: Provide hooks to allocate data shared per LLC Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 3/9] sched/topology: Provide cfs_overload_cpus bitmap Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 4/9] sched/fair: Dynamically update cfs_overload_cpus Chen Jinghuang
2026-03-24 13:56 ` kernel test robot
2026-03-20 5:59 ` [RFC PATCH v5 5/9] sched/fair: Hoist idle_stamp up from idle_balance Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 6/9] sched/fair: Generalize the detach_task interface Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 7/9] sched/fair: Provide can_migrate_task_llc Chen Jinghuang
2026-03-20 5:59 ` [RFC PATCH v5 8/9] sched/fair: Steal work from an overloaded CPU when CPU goes idle Chen Jinghuang
2026-03-22 12:09 ` kernel test robot
2026-03-23 0:14 ` kernel test robot
2026-03-23 4:51 ` kernel test robot
2026-03-20 5:59 ` [RFC PATCH v5 9/9] sched/fair: Provide idle search schedstats Chen Jinghuang
2026-03-20 8:53 ` [RFC PATCH v5 0/9] steal tasks to improve CPU utilization Peter Zijlstra
2026-03-27 13:54 ` Valentin Schneider
2026-03-28 2:48 ` chenjinghuang
2026-04-20 4:01 ` chenjinghuang
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.