* [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll
@ 2025-06-21 23:57 Aaron Tomlin
2025-06-23 10:23 ` Peter Zijlstra
0 siblings, 1 reply; 5+ messages in thread
From: Aaron Tomlin @ 2025-06-21 23:57 UTC (permalink / raw)
To: tglx, mingo, bp, dave.hansen, x86, peterz, juri.lelli,
vincent.guittot
Cc: hpa, oleg, atomlin, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid, linux-kernel
Currently, the idle=poll kernel boot parameter applies globally, forcing
all CPUs into a shallow polling idle state to ensure ultra-low latency
responsiveness. While this is beneficial for extremely latency-sensitive
workloads, this global application lacks flexibility and can lead to
significant power inefficiency. This is particularly evident in systems
with a high CPU count, such as those utilising the
Full Dynticks/Adaptive Tick feature (i.e., nohz_full). In such
environments, only a subset of CPUs might genuinely require
sub-microsecond responsiveness, while others, though active, could
benefit from entering deeper idle states to conserve power.
This patch addresses this limitation by introducing the ability to
configure idle=poll on a per-CPU basis. This new feature allows
administrators to specifically designate which CPUs are permitted to
remain in the polling idle state.
This provides a critical improvement in power consumption by enabling a
nuanced power management strategy. CPUs running workloads with stringent
ultra-low-latency requirements can continue to benefit from idle=poll,
while other CPUs which in Full Dynticks mode, but not constantly busy
can dynamically enter deeper, power-saving idle states. This granular
control offers significantly enhanced flexibility and efficiency
compared to the previous system-wide limitation of idle=poll.
Consider a CPU configured in Full Dynticks mode, while idle=poll, a
"perf report" from such a system, even when the CPU is largely idle,
frequently reveals the following dominant activity:
99.70% swapper [kernel.kallsyms] [k] cpu_idle_poll.isra.0
0.10% swapper [kernel.kallsyms] [k] sched_tick
0.10% swapper [kernel.kallsyms] [k] native_read_msr
0.10% swapper [kernel.kallsyms] [k] native_sched_clock
The high percentage of cpu_idle_poll code indicates the CPU is
spending virtually all its time busy-looping in a shallow idle state.
This behavior, while ensuring responsiveness, directly translates to
substantial, unnecessary power consumption for CPUs that are not
"actively" processing latency-critical workloads.
Now consider nohz_full=2-47 and idle=poll,2-26. This setup attempts to
allow for a highly optimised balance between extreme performance for
critical components and significant energy efficiency for the rest of
the system
- Dedicated Responsiveness. Cores 2-26 provide unparalleled low-latency
for the most critical workloads by remaining in constant polling,
consciously trading increased power consumption for absolute speed
and predictability.
- Significant Power Savings. Cores 27-47 achieve substantial energy
conservation by effectively utilising the true deep-sleep
capabilities of nohz_full when idle, directly addressing and
mitigating the power waste observed in the perf report for similar
scenarios.
- Enhanced Flexibility. This system avoids the previous
"all-or-nothing" trade-off inherent in a global idle=poll setting. It
empowers administrators with fine-grained control, enabling a
precisely-tuned power and performance profile for specific
application needs and optimising resource utilisation across the
entire 48-core system.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
arch/x86/kernel/process.c | 27 +++++++++++++++++++++++----
include/linux/cpu.h | 1 +
kernel/sched/idle.c | 33 ++++++++++++++++++++-------------
3 files changed, 44 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index c1d2dac72b9c..43d0cc2bed73 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -973,15 +973,34 @@ void __init arch_post_acpi_subsys_init(void)
pr_info("System has AMD C1E erratum E400. Workaround enabled.\n");
}
+cpumask_var_t idle_poll_mask;
+EXPORT_SYMBOL_GPL(idle_poll_mask);
+
+static int __init idle_poll_setup(char *str)
+{
+ int err = 0;
+
+ if (cpulist_parse(str, idle_poll_mask) < 0) {
+ pr_warn("idle poll: incorrect CPU range\n");
+ err = 1;
+ } else {
+ boot_option_idle_override = IDLE_POLL;
+ cpu_idle_poll_update(idle_poll_mask);
+ }
+
+ return err;
+}
+
static int __init idle_setup(char *str)
{
if (!str)
return -EINVAL;
- if (!strcmp(str, "poll")) {
- pr_info("using polling idle threads\n");
- boot_option_idle_override = IDLE_POLL;
- cpu_idle_poll_ctrl(true);
+ if (!strncmp(str, "poll,", 5)) {
+ str += 5;
+ idle_poll_setup(str);
+ } else if (!strcmp(str, "poll")) {
+ cpu_idle_poll_update(cpu_present_mask);
} else if (!strcmp(str, "halt")) {
/* 'idle=halt' HALT for idle. C-states are disabled. */
boot_option_idle_override = IDLE_HALT;
diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index e6089abc28e2..ce909b1839c9 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -164,6 +164,7 @@ static inline void suspend_enable_secondary_cpus(void) { }
void __noreturn cpu_startup_entry(enum cpuhp_state state);
void cpu_idle_poll_ctrl(bool enable);
+void cpu_idle_poll_update(const struct cpumask *mask);
bool cpu_in_idle(unsigned long pc);
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 2c85c86b455f..86365bbbc111 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -19,22 +19,29 @@ void sched_idle_set_state(struct cpuidle_state *idle_state)
idle_set_state(this_rq(), idle_state);
}
-static int __read_mostly cpu_idle_force_poll;
+static DEFINE_PER_CPU(int, idle_force_poll);
void cpu_idle_poll_ctrl(bool enable)
{
if (enable) {
- cpu_idle_force_poll++;
- } else {
- cpu_idle_force_poll--;
- WARN_ON_ONCE(cpu_idle_force_poll < 0);
- }
+ this_cpu_inc(idle_force_poll);
+ } else
+ WARN_ON_ONCE(this_cpu_dec_return(idle_force_poll) < 0);
+}
+
+void cpu_idle_poll_update(const struct cpumask *mask)
+{
+ int cpu;
+
+ pr_info_once("using polling idle threads\n");
+ for_each_cpu(cpu, mask)
+ per_cpu(idle_force_poll, cpu) = 1;
}
#ifdef CONFIG_GENERIC_IDLE_POLL_SETUP
static int __init cpu_idle_poll_setup(char *__unused)
{
- cpu_idle_force_poll = 1;
+ cpu_idle_poll_update(cpu_present_mask);
return 1;
}
@@ -42,8 +49,6 @@ __setup("nohlt", cpu_idle_poll_setup);
static int __init cpu_idle_nopoll_setup(char *__unused)
{
- cpu_idle_force_poll = 0;
-
return 1;
}
__setup("hlt", cpu_idle_nopoll_setup);
@@ -51,14 +56,16 @@ __setup("hlt", cpu_idle_nopoll_setup);
static noinline int __cpuidle cpu_idle_poll(void)
{
+ int cpu = smp_processor_id();
+
instrumentation_begin();
- trace_cpu_idle(0, smp_processor_id());
+ trace_cpu_idle(0, cpu);
stop_critical_timings();
ct_cpuidle_enter();
raw_local_irq_enable();
while (!tif_need_resched() &&
- (cpu_idle_force_poll || tick_check_broadcast_expired()))
+ (per_cpu(idle_force_poll, cpu) || tick_check_broadcast_expired()))
cpu_relax();
raw_local_irq_disable();
@@ -78,7 +85,7 @@ void __weak arch_cpu_idle_exit(void) { }
void __weak __noreturn arch_cpu_idle_dead(void) { while (1); }
void __weak arch_cpu_idle(void)
{
- cpu_idle_force_poll = 1;
+ this_cpu_inc(idle_force_poll);
}
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE
@@ -318,7 +325,7 @@ static void do_idle(void)
* broadcast device expired for us, we don't want to go deep
* idle as we know that the IPI is going to arrive right away.
*/
- if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
+ if (__this_cpu_read(idle_force_poll) || tick_check_broadcast_expired()) {
tick_nohz_idle_restart_tick();
cpu_idle_poll();
} else {
--
2.49.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll
2025-06-21 23:57 [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll Aaron Tomlin
@ 2025-06-23 10:23 ` Peter Zijlstra
2025-06-23 21:49 ` Mel Gorman
0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2025-06-23 10:23 UTC (permalink / raw)
To: Aaron Tomlin
Cc: tglx, mingo, bp, dave.hansen, x86, juri.lelli, vincent.guittot,
hpa, oleg, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel
On Sat, Jun 21, 2025 at 07:57:45PM -0400, Aaron Tomlin wrote:
> Currently, the idle=poll kernel boot parameter applies globally, forcing
> all CPUs into a shallow polling idle state to ensure ultra-low latency
> responsiveness. While this is beneficial for extremely latency-sensitive
> workloads, this global application lacks flexibility and can lead to
> significant power inefficiency. This is particularly evident in systems
> with a high CPU count, such as those utilising the
> Full Dynticks/Adaptive Tick feature (i.e., nohz_full). In such
> environments, only a subset of CPUs might genuinely require
> sub-microsecond responsiveness, while others, though active, could
> benefit from entering deeper idle states to conserve power.
Can't we already do this at runtime with pmqos? If you set your latency
demand very low, it should end up picking the poll state, no? And you
can do this per-cpu.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll
2025-06-23 10:23 ` Peter Zijlstra
@ 2025-06-23 21:49 ` Mel Gorman
2025-06-25 13:39 ` Aaron Tomlin
2025-08-31 22:29 ` Aaron Tomlin
0 siblings, 2 replies; 5+ messages in thread
From: Mel Gorman @ 2025-06-23 21:49 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Aaron Tomlin, tglx, mingo, bp, dave.hansen, x86, juri.lelli,
vincent.guittot, hpa, oleg, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel
On Mon, Jun 23, 2025 at 12:23:34PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 21, 2025 at 07:57:45PM -0400, Aaron Tomlin wrote:
> > Currently, the idle=poll kernel boot parameter applies globally, forcing
> > all CPUs into a shallow polling idle state to ensure ultra-low latency
> > responsiveness. While this is beneficial for extremely latency-sensitive
> > workloads, this global application lacks flexibility and can lead to
> > significant power inefficiency. This is particularly evident in systems
> > with a high CPU count, such as those utilising the
> > Full Dynticks/Adaptive Tick feature (i.e., nohz_full). In such
> > environments, only a subset of CPUs might genuinely require
> > sub-microsecond responsiveness, while others, though active, could
> > benefit from entering deeper idle states to conserve power.
>
> Can't we already do this at runtime with pmqos? If you set your latency
> demand very low, it should end up picking the poll state, no? And you
> can do this per-cpu.
Yes, we can. idle=poll can be hazardous in weird ways and it's not like
pmqos is hard to use. For example, lets say you had a RT application with
latency constraints running on isolated CPUs while leaving housekeeping
CPUs alone then it's simply a case of;
for CPU in $ISOLATED_CPUS; do
SYSFS_PARAM="/sys/devices/system/cpu/cpu$CPU/power/pm_qos_resume_latency_us"
if [ ! -e $SYSFS_PARAM ]; then
echo "WARNING: Unable to set PM QOS max latency for CPU $CPU\n"
continue
fi
echo $MAX_EXIT_LATENCY > $SYSFS_PARAM
echo "Set PM QOS maximum resume latency on CPU $CPU to ${MAX_EXIT_LATENCY}us"
done
In too many cases I've seen idle=poll being used when the user didn't know
PM QOS existed. The most common response I've received is that the latency
requirements were unknown resulting in much headbanging off the table.
Don't get me started on the hazards of limiting c-states by index without
checking that the c-states are or splitting isolated/housekeeping across
SMT siblings.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll
2025-06-23 21:49 ` Mel Gorman
@ 2025-06-25 13:39 ` Aaron Tomlin
2025-08-31 22:29 ` Aaron Tomlin
1 sibling, 0 replies; 5+ messages in thread
From: Aaron Tomlin @ 2025-06-25 13:39 UTC (permalink / raw)
To: Mel Gorman
Cc: Peter Zijlstra, tglx, mingo, bp, dave.hansen, x86, juri.lelli,
vincent.guittot, hpa, oleg, dietmar.eggemann, rostedt, bsegall,
mgorman, vschneid, linux-kernel
On Mon, Jun 23, 2025 at 10:49:59PM +0100, Mel Gorman wrote:
> > > Full Dynticks/Adaptive Tick feature (i.e., nohz_full). In such
> > > environments, only a subset of CPUs might genuinely require
> > > sub-microsecond responsiveness, while others, though active, could
> > > benefit from entering deeper idle states to conserve power.
> >
> > Can't we already do this at runtime with pmqos? If you set your latency
> > demand very low, it should end up picking the poll state, no? And you
> > can do this per-cpu.
>
> Yes, we can. idle=poll can be hazardous in weird ways and it's not like
> pmqos is hard to use. For example, lets say you had a RT application with
> latency constraints running on isolated CPUs while leaving housekeeping
> CPUs alone then it's simply a case of;
>
> for CPU in $ISOLATED_CPUS; do
> SYSFS_PARAM="/sys/devices/system/cpu/cpu$CPU/power/pm_qos_resume_latency_us"
> if [ ! -e $SYSFS_PARAM ]; then
> echo "WARNING: Unable to set PM QOS max latency for CPU $CPU\n"
> continue
> fi
> echo $MAX_EXIT_LATENCY > $SYSFS_PARAM
> echo "Set PM QOS maximum resume latency on CPU $CPU to ${MAX_EXIT_LATENCY}us"
> done
>
>
> In too many cases I've seen idle=poll being used when the user didn't know
> PM QOS existed. The most common response I've received is that the latency
> requirements were unknown resulting in much headbanging off the table.
> Don't get me started on the hazards of limiting c-states by index without
> checking that the c-states are or splitting isolated/housekeeping across
> SMT siblings.
>
> --
> Mel Gorman
> SUSE Labs
Hi Peter, Mel,
Interesting. I was not aware of PM QOS. I will look into it, thank you!
As far as I can tell, the function cpu_idle_poll_ctrl() is used in a few
locations to ensure the running CPU does not enter a "deep" idle state i.e.,
use cpu_idle_poll() only. I do not see why cpu_idle_force_poll should
remain global. Perhaps I am missing something?
Kind regards,
--
Aaron Tomlin
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll
2025-06-23 21:49 ` Mel Gorman
2025-06-25 13:39 ` Aaron Tomlin
@ 2025-08-31 22:29 ` Aaron Tomlin
1 sibling, 0 replies; 5+ messages in thread
From: Aaron Tomlin @ 2025-08-31 22:29 UTC (permalink / raw)
To: Mel Gorman, Peter Zijlstra
Cc: tglx, mingo, bp, dave.hansen, x86, juri.lelli, vincent.guittot,
hpa, oleg, dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
linux-kernel, atomlin
On Mon, Jun 23, 2025 at 10:49:59PM +0100, Mel Gorman wrote:
> On Mon, Jun 23, 2025 at 12:23:34PM +0200, Peter Zijlstra wrote:
> > On Sat, Jun 21, 2025 at 07:57:45PM -0400, Aaron Tomlin wrote:
> > > Currently, the idle=poll kernel boot parameter applies globally, forcing
> > > all CPUs into a shallow polling idle state to ensure ultra-low latency
> > > responsiveness. While this is beneficial for extremely latency-sensitive
> > > workloads, this global application lacks flexibility and can lead to
> > > significant power inefficiency. This is particularly evident in systems
> > > with a high CPU count, such as those utilising the
> > > Full Dynticks/Adaptive Tick feature (i.e., nohz_full). In such
> > > environments, only a subset of CPUs might genuinely require
> > > sub-microsecond responsiveness, while others, though active, could
> > > benefit from entering deeper idle states to conserve power.
> >
> > Can't we already do this at runtime with pmqos? If you set your latency
> > demand very low, it should end up picking the poll state, no? And you
> > can do this per-cpu.
>
> Yes, we can. idle=poll can be hazardous in weird ways and it's not like
> pmqos is hard to use. For example, lets say you had a RT application with
> latency constraints running on isolated CPUs while leaving housekeeping
> CPUs alone then it's simply a case of;
>
> for CPU in $ISOLATED_CPUS; do
> SYSFS_PARAM="/sys/devices/system/cpu/cpu$CPU/power/pm_qos_resume_latency_us"
> if [ ! -e $SYSFS_PARAM ]; then
> echo "WARNING: Unable to set PM QOS max latency for CPU $CPU\n"
> continue
> fi
> echo $MAX_EXIT_LATENCY > $SYSFS_PARAM
> echo "Set PM QOS maximum resume latency on CPU $CPU to ${MAX_EXIT_LATENCY}us"
> done
>
>
> In too many cases I've seen idle=poll being used when the user didn't know
> PM QOS existed. The most common response I've received is that the latency
> requirements were unknown resulting in much headbanging off the table.
> Don't get me started on the hazards of limiting c-states by index without
> checking that the c-states are or splitting isolated/housekeeping across
> SMT siblings.
Mel, Peter,
Yes, I can confirm the PM QoS subsystem allows one to set constraints on a
per-CPU basis or for the entire system. One can specify a maximum allowed
latency constraint. As per function pm_qos_resume_latency_us_store() a
value of "n/a" will prevent the specified CPU from entering the shallowest
CPU idle-state (namely "C1") given its exit latency constraint.
Indeed using "idle=poll" to prevent a CPU from entering idle C-states is
problematic given its crude, all-or-nothing approach.
Thank you for the suggestion.
...
do_idle
cpuidle_idle_call
{
next_state = cpuidle_select(drv, dev, &stop_tick)
// cpuidle_curr_governor->select(drv, dev, stop_tick)
menu_select(drv, dev, stop_tick)
{
latency_req = cpuidle_governor_latency_req(dev->cpu)
{
*device = get_cpu_device(cpu)
device_req = dev_pm_qos_raw_resume_latency(device)
global_req = cpu_latency_qos_limit()
if (device_req > global_req)
device_req = global_req
return (s64)device_req * NSEC_PER_USEC
}
if (unlikely(drv->state_count <= 1 || latency_req == 0) || ...
... ) {
// A CPU idle driver which more than one C-state and a
// latency requirement of 0 will force C0
*stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING)
return 0
}
}
entered_state = call_cpuidle(drv, dev, next_state)
}
crash> p cpuidle_curr_governor
cpuidle_curr_governor = $1 = (struct cpuidle_governor *) 0xffffffff9ab913e0 <menu_governor>
crash> p cpuidle_curr_governor.select
$2 = (int (*)(struct cpuidle_driver *, struct cpuidle_device *, bool *)) 0xffffffff99157ed0 <menu_select>
crash> p cpuidle_curr_driver
cpuidle_curr_driver = $3 = (struct cpuidle_driver *) 0xffffffff9ab04dc0 <intel_idle_driver>
crash> p ((struct cpuidle_driver *)0xffffffff9ab04dc0)->states[0].enter
$4 = (int (*)(struct cpuidle_device *, struct cpuidle_driver *, int)) 0xffffffff994d32a0 <poll_idle>
crash> p -d ((struct cpuidle_driver *)0xffffffff9ab04dc0)->states[1].exit_latency_ns
$5 = 2000
crash> p -d ((struct cpuidle_driver *)0xffffffff9ab04dc0)->states[0].exit_latency_ns
$6 = 0
# cat /sys/devices/system/cpu/cpuidle/current_driver
intel_idle
# cat /sys/devices/system/cpu/cpu7/cpuidle/state0/name
POLL
# cat /sys/devices/system/cpu/cpu7/cpuidle/state0/latency
0
# echo "n/a" > /sys/devices/system/cpu/cpu7/power/pm_qos_resume_latency_us
#
# Samples: 2K of event 'cpu-cycles:k'
# Event count (approx.): 89401819821
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... .............................................. ..................................................
#
99.80% 0.05% swapper [kernel.kallsyms] [k] do_idle
99.80% 0.00% swapper [kernel.kallsyms] [k] common_startup_64
99.80% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
99.80% 0.00% swapper [kernel.kallsyms] [k] start_secondary
99.75% 0.00% swapper [kernel.kallsyms] [k] cpuidle_idle_call
99.34% 0.05% swapper [kernel.kallsyms] [k] cpuidle_enter_state
99.34% 0.00% swapper [kernel.kallsyms] [k] cpuidle_enter
98.97% 98.56% swapper [kernel.kallsyms] [k] poll_idle
swapper 0 [007] 1203865.059685: 6998090 cpu-cycles:k:
ffffffff994d32f8 poll_idle+0x58 ([kernel.kallsyms])
ffffffff994d16b4 cpuidle_enter_state+0x84 ([kernel.kallsyms])
ffffffff99156241 cpuidle_enter+0x31 ([kernel.kallsyms])
ffffffff9844ab77 cpuidle_idle_call+0xf7 ([kernel.kallsyms])
ffffffff9844ac68 do_idle+0x78 ([kernel.kallsyms])
ffffffff9844aec9 cpu_startup_entry+0x29 ([kernel.kallsyms])
ffffffff983663db start_secondary+0x12b ([kernel.kallsyms])
ffffffff9831452d common_startup_64+0x13e ([kernel.kallsyms])
--
Aaron Tomlin
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-08-31 22:29 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-21 23:57 [RFC PATCH] sched: idle: Introduce CPU-specific idle=poll Aaron Tomlin
2025-06-23 10:23 ` Peter Zijlstra
2025-06-23 21:49 ` Mel Gorman
2025-06-25 13:39 ` Aaron Tomlin
2025-08-31 22:29 ` Aaron Tomlin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).