* [PATCH v2 0/2] cpuidle: governor: Modify the handling of stopped tick @ 2026-02-23 15:37 Rafael J. Wysocki 2026-02-23 15:38 ` [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling Rafael J. Wysocki 2026-02-23 15:40 ` [PATCH v2 2/2] cpuidle: governors: teo: Rearrange " Rafael J. Wysocki 0 siblings, 2 replies; 6+ messages in thread From: Rafael J. Wysocki @ 2026-02-23 15:37 UTC (permalink / raw) To: Linux PM Cc: LKML, Christian Loehle, Doug Smythies, Aboorva Devarajan, Ionut Nechita (Sunlight Linux) Hi All, This is an update of https://lore.kernel.org/linux-pm/1953482.tdWV9SEqCh@rafael.j.wysocki/ that fixes an issue in the second patch. The first patch does not change and the changelog below still applies. When I was thinking about possible ways to address high CPU wakeup latency on isolated CPUs resulting from the selection of deep idle states by cpuidle governors, it occurred to me that it is not always necessary to select a deep idle state if the scheduler tick has been stopped. Namely, if a timer is going to trigger (relatively) shortly, a shallow state may as well be selected because the timer will kick the CPU out of that state anyway and getting stuck in it for a long time is not a concern. Changing the menu governor to take that observation into account is a 2-line patch, modulo a comment update (patch [1/2]). Of course, the SAFE_TIMER_RANGE_NS value is somewhat arbitrary. Updating the teo governor accordingly is a bit more challenging, but overall it is a major simplification of the stopped tick handling there, so IMV it is very much worth doing (patch [2/2]). By itself, this is not going to help workloads running on isolated CPUs too much, but if SAFE_TIMER_RANGE_NS were replaced with a per-CPU tunable, that could help people to configure their systems to avoid the latency issue mentioned above. Thanks, Rafael ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling 2026-02-23 15:37 [PATCH v2 0/2] cpuidle: governor: Modify the handling of stopped tick Rafael J. Wysocki @ 2026-02-23 15:38 ` Rafael J. Wysocki 2026-03-05 10:45 ` Christian Loehle 2026-04-03 17:07 ` Ionut Nechita (Wind River) 2026-02-23 15:40 ` [PATCH v2 2/2] cpuidle: governors: teo: Rearrange " Rafael J. Wysocki 1 sibling, 2 replies; 6+ messages in thread From: Rafael J. Wysocki @ 2026-02-23 15:38 UTC (permalink / raw) To: Linux PM Cc: LKML, Christian Loehle, Doug Smythies, Aboorva Devarajan, Ionut Nechita (Sunlight Linux) From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> This change is based on the observation that it is not in fact necessary to select a deep idle state every time the scheduler tick has been stopped before the idle state selection takes place. Namely, if the time till the closest timer (that is not the tick) is short enough, a shallow idle state can be selected because the timer will kick the CPU out of that state, so the damage from a possible overly optimistic selection will be limited. Update the menu governor in accordance with the above and use twice the tick period length as the "safe timer range" for allowing the original predicted_ns value to be used even if the tick has been stopped. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- v1 -> v2: No changes --- drivers/cpuidle/governors/gov.h | 5 +++++ drivers/cpuidle/governors/menu.c | 15 +++++++++------ 2 files changed, 14 insertions(+), 6 deletions(-) --- a/drivers/cpuidle/governors/gov.h +++ b/drivers/cpuidle/governors/gov.h @@ -10,5 +10,10 @@ * check the time till the closest expected timer event. */ #define RESIDENCY_THRESHOLD_NS (15 * NSEC_PER_USEC) +/* + * If the closest timer is in this range, the governor idle state selection need + * not be adjusted after the scheduler tick has been stopped. + */ +#define SAFE_TIMER_RANGE_NS (2 * TICK_NSEC) #endif /* __CPUIDLE_GOVERNOR_H */ --- a/drivers/cpuidle/governors/menu.c +++ b/drivers/cpuidle/governors/menu.c @@ -261,13 +261,16 @@ static int menu_select(struct cpuidle_dr predicted_ns = min((u64)timer_us * NSEC_PER_USEC, predicted_ns); /* * If the tick is already stopped, the cost of possible short - * idle duration misprediction is much higher, because the CPU - * may be stuck in a shallow idle state for a long time as a - * result of it. In that case, say we might mispredict and use - * the known time till the closest timer event for the idle - * state selection. + * idle duration misprediction is higher because the CPU may get + * stuck in a shallow idle state then. To avoid that, if + * predicted_ns is small enough, say it might be mispredicted + * and use the known time till the closest timer for idle state + * selection unless that timer is going to trigger within + * SAFE_TIMER_RANGE_NS in which case it can be regarded as a + * sufficient safety net. */ - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC && + data->next_timer_ns > SAFE_TIMER_RANGE_NS) predicted_ns = data->next_timer_ns; } else { /* ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling 2026-02-23 15:38 ` [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling Rafael J. Wysocki @ 2026-03-05 10:45 ` Christian Loehle 2026-04-03 17:07 ` Ionut Nechita (Wind River) 1 sibling, 0 replies; 6+ messages in thread From: Christian Loehle @ 2026-03-05 10:45 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM Cc: LKML, Doug Smythies, Aboorva Devarajan, Ionut Nechita (Sunlight Linux) On 2/23/26 15:38, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > This change is based on the observation that it is not in fact necessary > to select a deep idle state every time the scheduler tick has been > stopped before the idle state selection takes place. Namely, if the > time till the closest timer (that is not the tick) is short enough, > a shallow idle state can be selected because the timer will kick the > CPU out of that state, so the damage from a possible overly optimistic > selection will be limited. > > Update the menu governor in accordance with the above and use twice > the tick period length as the "safe timer range" for allowing the > original predicted_ns value to be used even if the tick has been > stopped. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > v1 -> v2: No changes > > --- > drivers/cpuidle/governors/gov.h | 5 +++++ > drivers/cpuidle/governors/menu.c | 15 +++++++++------ > 2 files changed, 14 insertions(+), 6 deletions(-) > > --- a/drivers/cpuidle/governors/gov.h > +++ b/drivers/cpuidle/governors/gov.h > @@ -10,5 +10,10 @@ > * check the time till the closest expected timer event. > */ > #define RESIDENCY_THRESHOLD_NS (15 * NSEC_PER_USEC) > +/* > + * If the closest timer is in this range, the governor idle state selection need > + * not be adjusted after the scheduler tick has been stopped. > + */ > +#define SAFE_TIMER_RANGE_NS (2 * TICK_NSEC) > > #endif /* __CPUIDLE_GOVERNOR_H */ > --- a/drivers/cpuidle/governors/menu.c > +++ b/drivers/cpuidle/governors/menu.c > @@ -261,13 +261,16 @@ static int menu_select(struct cpuidle_dr > predicted_ns = min((u64)timer_us * NSEC_PER_USEC, predicted_ns); > /* > * If the tick is already stopped, the cost of possible short > - * idle duration misprediction is much higher, because the CPU > - * may be stuck in a shallow idle state for a long time as a > - * result of it. In that case, say we might mispredict and use > - * the known time till the closest timer event for the idle > - * state selection. > + * idle duration misprediction is higher because the CPU may get > + * stuck in a shallow idle state then. To avoid that, if > + * predicted_ns is small enough, say it might be mispredicted > + * and use the known time till the closest timer for idle state > + * selection unless that timer is going to trigger within > + * SAFE_TIMER_RANGE_NS in which case it can be regarded as a > + * sufficient safety net. > */ > - if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC) > + if (tick_nohz_tick_stopped() && predicted_ns < TICK_NSEC && > + data->next_timer_ns > SAFE_TIMER_RANGE_NS) > predicted_ns = data->next_timer_ns; > } else { > /* > > > So FWIW both patches look sane to me, I'm still trying to get some test setup to see what this looks like and should look like, but for now: Reviewed-by: Christian Loehle <christian.loehle@arm.com> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling 2026-02-23 15:38 ` [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling Rafael J. Wysocki 2026-03-05 10:45 ` Christian Loehle @ 2026-04-03 17:07 ` Ionut Nechita (Wind River) 1 sibling, 0 replies; 6+ messages in thread From: Ionut Nechita (Wind River) @ 2026-04-03 17:07 UTC (permalink / raw) To: rafael Cc: aboorvad, christian.loehle, dsmythies, linux-kernel, linux-pm, sunlightlinux On Mon, 23 Feb 2026 16:38:55 +0100, Rafael J. Wysocki wrote: > Update the menu governor in accordance with the above and use twice > the tick period length as the "safe timer range" for allowing the > original predicted_ns value to be used even if the tick has been > stopped. Tested this on 6.12.79-rt17 with isolated CPUs (nohz_full=1-16, isolcpus=nohz,domain,managed_irq,1-16) on Intel Xeon Gold 6338N. cyclictest --priority 95 --nsecs --duration 600 --affinity 1-15 --threads 15 --mainaffinity 0 Before (6.12.79-rt17 without patch): Avg: ~1780ns, Max T:3-T:8: 9300-9700ns After (6.12.79-rt17 + this patch): Avg: ~1790ns, Max T:3-T:14: 5200-6100ns The patch reduces worst-case latency on threads T:3-T:14 from ~9500ns to ~5800ns on isolated CPUs with nohz_full. T:0-T:2 still show occasional higher spikes (9400-10700ns) but the overall tail latency improvement is clear. Tested-by: Ionut Nechita <sunlightlinux@gmail.com> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] cpuidle: governors: teo: Rearrange stopped tick handling 2026-02-23 15:37 [PATCH v2 0/2] cpuidle: governor: Modify the handling of stopped tick Rafael J. Wysocki 2026-02-23 15:38 ` [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling Rafael J. Wysocki @ 2026-02-23 15:40 ` Rafael J. Wysocki 2026-03-05 10:45 ` Christian Loehle 1 sibling, 1 reply; 6+ messages in thread From: Rafael J. Wysocki @ 2026-02-23 15:40 UTC (permalink / raw) To: Linux PM Cc: LKML, Christian Loehle, Doug Smythies, Aboorva Devarajan, Ionut Nechita (Sunlight Linux) From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> This change is based on the observation that it is not in fact necessary to select a deep idle state every time the scheduler tick has been stopped before the idle state selection takes place. Namely, if the time till the closest timer (that is not the tick) is short enough, a shallow idle state can be selected because the timer will kick the CPU out of that state, so the damage from a possible overly optimistic selection will be limited. Update the teo governor in accordance with the above in analogy with the previous analogous menu governor update. Among other things, this will cause the teo governor to call tick_nohz_get_sleep_length() every time when the tick has been stopped already and only change the original idle state selection if the time till the closest timer is beyond SAFE_TIMER_RANGE_NS which is way more straightforward than the current code flow. Of course, this effectively throws away some of the recent teo governor changes made recently, but the resulting simplification is worth it in my view. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> --- v1 -> v2: Take constraint_idx into account when looking for a deeper idle state (Christian) --- drivers/cpuidle/governors/teo.c | 81 ++++++++++++++++------------------------ 1 file changed, 34 insertions(+), 47 deletions(-) --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -407,50 +407,13 @@ static int teo_select(struct cpuidle_dri * better choice. */ if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { - int min_idx = idx0; - - if (tick_nohz_tick_stopped()) { - /* - * Look for the shallowest idle state below the current - * candidate one whose target residency is at least - * equal to the tick period length. - */ - while (min_idx < idx && - drv->states[min_idx].target_residency_ns < TICK_NSEC) - min_idx++; - - /* - * Avoid selecting a state with a lower index, but with - * the same target residency as the current candidate - * one. - */ - if (drv->states[min_idx].target_residency_ns == - drv->states[idx].target_residency_ns) - goto constraint; - } - - /* - * If the minimum state index is greater than or equal to the - * index of the state with the maximum intercepts metric and - * the corresponding state is enabled, there is no need to look - * at the deeper states. - */ - if (min_idx >= intercept_max_idx && - !dev->states_usage[min_idx].disable) { - idx = min_idx; - goto constraint; - } - /* * Look for the deepest enabled idle state, at most as deep as * the one with the maximum intercepts metric, whose target * residency had not been greater than the idle duration in over * a half of the relevant cases in the past. - * - * Take the possible duration limitation present if the tick - * has been stopped already into account. */ - for (i = idx - 1, intercept_sum = 0; i >= min_idx; i--) { + for (i = idx - 1, intercept_sum = 0; i >= idx0; i--) { intercept_sum += cpu_data->state_bins[i].intercepts; if (dev->states_usage[i].disable) @@ -463,7 +426,6 @@ static int teo_select(struct cpuidle_dri } } -constraint: /* * If there is a latency constraint, it may be necessary to select an * idle state shallower than the current candidate one. @@ -472,13 +434,13 @@ constraint: idx = constraint_idx; /* - * If either the candidate state is state 0 or its target residency is - * low enough, there is basically nothing more to do, but if the sleep - * length is not updated, the subsequent wakeup will be counted as an - * "intercept" which may be problematic in the cases when timer wakeups - * are dominant. Namely, it may effectively prevent deeper idle states - * from being selected at one point even if no imminent timers are - * scheduled. + * If the tick has not been stopped and either the candidate state is + * state 0 or its target residency is low enough, there is basically + * nothing more to do, but if the sleep length is not updated, the + * subsequent wakeup will be counted as an "intercept". That may be + * problematic in the cases when timer wakeups are dominant because it + * may effectively prevent deeper idle states from being selected at one + * point even if no imminent timers are scheduled. * * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one * CPU are unlikely (user space has a default 50 us slack value for @@ -494,7 +456,8 @@ constraint: * shallow idle states regardless of the wakeup type, so the sleep * length need not be known in that case. */ - if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && + if (!tick_nohz_tick_stopped() && (!idx || + drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && (2 * cpu_data->short_idles >= cpu_data->total || latency_req < LATENCY_THRESHOLD_NS)) goto out_tick; @@ -502,6 +465,30 @@ constraint: duration_ns = tick_nohz_get_sleep_length(&delta_tick); cpu_data->sleep_length_ns = duration_ns; + /* + * If the tick has been stopped and the closest timer is too far away, + * update the selection to prevent the CPU from getting stuck in a + * shallow idle state for too long. + */ + if (tick_nohz_tick_stopped() && duration_ns > SAFE_TIMER_RANGE_NS && + drv->states[idx].target_residency_ns < TICK_NSEC) { + /* + * Look for the deepest enabled idle state with exit latency + * within the PM QoS limit and with target residency within + * duration_ns. + */ + for (i = constraint_idx; i > idx; i--) { + if (dev->states_usage[i].disable) + continue; + + if (drv->states[i].target_residency_ns <= duration_ns) { + idx = i; + break; + } + } + return idx; + } + if (!idx) goto out_tick; ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 2/2] cpuidle: governors: teo: Rearrange stopped tick handling 2026-02-23 15:40 ` [PATCH v2 2/2] cpuidle: governors: teo: Rearrange " Rafael J. Wysocki @ 2026-03-05 10:45 ` Christian Loehle 0 siblings, 0 replies; 6+ messages in thread From: Christian Loehle @ 2026-03-05 10:45 UTC (permalink / raw) To: Rafael J. Wysocki, Linux PM Cc: LKML, Doug Smythies, Aboorva Devarajan, Ionut Nechita (Sunlight Linux) On 2/23/26 15:40, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > > This change is based on the observation that it is not in fact necessary > to select a deep idle state every time the scheduler tick has been > stopped before the idle state selection takes place. Namely, if the > time till the closest timer (that is not the tick) is short enough, > a shallow idle state can be selected because the timer will kick the > CPU out of that state, so the damage from a possible overly optimistic > selection will be limited. > > Update the teo governor in accordance with the above in analogy with > the previous analogous menu governor update. > > Among other things, this will cause the teo governor to call > tick_nohz_get_sleep_length() every time when the tick has been > stopped already and only change the original idle state selection > if the time till the closest timer is beyond SAFE_TIMER_RANGE_NS > which is way more straightforward than the current code flow. > > Of course, this effectively throws away some of the recent teo governor > changes made recently, but the resulting simplification is worth it in > my view. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > > v1 -> v2: Take constraint_idx into account when looking for a deeper idle > state (Christian) > > --- > drivers/cpuidle/governors/teo.c | 81 ++++++++++++++++------------------------ > 1 file changed, 34 insertions(+), 47 deletions(-) > > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -407,50 +407,13 @@ static int teo_select(struct cpuidle_dri > * better choice. > */ > if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { > - int min_idx = idx0; > - > - if (tick_nohz_tick_stopped()) { > - /* > - * Look for the shallowest idle state below the current > - * candidate one whose target residency is at least > - * equal to the tick period length. > - */ > - while (min_idx < idx && > - drv->states[min_idx].target_residency_ns < TICK_NSEC) > - min_idx++; > - > - /* > - * Avoid selecting a state with a lower index, but with > - * the same target residency as the current candidate > - * one. > - */ > - if (drv->states[min_idx].target_residency_ns == > - drv->states[idx].target_residency_ns) > - goto constraint; > - } > - > - /* > - * If the minimum state index is greater than or equal to the > - * index of the state with the maximum intercepts metric and > - * the corresponding state is enabled, there is no need to look > - * at the deeper states. > - */ > - if (min_idx >= intercept_max_idx && > - !dev->states_usage[min_idx].disable) { > - idx = min_idx; > - goto constraint; > - } > - > /* > * Look for the deepest enabled idle state, at most as deep as > * the one with the maximum intercepts metric, whose target > * residency had not been greater than the idle duration in over > * a half of the relevant cases in the past. > - * > - * Take the possible duration limitation present if the tick > - * has been stopped already into account. > */ > - for (i = idx - 1, intercept_sum = 0; i >= min_idx; i--) { > + for (i = idx - 1, intercept_sum = 0; i >= idx0; i--) { > intercept_sum += cpu_data->state_bins[i].intercepts; > > if (dev->states_usage[i].disable) > @@ -463,7 +426,6 @@ static int teo_select(struct cpuidle_dri > } > } > > -constraint: > /* > * If there is a latency constraint, it may be necessary to select an > * idle state shallower than the current candidate one. > @@ -472,13 +434,13 @@ constraint: > idx = constraint_idx; > > /* > - * If either the candidate state is state 0 or its target residency is > - * low enough, there is basically nothing more to do, but if the sleep > - * length is not updated, the subsequent wakeup will be counted as an > - * "intercept" which may be problematic in the cases when timer wakeups > - * are dominant. Namely, it may effectively prevent deeper idle states > - * from being selected at one point even if no imminent timers are > - * scheduled. > + * If the tick has not been stopped and either the candidate state is > + * state 0 or its target residency is low enough, there is basically > + * nothing more to do, but if the sleep length is not updated, the > + * subsequent wakeup will be counted as an "intercept". That may be > + * problematic in the cases when timer wakeups are dominant because it > + * may effectively prevent deeper idle states from being selected at one > + * point even if no imminent timers are scheduled. > * > * However, frequent timers in the RESIDENCY_THRESHOLD_NS range on one > * CPU are unlikely (user space has a default 50 us slack value for > @@ -494,7 +456,8 @@ constraint: > * shallow idle states regardless of the wakeup type, so the sleep > * length need not be known in that case. > */ > - if ((!idx || drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && > + if (!tick_nohz_tick_stopped() && (!idx || > + drv->states[idx].target_residency_ns < RESIDENCY_THRESHOLD_NS) && > (2 * cpu_data->short_idles >= cpu_data->total || > latency_req < LATENCY_THRESHOLD_NS)) > goto out_tick; > @@ -502,6 +465,30 @@ constraint: > duration_ns = tick_nohz_get_sleep_length(&delta_tick); > cpu_data->sleep_length_ns = duration_ns; > > + /* > + * If the tick has been stopped and the closest timer is too far away, > + * update the selection to prevent the CPU from getting stuck in a > + * shallow idle state for too long. > + */ > + if (tick_nohz_tick_stopped() && duration_ns > SAFE_TIMER_RANGE_NS && > + drv->states[idx].target_residency_ns < TICK_NSEC) { > + /* > + * Look for the deepest enabled idle state with exit latency > + * within the PM QoS limit and with target residency within > + * duration_ns. > + */ > + for (i = constraint_idx; i > idx; i--) { > + if (dev->states_usage[i].disable) > + continue; > + > + if (drv->states[i].target_residency_ns <= duration_ns) { > + idx = i; > + break; > + } > + } > + return idx; > + } > + > if (!idx) > goto out_tick; > > Reviewed-by: Christian Loehle <christian.loehle@arm.com> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-03 17:08 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-23 15:37 [PATCH v2 0/2] cpuidle: governor: Modify the handling of stopped tick Rafael J. Wysocki 2026-02-23 15:38 ` [PATCH v2 1/2] cpuidle: governors: menu: Refine stopped tick handling Rafael J. Wysocki 2026-03-05 10:45 ` Christian Loehle 2026-04-03 17:07 ` Ionut Nechita (Wind River) 2026-02-23 15:40 ` [PATCH v2 2/2] cpuidle: governors: teo: Rearrange " Rafael J. Wysocki 2026-03-05 10:45 ` Christian Loehle
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox