From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0C23A2FF66C for ; Wed, 12 Nov 2025 09:51:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762941092; cv=none; b=e9yec3HoTETEh1Vu6T98Mu7hOTPjU2zvP4v29j7gdUQeoDdbpSVMFm/BrW1et72uG6ig+SwSZ0DBnREDpUKsSM95ryFPbgM0NsIslaEhHTnfUe5aPwyIj/uxve1hPjB+yh3QQAdQvME4qb9sjz1BWRRFPmujDZwYjat3MdlrlfY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762941092; c=relaxed/simple; bh=pluTr9oSP0JDpWAWT2rPvadTtR3WY8/KrKbtCxzriYw=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=OAK0Sbwl7sDyUx9FwPP6eMMkX6XdP+Jk9PZQBFB7EHTnPTwFM3JbJriClta2gBQVZ4WXNDQSl6uyGN547PXaE85oRaIi0B8iLkWLmpCmrlpmYYg4BGFw2p4cLCjZge0tKJ18sDS8Idy6r7ag1LB8jvyCg9SYltGEirgDf215Hks= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4F1501515; Wed, 12 Nov 2025 01:51:21 -0800 (PST) Received: from [10.1.28.59] (unknown [10.1.28.59]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DF3583F66E; Wed, 12 Nov 2025 01:51:27 -0800 (PST) Message-ID: Date: Wed, 12 Nov 2025 09:51:25 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Regression in TEO cpuidle governor between 6.6 and 6.12 From: Christian Loehle To: "Rafael J. Wysocki" Cc: Reka Norman , daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: <0c018867-c092-4f8e-8f7a-32bb02de3ad5@arm.com> <2a429c41-8624-408c-9db0-4450ab76e52f@arm.com> <939deff8-7856-4d9b-be91-eda06fac21d0@arm.com> Content-Language: en-US In-Reply-To: <939deff8-7856-4d9b-be91-eda06fac21d0@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/11/25 17:20, Christian Loehle wrote: > On 11/11/25 11:48, Rafael J. Wysocki wrote: >> On Tue, Nov 11, 2025 at 11:48 AM Christian Loehle >> wrote: >>> >>> On 11/11/25 10:00, Christian Loehle wrote: >>>> On 11/11/25 04:23, Reka Norman wrote: >>>>> On Mon, Nov 10, 2025 at 11:06 PM Christian Loehle >>>>> wrote: >>>>>> >>>>>> On 11/10/25 06:10, Reka Norman wrote: >>>>>>> On Mon, Nov 10, 2025 at 7:35 AM Christian Loehle >>>>>>> wrote: >>>>>>>> >>>>>>>> On 11/7/25 11:35, Rafael J. Wysocki wrote: >>>>>>>>> On Fri, Nov 7, 2025 at 4:28 AM Reka Norman wrote: >>>>>>>>>> >>>>>>>>>> On Fri, Nov 7, 2025 at 7:33 AM Rafael J. Wysocki wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, Nov 6, 2025 at 12:13 PM Christian Loehle >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 11/5/25 20:48, Rafael J. Wysocki wrote: >>>>>>>>>>>>> On Wed, Nov 5, 2025 at 12:24 AM Christian Loehle >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 11/4/25 09:03, Christian Loehle wrote: >>>>>>>>>>>>>>> On 11/4/25 03:36, Reka Norman wrote: >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m seeing a regression in the TEO governor between 6.6 and 6.12. At >>>>>>>>>>>>>>>> 6.12, when the system is idle it’s spending almost 100% of time in >>>>>>>>>>>>>>>> WFI, compared to about 6% at 6.6. At mainline it has improved compared >>>>>>>>>>>>>>>> to 6.12 but is still a lot worse than 6.6, spending about 50% in WFI. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The system is a ChromeOS device with Mediatek MT8196. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Bisecting showed the specific commit which caused the regression is: >>>>>>>>>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’ve attached sysfs dumps showing the issue. All were taken a couple >>>>>>>>>>>>>>>> of minutes after boot, with the device having been idle since boot. >>>>>>>>>>>>>>>> The cases tested are: >>>>>>>>>>>>>>>> cpuidle_6_6.txt = 6.6 kernel >>>>>>>>>>>>>>>> cpuidle_6_12.txt = 6.6 kernel with teo commits up to 6.12 >>>>>>>>>>>>>>>> cpuidle_mainline.txt = 6.6 kernel with teo commits up to mainline >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Summary of the percentage time spent in each state (averaged across CPUs): >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> | | 6.6 | 6.12 | mainline | >>>>>>>>>>>>>>>> |------------|------:|------:|---------:| >>>>>>>>>>>>>>>> | WFI | 6.02 | 99.94 | 56.84 | >>>>>>>>>>>>>>>> | cpuoff | 11.02 | 0 | 0.65 | >>>>>>>>>>>>>>>> | clusteroff | 82.96 | 0.05 | 42.51 | >>>>>>>>>>>>>>>> | s2idle | 0 | 0 | 0 | >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any help would be much appreciated. Let me know if there's any other >>>>>>>>>>>>>>>> debugging information I should provide. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That's not good. >>>>>>>>>>>>>>> If the system is mostly idle (only boot activity but dumps are taken after >>>>>>>>>>>>>>> ~3mins?), what is causing the wakeups? Even in 6.6 There are definitely more >>>>>>>>>>>>>>> than I would've expected? >>>>>>>>>>>>>>> I noticed that clusteroff and cpuoff have equal residency, which is >>>>>>>>>>>>>>> obviously a bit awkward for cpuidle, but shouldn't be relevant to your issue. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm a bit puzzled by your bisect results. >>>>>>>>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>>>>>>> made the intercept logic *less* prone to count (false) intercepts, yet it >>>>>>>>>>>>>>> seems to count more of them? (resulting in more WFI). >>>>>>>>>>>>>>> I'll think about it some more, for now of course a trace would be very >>>>>>>>>>>>>>> helpful. (cpuidle events, ipi_raise, irqs?) >>>>>>>>>>>>>>> Are there ever any latency constraints set? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> FWIW the mainline results look the most reasonable, from a 30000 feet view >>>>>>>>>>>>>>> anyway: >>>>>>>>>>>>>>> Cluster State above below usage above% below% >>>>>>>>>>>>>>> LITTLE cpuoff-l ~75 ~65 ~140 23% 20% >>>>>>>>>>>>>>> LITTLE clusteroff-l ~800 0 ~100 89% 0% >>>>>>>>>>>>>>> MID cpuoff-m ~3–4 ~15 ~20 15% 55% >>>>>>>>>>>>>>> MID clusteroff-m ~1300 0 ~4000 24% 0% >>>>>>>>>>>>>>> BIG cpuoff-b 0 1 1 — — >>>>>>>>>>>>>>> BIG clusteroff-b ~800 0 ~1900 30% 0% >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (WFI seems mostly the correct choice for little CPUs, that's fine, the energy >>>>>>>>>>>>>>> savings compared to cpuoff should be marginal anyway.) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you mind trying: >>>>>>>>>>>>>>> 13ed5c4a6d9c cpuidle: teo: Skip getting the sleep length if wakeups are very frequent >>>>>>>>>>>>>>> on 6.12? >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> So just thinking out loud, the only case I can actually thing of to explain your >>>>>>>>>>>>>> bisect to 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>>>>>> is that the workload essentially changed dramatically because of our calls >>>>>>>>>>>>>> to tick_nohz_get_sleep_length() now. >>>>>>>>>>>>>> I'm not sure how likely I think that is, but I'm lacking imagination for another >>>>>>>>>>>>>> cause. That's why results with >>>>>>>>>>>>>> 13ed5c4a6d9c ("cpuidle: teo: Skip getting the sleep length if wakeups are very frequent") >>>>>>>>>>>>>> would be interesting. >>>>>>>>>>>>> >>>>>>>>>>>>> My current theory is that this issue is related to the >>>>>>>>>>>>> tick_nohz_get_sleep_length() overhead and the way "intercepts" are >>>>>>>>>>>>> distinguished from "hits" in teo. >>>>>>>>>>>>> >>>>>>>>>>>>> Namely, teo assumes that its own overhead is negligible and so it >>>>>>>>>>>>> counts a given event as an "intercept" if the measured time spent in >>>>>>>>>>>>> the idle state (with the exit latency roughly taken into account) >>>>>>>>>>>>> falls into a different "state bin" than the sleep length (the expected >>>>>>>>>>>>> time till the next timer). However, the sleep length is computed as a >>>>>>>>>>>>> difference between the upcoming timer wakeup event time and >>>>>>>>>>>>> ts->idle_entrytime, so it actually includes the time taken by >>>>>>>>>>>>> tick_nohz_next_event(). If the latter is significant, it may >>>>>>>>>>>>> contribute to the difference seen by teo_update() and cause extra >>>>>>>>>>>>> "intercepts" to appear. >>>>>>>>>>>> >>>>>>>>>>>> Right, additionally with psci pc-mode and the exposed clusteroff states we end >>>>>>>>>>>> up vastly exaggerating the wakeup latency (i.e. underestimating the actual idle time) >>>>>>>>>>>> for three reasons: >>>>>>>>>>>> - wakeup latency = entry+exit latency (worst case: pay full latencies on both >>>>>>>>>>>> even though for most cases we don't incur the entry latency) >>>>>>>>>>>> - Wakeup latency is a worst-case and often is more like 2x-3x of the average. >>>>>>>>>>>> - We use the (higher) clusteroff values even though the clusteroff state couldn't >>>>>>>>>>>> possibly have been entered as not the entire cluster is idle. >>>>>>>>>>>> >>>>>>>>>>>> Nonetheless these are all just a "intercept counting is significantly more likely" >>>>>>>>>>>> while the results show not a single state >0 entered => the intercept logic >>>>>>>>>>>> probably triggers every cpuidle entry. >>>>>>>>>>> >>>>>>>>>>> It has to for this to happen, if timers are not frequent enough. >>>>>>>>>>> >>>>>>>>>>>> Feels like there should be an issue in the feedback loop. >>>>>>>>>>> >>>>>>>>>>> I'm wondering what the issue could be though. The change in commit >>>>>>>>>>> 4b20b07ce72f only affects the cases when idle state 0 is about to be >>>>>>>>>>> selected and it only really changes the sleep length value from >>>>>>>>>>> KTIME_MAX to something more realistic (but it still may be KTIME_MAX). >>>>>>>>>>> >>>>>>>>>>> It may turn an "intercept" into a "hit", but only if the CPU is not >>>>>>>>>>> woken up by the tick because those cases had been already counted as >>>>>>>>>>> "hits" before commit 4b20b07ce72f. >>>>>>>>>>> >>>>>>>>>>> Now, if the majority of wakeups in the workload are tick wakeups, the >>>>>>>>>>> only real difference appears to be the presence of >>>>>>>>>>> tick_nohz_get_sleep_length() in that code path. >>>>>>>>>>> >>>>>>>>>>> Frankly, I would try to remove the update of cpu_data->sleep_length_ns >>>>>>>>>>> right before the "goto out_tick" statement (in 6.12 that should be >>>>>>>>>>> line 426) and see what happens. >>>>>>>>>> >>>>>>>>>> Just tried this quickly. Results attached. It goes back to behaving >>>>>>>>>> the same as 6.6 - about 2% WFI. >>>>>>>>> >>>>>>>>> Thanks for checking this! It means that the >>>>>>>>> tick_nohz_get_sleep_length() overhead doesn't matter here that much. >>>>>>>>> >>>>>>>>> Instead of making the change above, can you please try the 6.12 >>>>>>>>> equivalent of the attached patch? >>>>>>>>> >>>>>>>>> Or alternatively, apply this one to the mainline and see if it changes >>>>>>>>> the idle states selection proportions? >>>>>>> >>>>>>> It doesn't seem to have an effect. Results are attached for both 6.12 >>>>>>> (6_12_teo_reflect) and mainline (mainline_teo_reflect). 6.12 is still >>>>>>> 100% WFI and mainline is about 20% WFI on average, the same as before. >>>>>>> >>>>>>>> >>>>>>>> I don't quite follow this. >>>>>>>> While I don't really believe that the tick_nohz_get_sleep_length() overhead >>>>>>>> plays a role here, how does removing that assignment prove it isn't? >>>>>>>> >>>>>>>> The below (if that's what you meant) might lead to the overhead being optimized >>>>>>>> out? [1] >>>>>>> >>>>>>> Oh true. I tried the diff below instead (which I think should avoid >>>>>>> that?). This also behaves the same as 6.6. Results attached as >>>>>>> 6_12_sleep_length2. >>>>>> >>>>>> Thanks for testing >>>>>> >>>>>>> >>>>>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>>>>> index 173ddcac540a..3c9595bc6c80 100644 >>>>>>> --- a/drivers/cpuidle/governors/teo.c >>>>>>> +++ b/drivers/cpuidle/governors/teo.c >>>>>>> @@ -427,8 +427,7 @@ static int teo_select(struct cpuidle_driver *drv, >>>>>>> struct cpuidle_device *dev, >>>>>>> * We have to query the sleep length here otherwise we don't >>>>>>> * know after wakeup if our guess was correct. >>>>>>> */ >>>>>>> - duration_ns = tick_nohz_get_sleep_length(&delta_tick); >>>>>>> - cpu_data->sleep_length_ns = duration_ns; >>>>>>> + volatile s64 __maybe_unused duration_volatile_ns = >>>>>>> tick_nohz_get_sleep_length(&delta_tick); >>>>>>> goto out_tick; >>>>>>> } >>>>>>> >>>>>>>> I'd be curious if [2] behaves like 6.12. So far I haven't been able to >>>>>>>> reproduce the issue Reka is seeing. >>>>>>> >>>>>>> Results with [2] attached as 6_12_sleep_length3. Yes, it behaves like 6.12. >>>>>>> >>>>>> >>>>>> Hmm interesting. >>>>>> >>>>>>>> There's one oddity that immediately came to mind: state1 and state2 having >>>>>>>> the same residency (slightly different latency though), but when I set these >>>>>>>> as such teo works fine. So I don't think it is an issue here. >>>>>>>> >>>>>>>> If Rafael's patch fixes the issue I'd still be curious what the predicted sleep >>>>>>>> length values are here (they must be <1ms, but do in fact never trigger?). >>>>>>> >>>>>>> In case it's helpful, I've attached a trace (trace_duration.dat) which >>>>>>> includes a trace_printk of the duration_ns at this point (on top of >>>>>>> unmodified 6.12). >>>>>> While I was gonna ask for this anyway, it's not that useful (duration_ns is some >>>>>> time in the far future, most wakeups are obviously ticks). >>>>>> May I ask you to run the following hopefully one last time, just to check if we're >>>>>> not missing something here: >>>>> >>>>> Trace attached. >>>>> >>>> >>>> Thank you! >>>> So the intercept bin counts themselves aren't the issue: >>>> -0 [007] d..1. 1097.218324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=62615312 last_residency_ns=997307 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 >>>> -0 [007] d..1. 1097.218324: cpu_idle: state=0 cpu_id=7 >>>> -0 [007] d..1. 1097.219321: cpu_idle: state=4294967295 cpu_id=7 >>>> -0 [007] d.h1. 1097.219322: irq_handler_entry: irq=19 name=arch_timer >>>> -0 [007] d.h1. 1097.219323: irq_handler_exit: irq=19 ret=handled >>>> -0 [007] d..1. 1097.219324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=61615389 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 >>>> -0 [007] d..1. 1097.219324: cpu_idle: state=0 cpu_id=7 >>>> -0 [007] d..1. 1097.220321: cpu_idle: state=4294967295 cpu_id=7 >>>> -0 [007] d.h1. 1097.220322: irq_handler_entry: irq=19 name=arch_timer >>>> -0 [007] d.h1. 1097.220323: irq_handler_exit: irq=19 ret=handled >>>> -0 [007] d..1. 1097.220324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=60615312 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 >>>> -0 [007] d..1. 1097.220324: cpu_idle: state=0 cpu_id=7 >>>> -0 [007] d..1. 1097.221321: cpu_idle: state=4294967295 cpu_id=7 >>>> -0 [007] d.h1. 1097.221322: irq_handler_entry: irq=19 name=arch_timer >>>> -0 [007] d.h1. 1097.221323: irq_handler_exit: irq=19 ret=handled >>>> -0 [007] d..1. 1097.221324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=59615312 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 >>>> >>>> >>>> 386 if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) >>>> will be 2*7 > 8206 - (7 + 8192) => True >>>> The intercept state selection will choose a state that covers at least half of the >>>> intercepts, but only bin 0 has intercepts at all => state0 selected. >>>> >>>> I see two issues: >>>> 1) Because of DECAY_SHIFT 3 values < 8 cannot decay (I guess this wouldn't really be an issue without 2)) >> >> This shouldn't be a problem. > > Agreed, it should be a non-issue. Nonetheless if this wasn't the case $subject would've likely > never been an issue. > >> >>>> 2) if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) isn't an appropriate check, it will >>>> exclude the state if it its idx_hit_sum make up the vast majority of cpu_data->total (i.e. it would >>>> have been a really good candidate actually). >> >> Well, it would exclude the state if the sum of hits for the states >> below it is large enough. This is questionable (because why would >> hits matter here), but I attempted to make the change below and >> somebody reported a regression IIRC. >> >> This check is related to the problem at hand though (see below). >> >>>> >>>> I lightly tested the below, it seems to be at least comparable to mainline teo. >>>> (the documentation/comments would need adapting too, of course) >>>> >>>> -----8<----- >>>> >>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>> index bfa55c1eab5b..f8f76e3b8364 100644 >>>> --- a/drivers/cpuidle/governors/teo.c >>>> +++ b/drivers/cpuidle/governors/teo.c >>>> @@ -355,7 +355,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>>> * all of the deeper states, a shallower idle state is likely to be a >>>> * better choice. >>>> */ >>>> - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { >>>> + if (2 * idx_intercept_sum > idx_hit_sum) { >>>> int first_suitable_idx = idx; >>>> >>>> /* >>>> >>>> >>> >>> ... nevermind the patch, idx_hit_sum is of course the sum of 0...idx-1. >>> Maybe something like this, again lightly tested: >>> >>> -----8<----- >>> >>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>> index 173ddcac540a..6bfb9cedb75e 100644 >>> --- a/drivers/cpuidle/governors/teo.c >>> +++ b/drivers/cpuidle/governors/teo.c >>> @@ -383,13 +395,15 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>> * has been stopped already into account. >>> */ >>> intercept_sum = 0; >>> + hit_sum = 0; >>> >>> for (i = idx - 1; i >= 0; i--) { >>> struct teo_bin *bin = &cpu_data->state_bins[i]; >>> >>> intercept_sum += bin->intercepts; >>> + hit_sum += bin->hits; >>> >>> - if (2 * intercept_sum > idx_intercept_sum) { >>> + if (2 * intercept_sum > cpu_data->total || 2 * hit_sum > cpu_data->total) { >>> /* >>> * Use the current state unless it is too >>> * shallow or disabled, in which case take the >> >> This will only matter after the deepest state has been rejected >> already and on the system in question this means selecting state 0 no >> matter what. >> > > Ah, right! > > >> The pre-6.12 behavior can be explained if tick wakeups are taken into account. >> >> Namely, when state 0 is chosen (because of the check mentioned above), >> the tick is not stopped and the sleep length is KTIME_MAX. If the >> subsequent wakeup is a tick one, it will be counted as a hit on the >> deepest state (and it will contribute to the total sum in the check >> mentioned above). Then, at one point, cpu_data->total will be large >> enough and the deepest state will become the candidate one. If >> tick_nohz_get_sleep_length() returns a large value at that point, the >> tick will be stopped and the deepest state will be entered. Nirvana >> ensues. > > So fundamentally we will have to count tick-wakeups as a) nothing, which > doesn't allow us to ever break out of the intercept logic that caused us > to leave the tick on b) intercepts, which is bonkers and doesn't allow us > to ever break out and c) hits == sleep_length would've been accurate. > Of course counting a tick wakeup as a hit for sleep_length negates the > intercept logic. > >> >> The change in commit 4b20b07ce72f causes the sleep length to fall >> below the deepest idle state target residency at least sometimes in >> the scenario above, and if the subsequent wakeup is a tick one, it >> doesn't matter how it is counted - it will contribute to selecting >> idle state 0. >> >> The mainline is affected to a lesser extent because it sometimes does >> what pre-6.12 did. >> >> IMV addressing this would require changing the check you've identified >> as the culprit, but I'm not sure how to change it TBH. > > I guess we don't really have a choice but to have every 10th or so idle > during intercept try the sleep_length state with tick off (i.e. disregard > intercept) and see if that's still the case? If the sleep length is now > accurate and we haven't been woken up then keep disregarding until it does > again? > It's the smartest thing I can come up with for now, but I'll give it some > more thought! Alternatively we could have the above but have it depend on getting measured_ns significantly below delta_tick, but that might be a bit risky, I'll send patches for both later with some preliminary results unless someone objects. For 6.12 specifically we also get away with for now with just backporting (up to) 13ed5c4a6d9c ("cpuidle: teo: Skip getting the sleep length if wakeups are very frequent"). Obviously this doesn't fix the underlying problem, which menu can also suffer from AFAICT.