From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 6C05F32D7FB for ; Tue, 11 Nov 2025 10:00:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762855212; cv=none; b=uyNiLdr1oIRaSqA7QVme/NWvyjUWnB1cK4fAu3a24MY4OMytiH2lTrtrWrEQGp5ySMF5Spzd6p+O85zrmxWE/9eV845QrF44xBeaE/DEoEnAyj0Sfq6tCq4IGOJ+Vxv+gSNtBi9yuP048gc10C7Rofw3q5B4Yi8+zxI2QfS8KQg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762855212; c=relaxed/simple; bh=kapCD4NSxmywqUneAmhDUG4dnx9EeipzZgbWRDJXGtU=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=BFTDmk8+Kw1Sf/rOUreTqe9eepkrP43AUxZDyhYsALPIJVMOTmZRf8HmmOWqVCx+tLLb8PAcQhynioPNAlJu/4hQgCrDUhWqycq4+vYijldCQ2O8fdWB4zZ4Qq+2rbElz1R8wNY/t37nGL7AtOffv25x52rCp+ktUKrHVNaroas= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BBACA2F; Tue, 11 Nov 2025 02:00:00 -0800 (PST) Received: from [10.1.39.22] (e127648.arm.com [10.1.39.22]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 761CC3F63F; Tue, 11 Nov 2025 02:00:07 -0800 (PST) Message-ID: <2a429c41-8624-408c-9db0-4450ab76e52f@arm.com> Date: Tue, 11 Nov 2025 10:00:05 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Regression in TEO cpuidle governor between 6.6 and 6.12 To: Reka Norman Cc: "Rafael J. Wysocki" , daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: <0c018867-c092-4f8e-8f7a-32bb02de3ad5@arm.com> Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/11/25 04:23, Reka Norman wrote: > On Mon, Nov 10, 2025 at 11:06 PM Christian Loehle > wrote: >> >> On 11/10/25 06:10, Reka Norman wrote: >>> On Mon, Nov 10, 2025 at 7:35 AM Christian Loehle >>> wrote: >>>> >>>> On 11/7/25 11:35, Rafael J. Wysocki wrote: >>>>> On Fri, Nov 7, 2025 at 4:28 AM Reka Norman wrote: >>>>>> >>>>>> On Fri, Nov 7, 2025 at 7:33 AM Rafael J. Wysocki wrote: >>>>>>> >>>>>>> On Thu, Nov 6, 2025 at 12:13 PM Christian Loehle >>>>>>> wrote: >>>>>>>> >>>>>>>> On 11/5/25 20:48, Rafael J. Wysocki wrote: >>>>>>>>> On Wed, Nov 5, 2025 at 12:24 AM Christian Loehle >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 11/4/25 09:03, Christian Loehle wrote: >>>>>>>>>>> On 11/4/25 03:36, Reka Norman wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I’m seeing a regression in the TEO governor between 6.6 and 6.12. At >>>>>>>>>>>> 6.12, when the system is idle it’s spending almost 100% of time in >>>>>>>>>>>> WFI, compared to about 6% at 6.6. At mainline it has improved compared >>>>>>>>>>>> to 6.12 but is still a lot worse than 6.6, spending about 50% in WFI. >>>>>>>>>>>> >>>>>>>>>>>> The system is a ChromeOS device with Mediatek MT8196. >>>>>>>>>>>> >>>>>>>>>>>> Bisecting showed the specific commit which caused the regression is: >>>>>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>>>> >>>>>>>>>>>> I’ve attached sysfs dumps showing the issue. All were taken a couple >>>>>>>>>>>> of minutes after boot, with the device having been idle since boot. >>>>>>>>>>>> The cases tested are: >>>>>>>>>>>> cpuidle_6_6.txt = 6.6 kernel >>>>>>>>>>>> cpuidle_6_12.txt = 6.6 kernel with teo commits up to 6.12 >>>>>>>>>>>> cpuidle_mainline.txt = 6.6 kernel with teo commits up to mainline >>>>>>>>>>>> >>>>>>>>>>>> Summary of the percentage time spent in each state (averaged across CPUs): >>>>>>>>>>>> >>>>>>>>>>>> | | 6.6 | 6.12 | mainline | >>>>>>>>>>>> |------------|------:|------:|---------:| >>>>>>>>>>>> | WFI | 6.02 | 99.94 | 56.84 | >>>>>>>>>>>> | cpuoff | 11.02 | 0 | 0.65 | >>>>>>>>>>>> | clusteroff | 82.96 | 0.05 | 42.51 | >>>>>>>>>>>> | s2idle | 0 | 0 | 0 | >>>>>>>>>>>> >>>>>>>>>>>> Any help would be much appreciated. Let me know if there's any other >>>>>>>>>>>> debugging information I should provide. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That's not good. >>>>>>>>>>> If the system is mostly idle (only boot activity but dumps are taken after >>>>>>>>>>> ~3mins?), what is causing the wakeups? Even in 6.6 There are definitely more >>>>>>>>>>> than I would've expected? >>>>>>>>>>> I noticed that clusteroff and cpuoff have equal residency, which is >>>>>>>>>>> obviously a bit awkward for cpuidle, but shouldn't be relevant to your issue. >>>>>>>>>>> >>>>>>>>>>> I'm a bit puzzled by your bisect results. >>>>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>>> made the intercept logic *less* prone to count (false) intercepts, yet it >>>>>>>>>>> seems to count more of them? (resulting in more WFI). >>>>>>>>>>> I'll think about it some more, for now of course a trace would be very >>>>>>>>>>> helpful. (cpuidle events, ipi_raise, irqs?) >>>>>>>>>>> Are there ever any latency constraints set? >>>>>>>>>>> >>>>>>>>>>> FWIW the mainline results look the most reasonable, from a 30000 feet view >>>>>>>>>>> anyway: >>>>>>>>>>> Cluster State above below usage above% below% >>>>>>>>>>> LITTLE cpuoff-l ~75 ~65 ~140 23% 20% >>>>>>>>>>> LITTLE clusteroff-l ~800 0 ~100 89% 0% >>>>>>>>>>> MID cpuoff-m ~3–4 ~15 ~20 15% 55% >>>>>>>>>>> MID clusteroff-m ~1300 0 ~4000 24% 0% >>>>>>>>>>> BIG cpuoff-b 0 1 1 — — >>>>>>>>>>> BIG clusteroff-b ~800 0 ~1900 30% 0% >>>>>>>>>>> >>>>>>>>>>> (WFI seems mostly the correct choice for little CPUs, that's fine, the energy >>>>>>>>>>> savings compared to cpuoff should be marginal anyway.) >>>>>>>>>>> >>>>>>>>>>> Do you mind trying: >>>>>>>>>>> 13ed5c4a6d9c cpuidle: teo: Skip getting the sleep length if wakeups are very frequent >>>>>>>>>>> on 6.12? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So just thinking out loud, the only case I can actually thing of to explain your >>>>>>>>>> bisect to 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>> is that the workload essentially changed dramatically because of our calls >>>>>>>>>> to tick_nohz_get_sleep_length() now. >>>>>>>>>> I'm not sure how likely I think that is, but I'm lacking imagination for another >>>>>>>>>> cause. That's why results with >>>>>>>>>> 13ed5c4a6d9c ("cpuidle: teo: Skip getting the sleep length if wakeups are very frequent") >>>>>>>>>> would be interesting. >>>>>>>>> >>>>>>>>> My current theory is that this issue is related to the >>>>>>>>> tick_nohz_get_sleep_length() overhead and the way "intercepts" are >>>>>>>>> distinguished from "hits" in teo. >>>>>>>>> >>>>>>>>> Namely, teo assumes that its own overhead is negligible and so it >>>>>>>>> counts a given event as an "intercept" if the measured time spent in >>>>>>>>> the idle state (with the exit latency roughly taken into account) >>>>>>>>> falls into a different "state bin" than the sleep length (the expected >>>>>>>>> time till the next timer). However, the sleep length is computed as a >>>>>>>>> difference between the upcoming timer wakeup event time and >>>>>>>>> ts->idle_entrytime, so it actually includes the time taken by >>>>>>>>> tick_nohz_next_event(). If the latter is significant, it may >>>>>>>>> contribute to the difference seen by teo_update() and cause extra >>>>>>>>> "intercepts" to appear. >>>>>>>> >>>>>>>> Right, additionally with psci pc-mode and the exposed clusteroff states we end >>>>>>>> up vastly exaggerating the wakeup latency (i.e. underestimating the actual idle time) >>>>>>>> for three reasons: >>>>>>>> - wakeup latency = entry+exit latency (worst case: pay full latencies on both >>>>>>>> even though for most cases we don't incur the entry latency) >>>>>>>> - Wakeup latency is a worst-case and often is more like 2x-3x of the average. >>>>>>>> - We use the (higher) clusteroff values even though the clusteroff state couldn't >>>>>>>> possibly have been entered as not the entire cluster is idle. >>>>>>>> >>>>>>>> Nonetheless these are all just a "intercept counting is significantly more likely" >>>>>>>> while the results show not a single state >0 entered => the intercept logic >>>>>>>> probably triggers every cpuidle entry. >>>>>>> >>>>>>> It has to for this to happen, if timers are not frequent enough. >>>>>>> >>>>>>>> Feels like there should be an issue in the feedback loop. >>>>>>> >>>>>>> I'm wondering what the issue could be though. The change in commit >>>>>>> 4b20b07ce72f only affects the cases when idle state 0 is about to be >>>>>>> selected and it only really changes the sleep length value from >>>>>>> KTIME_MAX to something more realistic (but it still may be KTIME_MAX). >>>>>>> >>>>>>> It may turn an "intercept" into a "hit", but only if the CPU is not >>>>>>> woken up by the tick because those cases had been already counted as >>>>>>> "hits" before commit 4b20b07ce72f. >>>>>>> >>>>>>> Now, if the majority of wakeups in the workload are tick wakeups, the >>>>>>> only real difference appears to be the presence of >>>>>>> tick_nohz_get_sleep_length() in that code path. >>>>>>> >>>>>>> Frankly, I would try to remove the update of cpu_data->sleep_length_ns >>>>>>> right before the "goto out_tick" statement (in 6.12 that should be >>>>>>> line 426) and see what happens. >>>>>> >>>>>> Just tried this quickly. Results attached. It goes back to behaving >>>>>> the same as 6.6 - about 2% WFI. >>>>> >>>>> Thanks for checking this! It means that the >>>>> tick_nohz_get_sleep_length() overhead doesn't matter here that much. >>>>> >>>>> Instead of making the change above, can you please try the 6.12 >>>>> equivalent of the attached patch? >>>>> >>>>> Or alternatively, apply this one to the mainline and see if it changes >>>>> the idle states selection proportions? >>> >>> It doesn't seem to have an effect. Results are attached for both 6.12 >>> (6_12_teo_reflect) and mainline (mainline_teo_reflect). 6.12 is still >>> 100% WFI and mainline is about 20% WFI on average, the same as before. >>> >>>> >>>> I don't quite follow this. >>>> While I don't really believe that the tick_nohz_get_sleep_length() overhead >>>> plays a role here, how does removing that assignment prove it isn't? >>>> >>>> The below (if that's what you meant) might lead to the overhead being optimized >>>> out? [1] >>> >>> Oh true. I tried the diff below instead (which I think should avoid >>> that?). This also behaves the same as 6.6. Results attached as >>> 6_12_sleep_length2. >> >> Thanks for testing >> >>> >>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>> index 173ddcac540a..3c9595bc6c80 100644 >>> --- a/drivers/cpuidle/governors/teo.c >>> +++ b/drivers/cpuidle/governors/teo.c >>> @@ -427,8 +427,7 @@ static int teo_select(struct cpuidle_driver *drv, >>> struct cpuidle_device *dev, >>> * We have to query the sleep length here otherwise we don't >>> * know after wakeup if our guess was correct. >>> */ >>> - duration_ns = tick_nohz_get_sleep_length(&delta_tick); >>> - cpu_data->sleep_length_ns = duration_ns; >>> + volatile s64 __maybe_unused duration_volatile_ns = >>> tick_nohz_get_sleep_length(&delta_tick); >>> goto out_tick; >>> } >>> >>>> I'd be curious if [2] behaves like 6.12. So far I haven't been able to >>>> reproduce the issue Reka is seeing. >>> >>> Results with [2] attached as 6_12_sleep_length3. Yes, it behaves like 6.12. >>> >> >> Hmm interesting. >> >>>> There's one oddity that immediately came to mind: state1 and state2 having >>>> the same residency (slightly different latency though), but when I set these >>>> as such teo works fine. So I don't think it is an issue here. >>>> >>>> If Rafael's patch fixes the issue I'd still be curious what the predicted sleep >>>> length values are here (they must be <1ms, but do in fact never trigger?). >>> >>> In case it's helpful, I've attached a trace (trace_duration.dat) which >>> includes a trace_printk of the duration_ns at this point (on top of >>> unmodified 6.12). >> While I was gonna ask for this anyway, it's not that useful (duration_ns is some >> time in the far future, most wakeups are obviously ticks). >> May I ask you to run the following hopefully one last time, just to check if we're >> not missing something here: > > Trace attached. > Thank you! So the intercept bin counts themselves aren't the issue: -0 [007] d..1. 1097.218324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=62615312 last_residency_ns=997307 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 -0 [007] d..1. 1097.218324: cpu_idle: state=0 cpu_id=7 -0 [007] d..1. 1097.219321: cpu_idle: state=4294967295 cpu_id=7 -0 [007] d.h1. 1097.219322: irq_handler_entry: irq=19 name=arch_timer -0 [007] d.h1. 1097.219323: irq_handler_exit: irq=19 ret=handled -0 [007] d..1. 1097.219324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=61615389 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 -0 [007] d..1. 1097.219324: cpu_idle: state=0 cpu_id=7 -0 [007] d..1. 1097.220321: cpu_idle: state=4294967295 cpu_id=7 -0 [007] d.h1. 1097.220322: irq_handler_entry: irq=19 name=arch_timer -0 [007] d.h1. 1097.220323: irq_handler_exit: irq=19 ret=handled -0 [007] d..1. 1097.220324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=60615312 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 -0 [007] d..1. 1097.220324: cpu_idle: state=0 cpu_id=7 -0 [007] d..1. 1097.221321: cpu_idle: state=4294967295 cpu_id=7 -0 [007] d.h1. 1097.221322: irq_handler_entry: irq=19 name=arch_timer -0 [007] d.h1. 1097.221323: irq_handler_exit: irq=19 ret=handled -0 [007] d..1. 1097.221324: bprint: teo_update: teo_update cpu=7 last_state_idx=0 state_count=4 total=7182 sleep_len_ns=59615312 last_residency_ns=997538 measured_ns=18446744073709551615 lat_ns=1000 idx_timer=2 idx_duration=2 tick_hits=0 | bins: [0]h=7,i=7,tr=1000 [1]h=0,i=0,tr=2580000 [2]h=8192,i=0,tr=2580000 [3]h=0,i=0,tr=4294967295000 386 if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) will be 2*7 > 8206 - (7 + 8192) => True The intercept state selection will choose a state that covers at least half of the intercepts, but only bin 0 has intercepts at all => state0 selected. I see two issues: 1) Because of DECAY_SHIFT 3 values < 8 cannot decay (I guess this wouldn't really be an issue without 2)) 2) if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) isn't an appropriate check, it will exclude the state if it its idx_hit_sum make up the vast majority of cpu_data->total (i.e. it would have been a really good candidate actually). I lightly tested the below, it seems to be at least comparable to mainline teo. (the documentation/comments would need adapting too, of course) -----8<----- diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index bfa55c1eab5b..f8f76e3b8364 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -355,7 +355,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, * all of the deeper states, a shallower idle state is likely to be a * better choice. */ - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { + if (2 * idx_intercept_sum > idx_hit_sum) { int first_suitable_idx = idx; /*