From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id AF72B2FE071 for ; Mon, 10 Nov 2025 12:06:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762776417; cv=none; b=KoKCEnix7QWXK6NH/DoWBvAgNpDSit8uL2EYX+cLLoNpDeaFVFbT1mIJX6yvJWKcarLyWrNMDMHcMEVlWO+Bn74jE0HByhcI81U0irhPPQzajC9GjF4KORI2CQZzuSo0gPg/CyAurUdnA8YCdWaFEEzB81trq2EeOHUy4x8SkQQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762776417; c=relaxed/simple; bh=hJDgXZ4BHt2bT688+VdsOlX+wwSbDJlt45xwSEqnRHk=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=M4TBtp4t9r3sJiUl+eRzCLICM2x4WfmacCyo3UP+MEHoflMdPt29CkOmVsfYF4dVl3tZXU/LJ4fkKXN3Gu5rHAvjd+emO8ZnRuZpsMf+bDfz0yfPy6e4NId2uYWR1UaBH7Ypz2rD787qSxjsKF8EFaW8NIQyRYysvunZEk9+Ji4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0AB69497; Mon, 10 Nov 2025 04:06:46 -0800 (PST) Received: from [10.1.34.30] (e127648.arm.com [10.1.34.30]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 868A63F66E; Mon, 10 Nov 2025 04:06:52 -0800 (PST) Message-ID: <0c018867-c092-4f8e-8f7a-32bb02de3ad5@arm.com> Date: Mon, 10 Nov 2025 12:06:50 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Regression in TEO cpuidle governor between 6.6 and 6.12 To: Reka Norman Cc: "Rafael J. Wysocki" , daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/10/25 06:10, Reka Norman wrote: > On Mon, Nov 10, 2025 at 7:35 AM Christian Loehle > wrote: >> >> On 11/7/25 11:35, Rafael J. Wysocki wrote: >>> On Fri, Nov 7, 2025 at 4:28 AM Reka Norman wrote: >>>> >>>> On Fri, Nov 7, 2025 at 7:33 AM Rafael J. Wysocki wrote: >>>>> >>>>> On Thu, Nov 6, 2025 at 12:13 PM Christian Loehle >>>>> wrote: >>>>>> >>>>>> On 11/5/25 20:48, Rafael J. Wysocki wrote: >>>>>>> On Wed, Nov 5, 2025 at 12:24 AM Christian Loehle >>>>>>> wrote: >>>>>>>> >>>>>>>> On 11/4/25 09:03, Christian Loehle wrote: >>>>>>>>> On 11/4/25 03:36, Reka Norman wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I’m seeing a regression in the TEO governor between 6.6 and 6.12. At >>>>>>>>>> 6.12, when the system is idle it’s spending almost 100% of time in >>>>>>>>>> WFI, compared to about 6% at 6.6. At mainline it has improved compared >>>>>>>>>> to 6.12 but is still a lot worse than 6.6, spending about 50% in WFI. >>>>>>>>>> >>>>>>>>>> The system is a ChromeOS device with Mediatek MT8196. >>>>>>>>>> >>>>>>>>>> Bisecting showed the specific commit which caused the regression is: >>>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>>> >>>>>>>>>> I’ve attached sysfs dumps showing the issue. All were taken a couple >>>>>>>>>> of minutes after boot, with the device having been idle since boot. >>>>>>>>>> The cases tested are: >>>>>>>>>> cpuidle_6_6.txt = 6.6 kernel >>>>>>>>>> cpuidle_6_12.txt = 6.6 kernel with teo commits up to 6.12 >>>>>>>>>> cpuidle_mainline.txt = 6.6 kernel with teo commits up to mainline >>>>>>>>>> >>>>>>>>>> Summary of the percentage time spent in each state (averaged across CPUs): >>>>>>>>>> >>>>>>>>>> | | 6.6 | 6.12 | mainline | >>>>>>>>>> |------------|------:|------:|---------:| >>>>>>>>>> | WFI | 6.02 | 99.94 | 56.84 | >>>>>>>>>> | cpuoff | 11.02 | 0 | 0.65 | >>>>>>>>>> | clusteroff | 82.96 | 0.05 | 42.51 | >>>>>>>>>> | s2idle | 0 | 0 | 0 | >>>>>>>>>> >>>>>>>>>> Any help would be much appreciated. Let me know if there's any other >>>>>>>>>> debugging information I should provide. >>>>>>>>>> >>>>>>>>> >>>>>>>>> That's not good. >>>>>>>>> If the system is mostly idle (only boot activity but dumps are taken after >>>>>>>>> ~3mins?), what is causing the wakeups? Even in 6.6 There are definitely more >>>>>>>>> than I would've expected? >>>>>>>>> I noticed that clusteroff and cpuoff have equal residency, which is >>>>>>>>> obviously a bit awkward for cpuidle, but shouldn't be relevant to your issue. >>>>>>>>> >>>>>>>>> I'm a bit puzzled by your bisect results. >>>>>>>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>>> made the intercept logic *less* prone to count (false) intercepts, yet it >>>>>>>>> seems to count more of them? (resulting in more WFI). >>>>>>>>> I'll think about it some more, for now of course a trace would be very >>>>>>>>> helpful. (cpuidle events, ipi_raise, irqs?) >>>>>>>>> Are there ever any latency constraints set? >>>>>>>>> >>>>>>>>> FWIW the mainline results look the most reasonable, from a 30000 feet view >>>>>>>>> anyway: >>>>>>>>> Cluster State above below usage above% below% >>>>>>>>> LITTLE cpuoff-l ~75 ~65 ~140 23% 20% >>>>>>>>> LITTLE clusteroff-l ~800 0 ~100 89% 0% >>>>>>>>> MID cpuoff-m ~3–4 ~15 ~20 15% 55% >>>>>>>>> MID clusteroff-m ~1300 0 ~4000 24% 0% >>>>>>>>> BIG cpuoff-b 0 1 1 — — >>>>>>>>> BIG clusteroff-b ~800 0 ~1900 30% 0% >>>>>>>>> >>>>>>>>> (WFI seems mostly the correct choice for little CPUs, that's fine, the energy >>>>>>>>> savings compared to cpuoff should be marginal anyway.) >>>>>>>>> >>>>>>>>> Do you mind trying: >>>>>>>>> 13ed5c4a6d9c cpuidle: teo: Skip getting the sleep length if wakeups are very frequent >>>>>>>>> on 6.12? >>>>>>>>> >>>>>>>> >>>>>>>> So just thinking out loud, the only case I can actually thing of to explain your >>>>>>>> bisect to 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>>>>>> is that the workload essentially changed dramatically because of our calls >>>>>>>> to tick_nohz_get_sleep_length() now. >>>>>>>> I'm not sure how likely I think that is, but I'm lacking imagination for another >>>>>>>> cause. That's why results with >>>>>>>> 13ed5c4a6d9c ("cpuidle: teo: Skip getting the sleep length if wakeups are very frequent") >>>>>>>> would be interesting. >>>>>>> >>>>>>> My current theory is that this issue is related to the >>>>>>> tick_nohz_get_sleep_length() overhead and the way "intercepts" are >>>>>>> distinguished from "hits" in teo. >>>>>>> >>>>>>> Namely, teo assumes that its own overhead is negligible and so it >>>>>>> counts a given event as an "intercept" if the measured time spent in >>>>>>> the idle state (with the exit latency roughly taken into account) >>>>>>> falls into a different "state bin" than the sleep length (the expected >>>>>>> time till the next timer). However, the sleep length is computed as a >>>>>>> difference between the upcoming timer wakeup event time and >>>>>>> ts->idle_entrytime, so it actually includes the time taken by >>>>>>> tick_nohz_next_event(). If the latter is significant, it may >>>>>>> contribute to the difference seen by teo_update() and cause extra >>>>>>> "intercepts" to appear. >>>>>> >>>>>> Right, additionally with psci pc-mode and the exposed clusteroff states we end >>>>>> up vastly exaggerating the wakeup latency (i.e. underestimating the actual idle time) >>>>>> for three reasons: >>>>>> - wakeup latency = entry+exit latency (worst case: pay full latencies on both >>>>>> even though for most cases we don't incur the entry latency) >>>>>> - Wakeup latency is a worst-case and often is more like 2x-3x of the average. >>>>>> - We use the (higher) clusteroff values even though the clusteroff state couldn't >>>>>> possibly have been entered as not the entire cluster is idle. >>>>>> >>>>>> Nonetheless these are all just a "intercept counting is significantly more likely" >>>>>> while the results show not a single state >0 entered => the intercept logic >>>>>> probably triggers every cpuidle entry. >>>>> >>>>> It has to for this to happen, if timers are not frequent enough. >>>>> >>>>>> Feels like there should be an issue in the feedback loop. >>>>> >>>>> I'm wondering what the issue could be though. The change in commit >>>>> 4b20b07ce72f only affects the cases when idle state 0 is about to be >>>>> selected and it only really changes the sleep length value from >>>>> KTIME_MAX to something more realistic (but it still may be KTIME_MAX). >>>>> >>>>> It may turn an "intercept" into a "hit", but only if the CPU is not >>>>> woken up by the tick because those cases had been already counted as >>>>> "hits" before commit 4b20b07ce72f. >>>>> >>>>> Now, if the majority of wakeups in the workload are tick wakeups, the >>>>> only real difference appears to be the presence of >>>>> tick_nohz_get_sleep_length() in that code path. >>>>> >>>>> Frankly, I would try to remove the update of cpu_data->sleep_length_ns >>>>> right before the "goto out_tick" statement (in 6.12 that should be >>>>> line 426) and see what happens. >>>> >>>> Just tried this quickly. Results attached. It goes back to behaving >>>> the same as 6.6 - about 2% WFI. >>> >>> Thanks for checking this! It means that the >>> tick_nohz_get_sleep_length() overhead doesn't matter here that much. >>> >>> Instead of making the change above, can you please try the 6.12 >>> equivalent of the attached patch? >>> >>> Or alternatively, apply this one to the mainline and see if it changes >>> the idle states selection proportions? > > It doesn't seem to have an effect. Results are attached for both 6.12 > (6_12_teo_reflect) and mainline (mainline_teo_reflect). 6.12 is still > 100% WFI and mainline is about 20% WFI on average, the same as before. > >> >> I don't quite follow this. >> While I don't really believe that the tick_nohz_get_sleep_length() overhead >> plays a role here, how does removing that assignment prove it isn't? >> >> The below (if that's what you meant) might lead to the overhead being optimized >> out? [1] > > Oh true. I tried the diff below instead (which I think should avoid > that?). This also behaves the same as 6.6. Results attached as > 6_12_sleep_length2. Thanks for testing > > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c > index 173ddcac540a..3c9595bc6c80 100644 > --- a/drivers/cpuidle/governors/teo.c > +++ b/drivers/cpuidle/governors/teo.c > @@ -427,8 +427,7 @@ static int teo_select(struct cpuidle_driver *drv, > struct cpuidle_device *dev, > * We have to query the sleep length here otherwise we don't > * know after wakeup if our guess was correct. > */ > - duration_ns = tick_nohz_get_sleep_length(&delta_tick); > - cpu_data->sleep_length_ns = duration_ns; > + volatile s64 __maybe_unused duration_volatile_ns = > tick_nohz_get_sleep_length(&delta_tick); > goto out_tick; > } > >> I'd be curious if [2] behaves like 6.12. So far I haven't been able to >> reproduce the issue Reka is seeing. > > Results with [2] attached as 6_12_sleep_length3. Yes, it behaves like 6.12. > Hmm interesting. >> There's one oddity that immediately came to mind: state1 and state2 having >> the same residency (slightly different latency though), but when I set these >> as such teo works fine. So I don't think it is an issue here. >> >> If Rafael's patch fixes the issue I'd still be curious what the predicted sleep >> length values are here (they must be <1ms, but do in fact never trigger?). > > In case it's helpful, I've attached a trace (trace_duration.dat) which > includes a trace_printk of the duration_ns at this point (on top of > unmodified 6.12). While I was gonna ask for this anyway, it's not that useful (duration_ns is some time in the far future, most wakeups are obviously ticks). May I ask you to run the following hopefully one last time, just to check if we're not missing something here: -----8<----- diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c index 173ddcac540a..f4cd648133e7 100644 --- a/drivers/cpuidle/governors/teo.c +++ b/drivers/cpuidle/governors/teo.c @@ -237,6 +237,28 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev) else cpu_data->state_bins[idx_duration].intercepts += PULSE; + if (dev->cpu == 7) { + char buf[1024]; + size_t off; + + off = scnprintf(buf, sizeof(buf), + "cpu=%d last_state_idx=%d state_count=%u " + "total=%u sleep_len_ns=%llu last_residency_ns=%llu measured_ns=%llu " + "lat_ns=%llu idx_timer=%d idx_duration=%d tick_hits=%u | bins:", + dev->cpu, dev->last_state_idx, drv->state_count, cpu_data->total, + cpu_data->sleep_length_ns, dev->last_residency_ns, measured_ns, + drv->states[dev->last_state_idx].exit_latency_ns, + idx_timer, idx_duration, cpu_data->tick_hits); + + for (i = 0; i < drv->state_count && off < sizeof(buf); i++) { + struct teo_bin *bin = &cpu_data->state_bins[i]; + + off += scnprintf(buf + off, sizeof(buf) - off, + " [%d]h=%u,i=%u,tr=%lld", i, bin->hits, + bin->intercepts, drv->states[i].target_residency_ns); + } + trace_printk("teo_update %s\n", buf); + } end: cpu_data->total += PULSE; }