From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8FAEB3074B1 for ; Thu, 6 Nov 2025 11:13:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762427593; cv=none; b=jZu3ETyH/d3POUobH23NvA+tqOiZC5C5ajI3CXTWBfn069YP81Y3iwgMQEdUcBRK9vSUWmgQuE7sjqThvNxTrUAiF7WYI6R642qnLdKInzMCXjSZrwNi7WnAFXIGnTWMtEAp6vMvzB7PxZXLL/P1jNHB8ekp7CLTAgCDrkGpjbk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762427593; c=relaxed/simple; bh=SzCSS56vB8/bzF8mya6d920cNSvBzOQEFAsgjKM1HtQ=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=ZmafC9UamFu3Xg1kiRG5l1bghDSG8LA4lV8OuTi3P+Pyq1zFlxDPBGyrWj1grsaeTiNLvpckDLW98/uLxzGsCd+cj6aqy+jTmXWiZbU+X2TV0sUWtkyRSVSkLUv2++y+BWnxU9JmCvUERdH+7igEMY34dMrR1CCfl1DYZk47XMQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 24ED01596; Thu, 6 Nov 2025 03:13:03 -0800 (PST) Received: from [10.1.28.78] (e127648.arm.com [10.1.28.78]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 254983F63F; Thu, 6 Nov 2025 03:13:10 -0800 (PST) Message-ID: Date: Thu, 6 Nov 2025 11:13:07 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Regression in TEO cpuidle governor between 6.6 and 6.12 To: "Rafael J. Wysocki" Cc: Reka Norman , daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/5/25 20:48, Rafael J. Wysocki wrote: > On Wed, Nov 5, 2025 at 12:24 AM Christian Loehle > wrote: >> >> On 11/4/25 09:03, Christian Loehle wrote: >>> On 11/4/25 03:36, Reka Norman wrote: >>>> Hi, >>>> >>>> I’m seeing a regression in the TEO governor between 6.6 and 6.12. At >>>> 6.12, when the system is idle it’s spending almost 100% of time in >>>> WFI, compared to about 6% at 6.6. At mainline it has improved compared >>>> to 6.12 but is still a lot worse than 6.6, spending about 50% in WFI. >>>> >>>> The system is a ChromeOS device with Mediatek MT8196. >>>> >>>> Bisecting showed the specific commit which caused the regression is: >>>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>>> >>>> I’ve attached sysfs dumps showing the issue. All were taken a couple >>>> of minutes after boot, with the device having been idle since boot. >>>> The cases tested are: >>>> cpuidle_6_6.txt = 6.6 kernel >>>> cpuidle_6_12.txt = 6.6 kernel with teo commits up to 6.12 >>>> cpuidle_mainline.txt = 6.6 kernel with teo commits up to mainline >>>> >>>> Summary of the percentage time spent in each state (averaged across CPUs): >>>> >>>> | | 6.6 | 6.12 | mainline | >>>> |------------|------:|------:|---------:| >>>> | WFI | 6.02 | 99.94 | 56.84 | >>>> | cpuoff | 11.02 | 0 | 0.65 | >>>> | clusteroff | 82.96 | 0.05 | 42.51 | >>>> | s2idle | 0 | 0 | 0 | >>>> >>>> Any help would be much appreciated. Let me know if there's any other >>>> debugging information I should provide. >>>> >>> >>> That's not good. >>> If the system is mostly idle (only boot activity but dumps are taken after >>> ~3mins?), what is causing the wakeups? Even in 6.6 There are definitely more >>> than I would've expected? >>> I noticed that clusteroff and cpuoff have equal residency, which is >>> obviously a bit awkward for cpuidle, but shouldn't be relevant to your issue. >>> >>> I'm a bit puzzled by your bisect results. >>> 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >>> made the intercept logic *less* prone to count (false) intercepts, yet it >>> seems to count more of them? (resulting in more WFI). >>> I'll think about it some more, for now of course a trace would be very >>> helpful. (cpuidle events, ipi_raise, irqs?) >>> Are there ever any latency constraints set? >>> >>> FWIW the mainline results look the most reasonable, from a 30000 feet view >>> anyway: >>> Cluster State above below usage above% below% >>> LITTLE cpuoff-l ~75 ~65 ~140 23% 20% >>> LITTLE clusteroff-l ~800 0 ~100 89% 0% >>> MID cpuoff-m ~3–4 ~15 ~20 15% 55% >>> MID clusteroff-m ~1300 0 ~4000 24% 0% >>> BIG cpuoff-b 0 1 1 — — >>> BIG clusteroff-b ~800 0 ~1900 30% 0% >>> >>> (WFI seems mostly the correct choice for little CPUs, that's fine, the energy >>> savings compared to cpuoff should be marginal anyway.) >>> >>> Do you mind trying: >>> 13ed5c4a6d9c cpuidle: teo: Skip getting the sleep length if wakeups are very frequent >>> on 6.12? >>> >> >> So just thinking out loud, the only case I can actually thing of to explain your >> bisect to 4b20b07ce72f ("cpuidle: teo: Don't count non-existent intercepts") >> is that the workload essentially changed dramatically because of our calls >> to tick_nohz_get_sleep_length() now. >> I'm not sure how likely I think that is, but I'm lacking imagination for another >> cause. That's why results with >> 13ed5c4a6d9c ("cpuidle: teo: Skip getting the sleep length if wakeups are very frequent") >> would be interesting. > > My current theory is that this issue is related to the > tick_nohz_get_sleep_length() overhead and the way "intercepts" are > distinguished from "hits" in teo. > > Namely, teo assumes that its own overhead is negligible and so it > counts a given event as an "intercept" if the measured time spent in > the idle state (with the exit latency roughly taken into account) > falls into a different "state bin" than the sleep length (the expected > time till the next timer). However, the sleep length is computed as a > difference between the upcoming timer wakeup event time and > ts->idle_entrytime, so it actually includes the time taken by > tick_nohz_next_event(). If the latter is significant, it may > contribute to the difference seen by teo_update() and cause extra > "intercepts" to appear. Right, additionally with psci pc-mode and the exposed clusteroff states we end up vastly exaggerating the wakeup latency (i.e. underestimating the actual idle time) for three reasons: - wakeup latency = entry+exit latency (worst case: pay full latencies on both even though for most cases we don't incur the entry latency) - Wakeup latency is a worst-case and often is more like 2x-3x of the average. - We use the (higher) clusteroff values even though the clusteroff state couldn't possibly have been entered as not the entire cluster is idle. Nonetheless these are all just a "intercept counting is significantly more likely" while the results show not a single state >0 entered => the intercept logic probably triggers every cpuidle entry. Feels like there should be an issue in the feedback loop. Anyway I'll try to reproduce what Reka is seeing.