From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0D1962222C4 for ; Fri, 14 Nov 2025 09:20:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763112046; cv=none; b=s/vB/X7N1j7DV35bYWE6e/niUnosulOvPuWfqYx//rnCTN3hDukLOkgbACS904FdpZuOqnfvGQ6pZbxJPTv6+1lOsahX8Kw3pHNLPpw1BY8+wSKlH7xz4BvXRadcIk391gn4CW/9rQ0S07O2WGJpLYKXLfAAxOfSsB8c9f9BUao= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763112046; c=relaxed/simple; bh=kOIsSHtdCwDJGbbxZzMvtgo/nNRJnVAxST1d2gum57I=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=sGJm1PZsjjIO3L3QYL5/xTRs9KfB54sBke3ELr0F7d30KpZoOEfBJYxr4Ry4EwHbf3x0ZKt/pIU26RUveJRI1WNHc1+gEmwIg0e2Sbw0BOx/aO0kWW5A4d6v63v5coeEIZlAbFjQ8dLCyRC4inbvOaTxdSaOBdolY1dkR6+Lycw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 693361063; Fri, 14 Nov 2025 01:20:35 -0800 (PST) Received: from [10.57.41.249] (unknown [10.57.41.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0C2733F66E; Fri, 14 Nov 2025 01:20:41 -0800 (PST) Message-ID: <1c9672cb-d167-4f7d-884a-7e9a2af2dadf@arm.com> Date: Fri, 14 Nov 2025 09:20:40 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] cpuidle: governors: teo: Rework the handling of tick wakeups To: Reka Norman , "Rafael J. Wysocki" Cc: daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: <00928b9d-7189-4929-afc9-7684fc5ef531@arm.com> <6228387.lOV4Wx5bFT@rafael.j.wysocki> Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/14/25 04:05, Reka Norman wrote: > On Fri, Nov 14, 2025 at 3:56 AM Rafael J. Wysocki wrote: >> >> On Thursday, November 13, 2025 4:43:18 PM CET Christian Loehle wrote: >>> On 11/12/25 18:33, Christian Loehle wrote: >>>> On 11/12/25 14:16, Rafael J. Wysocki wrote: >>>>> On Wed, Nov 12, 2025 at 3:03 PM Christian Loehle >>>>> wrote: >>>>>> >>>>>> On 11/12/25 13:32, Rafael J. Wysocki wrote: >>>>>>> On Tue, Nov 11, 2025 at 6:20 PM Christian Loehle >>>>>>> wrote: >>>>>>>> >>>>>>>> On 11/11/25 11:48, Rafael J. Wysocki wrote: >>>>>>>>> On Tue, Nov 11, 2025 at 11:48 AM Christian Loehle >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 11/11/25 10:00, Christian Loehle wrote: >>>>>>> >>>>>>> [...] >>>>>>> >>>>>>>>>>> I see two issues: >>>>>>>>>>> 1) Because of DECAY_SHIFT 3 values < 8 cannot decay (I guess this wouldn't really be an issue without 2)) >>>>>>>>> >>>>>>>>> This shouldn't be a problem. >>>>>>>> >>>>>>>> Agreed, it should be a non-issue. Nonetheless if this wasn't the case $subject would've likely >>>>>>>> never been an issue. >>>>>>> >>>>>>> Well, I think that the leftovers can be cleared when they become less than 8. >>>>>>> >>>>>>>>> >>>>>>>>>>> 2) if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) isn't an appropriate check, it will >>>>>>>>>>> exclude the state if it its idx_hit_sum make up the vast majority of cpu_data->total (i.e. it would >>>>>>>>>>> have been a really good candidate actually). >>>>>>>>> >>>>>>>>> Well, it would exclude the state if the sum of hits for the states >>>>>>>>> below it is large enough. This is questionable (because why would >>>>>>>>> hits matter here), but I attempted to make the change below and >>>>>>>>> somebody reported a regression IIRC. >>>>>>>>> >>>>>>>>> This check is related to the problem at hand though (see below). >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I lightly tested the below, it seems to be at least comparable to mainline teo. >>>>>>>>>>> (the documentation/comments would need adapting too, of course) >>>>>>>>>>> >>>>>>>>>>> -----8<----- >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>>>>>>>>> index bfa55c1eab5b..f8f76e3b8364 100644 >>>>>>>>>>> --- a/drivers/cpuidle/governors/teo.c >>>>>>>>>>> +++ b/drivers/cpuidle/governors/teo.c >>>>>>>>>>> @@ -355,7 +355,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>>>>>>>>>> * all of the deeper states, a shallower idle state is likely to be a >>>>>>>>>>> * better choice. >>>>>>>>>>> */ >>>>>>>>>>> - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { >>>>>>>>>>> + if (2 * idx_intercept_sum > idx_hit_sum) { >>>>>>>>>>> int first_suitable_idx = idx; >>>>>>>>>>> >>>>>>>>>>> /* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ... nevermind the patch, idx_hit_sum is of course the sum of 0...idx-1. >>>>>>>>>> Maybe something like this, again lightly tested: >>>>>>>>>> >>>>>>>>>> -----8<----- >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>>>>>>>> index 173ddcac540a..6bfb9cedb75e 100644 >>>>>>>>>> --- a/drivers/cpuidle/governors/teo.c >>>>>>>>>> +++ b/drivers/cpuidle/governors/teo.c >>>>>>>>>> @@ -383,13 +395,15 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>>>>>>>>> * has been stopped already into account. >>>>>>>>>> */ >>>>>>>>>> intercept_sum = 0; >>>>>>>>>> + hit_sum = 0; >>>>>>>>>> >>>>>>>>>> for (i = idx - 1; i >= 0; i--) { >>>>>>>>>> struct teo_bin *bin = &cpu_data->state_bins[i]; >>>>>>>>>> >>>>>>>>>> intercept_sum += bin->intercepts; >>>>>>>>>> + hit_sum += bin->hits; >>>>>>>>>> >>>>>>>>>> - if (2 * intercept_sum > idx_intercept_sum) { >>>>>>>>>> + if (2 * intercept_sum > cpu_data->total || 2 * hit_sum > cpu_data->total) { >>>>>>>>>> /* >>>>>>>>>> * Use the current state unless it is too >>>>>>>>>> * shallow or disabled, in which case take the >>>>>>>>> >>>>>>>>> This will only matter after the deepest state has been rejected >>>>>>>>> already and on the system in question this means selecting state 0 no >>>>>>>>> matter what. >>>>>>>>> >>>>>>>> >>>>>>>> Ah, right! >>>>>>>> >>>>>>>> >>>>>>>>> The pre-6.12 behavior can be explained if tick wakeups are taken into account. >>>>>>>>> >>>>>>>>> Namely, when state 0 is chosen (because of the check mentioned above), >>>>>>>>> the tick is not stopped and the sleep length is KTIME_MAX. If the >>>>>>>>> subsequent wakeup is a tick one, it will be counted as a hit on the >>>>>>>>> deepest state (and it will contribute to the total sum in the check >>>>>>>>> mentioned above). Then, at one point, cpu_data->total will be large >>>>>>>>> enough and the deepest state will become the candidate one. If >>>>>>>>> tick_nohz_get_sleep_length() returns a large value at that point, the >>>>>>>>> tick will be stopped and the deepest state will be entered. Nirvana >>>>>>>>> ensues. >>>>>>>> >>>>>>>> So fundamentally we will have to count tick-wakeups as a) nothing, which >>>>>>>> doesn't allow us to ever break out of the intercept logic that caused us >>>>>>>> to leave the tick on b) intercepts, which is bonkers and doesn't allow us >>>>>>>> to ever break out and c) hits == sleep_length would've been accurate. >>>>>>>> Of course counting a tick wakeup as a hit for sleep_length negates the >>>>>>>> intercept logic. >>>>>>> >>>>>>> Not quite. The intercept logic is there for wakeups other than tick >>>>>>> wakeups and timer wakeups. >>>>>>> >>>>>>> I actually think that tick wakeups can be counted as hits on the >>>>>>> deepest available state - maybe only when tick wakeups dominate the >>>>>>> wakeup pattern - but generally this is not unreasonable: When the >>>>>>> wakeup pattern is dominated by tick wakeups, this by itself is a good >>>>>>> enough reason to stop the tick. >>>>>> >>>>>> (assuming HZ=1000 below but it doesn't matter) >>>>>> That will exclude any 'intercept' logic from having much effect if the >>>>>> avg idle duration is >TICK_NSEC/2, which is potentially still quite a bit >>>>>> off from state1 residency, like in Reka's case here. >>>>>> That's why I thought it would cause unreasonable regressions here. >>>>>> I'll give it a go as well though! >>>>> >>>>> Thanks! >>>>> >>>>> Note that I'd prefer to add a check if tick wakeups dominate the >>>>> wakeup pattern before setting sleep_length_ns to KTIME_MAX though. >>>>> I'd first like to know how the Reka's system reacts to the more >>>>> drastic variant of this change. >>>> >>>> Below are my usual tests, it's definitely visible but the impact is limited >>>> on this platform anyway. I think if we gate the KTIME_MAX setting behind >>>> the "tick wakeup dominate" it should be acceptable! >>>> Let's see what Reka reports. >>>> >>> Forgot to post the full results, anyway as expected with mtdblock (a very slow >>> / low frequent wakeup scenario) the impact becomes clearly visible. >>> Still hopeful that the more conservative approach will be acceptable! >> >> Speaking of which, the patch to test is appended below, but it doesn't apply >> directly on top of the mainline. It is based on some other patches that have >> been posted recently, so here's a git branch with all of the requisite >> material: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git cpuidle-teo-testing >> >> Reka, please try this one and let us know how it goes. > > Results attached. The residencies are a bit less deep than before - > about 4.5% in WFI vs 2% at 6.6 or with the more aggressive patch. But > I’m guessing that’s expected. > > I also measured the power on a slightly different system where I first > noticed this regression, and it’s indistinguishable from 6.6. So from > my side this looks great, thank you! Could you share details on the other platform? I wasn't able to reproduce your exact behaviour on my systems, so I'd still like to close my testing gap here!