From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 2855F320CA9 for ; Fri, 14 Nov 2025 14:49:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763131771; cv=none; b=EPtwnKaqU6H6ub9pC/NzULe7nIT2IsbNFXkiGbXxyAX5Kg7znkMfuvDmg00g6c1jbfBmFzS9GrF7r3mQfXZD0BIFEmtJsGoceHta7g7z6InHu035VMiwdff7buXGzfgcRqPtz7a5m//a89pTiFXplpwtkB/55iKzXHGLdKma4tI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763131771; c=relaxed/simple; bh=7PKN+2Fu6WLQzSsXm7FR2PKtaogvZCMB9pbmO5p4Mrg=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=ZVL4wgfQhN4rGI/hG53FJ1NPWSmeo8032oaibBVQzjdq1ir74MNgxHftzYAVbGvN/g1eTqfllh8f+P0IAu0StEt2ivgWmVy/GYfedvg36rs/iZdVCm1nW8RoMKOx8sCSuUbY2NxWA2QfQ5WyDxw0XDcQuCwUlw2NX0/NRsWyR4A= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id CE16C1063; Fri, 14 Nov 2025 06:49:21 -0800 (PST) Received: from [10.1.35.10] (e127648.arm.com [10.1.35.10]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E28963F5A1; Fri, 14 Nov 2025 06:49:27 -0800 (PST) Message-ID: Date: Fri, 14 Nov 2025 14:49:25 +0000 Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] cpuidle: governors: teo: Rework the handling of tick wakeups To: "Rafael J. Wysocki" , Reka Norman Cc: daniel.lezcano@linaro.org, linux-pm@vger.kernel.org References: <00928b9d-7189-4929-afc9-7684fc5ef531@arm.com> <6228387.lOV4Wx5bFT@rafael.j.wysocki> <431db236-736d-4fc3-95c2-876bc767aa0c@arm.com> Content-Language: en-US From: Christian Loehle In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 11/14/25 14:15, Rafael J. Wysocki wrote: > On Fri, Nov 14, 2025 at 9:33 AM Christian Loehle > wrote: >> >> On 11/14/25 04:05, Reka Norman wrote: >>> On Fri, Nov 14, 2025 at 3:56 AM Rafael J. Wysocki wrote: >>>> >>>> On Thursday, November 13, 2025 4:43:18 PM CET Christian Loehle wrote: >>>>> On 11/12/25 18:33, Christian Loehle wrote: >>>>>> On 11/12/25 14:16, Rafael J. Wysocki wrote: >>>>>>> On Wed, Nov 12, 2025 at 3:03 PM Christian Loehle >>>>>>> wrote: >>>>>>>> >>>>>>>> On 11/12/25 13:32, Rafael J. Wysocki wrote: >>>>>>>>> On Tue, Nov 11, 2025 at 6:20 PM Christian Loehle >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 11/11/25 11:48, Rafael J. Wysocki wrote: >>>>>>>>>>> On Tue, Nov 11, 2025 at 11:48 AM Christian Loehle >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 11/11/25 10:00, Christian Loehle wrote: >>>>>>>>> >>>>>>>>> [...] >>>>>>>>> >>>>>>>>>>>>> I see two issues: >>>>>>>>>>>>> 1) Because of DECAY_SHIFT 3 values < 8 cannot decay (I guess this wouldn't really be an issue without 2)) >>>>>>>>>>> >>>>>>>>>>> This shouldn't be a problem. >>>>>>>>>> >>>>>>>>>> Agreed, it should be a non-issue. Nonetheless if this wasn't the case $subject would've likely >>>>>>>>>> never been an issue. >>>>>>>>> >>>>>>>>> Well, I think that the leftovers can be cleared when they become less than 8. >>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> 2) if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) isn't an appropriate check, it will >>>>>>>>>>>>> exclude the state if it its idx_hit_sum make up the vast majority of cpu_data->total (i.e. it would >>>>>>>>>>>>> have been a really good candidate actually). >>>>>>>>>>> >>>>>>>>>>> Well, it would exclude the state if the sum of hits for the states >>>>>>>>>>> below it is large enough. This is questionable (because why would >>>>>>>>>>> hits matter here), but I attempted to make the change below and >>>>>>>>>>> somebody reported a regression IIRC. >>>>>>>>>>> >>>>>>>>>>> This check is related to the problem at hand though (see below). >>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I lightly tested the below, it seems to be at least comparable to mainline teo. >>>>>>>>>>>>> (the documentation/comments would need adapting too, of course) >>>>>>>>>>>>> >>>>>>>>>>>>> -----8<----- >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>>>>>>>>>>> index bfa55c1eab5b..f8f76e3b8364 100644 >>>>>>>>>>>>> --- a/drivers/cpuidle/governors/teo.c >>>>>>>>>>>>> +++ b/drivers/cpuidle/governors/teo.c >>>>>>>>>>>>> @@ -355,7 +355,7 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>>>>>>>>>>>> * all of the deeper states, a shallower idle state is likely to be a >>>>>>>>>>>>> * better choice. >>>>>>>>>>>>> */ >>>>>>>>>>>>> - if (2 * idx_intercept_sum > cpu_data->total - idx_hit_sum) { >>>>>>>>>>>>> + if (2 * idx_intercept_sum > idx_hit_sum) { >>>>>>>>>>>>> int first_suitable_idx = idx; >>>>>>>>>>>>> >>>>>>>>>>>>> /* >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ... nevermind the patch, idx_hit_sum is of course the sum of 0...idx-1. >>>>>>>>>>>> Maybe something like this, again lightly tested: >>>>>>>>>>>> >>>>>>>>>>>> -----8<----- >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c >>>>>>>>>>>> index 173ddcac540a..6bfb9cedb75e 100644 >>>>>>>>>>>> --- a/drivers/cpuidle/governors/teo.c >>>>>>>>>>>> +++ b/drivers/cpuidle/governors/teo.c >>>>>>>>>>>> @@ -383,13 +395,15 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, >>>>>>>>>>>> * has been stopped already into account. >>>>>>>>>>>> */ >>>>>>>>>>>> intercept_sum = 0; >>>>>>>>>>>> + hit_sum = 0; >>>>>>>>>>>> >>>>>>>>>>>> for (i = idx - 1; i >= 0; i--) { >>>>>>>>>>>> struct teo_bin *bin = &cpu_data->state_bins[i]; >>>>>>>>>>>> >>>>>>>>>>>> intercept_sum += bin->intercepts; >>>>>>>>>>>> + hit_sum += bin->hits; >>>>>>>>>>>> >>>>>>>>>>>> - if (2 * intercept_sum > idx_intercept_sum) { >>>>>>>>>>>> + if (2 * intercept_sum > cpu_data->total || 2 * hit_sum > cpu_data->total) { >>>>>>>>>>>> /* >>>>>>>>>>>> * Use the current state unless it is too >>>>>>>>>>>> * shallow or disabled, in which case take the >>>>>>>>>>> >>>>>>>>>>> This will only matter after the deepest state has been rejected >>>>>>>>>>> already and on the system in question this means selecting state 0 no >>>>>>>>>>> matter what. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Ah, right! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> The pre-6.12 behavior can be explained if tick wakeups are taken into account. >>>>>>>>>>> >>>>>>>>>>> Namely, when state 0 is chosen (because of the check mentioned above), >>>>>>>>>>> the tick is not stopped and the sleep length is KTIME_MAX. If the >>>>>>>>>>> subsequent wakeup is a tick one, it will be counted as a hit on the >>>>>>>>>>> deepest state (and it will contribute to the total sum in the check >>>>>>>>>>> mentioned above). Then, at one point, cpu_data->total will be large >>>>>>>>>>> enough and the deepest state will become the candidate one. If >>>>>>>>>>> tick_nohz_get_sleep_length() returns a large value at that point, the >>>>>>>>>>> tick will be stopped and the deepest state will be entered. Nirvana >>>>>>>>>>> ensues. >>>>>>>>>> >>>>>>>>>> So fundamentally we will have to count tick-wakeups as a) nothing, which >>>>>>>>>> doesn't allow us to ever break out of the intercept logic that caused us >>>>>>>>>> to leave the tick on b) intercepts, which is bonkers and doesn't allow us >>>>>>>>>> to ever break out and c) hits == sleep_length would've been accurate. >>>>>>>>>> Of course counting a tick wakeup as a hit for sleep_length negates the >>>>>>>>>> intercept logic. >>>>>>>>> >>>>>>>>> Not quite. The intercept logic is there for wakeups other than tick >>>>>>>>> wakeups and timer wakeups. >>>>>>>>> >>>>>>>>> I actually think that tick wakeups can be counted as hits on the >>>>>>>>> deepest available state - maybe only when tick wakeups dominate the >>>>>>>>> wakeup pattern - but generally this is not unreasonable: When the >>>>>>>>> wakeup pattern is dominated by tick wakeups, this by itself is a good >>>>>>>>> enough reason to stop the tick. >>>>>>>> >>>>>>>> (assuming HZ=1000 below but it doesn't matter) >>>>>>>> That will exclude any 'intercept' logic from having much effect if the >>>>>>>> avg idle duration is >TICK_NSEC/2, which is potentially still quite a bit >>>>>>>> off from state1 residency, like in Reka's case here. >>>>>>>> That's why I thought it would cause unreasonable regressions here. >>>>>>>> I'll give it a go as well though! >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> Note that I'd prefer to add a check if tick wakeups dominate the >>>>>>> wakeup pattern before setting sleep_length_ns to KTIME_MAX though. >>>>>>> I'd first like to know how the Reka's system reacts to the more >>>>>>> drastic variant of this change. >>>>>> >>>>>> Below are my usual tests, it's definitely visible but the impact is limited >>>>>> on this platform anyway. I think if we gate the KTIME_MAX setting behind >>>>>> the "tick wakeup dominate" it should be acceptable! >>>>>> Let's see what Reka reports. >>>>>> >>>>> Forgot to post the full results, anyway as expected with mtdblock (a very slow >>>>> / low frequent wakeup scenario) the impact becomes clearly visible. >>>>> Still hopeful that the more conservative approach will be acceptable! >>>> >>>> Speaking of which, the patch to test is appended below, but it doesn't apply >>>> directly on top of the mainline. It is based on some other patches that have >>>> been posted recently, so here's a git branch with all of the requisite >>>> material: >>>> >>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git cpuidle-teo-testing >>>> >>>> Reka, please try this one and let us know how it goes. >>> >>> Results attached. The residencies are a bit less deep than before - >>> about 4.5% in WFI vs 2% at 6.6 or with the more aggressive patch. But >>> I’m guessing that’s expected. >>> >>> I also measured the power on a slightly different system where I first >>> noticed this regression, and it’s indistinguishable from 6.6. So from >>> my side this looks great, thank you! >> >> Good news! > > So I'm going to queue this up for 6.19 along with the material it depends on. > > I'd rather not rush it into 6.18-rc7 because it would not get > meaningful exposure in linux-next before appearing in the mainline in > that case. Ack > > Reka, I'm going to add your Tested-by: to it, please let me know if > there are any concerns regarding that. > > Now, I think that it needs to go into -stable (6.12+ I suppose?) and > I'm going to instruct -stable to pick up the dependencies along with > it. > Would be nice to get teo slightly more aligned, agreed! As noted I don't think it's 6.12+ only but clearly that's where the impact of the issue is the highest. Thanks for fixing this!