From: Christian Loehle <christian.loehle@arm.com>
To: Harshvardhan Jha <harshvardhan.j.jha@oracle.com>,
"Rafael J. Wysocki" <rafael@kernel.org>,
Doug Smythies <dsmythies@telus.net>
Cc: Sasha Levin <sashal@kernel.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
linux-pm@vger.kernel.org, stable@vger.kernel.org,
Daniel Lezcano <daniel.lezcano@linaro.org>,
Sergey Senozhatsky <senozhatsky@chromium.org>
Subject: Re: Performance regressions introduced via Revert "cpuidle: menu: Avoid discarding useful information" on 5.15 LTS
Date: Tue, 3 Feb 2026 09:07:42 +0000 [thread overview]
Message-ID: <8fd5a9d4-e555-4db1-aa02-8fe5b8a2962c@arm.com> (raw)
In-Reply-To: <5d4b624c-f993-49aa-95ab-5f279f7f6599@oracle.com>
On 2/2/26 17:31, Harshvardhan Jha wrote:
>
> On 02/02/26 12:50 AM, Christian Loehle wrote:
>> On 1/30/26 19:28, Rafael J. Wysocki wrote:
>>> On Thu, Jan 29, 2026 at 11:27 PM Doug Smythies <dsmythies@telus.net> wrote:
>>>> On 2026.01.28 15:53 Doug Smythies wrote:
>>>>> On 2026.01.27 21:07 Doug Smythies wrote:
>>>>>> On 2026.01.27 07:45 Harshvardhan Jha wrote:
>>>>>>> On 08/12/25 6:17 PM, Christian Loehle wrote:
>>>>>>>> On 12/8/25 11:33, Harshvardhan Jha wrote:
>>>>>>>>> On 04/12/25 4:00 AM, Doug Smythies wrote:
>>>>>>>>>> On 2025.12.03 08:45 Christian Loehle wrote:
>>>>>>>>>>> On 12/3/25 16:18, Harshvardhan Jha wrote:
>>>>> ... snip ...
>>>>>
>>>>>>>> It would be nice to get the idle states here, ideally how the states' usage changed
>>>>>>>> from base to revert.
>>>>>>>> The mentioned thread did this and should show how it can be done, but a dump of
>>>>>>>> cat /sys/devices/system/cpu/cpu*/cpuidle/state*/*
>>>>>>>> before and after the workload is usually fine to work with:
>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-pm/8da42386-282e-4f97-af93-4715ae206361@arm.com/__;!!ACWV5N9M2RV99hQ!PEhkFcO7emFLMaNxWEoE2Gtnw3zSkpghP17iuEvZM3W6KUpmkbgKw_tr91FwGfpzm4oA5f7c5sz8PkYvKiEVwI_iLIPpMt53$
>>>>>>> Apologies for the late reply, I'm attaching a tar ball which has the cpu
>>>>>>> states for the test suites before and after tests. The folders with the
>>>>>>> name of the test contain two folders good-kernel and bad-kernel
>>>>>>> containing two files having the before and after states. Please note
>>>>>>> that different machines were used for different test suites due to
>>>>>>> compatibility reasons. The jbb test was run using containers.
>>>>> Please provide the results of the test runs that were done for
>>>>> the supplied before and after idle data.
>>>>> In particular, what is the "fio" test and it results. Its idle data is not very revealing.
>>>>> Is it a test I can run on my test computer?
>>>> I see that I have fio installed on my test computer.
>>>>
>>>>>> It is a considerable amount of work to manually extract and summarize the data.
>>>>>> I have only done it for the phoronix-sqlite data.
>>>>> I have done the rest now, see below.
>>>>> I have also attached the results, in case the formatting gets screwed up.
>>>>>
>>>>>> There seems to be 40 CPUs, 5 idle states, with idle state 3 defaulting to disabled.
>>>>>> I remember seeing a Linux-pm email about why but couldn't find it just now.
>>>>>> Summary (also attached as a PNG file, in case the formatting gets messed up):
>>>>>> The total idle entries (usage) and time seem low to me, which is why the ???.
>>>>>>
>>>>>> phoronix-sqlite
>>>>>> Good Kernel: Time between samples 4 seconds (estimated and ???)
>>>>>> Usage Above Below Above Below
>>>>>> state 0 220 0 218 0.00% 99.09%
>>>>>> state 1 70212 5213 34602 7.42% 49.28%
>>>>>> state 2 30273 5237 1806 17.30% 5.97%
>>>>>> state 3 0 0 0 0.00% 0.00%
>>>>>> state 4 11824 2120 0 17.93% 0.00%
>>>>>>
>>>>>> total 112529 12570 36626 43.72% <<< Misses %
>>>>>>
>>>>>> Bad Kernel: Time between samples 3.8 seconds (estimated and ???)
>>>>>> Usage Above Below Above Below
>>>>>> state 0 262 0 260 0.00% 99.24%
>>>>>> state 1 62751 3985 35588 6.35% 56.71%
>>>>>> state 2 24941 7896 1433 31.66% 5.75%
>>>>>> state 3 0 0 0 0.00% 0.00%
>>>>>> state 4 24489 11543 0 47.14% 0.00%
>>>>>>
>>>>>> total 112443 23424 37281 53.99% <<< Misses %
>>>>>>
>>>>>> Observe 2X use of idle state 4 for the "Bad Kernel"
>>>>>>
>>>>>> I have a template now, and can summarize the other 40 CPU data
>>>>>> faster, but I would have to rework the template for the 56 CPU data,
>>>>>> and is it a 64 CPU data set at 4 idle states per CPU?
>>>>> jbb: 40 CPU's; 5 idle states, with idle state 3 defaulting to disabled.
>>>>> POLL, C1, C1E, C3 (disabled), C6
>>>>>
>>>>> Good Kernel: Time between samples > 2 hours (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 297550 0 296084 0.00% 99.51%
>>>>> state 1 8062854 341043 4962635 4.23% 61.55%
>>>>> state 2 56708358 12688379 6252051 22.37% 11.02%
>>>>> state 3 0 0 0 0.00% 0.00%
>>>>> state 4 54624476 15868752 0 29.05% 0.00%
>>>>>
>>>>> total 119693238 28898174 11510770 33.76% <<< Misses
>>>>>
>>>>> Bad Kernel: Time between samples > 2 hours (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 90715 0 75134 0.00% 82.82%
>>>>> state 1 8878738 312970 6082180 3.52% 68.50%
>>>>> state 2 12048728 2576251 603316 21.38% 5.01%
>>>>> state 3 0 0 0 0.00% 0.00%
>>>>> state 4 85999424 44723273 0 52.00% 0.00%
>>>>>
>>>>> total 107017605 47612494 6760630 50.81% <<< Misses
>>>>>
>>>>> As with the previous test, observe 1.6X use of idle state 4 for the "Bad Kernel"
>>>>>
>>>>> fio: 64 CPUs; 4 idle states; POLL, C1, C1E, C6.
>>>>>
>>>>> fio
>>>>> Good Kernel: Time between samples ~ 1 minute (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 3822 0 3818 0.00% 99.90%
>>>>> state 1 148640 4406 68956 2.96% 46.39%
>>>>> state 2 593455 45344 105675 7.64% 17.81%
>>>>> state 3 3209648 807014 0 25.14% 0.00%
>>>>>
>>>>> total 3955565 856764 178449 26.17% <<< Misses
>>>>>
>>>>> Bad Kernel: Time between samples ~ 1 minute (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 916 0 756 0.00% 82.53%
>>>>> state 1 80230 2028 42791 2.53% 53.34%
>>>>> state 2 59231 6888 6791 11.63% 11.47%
>>>>> state 3 2455784 564797 0 23.00% 0.00%
>>>>>
>>>>> total 2596161 573713 50338 24.04% <<< Misses
>>>>>
>>>>> It is not clear why the number of idle entries differs so much
>>>>> between the tests, but there is a bit of a different distribution
>>>>> of the workload among the CPUs.
>>>>>
>>>>> rds-stress: 56 CPUs; 5 idle states, with idle state 3 defaulting to disabled.
>>>>> POLL, C1, C1E, C3 (disabled), C6
>>>>>
>>>>> rds-stress-test
>>>>> Good Kernel: Time between samples ~70 Seconds (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 1561 0 1435 0.00% 91.93%
>>>>> state 1 13855 899 2410 6.49% 17.39%
>>>>> state 2 467998 139254 23679 29.76% 5.06%
>>>>> state 3 0 0 0 0.00% 0.00%
>>>>> state 4 213132 107417 0 50.40% 0.00%
>>>>>
>>>>> total 696546 247570 27524 39.49% <<< Misses
>>>>>
>>>>> Bad Kernel: Time between samples ~ 70 Seconds (estimated)
>>>>> Usage Above Below Above Below
>>>>> state 0 231 0 231 0.00% 100.00%
>>>>> state 1 5413 266 1186 4.91% 21.91%
>>>>> state 2 54365 719 3789 1.32% 6.97%
>>>>> state 3 0 0 0 0.00% 0.00%
>>>>> state 4 267055 148327 0 55.54% 0.00%
>>>>>
>>>>> total 327064 149312 5206 47.24% <<< Misses
>>>>>
>>>>> Again, differing numbers of idle entries between tests.
>>>>> This time the load distribution between CPUs is more
>>>>> obvious. In the "Bad" case most work is done on 2 or 3 CPU's.
>>>>> In the "Good" case the work is distributed over more CPUs.
>>>>> I assume without proof, that the scheduler is deciding not to migrate
>>>>> the next bit of work to another CPU in the one case verses the other.
>>>> The above is incorrect. The CPUs involved between the "Good"
>>>> and "Bad" tests are very similar, mainly 2 CPUs with a little of
>>>> a 3rd and 4th. See the attached graph for more detail / clarity.
>>>>
>>>> All of the tests show higher usage of shallower idle states with
>>>> the "Good" verses the "Bad", which was the expectation of the
>>>> original patch, as has been mentioned a few times in the emails.
>>>>
>>>> My input is to revert the reversion.
>>> OK, noted, thanks!
>>>
>>> Christian, what do you think?
>> I've attached readable diffs of the values provided the tldr is:
>>
>> +--------------------+-----------+-----------+
>> | Workload | Δ above % | Δ below % |
>> +--------------------+-----------+-----------+
>> | fio | -10.11 | +2.36 |
>> | rds-stress-test | -0.44 | +2.57 |
>> | jbb | -20.35 | +3.30 |
>> | phoronix-sqlite | -9.66 | -0.61 |
>> +--------------------+-----------+-----------+
>>
>> I think the overall trend however is clear, the commit
>> 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information")
>> improved menu on many systems and workloads, I'd dare to say most.
>>
>> Even on the reported regression introduced by it, the cpuidle governor
>> performed better on paper, system metrics regressed because other
>> CPUs' P-states weren't available due to being in a shallower state.
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/__;!!ACWV5N9M2RV99hQ!KSEGRBOHs7E_E4fRenT3y3MovrhDewsTY-E4lu1JCX0Py-r4GiEJefoLfcHrummpmvmeO_vp1beh-OO_MYxG9xLU0BuBunAS$
>> (+CC Sergey)
>> It could be argued that this is a limitation of a per-CPU cpuidle
>> governor and a more holistic approach would be needed for that platform
>> (i.e. power/thermal-budget-sharing-CPUs want to use higher P-states,
>> skew towards deeper cpuidle states).
>>
>> I also think that the change made sense, for small residency values
>> with a bit of random noise mixed in, performing the same statistical
>> test doesn't seem sensible, the short intervals will look noisier.
>>
>> So options are:
>> 1. Revert revert on mainline+stable
>> 2. Revert revert on mainline only
>> 3. Keep revert, miss out on the improvement for many.
>> 4. Revert only when we have a good solution for the platforms like
>> Sergey's.
>>
>> I'd lean towards 2 because 4 won't be easy, unless of course a minor
>> hack like playing with the deep idle state residency values would
>> be enough to mitigate.
>
> Wouldn't it be better to choose option 1 as reverting the revert has
> even more pronounced improvements on older kernels? I've tested this on
> 6.12, 5.15 and 5.4 stable based kernels and found massive improvements.
> Since the revert has optimizations present only in Jasper Lake Systems
> which is new, isn't reverting the revert more relevant on stable
> kernels? It's more likely that older hardware runs older kernels than
> newer hardware although not always necessary imo.
>
FWIW Jasper Lake seems to be supported from 5.6 on, see
b2d32af0bff4 ("x86/cpu: Add Jasper Lake to Intel family")
next prev parent reply other threads:[~2026-02-03 9:07 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-03 16:18 Performance regressions introduced via Revert "cpuidle: menu: Avoid discarding useful information" on 5.15 LTS Harshvardhan Jha
2025-12-03 16:44 ` Christian Loehle
2025-12-03 22:30 ` Doug Smythies
2025-12-08 11:33 ` Harshvardhan Jha
2025-12-08 12:47 ` Christian Loehle
2026-01-13 7:06 ` Harshvardhan Jha
2026-01-13 14:13 ` Rafael J. Wysocki
2026-01-13 14:18 ` Rafael J. Wysocki
2026-01-14 4:28 ` Sergey Senozhatsky
2026-01-14 4:49 ` Sergey Senozhatsky
2026-01-14 5:15 ` Tomasz Figa
2026-01-14 20:07 ` Rafael J. Wysocki
2026-01-29 10:23 ` Harshvardhan Jha
2026-01-29 22:47 ` Doug Smythies
2026-01-27 15:45 ` Harshvardhan Jha
2026-01-28 5:06 ` Doug Smythies
2026-01-28 23:53 ` Doug Smythies
2026-01-29 22:27 ` Doug Smythies
2026-01-30 19:28 ` Rafael J. Wysocki
2026-02-01 19:20 ` Christian Loehle
2026-02-02 17:31 ` Harshvardhan Jha
2026-02-03 9:07 ` Christian Loehle [this message]
2026-02-03 9:16 ` Harshvardhan Jha
2026-02-03 9:31 ` Christian Loehle
2026-02-03 10:22 ` Harshvardhan Jha
2026-02-03 10:30 ` Christian Loehle
2026-02-03 16:45 ` Rafael J. Wysocki
2026-02-05 0:45 ` Doug Smythies
2026-02-05 2:37 ` Sergey Senozhatsky
2026-02-05 5:18 ` Doug Smythies
2026-02-10 9:17 ` Sergey Senozhatsky
2026-02-11 4:27 ` Doug Smythies
2026-02-05 7:15 ` Christian Loehle
2026-02-10 8:02 ` Sergey Senozhatsky
2026-02-10 8:57 ` Christian Loehle
2026-02-11 4:27 ` Doug Smythies
2026-02-05 5:02 ` Doug Smythies
2026-02-10 9:33 ` Xueqin Luo
2026-02-10 10:04 ` Sergey Senozhatsky
2026-02-10 14:24 ` Rafael J. Wysocki
2026-02-11 1:34 ` Sergey Senozhatsky
2026-02-11 4:17 ` Doug Smythies
2026-02-11 8:58 ` Xueqin Luo
2026-02-10 14:20 ` Rafael J. Wysocki
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8fd5a9d4-e555-4db1-aa02-8fe5b8a2962c@arm.com \
--to=christian.loehle@arm.com \
--cc=daniel.lezcano@linaro.org \
--cc=dsmythies@telus.net \
--cc=gregkh@linuxfoundation.org \
--cc=harshvardhan.j.jha@oracle.com \
--cc=linux-pm@vger.kernel.org \
--cc=rafael@kernel.org \
--cc=sashal@kernel.org \
--cc=senozhatsky@chromium.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox