Re: Performance regressions introduced via Revert "cpuidle: menu: Avoid discarding useful information" on 5.15 LTS

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Christian Loehle <christian.loehle@arm.com>
To: Harshvardhan Jha <harshvardhan.j.jha@oracle.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Doug Smythies <dsmythies@telus.net>
Cc: Sasha Levin <sashal@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	linux-pm@vger.kernel.org, stable@vger.kernel.org,
	Daniel Lezcano <daniel.lezcano@linaro.org>,
	Sergey Senozhatsky <senozhatsky@chromium.org>
Subject: Re: Performance regressions introduced via Revert "cpuidle: menu: Avoid discarding useful information" on 5.15 LTS
Date: Tue, 3 Feb 2026 09:07:42 +0000	[thread overview]
Message-ID: <8fd5a9d4-e555-4db1-aa02-8fe5b8a2962c@arm.com> (raw)
In-Reply-To: <5d4b624c-f993-49aa-95ab-5f279f7f6599@oracle.com>

On 2/2/26 17:31, Harshvardhan Jha wrote:
> 
> On 02/02/26 12:50 AM, Christian Loehle wrote:
>> On 1/30/26 19:28, Rafael J. Wysocki wrote:
>>> On Thu, Jan 29, 2026 at 11:27 PM Doug Smythies <dsmythies@telus.net> wrote:
>>>> On 2026.01.28 15:53 Doug Smythies wrote:
>>>>> On 2026.01.27 21:07 Doug Smythies wrote:
>>>>>> On 2026.01.27 07:45 Harshvardhan Jha wrote:
>>>>>>> On 08/12/25 6:17 PM, Christian Loehle wrote:
>>>>>>>> On 12/8/25 11:33, Harshvardhan Jha wrote:
>>>>>>>>> On 04/12/25 4:00 AM, Doug Smythies wrote:
>>>>>>>>>> On 2025.12.03 08:45 Christian Loehle wrote:
>>>>>>>>>>> On 12/3/25 16:18, Harshvardhan Jha wrote:
>>>>> ... snip ...
>>>>>
>>>>>>>> It would be nice to get the idle states here, ideally how the states' usage changed
>>>>>>>> from base to revert.
>>>>>>>> The mentioned thread did this and should show how it can be done, but a dump of
>>>>>>>> cat /sys/devices/system/cpu/cpu*/cpuidle/state*/*
>>>>>>>> before and after the workload is usually fine to work with:
>>>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-pm/8da42386-282e-4f97-af93-4715ae206361@arm.com/__;!!ACWV5N9M2RV99hQ!PEhkFcO7emFLMaNxWEoE2Gtnw3zSkpghP17iuEvZM3W6KUpmkbgKw_tr91FwGfpzm4oA5f7c5sz8PkYvKiEVwI_iLIPpMt53$
>>>>>>> Apologies for the late reply, I'm attaching a tar ball which has the cpu
>>>>>>> states for the test suites before and after tests. The folders with the
>>>>>>> name of the test contain two folders good-kernel and bad-kernel
>>>>>>> containing two files having the before and after states. Please note
>>>>>>> that different machines were used for different test suites due to
>>>>>>> compatibility reasons. The jbb test was run using containers.
>>>>> Please provide the results of the test runs that were done for
>>>>> the supplied before and after idle data.
>>>>> In particular, what is the "fio" test and it results. Its idle data is not very revealing.
>>>>> Is it a test I can run on my test computer?
>>>> I see that I have fio installed on my test computer.
>>>>
>>>>>> It is a considerable amount of work to manually extract and summarize the data.
>>>>>> I have only done it for the phoronix-sqlite data.
>>>>> I have done the rest now, see below.
>>>>> I have also attached the results, in case the formatting gets screwed up.
>>>>>
>>>>>> There seems to be 40 CPUs, 5 idle states, with idle state 3 defaulting to disabled.
>>>>>> I remember seeing a Linux-pm email about why but couldn't find it just now.
>>>>>> Summary (also attached as a PNG file, in case the formatting gets messed up):
>>>>>> The total idle entries (usage)  and time seem low to me, which is why the ???.
>>>>>>
>>>>>> phoronix-sqlite
>>>>>>      Good Kernel: Time between samples 4 seconds (estimated and ???)
>>>>>>              Usage   Above   Below   Above   Below
>>>>>> state 0      220     0       218     0.00%   99.09%
>>>>>> state 1      70212   5213    34602   7.42%   49.28%
>>>>>> state 2      30273   5237    1806    17.30%  5.97%
>>>>>> state 3      0       0       0       0.00%   0.00%
>>>>>> state 4      11824   2120    0       17.93%  0.00%
>>>>>>
>>>>>> total                112529  12570   36626   43.72%   <<< Misses %
>>>>>>
>>>>>>      Bad Kernel: Time between samples 3.8 seconds (estimated and ???)
>>>>>>              Usage   Above   Below   Above   Below
>>>>>> state 0      262     0       260     0.00%   99.24%
>>>>>> state 1      62751   3985    35588   6.35%   56.71%
>>>>>> state 2      24941   7896    1433    31.66%  5.75%
>>>>>> state 3      0       0       0       0.00%   0.00%
>>>>>> state 4      24489   11543   0       47.14%  0.00%
>>>>>>
>>>>>> total                112443  23424   37281   53.99%   <<< Misses %
>>>>>>
>>>>>> Observe 2X use of idle state 4 for the "Bad Kernel"
>>>>>>
>>>>>> I have a template now, and can summarize the other 40 CPU data
>>>>>> faster, but I would have to rework the template for the 56 CPU data,
>>>>>> and is it a 64 CPU data set at 4 idle states per CPU?
>>>>> jbb: 40 CPU's; 5 idle states, with idle state 3 defaulting to disabled.
>>>>> POLL, C1, C1E, C3 (disabled), C6
>>>>>
>>>>>       Good Kernel: Time between samples > 2 hours (estimated)
>>>>>       Usage           Above           Below           Above   Below
>>>>> state 0       297550          0               296084          0.00%   99.51%
>>>>> state 1       8062854 341043          4962635 4.23%   61.55%
>>>>> state 2       56708358        12688379        6252051 22.37%  11.02%
>>>>> state 3       0               0               0               0.00%   0.00%
>>>>> state 4       54624476        15868752        0               29.05%  0.00%
>>>>>
>>>>> total 119693238       28898174        11510770        33.76%  <<< Misses
>>>>>
>>>>>       Bad Kernel: Time between samples > 2 hours (estimated)
>>>>>       Usage           Above           Below           Above   Below
>>>>> state 0       90715           0               75134           0.00%   82.82%
>>>>> state 1       8878738 312970          6082180 3.52%   68.50%
>>>>> state 2       12048728        2576251 603316          21.38%  5.01%
>>>>> state 3       0               0               0               0.00%   0.00%
>>>>> state 4       85999424        44723273        0               52.00%  0.00%
>>>>>
>>>>> total 107017605       47612494        6760630 50.81%  <<< Misses
>>>>>
>>>>> As with the previous test, observe 1.6X use of idle state 4 for the "Bad Kernel"
>>>>>
>>>>> fio: 64 CPUs; 4 idle states; POLL, C1, C1E, C6.
>>>>>
>>>>> fio
>>>>>       Good Kernel: Time between samples ~ 1 minute (estimated)
>>>>>       Usage           Above   Below   Above   Below
>>>>> state 0       3822            0       3818    0.00%   99.90%
>>>>> state 1       148640          4406    68956   2.96%   46.39%
>>>>> state 2       593455          45344   105675  7.64%   17.81%
>>>>> state 3       3209648 807014  0       25.14%  0.00%
>>>>>
>>>>> total 3955565 856764  178449  26.17%  <<< Misses
>>>>>
>>>>>       Bad Kernel: Time between samples ~ 1 minute (estimated)
>>>>>       Usage           Above   Below   Above   Below
>>>>> state 0       916             0       756     0.00%   82.53%
>>>>> state 1       80230           2028    42791   2.53%   53.34%
>>>>> state 2       59231           6888    6791    11.63%  11.47%
>>>>> state 3       2455784 564797  0       23.00%  0.00%
>>>>>
>>>>> total 2596161 573713  50338   24.04%  <<< Misses
>>>>>
>>>>> It is not clear why the number of idle entries differs so much
>>>>> between the tests, but there is a bit of a different distribution
>>>>> of the workload among the CPUs.
>>>>>
>>>>> rds-stress: 56 CPUs; 5 idle states, with idle state 3 defaulting to disabled.
>>>>> POLL, C1, C1E, C3 (disabled), C6
>>>>>
>>>>> rds-stress-test
>>>>>       Good Kernel: Time between samples ~70 Seconds (estimated)
>>>>>       Usage   Above   Below   Above   Below
>>>>> state 0       1561    0       1435    0.00%   91.93%
>>>>> state 1       13855   899     2410    6.49%   17.39%
>>>>> state 2       467998  139254  23679   29.76%  5.06%
>>>>> state 3       0       0       0       0.00%   0.00%
>>>>> state 4       213132  107417  0       50.40%  0.00%
>>>>>
>>>>> total 696546  247570  27524   39.49%  <<< Misses
>>>>>
>>>>>       Bad Kernel: Time between samples ~ 70 Seconds (estimated)
>>>>>       Usage   Above   Below   Above   Below
>>>>> state 0       231     0       231     0.00%   100.00%
>>>>> state 1       5413    266     1186    4.91%   21.91%
>>>>> state 2       54365   719     3789    1.32%   6.97%
>>>>> state 3       0       0       0       0.00%   0.00%
>>>>> state 4       267055  148327  0       55.54%  0.00%
>>>>>
>>>>> total 327064  149312  5206    47.24%  <<< Misses
>>>>>
>>>>> Again, differing numbers of idle entries between tests.
>>>>> This time the load distribution between CPUs is more
>>>>> obvious. In the "Bad" case most work is done on 2 or 3 CPU's.
>>>>> In the "Good" case the work is distributed over more CPUs.
>>>>> I assume without proof, that the scheduler is deciding not to migrate
>>>>> the next bit of work to another CPU in the one case verses the other.
>>>> The above is incorrect. The CPUs involved between the "Good"
>>>> and "Bad" tests are very similar, mainly 2 CPUs with a little of
>>>> a 3rd and 4th. See the attached graph for more detail / clarity.
>>>>
>>>> All of the tests show higher usage of shallower idle states with
>>>> the "Good" verses the "Bad", which was the expectation of the
>>>> original patch, as has been mentioned a few times in the emails.
>>>>
>>>> My input is to revert the reversion.
>>> OK, noted, thanks!
>>>
>>> Christian, what do you think?
>> I've attached readable diffs of the values provided the tldr is:
>>
>> +--------------------+-----------+-----------+
>> | Workload           | Δ above % | Δ below % |
>> +--------------------+-----------+-----------+
>> | fio                |  -10.11   |  +2.36    |
>> | rds-stress-test    |   -0.44   |  +2.57    |
>> | jbb                |  -20.35   |  +3.30    |
>> | phoronix-sqlite    |   -9.66   |  -0.61    |
>> +--------------------+-----------+-----------+
>>
>> I think the overall trend however is clear, the commit
>> 85975daeaa4d ("cpuidle: menu: Avoid discarding useful information")
>> improved menu on many systems and workloads, I'd dare to say most.
>>
>> Even on the reported regression introduced by it, the cpuidle governor
>> performed better on paper, system metrics regressed because other
>> CPUs' P-states weren't available due to being in a shallower state.
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-pm/36iykr223vmcfsoysexug6s274nq2oimcu55ybn6ww4il3g3cv@cohflgdbpnq7/__;!!ACWV5N9M2RV99hQ!KSEGRBOHs7E_E4fRenT3y3MovrhDewsTY-E4lu1JCX0Py-r4GiEJefoLfcHrummpmvmeO_vp1beh-OO_MYxG9xLU0BuBunAS$ 
>> (+CC Sergey)
>> It could be argued that this is a limitation of a per-CPU cpuidle
>> governor and a more holistic approach would be needed for that platform
>> (i.e. power/thermal-budget-sharing-CPUs want to use higher P-states,
>> skew towards deeper cpuidle states).
>>
>> I also think that the change made sense, for small residency values
>> with a bit of random noise mixed in, performing the same statistical
>> test doesn't seem sensible, the short intervals will look noisier.
>>
>> So options are:
>> 1. Revert revert on mainline+stable
>> 2. Revert revert on mainline only
>> 3. Keep revert, miss out on the improvement for many.
>> 4. Revert only when we have a good solution for the platforms like
>> Sergey's.
>>
>> I'd lean towards 2 because 4 won't be easy, unless of course a minor
>> hack like playing with the deep idle state residency values would
>> be enough to mitigate.
> 
> Wouldn't it be better to choose option 1 as reverting the revert has
> even more pronounced improvements on older kernels? I've tested this on
> 6.12, 5.15 and 5.4 stable based kernels and found massive improvements.
> Since the revert has optimizations present only in Jasper Lake Systems
> which is new, isn't reverting the revert more relevant on stable
> kernels? It's more likely that older hardware runs older kernels than
> newer hardware although not always necessary imo.
> 

FWIW Jasper Lake seems to be supported from 5.6 on, see
b2d32af0bff4 ("x86/cpu: Add Jasper Lake to Intel family")

next prev parent reply	other threads:[~2026-02-03  9:07 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-03 16:18 Performance regressions introduced via Revert "cpuidle: menu: Avoid discarding useful information" on 5.15 LTS Harshvardhan Jha
2025-12-03 16:44 ` Christian Loehle
2025-12-03 22:30   ` Doug Smythies
2025-12-08 11:33     ` Harshvardhan Jha
2025-12-08 12:47       ` Christian Loehle
2026-01-13  7:06         ` Harshvardhan Jha
2026-01-13 14:13           ` Rafael J. Wysocki
2026-01-13 14:18             ` Rafael J. Wysocki
2026-01-14  4:28               ` Sergey Senozhatsky
2026-01-14  4:49                 ` Sergey Senozhatsky
2026-01-14  5:15                   ` Tomasz Figa
2026-01-14 20:07                     ` Rafael J. Wysocki
2026-01-29 10:23                       ` Harshvardhan Jha
2026-01-29 22:47                   ` Doug Smythies
2026-01-27 15:45         ` Harshvardhan Jha
2026-01-28  5:06           ` Doug Smythies
2026-01-28 23:53             ` Doug Smythies
2026-01-29 22:27               ` Doug Smythies
2026-01-30 19:28                 ` Rafael J. Wysocki
2026-02-01 19:20                   ` Christian Loehle
2026-02-02 17:31                     ` Harshvardhan Jha
2026-02-03  9:07                       ` Christian Loehle [this message]
2026-02-03  9:16                         ` Harshvardhan Jha
2026-02-03  9:31                           ` Christian Loehle
2026-02-03 10:22                             ` Harshvardhan Jha
2026-02-03 10:30                               ` Christian Loehle
2026-02-03 16:45                             ` Rafael J. Wysocki
2026-02-05  0:45                               ` Doug Smythies
2026-02-05  2:37                                 ` Sergey Senozhatsky
2026-02-05  5:18                                   ` Doug Smythies
2026-02-10  9:17                                     ` Sergey Senozhatsky
2026-02-11  4:27                                       ` Doug Smythies
2026-02-05  7:15                                   ` Christian Loehle
2026-02-10  8:02                                     ` Sergey Senozhatsky
2026-02-10  8:57                                       ` Christian Loehle
2026-02-11  4:27                                         ` Doug Smythies
2026-02-05  5:02                                 ` Doug Smythies
2026-02-10  9:33                                   ` Xueqin Luo
2026-02-10 10:04                                     ` Sergey Senozhatsky
2026-02-10 14:24                                       ` Rafael J. Wysocki
2026-02-11  1:34                                         ` Sergey Senozhatsky
2026-02-11  4:17                                           ` Doug Smythies
2026-02-11  8:58                                         ` Xueqin Luo
2026-02-10 14:20                                     ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8fd5a9d4-e555-4db1-aa02-8fe5b8a2962c@arm.com \
    --to=christian.loehle@arm.com \
    --cc=daniel.lezcano@linaro.org \
    --cc=dsmythies@telus.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=harshvardhan.j.jha@oracle.com \
    --cc=linux-pm@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=sashal@kernel.org \
    --cc=senozhatsky@chromium.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox