From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1045776AbdDWPbn (ORCPT <rfc822;w@1wt.eu>);
        Sun, 23 Apr 2017 11:31:43 -0400
Received: from cmta20.telus.net ([209.171.16.93]:50989 "EHLO cmta20.telus.net"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1756858AbdDWPbd (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Sun, 23 Apr 2017 11:31:33 -0400
X-Authority-Analysis: v=2.2 cv=Nv4+S4VJ c=1 sm=1 tr=0
 a=zJWegnE7BH9C0Gl4FFgQyA==:117 a=zJWegnE7BH9C0Gl4FFgQyA==:17
 a=Pyq9K9CWowscuQLKlpiwfMBGOR0=:19 a=8nJEP1OIZ-IA:10 a=VwQbUJbxAAAA:8
 a=3yv4M0XK6FzVe4Ega7QA:9 a=7Zwj6sZBwVKJAoWSPKxL6X1jA+E=:19 a=wPNLvfGTeEIA:10
 a=AjGcO6oz07-iQ99wixmX:22
From: "Doug Smythies" <dsmythies@telus.net>
To: "'Rafael J. Wysocki'" <rjw@rjwysocki.net>
Cc: "'Mel Gorman'" <mgorman@techsingularity.net>,
        "'Rafael Wysocki'" <rafael.j.wysocki@intel.com>,
        "=?iso-8859-1?Q?'J=F6rg_Otte'?=" <jrg.otte@gmail.com>,
        "'Linux Kernel Mailing List'" <linux-kernel@vger.kernel.org>,
        "'Linux PM'" <linux-pm@vger.kernel.org>,
        "'Srinivas Pandruvada'" <srinivas.pandruvada@linux.intel.com>,
        "Doug Smythies" <dsmythies@telus.net>
References: <20170410084117.rjh3mtdx7hd2i5ze@techsingularity.net> <000a01d2b9e6$393afef0$abb0fcd0$@net> <000301d2bb31$c0037790$400a66b0$@net> 22LpdqAXDopZn22LudrJa9
In-Reply-To: 22LpdqAXDopZn22LudrJa9
Subject: RE: Performance of low-cpu utilisation benchmark regressed severely since 4.6
Date: Sun, 23 Apr 2017 08:31:25 -0700
Message-ID: <000501d2bc46$ad4b1fc0$07e15f40$@net>
MIME-Version: 1.0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: AdK7rV2GEB8COe/OTbOli2NJ+98wkAAB2w/g
Content-Language: en-ca
X-CMAE-Envelope: MS4wfJjv9NRlQW5Vs9PqRA2cu175AQBYHERsZgrcR04Ua6SldHly7GkcT3OYIuP0rSlQDkYXT0AJXJFWs5UBRrHrdMfR/dkO/71bLKtjaEMWlFK886TbOdVs
 2iAWewzyCuIpGT502g4916aThol+Y3rbz3kCsrMc1nXosKkiVN08AEdjJUQbWPrNiMltu2yqBtuzLY152ol0eJpqBwwZt7wMKLLxGz7UES/TDgZn28aWRDLX
 /q3z/19baONPKebqksw3KC24xIHni2ERhuzhryHroEVajHqz0EZOBkXYKc0/h2exZA/zVHNSJWxSJZYUzxk4W+8hSKNRhGb09bAr0Bbe2RBhWPXsWxTwbHVd
 M5kKNpHz+GNX7NEqB/asT2erCOnvq7JbrTpLOyjvSFChZlSwBUWSCkUexNMcIPF+Nor3xrH7
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 2017.04.22 14:08 Rafael wrote:
> On Friday, April 21, 2017 11:29:06 PM Doug Smythies wrote:
>> On 2017.04.20 18:18 Rafael wrote:
>>> On Thursday, April 20, 2017 07:55:57 AM Doug Smythies wrote:
>>>> On 2017.04.19 01:16 Mel Gorman wrote:
>>>>> On Fri, Apr 14, 2017 at 04:01:40PM -0700, Doug Smythies wrote:
>>>>>> Hi Mel,
>>>
>>> [cut]
>>>
>>>>> And the revert does help albeit not being an option for reasons Rafael
>>>>> covered.
>>>> 
>>>> New data point: Kernel 4.11-rc7  intel_pstate, powersave forcing the
>>>> load based algorithm: Elapsed 3178 seconds.
>>>> 
>>>> If I understand your data correctly, my load based results are the opposite of yours.
>>>> 
>>>> Mel: 4.11-rc5 vanilla: Elapsed mean: 3750.20 Seconds
>>>> Mel: 4.11-rc5 load based: Elapsed mean: 2503.27 Seconds
>>>> Or: 33.25%
>>>> 
>>>> Doug: 4.11-rc6 stock: Elapsed total (5 runs): 2364.45 Seconds
>>>> Doug: 4.11-rc7 force load based: Elapsed total (5 runs): 3178 Seconds
>>>> Or: -34.4%
>>>
>>> I wonder if you can do the same thing I've just advised Mel to do.  That is,
>>> take my linux-next branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
>>>
>>> (which is new material for 4.12 on top of 4.11-rc7) and reduce
>>> INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL (in intel_pstate.c) in it by 1/2
>>> (force load-based if need be, I'm not sure what PM profile of your test system
>>> is).
>> 
>> I did not need to force load-based. I do not know how to figure it out from
>> an acpidump the way Srinivas does. I did a trace and figured out what algorithm
>> it was using from the data.
>> 
>> Reference test, before changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>> 3239.4 seconds.
>> 
>> Test after changing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL:
>> 3195.5 seconds.
>
> So it does have an effect, but relatively small.

I don't know how repeatable the tests results are.
i.e. I don't know if the 1.36% change is within experimental
error or not. That being said, the trend does seem consistent.

> I wonder if further reducing INTEL_PSTATE_DEFAULT_SAMPLING_INTERVAL to 2 ms
> will make any difference.

I went all the way to 1 ms, just for the test:
3123.9 Seconds

>> By far, and with any code, I get the fastest elapsed time, of course next
>> to performance mode, but not by much, by limiting the test to only use
>> just 1 cpu: 1814.2 Seconds.
>
> Interesting.
>
> It looks like the cost is mostly related to moving the load from one CPU to
> another and waiting for the new one to ramp up then.
>
> I guess the workload consists of many small tasks that each start on new CPUs
> and cause that ping-pong to happen.

Yes, and (from trace data) many tasks are very very very small. Also the test
appears to take a few holidays, of up to 1 second, during execution.

>> (performance governor, restated from a previous e-mail: 1776.05 seconds)
>
> But that causes the processor to stay in the maximum sustainable P-state all
> the time, which on Sandy Bridge is quite costly energetically.

Agreed. I only provide these data points as a reference and so that we know
what the boundary conditions (limits) are.

> We can do one more trick I forgot about.  Namely, if we are about to increase
> the P-state, we can jump to the average between the target and the max
> instead of just the target, like in the appended patch (on top of linux-next).
>
> That will make the P-state selection really aggressive, so costly energetically,
> but it shoud small jumps of the average load above 0 to case big jumps of
> the target P-state.

I'm already seeing the energy costs of some of this stuff.
3050.2 Seconds.
Idle power 4.06 Watts.

Idle power for kernel 4.11-rc7 (performance-based): 3.89 Watts.
Idle power for kernel 4.11-rc7, using load-based: 4.01 watts
Idle power for kernel 4.11-rc7 next linux-pm: 3.91 watts