[BISECTED] "conservative" cpufreq governor broken

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BISECTED] "conservative" cpufreq governor broken
@ 2009-10-05 16:32 Steven Noonan
  2009-10-06  7:31 ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-05 16:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: Steven Noonan, Eero Nurkkala, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

I noticed on my machine that the "conservative" cpufreq governor wasn't 
working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on 
v2.6.30.8, so I decided I should figure out where this issue was coming 
from. The issue is pretty clear...


Here's the expected:

cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which need to switch frequency at the same time: 0
  hardware limits: 1000 MHz - 2.33 GHz
  available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, powersave, conservative, performance
  current policy: frequency should be within 1000 MHz and 2.33 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 1000 MHz (asserted by call to hardware).
  cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63%  (33)
analyzing CPU 1:
  driver: acpi-cpufreq
  CPUs which need to switch frequency at the same time: 1
  hardware limits: 1000 MHz - 2.33 GHz
  available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, powersave, conservative, performance
  current policy: frequency should be within 1000 MHz and 2.33 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 1000 MHz (asserted by call to hardware).
  cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27%  (7)



And here is the broken version (note the 'cpufreq stats' line):

cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
  driver: acpi-cpufreq
  CPUs which need to switch frequency at the same time: 0
  hardware limits: 1000 MHz - 2.33 GHz
  available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, powersave, conservative, performance
  current policy: frequency should be within 1000 MHz and 2.33 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 2.33 GHz (asserted by call to hardware).
  cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
analyzing CPU 1:
  driver: acpi-cpufreq
  CPUs which need to switch frequency at the same time: 1
  hardware limits: 1000 MHz - 2.33 GHz
  available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
  available cpufreq governors: ondemand, userspace, powersave, conservative, performance
  current policy: frequency should be within 1000 MHz and 2.33 GHz.
                  The governor "conservative" may decide which speed to use
                  within this range.
  current CPU frequency is 2.33 GHz (asserted by call to hardware).
  cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%


So basically, it just never clocks down from the maximum frequency.


Here's the bisection log:

 # bad:  [2147b209] Linux 2.6.31.1
 # good: [a1c4c06a] Linux 2.6.30.8
 # good: [07a2039b] Linux 2.6.30
 # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
 # bad:  [906e8d97] e1000e: delay second read of PHY_STATUS register o
 # good: [36e84467] Staging: heci: fix userspace pointer mess
 # bad:  [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
 # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
 # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
 # bad:  [eca41044] n_r3964: fix lock imbalance
 # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
 # bad:  [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
 # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
 # good: [71e308a2] function-graph: add stack frame test
 # bad:  [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
 # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
 # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
 # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
 # bad:  [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor


And finally, the commit that broke "conservative":

commit f2e21c9610991e95621a81407cdbab881226419b
Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Date:   Mon May 25 09:57:37 2009 +0300

    NOHZ: Properly feed cpufreq ondemand governor
    
    A call from irq_exit() may occasionally pause the timing
    info for cpufreq ondemand governor. This results in the
    cpufreq ondemand governor to fail to calculate the
    system load properly. Thus, relocate the checks for this
    particular case to keep the governor always functional.
    
    Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
    Reported-by: Tero Kristo <tero.kristo@nokia.com>
    Acked-by: Rik van Riel <riel@redhat.com>
    Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>


I'd work on fixing it myself and whip up a patch, but I'm going to be gone 
all day and I'm not too familiar with cpufreq anyway.

- Steven

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-05 16:32 [BISECTED] "conservative" cpufreq governor broken Steven Noonan
@ 2009-10-06  7:31 ` Eero Nurkkala
  2009-10-06 10:22   ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-06  7:31 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote:
> I noticed on my machine that the "conservative" cpufreq governor wasn't 
> working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on 
> v2.6.30.8, so I decided I should figure out where this issue was coming 
> from. The issue is pretty clear...
> 

I had some troubles with cpufreq-info as all values in "cpufreq stats"
were being as 0,00% (I fixed it by replacing unsigned long longs with
unsigned longs, and recompiled)

If this shows still insane values:
cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
I guess your system is indeed broken.

However. I get:
(cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state)

(OP1 == highest Frequency)
OP1 7148
OP2 242
OP3 2307
OP4 43145

And another round:

cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)

Just once more after doing nothing:
OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)

So I can't agree it's broken. The patch you bisected, actually filtered
out such phenomenon, in which an IRQ made the cpufreq framework
occasionally think we were idling, although we were not. So you got
"bonus" idle time that shouldn't been there in the first place. Now that
the "bonus" idle time is not there, your system load may indeed be so
high that the system never spends 80% or more time in idle? Could that
be the case? Of course, even though I can't agree it's broken, doesn't
mean it isn't somehow broken ;) It'd be nice to get info on other
systems as well...

> 
> Here's the expected:
> 
> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
> Report errors and bugs to cpufreq@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: acpi-cpufreq
>   CPUs which need to switch frequency at the same time: 0
>   hardware limits: 1000 MHz - 2.33 GHz
>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>                   The governor "conservative" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1000 MHz (asserted by call to hardware).
>   cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63%  (33)
> analyzing CPU 1:
>   driver: acpi-cpufreq
>   CPUs which need to switch frequency at the same time: 1
>   hardware limits: 1000 MHz - 2.33 GHz
>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>                   The governor "conservative" may decide which speed to use
>                   within this range.
>   current CPU frequency is 1000 MHz (asserted by call to hardware).
>   cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27%  (7)
> 
> 
> 
> And here is the broken version (note the 'cpufreq stats' line):
> 
> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
> Report errors and bugs to cpufreq@vger.kernel.org, please.
> analyzing CPU 0:
>   driver: acpi-cpufreq
>   CPUs which need to switch frequency at the same time: 0
>   hardware limits: 1000 MHz - 2.33 GHz
>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>                   The governor "conservative" may decide which speed to use
>                   within this range.
>   current CPU frequency is 2.33 GHz (asserted by call to hardware).
>   cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
> analyzing CPU 1:
>   driver: acpi-cpufreq
>   CPUs which need to switch frequency at the same time: 1
>   hardware limits: 1000 MHz - 2.33 GHz
>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>                   The governor "conservative" may decide which speed to use
>                   within this range.
>   current CPU frequency is 2.33 GHz (asserted by call to hardware).
>   cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
> 
> 
> So basically, it just never clocks down from the maximum frequency.
> 
> 
> Here's the bisection log:
> 
>  # bad:  [2147b209] Linux 2.6.31.1
>  # good: [a1c4c06a] Linux 2.6.30.8
>  # good: [07a2039b] Linux 2.6.30
>  # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
>  # bad:  [906e8d97] e1000e: delay second read of PHY_STATUS register o
>  # good: [36e84467] Staging: heci: fix userspace pointer mess
>  # bad:  [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
>  # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
>  # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
>  # bad:  [eca41044] n_r3964: fix lock imbalance
>  # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
>  # bad:  [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
>  # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
>  # good: [71e308a2] function-graph: add stack frame test
>  # bad:  [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
>  # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
>  # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
>  # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
>  # bad:  [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
> 
> 
> And finally, the commit that broke "conservative":
> 
> commit f2e21c9610991e95621a81407cdbab881226419b
> Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
> Date:   Mon May 25 09:57:37 2009 +0300
> 
>     NOHZ: Properly feed cpufreq ondemand governor
>     
>     A call from irq_exit() may occasionally pause the timing
>     info for cpufreq ondemand governor. This results in the
>     cpufreq ondemand governor to fail to calculate the
>     system load properly. Thus, relocate the checks for this
>     particular case to keep the governor always functional.
>     
>     Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
>     Reported-by: Tero Kristo <tero.kristo@nokia.com>
>     Acked-by: Rik van Riel <riel@redhat.com>
>     Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> 
> I'd work on fixing it myself and whip up a patch, but I'm going to be gone 
> all day and I'm not too familiar with cpufreq anyway.
> 
> - Steven


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-06  7:31 ` Eero Nurkkala
@ 2009-10-06 10:22   ` Steven Noonan
  2009-10-06 10:43     ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-06 10:22 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, Oct 6, 2009 at 12:31 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote:
>> I noticed on my machine that the "conservative" cpufreq governor wasn't
>> working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on
>> v2.6.30.8, so I decided I should figure out where this issue was coming
>> from. The issue is pretty clear...
>>
>
> I had some troubles with cpufreq-info as all values in "cpufreq stats"
> were being as 0,00% (I fixed it by replacing unsigned long longs with
> unsigned longs, and recompiled)

That doesn't make much sense for my case. I used the same build of
cpufreq-info through the whole bisection and the problem was very
consistently reproducible throughout. And the bisected commit is
indeed related to cpufreq.

Regardless, cpufreq-info wasn't the only reason that I thought
something was wrong. My machine was unusually warm, fans were going
pretty much constantly, and the Gnome applet for CPU frequency
monitoring constantly showed it was using 2.33GHz and never scaling
down. I checked system load, which stayed at zero. Multiple different
sources (top, perf top, etc) seemed to indicate there was no
justification for the "conservative" governor's behaviour.

> If this shows still insane values:
> cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
> I guess your system is indeed broken.

Here's the output:

2333000 6880
2167000 0
2000000 0
1833000 0
1667000 0
1500000 0
1333000 0
1000000 0
2333000 6880
2167000 0
2000000 0
1833000 0
1667000 0
1500000 0
1333000 0
1000000 0

Pretty broken, alright.

> However. I get:
> (cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state)
>
> (OP1 == highest Frequency)
> OP1 7148
> OP2 242
> OP3 2307
> OP4 43145

I would suspect you have to have CONFIG_NO_HZ enabled to be able to
reproduce the issue (considering the title of the bisected commit and
my own config). Do you have it enabled?

> And another round:
>
> cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
>
> Just once more after doing nothing:
> OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
>
> So I can't agree it's broken. The patch you bisected, actually filtered
> out such phenomenon, in which an IRQ made the cpufreq framework
> occasionally think we were idling, although we were not. So you got
> "bonus" idle time that shouldn't been there in the first place. Now that
> the "bonus" idle time is not there, your system load may indeed be so
> high that the system never spends 80% or more time in idle? Could that
> be the case? Of course, even though I can't agree it's broken, doesn't
> mean it isn't somehow broken ;) It'd be nice to get info on other
> systems as well...

Interestingly, "ondemand" (the governor fixed by the bisected commit)
works fine. "conservative" is the only broken one.

>>
>> Here's the expected:
>>
>> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
>> Report errors and bugs to cpufreq@vger.kernel.org, please.
>> analyzing CPU 0:
>>   driver: acpi-cpufreq
>>   CPUs which need to switch frequency at the same time: 0
>>   hardware limits: 1000 MHz - 2.33 GHz
>>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>>                   The governor "conservative" may decide which speed to use
>>                   within this range.
>>   current CPU frequency is 1000 MHz (asserted by call to hardware).
>>   cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63%  (33)
>> analyzing CPU 1:
>>   driver: acpi-cpufreq
>>   CPUs which need to switch frequency at the same time: 1
>>   hardware limits: 1000 MHz - 2.33 GHz
>>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>>                   The governor "conservative" may decide which speed to use
>>                   within this range.
>>   current CPU frequency is 1000 MHz (asserted by call to hardware).
>>   cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27%  (7)
>>
>>
>>
>> And here is the broken version (note the 'cpufreq stats' line):
>>
>> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
>> Report errors and bugs to cpufreq@vger.kernel.org, please.
>> analyzing CPU 0:
>>   driver: acpi-cpufreq
>>   CPUs which need to switch frequency at the same time: 0
>>   hardware limits: 1000 MHz - 2.33 GHz
>>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>>                   The governor "conservative" may decide which speed to use
>>                   within this range.
>>   current CPU frequency is 2.33 GHz (asserted by call to hardware).
>>   cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
>> analyzing CPU 1:
>>   driver: acpi-cpufreq
>>   CPUs which need to switch frequency at the same time: 1
>>   hardware limits: 1000 MHz - 2.33 GHz
>>   available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
>>   available cpufreq governors: ondemand, userspace, powersave, conservative, performance
>>   current policy: frequency should be within 1000 MHz and 2.33 GHz.
>>                   The governor "conservative" may decide which speed to use
>>                   within this range.
>>   current CPU frequency is 2.33 GHz (asserted by call to hardware).
>>   cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
>>
>>
>> So basically, it just never clocks down from the maximum frequency.
>>
>>
>> Here's the bisection log:
>>
>>  # bad:  [2147b209] Linux 2.6.31.1
>>  # good: [a1c4c06a] Linux 2.6.30.8
>>  # good: [07a2039b] Linux 2.6.30
>>  # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
>>  # bad:  [906e8d97] e1000e: delay second read of PHY_STATUS register o
>>  # good: [36e84467] Staging: heci: fix userspace pointer mess
>>  # bad:  [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
>>  # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
>>  # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
>>  # bad:  [eca41044] n_r3964: fix lock imbalance
>>  # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
>>  # bad:  [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
>>  # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
>>  # good: [71e308a2] function-graph: add stack frame test
>>  # bad:  [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
>>  # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
>>  # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
>>  # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
>>  # bad:  [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
>>
>>
>> And finally, the commit that broke "conservative":
>>
>> commit f2e21c9610991e95621a81407cdbab881226419b
>> Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
>> Date:   Mon May 25 09:57:37 2009 +0300
>>
>>     NOHZ: Properly feed cpufreq ondemand governor
>>
>>     A call from irq_exit() may occasionally pause the timing
>>     info for cpufreq ondemand governor. This results in the
>>     cpufreq ondemand governor to fail to calculate the
>>     system load properly. Thus, relocate the checks for this
>>     particular case to keep the governor always functional.
>>
>>     Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
>>     Reported-by: Tero Kristo <tero.kristo@nokia.com>
>>     Acked-by: Rik van Riel <riel@redhat.com>
>>     Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
>>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>
>>
>> I'd work on fixing it myself and whip up a patch, but I'm going to be gone
>> all day and I'm not too familiar with cpufreq anyway.
>>
>> - Steven
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-06 10:22   ` Steven Noonan
@ 2009-10-06 10:43     ` Eero Nurkkala
  2009-10-06 11:22       ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-06 10:43 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
> 
> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
> reproduce the issue (considering the title of the bisected commit and
> my own config). Do you have it enabled?
> 

Yes, it's enabled.

> > And another round:
> >
> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
> >
> > Just once more after doing nothing:
> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
> >
> > So I can't agree it's broken. The patch you bisected, actually filtered
> > out such phenomenon, in which an IRQ made the cpufreq framework
> > occasionally think we were idling, although we were not. So you got
> > "bonus" idle time that shouldn't been there in the first place. Now that
> > the "bonus" idle time is not there, your system load may indeed be so
> > high that the system never spends 80% or more time in idle? Could that
> > be the case? Of course, even though I can't agree it's broken, doesn't
> > mean it isn't somehow broken ;) It'd be nice to get info on other
> > systems as well...
> 
> Interestingly, "ondemand" (the governor fixed by the bisected commit)
> works fine. "conservative" is the only broken one.
> 

If you took timestamps in /arch/x86/kernel/process_**.c:
(let's assume process_64.c) in cpu_idle()
around enter_idle(); and __exit_idle(), took the diff,
added the diffs up, and compared it to system uptime, you could see how
much time you spend in idle()? I think it's possible that
even if the cpu load is near 0%, the system may idle only for a bare
moment (that translates to a buggy pm_idle()), and time is spent
elsewhere (less than 80% in idle). 

If I had similar findings like you, I'd be worried; but traces here and
there will point out where things go wrong ;)

- Eero


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-06 10:43     ` Eero Nurkkala
@ 2009-10-06 11:22       ` Steven Noonan
  2009-10-06 12:36         ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-06 11:22 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>>
>> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>> reproduce the issue (considering the title of the bisected commit and
>> my own config). Do you have it enabled?
>>
>
> Yes, it's enabled.
>
>> > And another round:
>> >
>> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
>> >
>> > Just once more after doing nothing:
>> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
>> >
>> > So I can't agree it's broken. The patch you bisected, actually filtered
>> > out such phenomenon, in which an IRQ made the cpufreq framework
>> > occasionally think we were idling, although we were not. So you got
>> > "bonus" idle time that shouldn't been there in the first place. Now that
>> > the "bonus" idle time is not there, your system load may indeed be so
>> > high that the system never spends 80% or more time in idle? Could that
>> > be the case? Of course, even though I can't agree it's broken, doesn't
>> > mean it isn't somehow broken ;) It'd be nice to get info on other
>> > systems as well...
>>
>> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>> works fine. "conservative" is the only broken one.
>>
>
> If you took timestamps in /arch/x86/kernel/process_**.c:
> (let's assume process_64.c) in cpu_idle()
> around enter_idle(); and __exit_idle(), took the diff,
> added the diffs up, and compared it to system uptime, you could see how
> much time you spend in idle()? I think it's possible that
> even if the cpu load is near 0%, the system may idle only for a bare
> moment (that translates to a buggy pm_idle()), and time is spent
> elsewhere (less than 80% in idle).

This makes logical sense, but how should I test this? Is there a way
to do this with existing tracers?

By the way, a revert of the bisected commit changes the cpufreq stats favorably:

2333000 31
2167000 13
2000000 6
1833000 79
1667000 239
1500000 193
1333000 959
1000000 10260
2333000 11
2167000 13
2000000 13
1833000 26
1667000 33
1500000 333
1333000 59
1000000 11293

- Steven

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-06 11:22       ` Steven Noonan
@ 2009-10-06 12:36         ` Eero Nurkkala
  2009-10-07  0:54           ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-06 12:36 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
> <ext-eero.nurkkala@nokia.com> wrote:
> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
> >>
> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
> >> reproduce the issue (considering the title of the bisected commit and
> >> my own config). Do you have it enabled?
> >>
> >
> > Yes, it's enabled.
> >
> >> > And another round:
> >> >
> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
> >> >
> >> > Just once more after doing nothing:
> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
> >> >
> >> > So I can't agree it's broken. The patch you bisected, actually filtered
> >> > out such phenomenon, in which an IRQ made the cpufreq framework
> >> > occasionally think we were idling, although we were not. So you got
> >> > "bonus" idle time that shouldn't been there in the first place. Now that
> >> > the "bonus" idle time is not there, your system load may indeed be so
> >> > high that the system never spends 80% or more time in idle? Could that
> >> > be the case? Of course, even though I can't agree it's broken, doesn't
> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
> >> > systems as well...
> >>
> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
> >> works fine. "conservative" is the only broken one.
> >>
> >
> > If you took timestamps in /arch/x86/kernel/process_**.c:
> > (let's assume process_64.c) in cpu_idle()
> > around enter_idle(); and __exit_idle(), took the diff,
> > added the diffs up, and compared it to system uptime, you could see how
> > much time you spend in idle()? I think it's possible that
> > even if the cpu load is near 0%, the system may idle only for a bare
> > moment (that translates to a buggy pm_idle()), and time is spent
> > elsewhere (less than 80% in idle).
> 
> This makes logical sense, but how should I test this? Is there a way
> to do this with existing tracers?

Tracers may by themselves add some load into the system.

If I were you, I'd add something like: (I have only one CPU BTW)

static ktime_t time_prior_idle;
static int64_t idle_total;

time_prior_idle = ktime_get();
<idle stuff>
idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));

and have a sysfs hook (something already present, so you can just cat
it) with a trace:

printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));

Sample output:
374758812519, 386768249832

So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
this case.

(Right, this should provide somewhat descent info, hopefully ;) )

- Eero


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-06 12:36         ` Eero Nurkkala
@ 2009-10-07  0:54           ` Steven Noonan
  2009-10-07  1:22             ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  0:54 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
>> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
>> <ext-eero.nurkkala@nokia.com> wrote:
>> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>> >>
>> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>> >> reproduce the issue (considering the title of the bisected commit and
>> >> my own config). Do you have it enabled?
>> >>
>> >
>> > Yes, it's enabled.
>> >
>> >> > And another round:
>> >> >
>> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
>> >> >
>> >> > Just once more after doing nothing:
>> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
>> >> >
>> >> > So I can't agree it's broken. The patch you bisected, actually filtered
>> >> > out such phenomenon, in which an IRQ made the cpufreq framework
>> >> > occasionally think we were idling, although we were not. So you got
>> >> > "bonus" idle time that shouldn't been there in the first place. Now that
>> >> > the "bonus" idle time is not there, your system load may indeed be so
>> >> > high that the system never spends 80% or more time in idle? Could that
>> >> > be the case? Of course, even though I can't agree it's broken, doesn't
>> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
>> >> > systems as well...
>> >>
>> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>> >> works fine. "conservative" is the only broken one.
>> >>
>> >
>> > If you took timestamps in /arch/x86/kernel/process_**.c:
>> > (let's assume process_64.c) in cpu_idle()
>> > around enter_idle(); and __exit_idle(), took the diff,
>> > added the diffs up, and compared it to system uptime, you could see how
>> > much time you spend in idle()? I think it's possible that
>> > even if the cpu load is near 0%, the system may idle only for a bare
>> > moment (that translates to a buggy pm_idle()), and time is spent
>> > elsewhere (less than 80% in idle).
>>
>> This makes logical sense, but how should I test this? Is there a way
>> to do this with existing tracers?
>
> Tracers may by themselves add some load into the system.
>
> If I were you, I'd add something like: (I have only one CPU BTW)
>
> static ktime_t time_prior_idle;
> static int64_t idle_total;
>
> time_prior_idle = ktime_get();
> <idle stuff>
> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>
> and have a sysfs hook (something already present, so you can just cat
> it) with a trace:
>
> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>
> Sample output:
> 374758812519, 386768249832
>
> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
> this case.
>
> (Right, this should provide somewhat descent info, hopefully ;) )
>

Well, I tried adding the code to cpu_idle() as suggested, but it never
printed anything. Apparently cpu_idle() isn't ever being called here.
Even added a 'BUG();' at the beginning of the function and it never
hit it. Of course, I'm probably missing something obvious. Is there a
separate cpu_idle()-esque function for SMP?

- Steven

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  0:54           ` Steven Noonan
@ 2009-10-07  1:22             ` Steven Noonan
  2009-10-07  5:27               ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  1:22 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, Oct 6, 2009 at 5:54 PM, Steven Noonan <steven@uplinklabs.net> wrote:
> On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala
> <ext-eero.nurkkala@nokia.com> wrote:
>> On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote:
>>> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala
>>> <ext-eero.nurkkala@nokia.com> wrote:
>>> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote:
>>> >>
>>> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to
>>> >> reproduce the issue (considering the title of the bisected commit and
>>> >> my own config). Do you have it enabled?
>>> >>
>>> >
>>> > Yes, it's enabled.
>>> >
>>> >> > And another round:
>>> >> >
>>> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83%  (72)
>>> >> >
>>> >> > Just once more after doing nothing:
>>> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10%  (82)
>>> >> >
>>> >> > So I can't agree it's broken. The patch you bisected, actually filtered
>>> >> > out such phenomenon, in which an IRQ made the cpufreq framework
>>> >> > occasionally think we were idling, although we were not. So you got
>>> >> > "bonus" idle time that shouldn't been there in the first place. Now that
>>> >> > the "bonus" idle time is not there, your system load may indeed be so
>>> >> > high that the system never spends 80% or more time in idle? Could that
>>> >> > be the case? Of course, even though I can't agree it's broken, doesn't
>>> >> > mean it isn't somehow broken ;) It'd be nice to get info on other
>>> >> > systems as well...
>>> >>
>>> >> Interestingly, "ondemand" (the governor fixed by the bisected commit)
>>> >> works fine. "conservative" is the only broken one.
>>> >>
>>> >
>>> > If you took timestamps in /arch/x86/kernel/process_**.c:
>>> > (let's assume process_64.c) in cpu_idle()
>>> > around enter_idle(); and __exit_idle(), took the diff,
>>> > added the diffs up, and compared it to system uptime, you could see how
>>> > much time you spend in idle()? I think it's possible that
>>> > even if the cpu load is near 0%, the system may idle only for a bare
>>> > moment (that translates to a buggy pm_idle()), and time is spent
>>> > elsewhere (less than 80% in idle).
>>>
>>> This makes logical sense, but how should I test this? Is there a way
>>> to do this with existing tracers?
>>
>> Tracers may by themselves add some load into the system.
>>
>> If I were you, I'd add something like: (I have only one CPU BTW)
>>
>> static ktime_t time_prior_idle;
>> static int64_t idle_total;
>>
>> time_prior_idle = ktime_get();
>> <idle stuff>
>> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>>
>> and have a sysfs hook (something already present, so you can just cat
>> it) with a trace:
>>
>> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>>
>> Sample output:
>> 374758812519, 386768249832
>>
>> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
>> this case.
>>
>> (Right, this should provide somewhat descent info, hopefully ;) )
>>
>
> Well, I tried adding the code to cpu_idle() as suggested, but it never
> printed anything. Apparently cpu_idle() isn't ever being called here.
> Even added a 'BUG();' at the beginning of the function and it never
> hit it. Of course, I'm probably missing something obvious. Is there a
> separate cpu_idle()-esque function for SMP?
>

Oh crap. Perhaps it's more insidious. I reverted the bisected commit
and it _DID_ hit the line I added. So cpu_idle is never entered with
the bisected commit. Bizarre.

- Steven

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  1:22             ` Steven Noonan
@ 2009-10-07  5:27               ` Eero Nurkkala
  2009-10-07  7:30                 ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-07  5:27 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote:
> >>
> >> Tracers may by themselves add some load into the system.
> >>
> >> If I were you, I'd add something like: (I have only one CPU BTW)
> >>
> >> static ktime_t time_prior_idle;
> >> static int64_t idle_total;
> >>
> >> time_prior_idle = ktime_get();
> >> <idle stuff>
> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
> >>
> >> and have a sysfs hook (something already present, so you can just cat
> >> it) with a trace:
> >>
> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
> >>
> >> Sample output:
> >> 374758812519, 386768249832
> >>
> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
> >> this case.
> >>
> >> (Right, this should provide somewhat descent info, hopefully ;) )
> >>
> >
> > Well, I tried adding the code to cpu_idle() as suggested, but it never
> > printed anything. Apparently cpu_idle() isn't ever being called here.
> > Even added a 'BUG();' at the beginning of the function and it never
> > hit it. Of course, I'm probably missing something obvious. Is there a
> > separate cpu_idle()-esque function for SMP?
> >
> 
> Oh crap. Perhaps it's more insidious. I reverted the bisected commit
> and it _DID_ hit the line I added. So cpu_idle is never entered with
> the bisected commit. Bizarre.
> 
> - Steven

So you're saying that cpu_idle() is not ever called.. The commit in
question - whether you got it or had it reverted - should, and must not
be a trigger for your system to hit cpu_idle(). What if you took a look
at init/main.c, and added traces before the first call to cpu_idle():

start_kernel() ->
rest_init() ->
cpu_idle()

With the patch not reverted?

So I guess, that an irq may actually start your "system up" (patch
reverted) somehow:

irq_exit() ->
tick_nohz_stop_sched_tick() ->
tick_nohz_start_idle() ->
sched_clock_idle_sleep_event() ?

So it looks that maybe the conservative governor isn't broken after
all ;)


(BTW I guess, with SMP, you may wish to have fix things to work in your
setup
static ktime_t time_prior_idle[2];
static int64_t idle_total[2];
and
time_prior_idle[smp_processor_id()] = ktime_get();
etc, and print out times separately).

- Eero


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  5:27               ` Eero Nurkkala
@ 2009-10-07  7:30                 ` Steven Noonan
  2009-10-07  7:49                   ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  7:30 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Tue, Oct 6, 2009 at 10:27 PM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote:
>> >>
>> >> Tracers may by themselves add some load into the system.
>> >>
>> >> If I were you, I'd add something like: (I have only one CPU BTW)
>> >>
>> >> static ktime_t time_prior_idle;
>> >> static int64_t idle_total;
>> >>
>> >> time_prior_idle = ktime_get();
>> >> <idle stuff>
>> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle));
>> >>
>> >> and have a sysfs hook (something already present, so you can just cat
>> >> it) with a trace:
>> >>
>> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get()));
>> >>
>> >> Sample output:
>> >> 374758812519, 386768249832
>> >>
>> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in
>> >> this case.
>> >>
>> >> (Right, this should provide somewhat descent info, hopefully ;) )
>> >>
>> >
>> > Well, I tried adding the code to cpu_idle() as suggested, but it never
>> > printed anything. Apparently cpu_idle() isn't ever being called here.
>> > Even added a 'BUG();' at the beginning of the function and it never
>> > hit it. Of course, I'm probably missing something obvious. Is there a
>> > separate cpu_idle()-esque function for SMP?
>> >
>>
>> Oh crap. Perhaps it's more insidious. I reverted the bisected commit
>> and it _DID_ hit the line I added. So cpu_idle is never entered with
>> the bisected commit. Bizarre.
>>
>> - Steven
>
> So you're saying that cpu_idle() is not ever called.. The commit in
> question - whether you got it or had it reverted - should, and must not
> be a trigger for your system to hit cpu_idle(). What if you took a look
> at init/main.c, and added traces before the first call to cpu_idle():
>
> start_kernel() ->
> rest_init() ->
> cpu_idle()

Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
for. I thought that cpu_idle() was a function that was periodically
called whenever the CPU had nothing to do, but now I see that it's
actually the main loop. I should really read the code next time.

I've moved the statistics printout code to the _inside_ of that
infinite loop and retested. I had it print every several hundred
iterations. Here's the results (note the machine was idle the whole
time, except for about the first 10-20 seconds while the machine
booted):

[    3.627716] timings[0]: 2250511125 / 3627716116
[    6.946216] timings[0]: 4780901366 / 6946213531
[   13.355182] timings[0]: 9385417604 / 13355183525
[   18.551304] timings[1]: 16300853077 / 18551301189
[   21.589039] timings[0]: 15984495433 / 21589037480
[   47.152733] timings[1]: 44386121538 / 47152731476
[   51.682630] timings[0]: 45713834076 / 51682628295
[   79.587359] timings[0]: 73524821916 / 79587356820
[   88.630110] timings[1]: 85324277596 / 88630109605
[   96.082386] timings[0]: 89691306072 / 96082384539


>
> With the patch not reverted?
>
> So I guess, that an irq may actually start your "system up" (patch
> reverted) somehow:
>
> irq_exit() ->
> tick_nohz_stop_sched_tick() ->
> tick_nohz_start_idle() ->
> sched_clock_idle_sleep_event() ?
>
> So it looks that maybe the conservative governor isn't broken after
> all ;)

Symptomatically, it is broken. But it's likely just that: a symptom of
a larger issue.

>
>
> (BTW I guess, with SMP, you may wish to have fix things to work in your
> setup
> static ktime_t time_prior_idle[2];
> static int64_t idle_total[2];
> and
> time_prior_idle[smp_processor_id()] = ktime_get();
> etc, and print out times separately).
>
> - Eero
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  7:30                 ` Steven Noonan
@ 2009-10-07  7:49                   ` Eero Nurkkala
  2009-10-07  8:05                     ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-07  7:49 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
> 
> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
> for. I thought that cpu_idle() was a function that was periodically
> called whenever the CPU had nothing to do, but now I see that it's
> actually the main loop. I should really read the code next time.
> 
> I've moved the statistics printout code to the _inside_ of that
> infinite loop and retested. I had it print every several hundred
> iterations. Here's the results (note the machine was idle the whole
> time, except for about the first 10-20 seconds while the machine
> booted):
> 
> [    3.627716] timings[0]: 2250511125 / 3627716116
> [    6.946216] timings[0]: 4780901366 / 6946213531
> [   13.355182] timings[0]: 9385417604 / 13355183525
> [   18.551304] timings[1]: 16300853077 / 18551301189
> [   21.589039] timings[0]: 15984495433 / 21589037480
> [   47.152733] timings[1]: 44386121538 / 47152731476
> [   51.682630] timings[0]: 45713834076 / 51682628295
> [   79.587359] timings[0]: 73524821916 / 79587356820
> [   88.630110] timings[1]: 85324277596 / 88630109605
> [   96.082386] timings[0]: 89691306072 / 96082384539
> 

Those look good.

Well, might as well then go for:
/drivers/cpufreq/cpufreq_conservative.c
dbs_check_cpu() ->
load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?

Let assume load is sane, look for (in dbs_check_cpu())
	if (load < (dbs_tuners_ins.down_threshold - 10)) {

whether it is taken ever...if not, what is your
(dbs_tuners_ins.down_threshold - 10) ?

- Eero


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  7:49                   ` Eero Nurkkala
@ 2009-10-07  8:05                     ` Steven Noonan
  2009-10-07  8:11                       ` Steven Noonan
  2009-10-07  8:14                       ` Eero Nurkkala
  0 siblings, 2 replies; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  8:05 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
>>
>> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
>> for. I thought that cpu_idle() was a function that was periodically
>> called whenever the CPU had nothing to do, but now I see that it's
>> actually the main loop. I should really read the code next time.
>>
>> I've moved the statistics printout code to the _inside_ of that
>> infinite loop and retested. I had it print every several hundred
>> iterations. Here's the results (note the machine was idle the whole
>> time, except for about the first 10-20 seconds while the machine
>> booted):
>>
>> [    3.627716] timings[0]: 2250511125 / 3627716116
>> [    6.946216] timings[0]: 4780901366 / 6946213531
>> [   13.355182] timings[0]: 9385417604 / 13355183525
>> [   18.551304] timings[1]: 16300853077 / 18551301189
>> [   21.589039] timings[0]: 15984495433 / 21589037480
>> [   47.152733] timings[1]: 44386121538 / 47152731476
>> [   51.682630] timings[0]: 45713834076 / 51682628295
>> [   79.587359] timings[0]: 73524821916 / 79587356820
>> [   88.630110] timings[1]: 85324277596 / 88630109605
>> [   96.082386] timings[0]: 89691306072 / 96082384539
>>
>
> Those look good.
>
> Well, might as well then go for:
> /drivers/cpufreq/cpufreq_conservative.c
> dbs_check_cpu() ->
> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?

That's probably the problem...

[   40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
[   40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
[   73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
[   74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
[  107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
[  107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
[  140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
[  140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95

idle_time is wrong.

> Let assume load is sane, look for (in dbs_check_cpu())
>        if (load < (dbs_tuners_ins.down_threshold - 10)) {
>
> whether it is taken ever...if not, what is your
> (dbs_tuners_ins.down_threshold - 10) ?
>
> - Eero
>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:05                     ` Steven Noonan
@ 2009-10-07  8:11                       ` Steven Noonan
  2009-10-07  8:14                       ` Eero Nurkkala
  1 sibling, 0 replies; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  8:11 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, Oct 7, 2009 at 1:05 AM, Steven Noonan <steven@uplinklabs.net> wrote:
> On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala
> <ext-eero.nurkkala@nokia.com> wrote:
>> On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote:
>>>
>>> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be
>>> for. I thought that cpu_idle() was a function that was periodically
>>> called whenever the CPU had nothing to do, but now I see that it's
>>> actually the main loop. I should really read the code next time.
>>>
>>> I've moved the statistics printout code to the _inside_ of that
>>> infinite loop and retested. I had it print every several hundred
>>> iterations. Here's the results (note the machine was idle the whole
>>> time, except for about the first 10-20 seconds while the machine
>>> booted):
>>>
>>> [    3.627716] timings[0]: 2250511125 / 3627716116
>>> [    6.946216] timings[0]: 4780901366 / 6946213531
>>> [   13.355182] timings[0]: 9385417604 / 13355183525
>>> [   18.551304] timings[1]: 16300853077 / 18551301189
>>> [   21.589039] timings[0]: 15984495433 / 21589037480
>>> [   47.152733] timings[1]: 44386121538 / 47152731476
>>> [   51.682630] timings[0]: 45713834076 / 51682628295
>>> [   79.587359] timings[0]: 73524821916 / 79587356820
>>> [   88.630110] timings[1]: 85324277596 / 88630109605
>>> [   96.082386] timings[0]: 89691306072 / 96082384539
>>>
>>
>> Those look good.
>>
>> Well, might as well then go for:
>> /drivers/cpufreq/cpufreq_conservative.c
>> dbs_check_cpu() ->
>> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
>
> That's probably the problem...
>
> [   40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
> [   40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
> [   73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
> [   74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
> [  107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
> [  107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
> [  140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
> [  140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
>
> idle_time is wrong.

Actually, it's more likely that the idle_time there is correct and
there's something else that's going runaway. My system's fans are
running at about 4000 RPM, when they'd normally be running at 2000 RPM
for this load average. I suspect there's something actually going
wild.

>
>> Let assume load is sane, look for (in dbs_check_cpu())
>>        if (load < (dbs_tuners_ins.down_threshold - 10)) {
>>
>> whether it is taken ever...if not, what is your
>> (dbs_tuners_ins.down_threshold - 10) ?
>>
>> - Eero
>>
>>
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:05                     ` Steven Noonan
  2009-10-07  8:11                       ` Steven Noonan
@ 2009-10-07  8:14                       ` Eero Nurkkala
  2009-10-07  8:24                         ` Steven Noonan
  1 sibling, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-07  8:14 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote:
> > Those look good.
> >
> > Well, might as well then go for:
> > /drivers/cpufreq/cpufreq_conservative.c
> > dbs_check_cpu() ->
> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
> 
> That's probably the problem...
> 
> [   40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
> [   40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
> [   73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
> [   74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
> [  107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
> [  107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
> [  140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
> [  140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
> 
> idle_time is wrong.
> 
> > Let assume load is sane, look for (in dbs_check_cpu())
> >        if (load < (dbs_tuners_ins.down_threshold - 10)) {
> >
> > whether it is taken ever...if not, what is your
> > (dbs_tuners_ins.down_threshold - 10) ?
> >
> > - Eero
> >
> >

<------------Sidenote------------------------------>
Vankatesh, I think the following is not smp friendly:
	cpufreq_conservative.c:
	/* Get Absolute Load */
	for_each_cpu(j, policy->cpus) {
	...
	load = 100 * (wall_time - idle_time) / wall_time;

	now the load is only considered from the last cpu in line?
	in ondemand, load goes to max_load_freq...so max is taken
	from all cpus out there. (well, doesn't seem like the case with
	conservative). Of course I may be wrong, and things are like
	supposed.
</------------Sidenote------------------------------>

Steven, how do the cpu loads look like without the patch?

- Eero




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:14                       ` Eero Nurkkala
@ 2009-10-07  8:24                         ` Steven Noonan
  2009-10-07  8:43                           ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  8:24 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, Oct 7, 2009 at 1:14 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote:
>> > Those look good.
>> >
>> > Well, might as well then go for:
>> > /drivers/cpufreq/cpufreq_conservative.c
>> > dbs_check_cpu() ->
>> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load?
>>
>> That's probably the problem...
>>
>> [   40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95
>> [   40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95
>> [   73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80
>> [   74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98
>> [  107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80
>> [  107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95
>> [  140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95
>> [  140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95
>>
>> idle_time is wrong.
>>
>> > Let assume load is sane, look for (in dbs_check_cpu())
>> >        if (load < (dbs_tuners_ins.down_threshold - 10)) {
>> >
>> > whether it is taken ever...if not, what is your
>> > (dbs_tuners_ins.down_threshold - 10) ?
>> >
>> > - Eero
>> >
>> >
>
> <------------Sidenote------------------------------>
> Vankatesh, I think the following is not smp friendly:
>        cpufreq_conservative.c:
>        /* Get Absolute Load */
>        for_each_cpu(j, policy->cpus) {
>        ...
>        load = 100 * (wall_time - idle_time) / wall_time;
>
>        now the load is only considered from the last cpu in line?
>        in ondemand, load goes to max_load_freq...so max is taken
>        from all cpus out there. (well, doesn't seem like the case with
>        conservative). Of course I may be wrong, and things are like
>        supposed.
> </------------Sidenote------------------------------>
>
> Steven, how do the cpu loads look like without the patch?

They're sane:

[   40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
[   40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
[   73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
[   73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:24                         ` Steven Noonan
@ 2009-10-07  8:43                           ` Eero Nurkkala
  2009-10-07  8:52                             ` Steven Noonan
  0 siblings, 1 reply; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-07  8:43 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi

On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
> >
> > Steven, how do the cpu loads look like without the patch?
> 
> They're sane:
> 
> [   40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
> [   40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
> [   73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
> [   73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4

Thank you. Could you please try the following:
Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
and all subsequent calls from irq_exit() think we weren't idling,
which is not true.

--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
 
        now = tick_nohz_start_idle(ts);
 
+       ts->inidle = 1;
+
        /*
         * If this cpu is offline and it is the one which updates
         * jiffies, then give up the assignment and let it be taken by
@@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
        if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
                goto end;
 
-       ts->inidle = 1;
-
        if (need_resched())
                goto end;
 





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:43                           ` Eero Nurkkala
@ 2009-10-07  8:52                             ` Steven Noonan
  2009-10-07  9:31                               ` Eero Nurkkala
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07  8:52 UTC (permalink / raw)
  To: ext-eero.nurkkala
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi, Greg KH, Ingo Molnar

On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala
<ext-eero.nurkkala@nokia.com> wrote:
> On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
>> >
>> > Steven, how do the cpu loads look like without the patch?
>>
>> They're sane:
>>
>> [   40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
>> [   40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
>> [   73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
>> [   73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
>
> Thank you. Could you please try the following:
> Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
> and all subsequent calls from irq_exit() think we weren't idling,
> which is not true.
>
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
>
>        now = tick_nohz_start_idle(ts);
>
> +       ts->inidle = 1;
> +
>        /*
>         * If this cpu is offline and it is the one which updates
>         * jiffies, then give up the assignment and let it be taken by
> @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
>        if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
>                goto end;
>
> -       ts->inidle = 1;
> -
>        if (need_resched())
>                goto end;
>

Yeah, that fixed it. The load is now sane, my system isn't heating up,
and 'conservative' is now clocking my CPU down to the minimum as is
appropriate.

Can you explain why this only affected 'conservative', why it caused
my machine to heat up, etc?

Also, this fix should probably be passed on to Greg K. H. (so it goes
in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
hopefully to Linus for 2.6.32). Both CC'd.

- Steven

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  8:52                             ` Steven Noonan
@ 2009-10-07  9:31                               ` Eero Nurkkala
  2009-10-07 10:42                                 ` Thomas Gleixner
  2009-10-07 11:06                                 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala
  0 siblings, 2 replies; 23+ messages in thread
From: Eero Nurkkala @ 2009-10-07  9:31 UTC (permalink / raw)
  To: ext Steven Noonan
  Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel,
	Venkatesh Pallipadi, Greg KH, Ingo Molnar

On Wed, 2009-10-07 at 10:52 +0200, ext Steven Noonan wrote:
> On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala
> <ext-eero.nurkkala@nokia.com> wrote:
> > On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote:
> >> >
> >> > Steven, how do the cpu loads look like without the patch?
> >>
> >> They're sane:
> >>
> >> [   40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0
> >> [   40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0
> >> [   73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0
> >> [   73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4
> >
> > Thank you. Could you please try the following:
> > Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set
> > and all subsequent calls from irq_exit() think we weren't idling,
> > which is not true.
> >
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
> >
> >        now = tick_nohz_start_idle(ts);
> >
> > +       ts->inidle = 1;
> > +
> >        /*
> >         * If this cpu is offline and it is the one which updates
> >         * jiffies, then give up the assignment and let it be taken by
> > @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
> >        if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
> >                goto end;
> >
> > -       ts->inidle = 1;
> > -
> >        if (need_resched())
> >                goto end;
> >
> 
> Yeah, that fixed it. The load is now sane, my system isn't heating up,
> and 'conservative' is now clocking my CPU down to the minimum as is
> appropriate.
> 

For some reason, this path is taken: 
if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
        goto end;

> Can you explain why this only affected 'conservative', why it caused
> my machine to heat up, etc?
> 
Maybe the loads dont look good with ondemand either? if you trace them
out. Below is a patch that has explanation on what's happening in
your system.

> Also, this fix should probably be passed on to Greg K. H. (so it goes
> in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> hopefully to Linus for 2.6.32). Both CC'd.
> 
> - Steven

From: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Date: Wed, 7 Oct 2009 11:54:26 +0300
Subject: [PATCH] NOHZ: update idle state properly

Commit f2e21c9610991e95621a81407cdbab881226419b
had unfortunate side effects with cpufreq governors on
some systems.

If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
set. Then, all subsequent calls from irq_exit() bypassed
calls to tick_nohz_start_idle() which resulted in wrong
information passed to cpufreq governors. Fix this by
updating the state of ts->inidle where it fits the best.

Reported-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
---
 kernel/time/tick-sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e0f59a2..4c1b0ac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle)
 
 	now = tick_nohz_start_idle(ts);
 
+	ts->inidle = 1;
+
 	/*
 	 * If this cpu is offline and it is the one which updates
 	 * jiffies, then give up the assignment and let it be taken by
@@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle)
 	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
 		goto end;
 
-	ts->inidle = 1;
-
 	if (need_resched())
 		goto end;
 
-- 
1.6.0.4



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07  9:31                               ` Eero Nurkkala
@ 2009-10-07 10:42                                 ` Thomas Gleixner
  2009-10-07 11:03                                   ` Steven Noonan
  2009-10-07 11:06                                 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala
  1 sibling, 1 reply; 23+ messages in thread
From: Thomas Gleixner @ 2009-10-07 10:42 UTC (permalink / raw)
  To: Eero Nurkkala
  Cc: ext Steven Noonan, linux-kernel@vger.kernel.org, Rik van Riel,
	Venkatesh Pallipadi, Greg KH, Ingo Molnar

On Wed, 7 Oct 2009, Eero Nurkkala wrote:
> For some reason, this path is taken: 
> if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
>         goto end;

Well, probably because the machine never switched to NOHZ :)
 
> > Also, this fix should probably be passed on to Greg K. H. (so it goes
> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> > hopefully to Linus for 2.6.32). Both CC'd.

Don't worry, it'll go the right way. No need to cc the world and some
more :)
 
> > - Steven
> 
> From: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
> Date: Wed, 7 Oct 2009 11:54:26 +0300
> Subject: [PATCH] NOHZ: update idle state properly
> 
> Commit f2e21c9610991e95621a81407cdbab881226419b
> had unfortunate side effects with cpufreq governors on
> some systems.
> 
> If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
> set. Then, all subsequent calls from irq_exit() bypassed
> calls to tick_nohz_start_idle() which resulted in wrong
> information passed to cpufreq governors. Fix this by
> updating the state of ts->inidle where it fits the best.

Hmm, nasty that code silently relies on NOHZ functionality even when
NOHZ is not active. That needs a big fat comment at least.

Nice detective work though !

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07 10:42                                 ` Thomas Gleixner
@ 2009-10-07 11:03                                   ` Steven Noonan
  2009-10-07 11:15                                     ` Thomas Gleixner
  0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-07 11:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Eero Nurkkala, linux-kernel@vger.kernel.org, Rik van Riel,
	Venkatesh Pallipadi, Greg KH, Ingo Molnar

On Wed, Oct 7, 2009 at 3:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 7 Oct 2009, Eero Nurkkala wrote:
>> For some reason, this path is taken:
>> if (ts->nohz_mode == NOHZ_MODE_INACTIVE)
>>         goto end;
>
> Well, probably because the machine never switched to NOHZ :)
>
>> > Also, this fix should probably be passed on to Greg K. H. (so it goes
>> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
>> > hopefully to Linus for 2.6.32). Both CC'd.
>
> Don't worry, it'll go the right way. No need to cc the world and some
> more :)
>

Sorry if it's not kosher. I was told that when dealing with a
bisection you should CC:
1. those involved with a bisected patch (the Signed-off-by people),
2. those responsible for the affected subsystem (doing an author check
via script[1]),
3. and those to whom a fix should be passed on to.

Any tips so I don't bug the wrong people (or too many people) in the future?

[1] http://www.uplinklabs.net/~tycho/linux/scripts/git-authors-email.sh
(credits to Ingo for making this one. It's handy)

>>
>> From: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
>> Date: Wed, 7 Oct 2009 11:54:26 +0300
>> Subject: [PATCH] NOHZ: update idle state properly
>>
>> Commit f2e21c9610991e95621a81407cdbab881226419b
>> had unfortunate side effects with cpufreq governors on
>> some systems.
>>
>> If NOHZ_MODE_INACTIVE was set, ts->inidle was not being
>> set. Then, all subsequent calls from irq_exit() bypassed
>> calls to tick_nohz_start_idle() which resulted in wrong
>> information passed to cpufreq governors. Fix this by
>> updating the state of ts->inidle where it fits the best.
>
> Hmm, nasty that code silently relies on NOHZ functionality even when
> NOHZ is not active. That needs a big fat comment at least.
>
> Nice detective work though !
>
> Thanks,
>
>        tglx
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [BISECTED] "conservative" cpufreq governor broken
  2009-10-07 11:03                                   ` Steven Noonan
@ 2009-10-07 11:15                                     ` Thomas Gleixner
  0 siblings, 0 replies; 23+ messages in thread
From: Thomas Gleixner @ 2009-10-07 11:15 UTC (permalink / raw)
  To: Steven Noonan
  Cc: Eero Nurkkala, linux-kernel@vger.kernel.org, Rik van Riel,
	Venkatesh Pallipadi, Greg KH, Ingo Molnar

On Wed, 7 Oct 2009, Steven Noonan wrote:
> > Well, probably because the machine never switched to NOHZ :)
> >
> >> > Also, this fix should probably be passed on to Greg K. H. (so it goes
> >> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and
> >> > hopefully to Linus for 2.6.32). Both CC'd.
> >
> > Don't worry, it'll go the right way. No need to cc the world and some
> > more :)
> >
> 
> Sorry if it's not kosher. I was told that when dealing with a
> bisection you should CC:
> 1. those involved with a bisected patch (the Signed-off-by people),
> 2. those responsible for the affected subsystem (doing an author check
> via script[1]),
> 3. and those to whom a fix should be passed on to.
> 
> Any tips so I don't bug the wrong people (or too many people) in the future?

No worry. It's just Linus and Greg who got some extra mail into their
inbox. The flow is that the patch is picked up by the maintainer and
pushed into his git tree which is then offered to Linus for pulling it
into mainline. We add a Cc: stable@kernel.org tag to the commit, which
triggers a notifier to the stable kernel team when the patch hits
Linus tree.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive
  2009-10-07  9:31                               ` Eero Nurkkala
  2009-10-07 10:42                                 ` Thomas Gleixner
@ 2009-10-07 11:06                                 ` tip-bot for Eero Nurkkala
  2009-10-07 13:35                                   ` Rik van Riel
  1 sibling, 1 reply; 23+ messages in thread
From: tip-bot for Eero Nurkkala @ 2009-10-07 11:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, greg, venkatesh.pallipadi, riel, steven,
	ext-eero.nurkkala, tglx

Commit-ID:  fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
Gitweb:     http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
Author:     Eero Nurkkala <ext-eero.nurkkala@nokia.com>
AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300
Committer:  Thomas Gleixner <tglx@linutronix.de>
CommitDate: Wed, 7 Oct 2009 13:05:05 +0200

NOHZ: update idle state also when NOHZ is inactive

Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side
effects with cpufreq governors on some systems.

If the system did not switch into NOHZ mode ts->inidle is not set when
tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
fail to call tick_nohz_start_idle(). This results in bogus idle
accounting information which is passed to cpufreq governors.

Set the inidle flag unconditionally of the NOHZ active state to keep
the idle time accounting correct in any case.

[ tglx: Added comment and tweaked the changelog ]

Reported-by: Steven Noonan <steven@uplinklabs.net>
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: Steven Noonan <steven@uplinklabs.net>
Cc: stable@kernel.org
LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/time/tick-sched.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e0f59a2..89aed59 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -231,6 +231,13 @@ void tick_nohz_stop_sched_tick(int inidle)
 	if (!inidle && !ts->inidle)
 		goto end;
 
+	/*
+	 * Set ts->inidle unconditionally. Even if the system did not
+	 * switch to NOHZ mode the cpu frequency governers rely on the
+	 * update of the idle time accounting in tick_nohz_start_idle().
+	 */
+	ts->inidle = 1;
+
 	now = tick_nohz_start_idle(ts);
 
 	/*
@@ -248,8 +255,6 @@ void tick_nohz_stop_sched_tick(int inidle)
 	if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
 		goto end;
 
-	ts->inidle = 1;
-
 	if (need_resched())
 		goto end;
 

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive
  2009-10-07 11:06                                 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala
@ 2009-10-07 13:35                                   ` Rik van Riel
  0 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2009-10-07 13:35 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, venkatesh.pallipadi, greg, riel, steven,
	ext-eero.nurkkala, tglx
  Cc: linux-tip-commits

tip-bot for Eero Nurkkala wrote:
> Commit-ID:  fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
> Gitweb:     http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184
> Author:     Eero Nurkkala <ext-eero.nurkkala@nokia.com>
> AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300
> Committer:  Thomas Gleixner <tglx@linutronix.de>
> CommitDate: Wed, 7 Oct 2009 13:05:05 +0200
> 
> NOHZ: update idle state also when NOHZ is inactive
> 
> Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side
> effects with cpufreq governors on some systems.
> 
> If the system did not switch into NOHZ mode ts->inidle is not set when
> tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
> all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
> fail to call tick_nohz_start_idle(). This results in bogus idle
> accounting information which is passed to cpufreq governors.
> 
> Set the inidle flag unconditionally of the NOHZ active state to keep
> the idle time accounting correct in any case.

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2009-10-07 13:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-05 16:32 [BISECTED] "conservative" cpufreq governor broken Steven Noonan
2009-10-06  7:31 ` Eero Nurkkala
2009-10-06 10:22   ` Steven Noonan
2009-10-06 10:43     ` Eero Nurkkala
2009-10-06 11:22       ` Steven Noonan
2009-10-06 12:36         ` Eero Nurkkala
2009-10-07  0:54           ` Steven Noonan
2009-10-07  1:22             ` Steven Noonan
2009-10-07  5:27               ` Eero Nurkkala
2009-10-07  7:30                 ` Steven Noonan
2009-10-07  7:49                   ` Eero Nurkkala
2009-10-07  8:05                     ` Steven Noonan
2009-10-07  8:11                       ` Steven Noonan
2009-10-07  8:14                       ` Eero Nurkkala
2009-10-07  8:24                         ` Steven Noonan
2009-10-07  8:43                           ` Eero Nurkkala
2009-10-07  8:52                             ` Steven Noonan
2009-10-07  9:31                               ` Eero Nurkkala
2009-10-07 10:42                                 ` Thomas Gleixner
2009-10-07 11:03                                   ` Steven Noonan
2009-10-07 11:15                                     ` Thomas Gleixner
2009-10-07 11:06                                 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala
2009-10-07 13:35                                   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.