* [BISECTED] "conservative" cpufreq governor broken
@ 2009-10-05 16:32 Steven Noonan
2009-10-06 7:31 ` Eero Nurkkala
0 siblings, 1 reply; 23+ messages in thread
From: Steven Noonan @ 2009-10-05 16:32 UTC (permalink / raw)
To: linux-kernel
Cc: Steven Noonan, Eero Nurkkala, Thomas Gleixner, Rik van Riel,
Venkatesh Pallipadi
I noticed on my machine that the "conservative" cpufreq governor wasn't
working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on
v2.6.30.8, so I decided I should figure out where this issue was coming
from. The issue is pretty clear...
Here's the expected:
cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% (33)
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 1000 MHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% (7)
And here is the broken version (note the 'cpufreq stats' line):
cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to cpufreq@vger.kernel.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 2.33 GHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
analyzing CPU 1:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 1
hardware limits: 1000 MHz - 2.33 GHz
available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz
available cpufreq governors: ondemand, userspace, powersave, conservative, performance
current policy: frequency should be within 1000 MHz and 2.33 GHz.
The governor "conservative" may decide which speed to use
within this range.
current CPU frequency is 2.33 GHz (asserted by call to hardware).
cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00%
So basically, it just never clocks down from the maximum frequency.
Here's the bisection log:
# bad: [2147b209] Linux 2.6.31.1
# good: [a1c4c06a] Linux 2.6.30.8
# good: [07a2039b] Linux 2.6.30
# good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION
# bad: [906e8d97] e1000e: delay second read of PHY_STATUS register o
# good: [36e84467] Staging: heci: fix userspace pointer mess
# bad: [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o
# skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git
# good: [48c93112] powerpc: Fix invalid construct in our CPU selectio
# bad: [eca41044] n_r3964: fix lock imbalance
# good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p
# bad: [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git.
# good: [1d991001] Merge branch 'x86/mce3' into x86/urgent
# good: [71e308a2] function-graph: add stack frame test
# bad: [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git
# good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi
# good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li
# good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke
# bad: [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor
And finally, the commit that broke "conservative":
commit f2e21c9610991e95621a81407cdbab881226419b
Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Date: Mon May 25 09:57:37 2009 +0300
NOHZ: Properly feed cpufreq ondemand governor
A call from irq_exit() may occasionally pause the timing
info for cpufreq ondemand governor. This results in the
cpufreq ondemand governor to fail to calculate the
system load properly. Thus, relocate the checks for this
particular case to keep the governor always functional.
Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com>
Reported-by: Tero Kristo <tero.kristo@nokia.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I'd work on fixing it myself and whip up a patch, but I'm going to be gone
all day and I'm not too familiar with cpufreq anyway.
- Steven
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-05 16:32 [BISECTED] "conservative" cpufreq governor broken Steven Noonan @ 2009-10-06 7:31 ` Eero Nurkkala 2009-10-06 10:22 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-06 7:31 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote: > I noticed on my machine that the "conservative" cpufreq governor wasn't > working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on > v2.6.30.8, so I decided I should figure out where this issue was coming > from. The issue is pretty clear... > I had some troubles with cpufreq-info as all values in "cpufreq stats" were being as 0,00% (I fixed it by replacing unsigned long longs with unsigned longs, and recompiled) If this shows still insane values: cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state I guess your system is indeed broken. However. I get: (cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state) (OP1 == highest Frequency) OP1 7148 OP2 242 OP3 2307 OP4 43145 And another round: cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) Just once more after doing nothing: OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) So I can't agree it's broken. The patch you bisected, actually filtered out such phenomenon, in which an IRQ made the cpufreq framework occasionally think we were idling, although we were not. So you got "bonus" idle time that shouldn't been there in the first place. Now that the "bonus" idle time is not there, your system load may indeed be so high that the system never spends 80% or more time in idle? Could that be the case? Of course, even though I can't agree it's broken, doesn't mean it isn't somehow broken ;) It'd be nice to get info on other systems as well... > > Here's the expected: > > cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 > Report errors and bugs to cpufreq@vger.kernel.org, please. > analyzing CPU 0: > driver: acpi-cpufreq > CPUs which need to switch frequency at the same time: 0 > hardware limits: 1000 MHz - 2.33 GHz > available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz > available cpufreq governors: ondemand, userspace, powersave, conservative, performance > current policy: frequency should be within 1000 MHz and 2.33 GHz. > The governor "conservative" may decide which speed to use > within this range. > current CPU frequency is 1000 MHz (asserted by call to hardware). > cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% (33) > analyzing CPU 1: > driver: acpi-cpufreq > CPUs which need to switch frequency at the same time: 1 > hardware limits: 1000 MHz - 2.33 GHz > available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz > available cpufreq governors: ondemand, userspace, powersave, conservative, performance > current policy: frequency should be within 1000 MHz and 2.33 GHz. > The governor "conservative" may decide which speed to use > within this range. > current CPU frequency is 1000 MHz (asserted by call to hardware). > cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% (7) > > > > And here is the broken version (note the 'cpufreq stats' line): > > cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 > Report errors and bugs to cpufreq@vger.kernel.org, please. > analyzing CPU 0: > driver: acpi-cpufreq > CPUs which need to switch frequency at the same time: 0 > hardware limits: 1000 MHz - 2.33 GHz > available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz > available cpufreq governors: ondemand, userspace, powersave, conservative, performance > current policy: frequency should be within 1000 MHz and 2.33 GHz. > The governor "conservative" may decide which speed to use > within this range. > current CPU frequency is 2.33 GHz (asserted by call to hardware). > cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00% > analyzing CPU 1: > driver: acpi-cpufreq > CPUs which need to switch frequency at the same time: 1 > hardware limits: 1000 MHz - 2.33 GHz > available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz > available cpufreq governors: ondemand, userspace, powersave, conservative, performance > current policy: frequency should be within 1000 MHz and 2.33 GHz. > The governor "conservative" may decide which speed to use > within this range. > current CPU frequency is 2.33 GHz (asserted by call to hardware). > cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00% > > > So basically, it just never clocks down from the maximum frequency. > > > Here's the bisection log: > > # bad: [2147b209] Linux 2.6.31.1 > # good: [a1c4c06a] Linux 2.6.30.8 > # good: [07a2039b] Linux 2.6.30 > # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION > # bad: [906e8d97] e1000e: delay second read of PHY_STATUS register o > # good: [36e84467] Staging: heci: fix userspace pointer mess > # bad: [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o > # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git > # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio > # bad: [eca41044] n_r3964: fix lock imbalance > # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p > # bad: [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git. > # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent > # good: [71e308a2] function-graph: add stack frame test > # bad: [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git > # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi > # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li > # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke > # bad: [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor > > > And finally, the commit that broke "conservative": > > commit f2e21c9610991e95621a81407cdbab881226419b > Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com> > Date: Mon May 25 09:57:37 2009 +0300 > > NOHZ: Properly feed cpufreq ondemand governor > > A call from irq_exit() may occasionally pause the timing > info for cpufreq ondemand governor. This results in the > cpufreq ondemand governor to fail to calculate the > system load properly. Thus, relocate the checks for this > particular case to keep the governor always functional. > > Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> > Reported-by: Tero Kristo <tero.kristo@nokia.com> > Acked-by: Rik van Riel <riel@redhat.com> > Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > > > I'd work on fixing it myself and whip up a patch, but I'm going to be gone > all day and I'm not too familiar with cpufreq anyway. > > - Steven ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-06 7:31 ` Eero Nurkkala @ 2009-10-06 10:22 ` Steven Noonan 2009-10-06 10:43 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-06 10:22 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, Oct 6, 2009 at 12:31 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Mon, 2009-10-05 at 18:32 +0200, ext Steven Noonan wrote: >> I noticed on my machine that the "conservative" cpufreq governor wasn't >> working properly in v2.6.31.1 or Linus' latest tree, but it worked fine on >> v2.6.30.8, so I decided I should figure out where this issue was coming >> from. The issue is pretty clear... >> > > I had some troubles with cpufreq-info as all values in "cpufreq stats" > were being as 0,00% (I fixed it by replacing unsigned long longs with > unsigned longs, and recompiled) That doesn't make much sense for my case. I used the same build of cpufreq-info through the whole bisection and the problem was very consistently reproducible throughout. And the bisected commit is indeed related to cpufreq. Regardless, cpufreq-info wasn't the only reason that I thought something was wrong. My machine was unusually warm, fans were going pretty much constantly, and the Gnome applet for CPU frequency monitoring constantly showed it was using 2.33GHz and never scaling down. I checked system load, which stayed at zero. Multiple different sources (top, perf top, etc) seemed to indicate there was no justification for the "conservative" governor's behaviour. > If this shows still insane values: > cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state > I guess your system is indeed broken. Here's the output: 2333000 6880 2167000 0 2000000 0 1833000 0 1667000 0 1500000 0 1333000 0 1000000 0 2333000 6880 2167000 0 2000000 0 1833000 0 1667000 0 1500000 0 1333000 0 1000000 0 Pretty broken, alright. > However. I get: > (cat /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state) > > (OP1 == highest Frequency) > OP1 7148 > OP2 242 > OP3 2307 > OP4 43145 I would suspect you have to have CONFIG_NO_HZ enabled to be able to reproduce the issue (considering the title of the bisected commit and my own config). Do you have it enabled? > And another round: > > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) > > Just once more after doing nothing: > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) > > So I can't agree it's broken. The patch you bisected, actually filtered > out such phenomenon, in which an IRQ made the cpufreq framework > occasionally think we were idling, although we were not. So you got > "bonus" idle time that shouldn't been there in the first place. Now that > the "bonus" idle time is not there, your system load may indeed be so > high that the system never spends 80% or more time in idle? Could that > be the case? Of course, even though I can't agree it's broken, doesn't > mean it isn't somehow broken ;) It'd be nice to get info on other > systems as well... Interestingly, "ondemand" (the governor fixed by the bisected commit) works fine. "conservative" is the only broken one. >> >> Here's the expected: >> >> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 >> Report errors and bugs to cpufreq@vger.kernel.org, please. >> analyzing CPU 0: >> driver: acpi-cpufreq >> CPUs which need to switch frequency at the same time: 0 >> hardware limits: 1000 MHz - 2.33 GHz >> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz >> available cpufreq governors: ondemand, userspace, powersave, conservative, performance >> current policy: frequency should be within 1000 MHz and 2.33 GHz. >> The governor "conservative" may decide which speed to use >> within this range. >> current CPU frequency is 1000 MHz (asserted by call to hardware). >> cpufreq stats: 2.33 GHz:0.59%, 2.17 GHz:1.41%, 2.00 GHz:0.88%, 1.83 GHz:1.22%, 1.67 GHz:0.88%, 1.50 GHz:1.41%, 1.33 GHz:10.98%, 1000 MHz:82.63% (33) >> analyzing CPU 1: >> driver: acpi-cpufreq >> CPUs which need to switch frequency at the same time: 1 >> hardware limits: 1000 MHz - 2.33 GHz >> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz >> available cpufreq governors: ondemand, userspace, powersave, conservative, performance >> current policy: frequency should be within 1000 MHz and 2.33 GHz. >> The governor "conservative" may decide which speed to use >> within this range. >> current CPU frequency is 1000 MHz (asserted by call to hardware). >> cpufreq stats: 2.33 GHz:0.40%, 2.17 GHz:0.16%, 2.00 GHz:0.16%, 1.83 GHz:0.35%, 1.67 GHz:0.16%, 1.50 GHz:0.35%, 1.33 GHz:0.16%, 1000 MHz:98.27% (7) >> >> >> >> And here is the broken version (note the 'cpufreq stats' line): >> >> cpufrequtils 005: cpufreq-info (C) Dominik Brodowski 2004-2006 >> Report errors and bugs to cpufreq@vger.kernel.org, please. >> analyzing CPU 0: >> driver: acpi-cpufreq >> CPUs which need to switch frequency at the same time: 0 >> hardware limits: 1000 MHz - 2.33 GHz >> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz >> available cpufreq governors: ondemand, userspace, powersave, conservative, performance >> current policy: frequency should be within 1000 MHz and 2.33 GHz. >> The governor "conservative" may decide which speed to use >> within this range. >> current CPU frequency is 2.33 GHz (asserted by call to hardware). >> cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00% >> analyzing CPU 1: >> driver: acpi-cpufreq >> CPUs which need to switch frequency at the same time: 1 >> hardware limits: 1000 MHz - 2.33 GHz >> available frequency steps: 2.33 GHz, 2.17 GHz, 2.00 GHz, 1.83 GHz, 1.67 GHz, 1.50 GHz, 1.33 GHz, 1000 MHz >> available cpufreq governors: ondemand, userspace, powersave, conservative, performance >> current policy: frequency should be within 1000 MHz and 2.33 GHz. >> The governor "conservative" may decide which speed to use >> within this range. >> current CPU frequency is 2.33 GHz (asserted by call to hardware). >> cpufreq stats: 2.33 GHz:100.00%, 2.17 GHz:0.00%, 2.00 GHz:0.00%, 1.83 GHz:0.00%, 1.67 GHz:0.00%, 1.50 GHz:0.00%, 1.33 GHz:0.00%, 1000 MHz:0.00% >> >> >> So basically, it just never clocks down from the maximum frequency. >> >> >> Here's the bisection log: >> >> # bad: [2147b209] Linux 2.6.31.1 >> # good: [a1c4c06a] Linux 2.6.30.8 >> # good: [07a2039b] Linux 2.6.30 >> # good: [452dac45] V4L/DVB (11761): dvb-ttpci: Fixed VIDEO_SLOWMOTION >> # bad: [906e8d97] e1000e: delay second read of PHY_STATUS register o >> # good: [36e84467] Staging: heci: fix userspace pointer mess >> # bad: [df36b439] Merge branch 'for-2.6.31' of git://git.linux-nfs.o >> # skip: [12e24f34] Merge branch 'perfcounters-fixes-for-linus' of git >> # good: [48c93112] powerpc: Fix invalid construct in our CPU selectio >> # bad: [eca41044] n_r3964: fix lock imbalance >> # good: [93db6294] Merge branch 'for-linus' of git://git.kernel.org/p >> # bad: [1eb51c33] Merge branch 'sched-fixes-for-linus' of git://git. >> # good: [1d991001] Merge branch 'x86/mce3' into x86/urgent >> # good: [71e308a2] function-graph: add stack frame test >> # bad: [38df92b8] Merge branch 'timers-fixes-for-linus' of git://git >> # good: [ad5cf46b] Merge git://git.kernel.org/pub/scm/linux/kernel/gi >> # good: [7fd5b632] Merge branch 'for-linus' of git://git.monstr.eu/li >> # good: [c4c5ab30] Merge branch 'x86-fixes-for-linus' of git://git.ke >> # bad: [f2e21c96] NOHZ: Properly feed cpufreq ondemand governor >> >> >> And finally, the commit that broke "conservative": >> >> commit f2e21c9610991e95621a81407cdbab881226419b >> Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com> >> Date: Mon May 25 09:57:37 2009 +0300 >> >> NOHZ: Properly feed cpufreq ondemand governor >> >> A call from irq_exit() may occasionally pause the timing >> info for cpufreq ondemand governor. This results in the >> cpufreq ondemand governor to fail to calculate the >> system load properly. Thus, relocate the checks for this >> particular case to keep the governor always functional. >> >> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> >> Reported-by: Tero Kristo <tero.kristo@nokia.com> >> Acked-by: Rik van Riel <riel@redhat.com> >> Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> >> >> >> I'd work on fixing it myself and whip up a patch, but I'm going to be gone >> all day and I'm not too familiar with cpufreq anyway. >> >> - Steven > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-06 10:22 ` Steven Noonan @ 2009-10-06 10:43 ` Eero Nurkkala 2009-10-06 11:22 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-06 10:43 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote: > > I would suspect you have to have CONFIG_NO_HZ enabled to be able to > reproduce the issue (considering the title of the bisected commit and > my own config). Do you have it enabled? > Yes, it's enabled. > > And another round: > > > > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) > > > > Just once more after doing nothing: > > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) > > > > So I can't agree it's broken. The patch you bisected, actually filtered > > out such phenomenon, in which an IRQ made the cpufreq framework > > occasionally think we were idling, although we were not. So you got > > "bonus" idle time that shouldn't been there in the first place. Now that > > the "bonus" idle time is not there, your system load may indeed be so > > high that the system never spends 80% or more time in idle? Could that > > be the case? Of course, even though I can't agree it's broken, doesn't > > mean it isn't somehow broken ;) It'd be nice to get info on other > > systems as well... > > Interestingly, "ondemand" (the governor fixed by the bisected commit) > works fine. "conservative" is the only broken one. > If you took timestamps in /arch/x86/kernel/process_**.c: (let's assume process_64.c) in cpu_idle() around enter_idle(); and __exit_idle(), took the diff, added the diffs up, and compared it to system uptime, you could see how much time you spend in idle()? I think it's possible that even if the cpu load is near 0%, the system may idle only for a bare moment (that translates to a buggy pm_idle()), and time is spent elsewhere (less than 80% in idle). If I had similar findings like you, I'd be worried; but traces here and there will point out where things go wrong ;) - Eero ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-06 10:43 ` Eero Nurkkala @ 2009-10-06 11:22 ` Steven Noonan 2009-10-06 12:36 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-06 11:22 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote: >> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to >> reproduce the issue (considering the title of the bisected commit and >> my own config). Do you have it enabled? >> > > Yes, it's enabled. > >> > And another round: >> > >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) >> > >> > Just once more after doing nothing: >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) >> > >> > So I can't agree it's broken. The patch you bisected, actually filtered >> > out such phenomenon, in which an IRQ made the cpufreq framework >> > occasionally think we were idling, although we were not. So you got >> > "bonus" idle time that shouldn't been there in the first place. Now that >> > the "bonus" idle time is not there, your system load may indeed be so >> > high that the system never spends 80% or more time in idle? Could that >> > be the case? Of course, even though I can't agree it's broken, doesn't >> > mean it isn't somehow broken ;) It'd be nice to get info on other >> > systems as well... >> >> Interestingly, "ondemand" (the governor fixed by the bisected commit) >> works fine. "conservative" is the only broken one. >> > > If you took timestamps in /arch/x86/kernel/process_**.c: > (let's assume process_64.c) in cpu_idle() > around enter_idle(); and __exit_idle(), took the diff, > added the diffs up, and compared it to system uptime, you could see how > much time you spend in idle()? I think it's possible that > even if the cpu load is near 0%, the system may idle only for a bare > moment (that translates to a buggy pm_idle()), and time is spent > elsewhere (less than 80% in idle). This makes logical sense, but how should I test this? Is there a way to do this with existing tracers? By the way, a revert of the bisected commit changes the cpufreq stats favorably: 2333000 31 2167000 13 2000000 6 1833000 79 1667000 239 1500000 193 1333000 959 1000000 10260 2333000 11 2167000 13 2000000 13 1833000 26 1667000 33 1500000 333 1333000 59 1000000 11293 - Steven ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-06 11:22 ` Steven Noonan @ 2009-10-06 12:36 ` Eero Nurkkala 2009-10-07 0:54 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-06 12:36 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote: > On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala > <ext-eero.nurkkala@nokia.com> wrote: > > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote: > >> > >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to > >> reproduce the issue (considering the title of the bisected commit and > >> my own config). Do you have it enabled? > >> > > > > Yes, it's enabled. > > > >> > And another round: > >> > > >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) > >> > > >> > Just once more after doing nothing: > >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) > >> > > >> > So I can't agree it's broken. The patch you bisected, actually filtered > >> > out such phenomenon, in which an IRQ made the cpufreq framework > >> > occasionally think we were idling, although we were not. So you got > >> > "bonus" idle time that shouldn't been there in the first place. Now that > >> > the "bonus" idle time is not there, your system load may indeed be so > >> > high that the system never spends 80% or more time in idle? Could that > >> > be the case? Of course, even though I can't agree it's broken, doesn't > >> > mean it isn't somehow broken ;) It'd be nice to get info on other > >> > systems as well... > >> > >> Interestingly, "ondemand" (the governor fixed by the bisected commit) > >> works fine. "conservative" is the only broken one. > >> > > > > If you took timestamps in /arch/x86/kernel/process_**.c: > > (let's assume process_64.c) in cpu_idle() > > around enter_idle(); and __exit_idle(), took the diff, > > added the diffs up, and compared it to system uptime, you could see how > > much time you spend in idle()? I think it's possible that > > even if the cpu load is near 0%, the system may idle only for a bare > > moment (that translates to a buggy pm_idle()), and time is spent > > elsewhere (less than 80% in idle). > > This makes logical sense, but how should I test this? Is there a way > to do this with existing tracers? Tracers may by themselves add some load into the system. If I were you, I'd add something like: (I have only one CPU BTW) static ktime_t time_prior_idle; static int64_t idle_total; time_prior_idle = ktime_get(); <idle stuff> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle)); and have a sysfs hook (something already present, so you can just cat it) with a trace: printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get())); Sample output: 374758812519, 386768249832 So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in this case. (Right, this should provide somewhat descent info, hopefully ;) ) - Eero ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-06 12:36 ` Eero Nurkkala @ 2009-10-07 0:54 ` Steven Noonan 2009-10-07 1:22 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 0:54 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote: >> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala >> <ext-eero.nurkkala@nokia.com> wrote: >> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote: >> >> >> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to >> >> reproduce the issue (considering the title of the bisected commit and >> >> my own config). Do you have it enabled? >> >> >> > >> > Yes, it's enabled. >> > >> >> > And another round: >> >> > >> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) >> >> > >> >> > Just once more after doing nothing: >> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) >> >> > >> >> > So I can't agree it's broken. The patch you bisected, actually filtered >> >> > out such phenomenon, in which an IRQ made the cpufreq framework >> >> > occasionally think we were idling, although we were not. So you got >> >> > "bonus" idle time that shouldn't been there in the first place. Now that >> >> > the "bonus" idle time is not there, your system load may indeed be so >> >> > high that the system never spends 80% or more time in idle? Could that >> >> > be the case? Of course, even though I can't agree it's broken, doesn't >> >> > mean it isn't somehow broken ;) It'd be nice to get info on other >> >> > systems as well... >> >> >> >> Interestingly, "ondemand" (the governor fixed by the bisected commit) >> >> works fine. "conservative" is the only broken one. >> >> >> > >> > If you took timestamps in /arch/x86/kernel/process_**.c: >> > (let's assume process_64.c) in cpu_idle() >> > around enter_idle(); and __exit_idle(), took the diff, >> > added the diffs up, and compared it to system uptime, you could see how >> > much time you spend in idle()? I think it's possible that >> > even if the cpu load is near 0%, the system may idle only for a bare >> > moment (that translates to a buggy pm_idle()), and time is spent >> > elsewhere (less than 80% in idle). >> >> This makes logical sense, but how should I test this? Is there a way >> to do this with existing tracers? > > Tracers may by themselves add some load into the system. > > If I were you, I'd add something like: (I have only one CPU BTW) > > static ktime_t time_prior_idle; > static int64_t idle_total; > > time_prior_idle = ktime_get(); > <idle stuff> > idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle)); > > and have a sysfs hook (something already present, so you can just cat > it) with a trace: > > printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get())); > > Sample output: > 374758812519, 386768249832 > > So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in > this case. > > (Right, this should provide somewhat descent info, hopefully ;) ) > Well, I tried adding the code to cpu_idle() as suggested, but it never printed anything. Apparently cpu_idle() isn't ever being called here. Even added a 'BUG();' at the beginning of the function and it never hit it. Of course, I'm probably missing something obvious. Is there a separate cpu_idle()-esque function for SMP? - Steven ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 0:54 ` Steven Noonan @ 2009-10-07 1:22 ` Steven Noonan 2009-10-07 5:27 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 1:22 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, Oct 6, 2009 at 5:54 PM, Steven Noonan <steven@uplinklabs.net> wrote: > On Tue, Oct 6, 2009 at 5:36 AM, Eero Nurkkala > <ext-eero.nurkkala@nokia.com> wrote: >> On Tue, 2009-10-06 at 13:22 +0200, ext Steven Noonan wrote: >>> On Tue, Oct 6, 2009 at 3:43 AM, Eero Nurkkala >>> <ext-eero.nurkkala@nokia.com> wrote: >>> > On Tue, 2009-10-06 at 12:22 +0200, ext Steven Noonan wrote: >>> >> >>> >> I would suspect you have to have CONFIG_NO_HZ enabled to be able to >>> >> reproduce the issue (considering the title of the bisected commit and >>> >> my own config). Do you have it enabled? >>> >> >>> > >>> > Yes, it's enabled. >>> > >>> >> > And another round: >>> >> > >>> >> > cpufreq stats: OP1:16,78%, OP2:0,24%, OP3:5,14%, OP4:77,83% (72) >>> >> > >>> >> > Just once more after doing nothing: >>> >> > OP1:7,41%, OP2:0,11%, OP3:2,38%, OP4:90,10% (82) >>> >> > >>> >> > So I can't agree it's broken. The patch you bisected, actually filtered >>> >> > out such phenomenon, in which an IRQ made the cpufreq framework >>> >> > occasionally think we were idling, although we were not. So you got >>> >> > "bonus" idle time that shouldn't been there in the first place. Now that >>> >> > the "bonus" idle time is not there, your system load may indeed be so >>> >> > high that the system never spends 80% or more time in idle? Could that >>> >> > be the case? Of course, even though I can't agree it's broken, doesn't >>> >> > mean it isn't somehow broken ;) It'd be nice to get info on other >>> >> > systems as well... >>> >> >>> >> Interestingly, "ondemand" (the governor fixed by the bisected commit) >>> >> works fine. "conservative" is the only broken one. >>> >> >>> > >>> > If you took timestamps in /arch/x86/kernel/process_**.c: >>> > (let's assume process_64.c) in cpu_idle() >>> > around enter_idle(); and __exit_idle(), took the diff, >>> > added the diffs up, and compared it to system uptime, you could see how >>> > much time you spend in idle()? I think it's possible that >>> > even if the cpu load is near 0%, the system may idle only for a bare >>> > moment (that translates to a buggy pm_idle()), and time is spent >>> > elsewhere (less than 80% in idle). >>> >>> This makes logical sense, but how should I test this? Is there a way >>> to do this with existing tracers? >> >> Tracers may by themselves add some load into the system. >> >> If I were you, I'd add something like: (I have only one CPU BTW) >> >> static ktime_t time_prior_idle; >> static int64_t idle_total; >> >> time_prior_idle = ktime_get(); >> <idle stuff> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle)); >> >> and have a sysfs hook (something already present, so you can just cat >> it) with a trace: >> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get())); >> >> Sample output: >> 374758812519, 386768249832 >> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in >> this case. >> >> (Right, this should provide somewhat descent info, hopefully ;) ) >> > > Well, I tried adding the code to cpu_idle() as suggested, but it never > printed anything. Apparently cpu_idle() isn't ever being called here. > Even added a 'BUG();' at the beginning of the function and it never > hit it. Of course, I'm probably missing something obvious. Is there a > separate cpu_idle()-esque function for SMP? > Oh crap. Perhaps it's more insidious. I reverted the bisected commit and it _DID_ hit the line I added. So cpu_idle is never entered with the bisected commit. Bizarre. - Steven ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 1:22 ` Steven Noonan @ 2009-10-07 5:27 ` Eero Nurkkala 2009-10-07 7:30 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-07 5:27 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote: > >> > >> Tracers may by themselves add some load into the system. > >> > >> If I were you, I'd add something like: (I have only one CPU BTW) > >> > >> static ktime_t time_prior_idle; > >> static int64_t idle_total; > >> > >> time_prior_idle = ktime_get(); > >> <idle stuff> > >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle)); > >> > >> and have a sysfs hook (something already present, so you can just cat > >> it) with a trace: > >> > >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get())); > >> > >> Sample output: > >> 374758812519, 386768249832 > >> > >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in > >> this case. > >> > >> (Right, this should provide somewhat descent info, hopefully ;) ) > >> > > > > Well, I tried adding the code to cpu_idle() as suggested, but it never > > printed anything. Apparently cpu_idle() isn't ever being called here. > > Even added a 'BUG();' at the beginning of the function and it never > > hit it. Of course, I'm probably missing something obvious. Is there a > > separate cpu_idle()-esque function for SMP? > > > > Oh crap. Perhaps it's more insidious. I reverted the bisected commit > and it _DID_ hit the line I added. So cpu_idle is never entered with > the bisected commit. Bizarre. > > - Steven So you're saying that cpu_idle() is not ever called.. The commit in question - whether you got it or had it reverted - should, and must not be a trigger for your system to hit cpu_idle(). What if you took a look at init/main.c, and added traces before the first call to cpu_idle(): start_kernel() -> rest_init() -> cpu_idle() With the patch not reverted? So I guess, that an irq may actually start your "system up" (patch reverted) somehow: irq_exit() -> tick_nohz_stop_sched_tick() -> tick_nohz_start_idle() -> sched_clock_idle_sleep_event() ? So it looks that maybe the conservative governor isn't broken after all ;) (BTW I guess, with SMP, you may wish to have fix things to work in your setup static ktime_t time_prior_idle[2]; static int64_t idle_total[2]; and time_prior_idle[smp_processor_id()] = ktime_get(); etc, and print out times separately). - Eero ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 5:27 ` Eero Nurkkala @ 2009-10-07 7:30 ` Steven Noonan 2009-10-07 7:49 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 7:30 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Tue, Oct 6, 2009 at 10:27 PM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Wed, 2009-10-07 at 03:22 +0200, ext Steven Noonan wrote: >> >> >> >> Tracers may by themselves add some load into the system. >> >> >> >> If I were you, I'd add something like: (I have only one CPU BTW) >> >> >> >> static ktime_t time_prior_idle; >> >> static int64_t idle_total; >> >> >> >> time_prior_idle = ktime_get(); >> >> <idle stuff> >> >> idle_total += ktime_to_ns(ktime_sub(ktime_get(), time_prior_idle)); >> >> >> >> and have a sysfs hook (something already present, so you can just cat >> >> it) with a trace: >> >> >> >> printk("Times: %lld, %lld \n", idle_total, ktime_to_ns(ktime_get())); >> >> >> >> Sample output: >> >> 374758812519, 386768249832 >> >> >> >> So I have 386768249832 / 386768249832 = 96.9 % time spent in idle in >> >> this case. >> >> >> >> (Right, this should provide somewhat descent info, hopefully ;) ) >> >> >> > >> > Well, I tried adding the code to cpu_idle() as suggested, but it never >> > printed anything. Apparently cpu_idle() isn't ever being called here. >> > Even added a 'BUG();' at the beginning of the function and it never >> > hit it. Of course, I'm probably missing something obvious. Is there a >> > separate cpu_idle()-esque function for SMP? >> > >> >> Oh crap. Perhaps it's more insidious. I reverted the bisected commit >> and it _DID_ hit the line I added. So cpu_idle is never entered with >> the bisected commit. Bizarre. >> >> - Steven > > So you're saying that cpu_idle() is not ever called.. The commit in > question - whether you got it or had it reverted - should, and must not > be a trigger for your system to hit cpu_idle(). What if you took a look > at init/main.c, and added traces before the first call to cpu_idle(): > > start_kernel() -> > rest_init() -> > cpu_idle() Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be for. I thought that cpu_idle() was a function that was periodically called whenever the CPU had nothing to do, but now I see that it's actually the main loop. I should really read the code next time. I've moved the statistics printout code to the _inside_ of that infinite loop and retested. I had it print every several hundred iterations. Here's the results (note the machine was idle the whole time, except for about the first 10-20 seconds while the machine booted): [ 3.627716] timings[0]: 2250511125 / 3627716116 [ 6.946216] timings[0]: 4780901366 / 6946213531 [ 13.355182] timings[0]: 9385417604 / 13355183525 [ 18.551304] timings[1]: 16300853077 / 18551301189 [ 21.589039] timings[0]: 15984495433 / 21589037480 [ 47.152733] timings[1]: 44386121538 / 47152731476 [ 51.682630] timings[0]: 45713834076 / 51682628295 [ 79.587359] timings[0]: 73524821916 / 79587356820 [ 88.630110] timings[1]: 85324277596 / 88630109605 [ 96.082386] timings[0]: 89691306072 / 96082384539 > > With the patch not reverted? > > So I guess, that an irq may actually start your "system up" (patch > reverted) somehow: > > irq_exit() -> > tick_nohz_stop_sched_tick() -> > tick_nohz_start_idle() -> > sched_clock_idle_sleep_event() ? > > So it looks that maybe the conservative governor isn't broken after > all ;) Symptomatically, it is broken. But it's likely just that: a symptom of a larger issue. > > > (BTW I guess, with SMP, you may wish to have fix things to work in your > setup > static ktime_t time_prior_idle[2]; > static int64_t idle_total[2]; > and > time_prior_idle[smp_processor_id()] = ktime_get(); > etc, and print out times separately). > > - Eero > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 7:30 ` Steven Noonan @ 2009-10-07 7:49 ` Eero Nurkkala 2009-10-07 8:05 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-07 7:49 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote: > > Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be > for. I thought that cpu_idle() was a function that was periodically > called whenever the CPU had nothing to do, but now I see that it's > actually the main loop. I should really read the code next time. > > I've moved the statistics printout code to the _inside_ of that > infinite loop and retested. I had it print every several hundred > iterations. Here's the results (note the machine was idle the whole > time, except for about the first 10-20 seconds while the machine > booted): > > [ 3.627716] timings[0]: 2250511125 / 3627716116 > [ 6.946216] timings[0]: 4780901366 / 6946213531 > [ 13.355182] timings[0]: 9385417604 / 13355183525 > [ 18.551304] timings[1]: 16300853077 / 18551301189 > [ 21.589039] timings[0]: 15984495433 / 21589037480 > [ 47.152733] timings[1]: 44386121538 / 47152731476 > [ 51.682630] timings[0]: 45713834076 / 51682628295 > [ 79.587359] timings[0]: 73524821916 / 79587356820 > [ 88.630110] timings[1]: 85324277596 / 88630109605 > [ 96.082386] timings[0]: 89691306072 / 96082384539 > Those look good. Well, might as well then go for: /drivers/cpufreq/cpufreq_conservative.c dbs_check_cpu() -> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load? Let assume load is sane, look for (in dbs_check_cpu()) if (load < (dbs_tuners_ins.down_threshold - 10)) { whether it is taken ever...if not, what is your (dbs_tuners_ins.down_threshold - 10) ? - Eero ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 7:49 ` Eero Nurkkala @ 2009-10-07 8:05 ` Steven Noonan 2009-10-07 8:11 ` Steven Noonan 2009-10-07 8:14 ` Eero Nurkkala 0 siblings, 2 replies; 23+ messages in thread From: Steven Noonan @ 2009-10-07 8:05 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote: >> >> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be >> for. I thought that cpu_idle() was a function that was periodically >> called whenever the CPU had nothing to do, but now I see that it's >> actually the main loop. I should really read the code next time. >> >> I've moved the statistics printout code to the _inside_ of that >> infinite loop and retested. I had it print every several hundred >> iterations. Here's the results (note the machine was idle the whole >> time, except for about the first 10-20 seconds while the machine >> booted): >> >> [ 3.627716] timings[0]: 2250511125 / 3627716116 >> [ 6.946216] timings[0]: 4780901366 / 6946213531 >> [ 13.355182] timings[0]: 9385417604 / 13355183525 >> [ 18.551304] timings[1]: 16300853077 / 18551301189 >> [ 21.589039] timings[0]: 15984495433 / 21589037480 >> [ 47.152733] timings[1]: 44386121538 / 47152731476 >> [ 51.682630] timings[0]: 45713834076 / 51682628295 >> [ 79.587359] timings[0]: 73524821916 / 79587356820 >> [ 88.630110] timings[1]: 85324277596 / 88630109605 >> [ 96.082386] timings[0]: 89691306072 / 96082384539 >> > > Those look good. > > Well, might as well then go for: > /drivers/cpufreq/cpufreq_conservative.c > dbs_check_cpu() -> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load? That's probably the problem... [ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95 [ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95 [ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80 [ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98 [ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80 [ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95 [ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95 [ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95 idle_time is wrong. > Let assume load is sane, look for (in dbs_check_cpu()) > if (load < (dbs_tuners_ins.down_threshold - 10)) { > > whether it is taken ever...if not, what is your > (dbs_tuners_ins.down_threshold - 10) ? > > - Eero > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:05 ` Steven Noonan @ 2009-10-07 8:11 ` Steven Noonan 2009-10-07 8:14 ` Eero Nurkkala 1 sibling, 0 replies; 23+ messages in thread From: Steven Noonan @ 2009-10-07 8:11 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, Oct 7, 2009 at 1:05 AM, Steven Noonan <steven@uplinklabs.net> wrote: > On Wed, Oct 7, 2009 at 12:49 AM, Eero Nurkkala > <ext-eero.nurkkala@nokia.com> wrote: >> On Wed, 2009-10-07 at 09:30 +0200, ext Steven Noonan wrote: >>> >>> Okay, wow, I'm a moron. I misread what cpu_idle() was intended to be >>> for. I thought that cpu_idle() was a function that was periodically >>> called whenever the CPU had nothing to do, but now I see that it's >>> actually the main loop. I should really read the code next time. >>> >>> I've moved the statistics printout code to the _inside_ of that >>> infinite loop and retested. I had it print every several hundred >>> iterations. Here's the results (note the machine was idle the whole >>> time, except for about the first 10-20 seconds while the machine >>> booted): >>> >>> [ 3.627716] timings[0]: 2250511125 / 3627716116 >>> [ 6.946216] timings[0]: 4780901366 / 6946213531 >>> [ 13.355182] timings[0]: 9385417604 / 13355183525 >>> [ 18.551304] timings[1]: 16300853077 / 18551301189 >>> [ 21.589039] timings[0]: 15984495433 / 21589037480 >>> [ 47.152733] timings[1]: 44386121538 / 47152731476 >>> [ 51.682630] timings[0]: 45713834076 / 51682628295 >>> [ 79.587359] timings[0]: 73524821916 / 79587356820 >>> [ 88.630110] timings[1]: 85324277596 / 88630109605 >>> [ 96.082386] timings[0]: 89691306072 / 96082384539 >>> >> >> Those look good. >> >> Well, might as well then go for: >> /drivers/cpufreq/cpufreq_conservative.c >> dbs_check_cpu() -> >> load = 100 * (wall_time - idle_time) / wall_time; <- What is your load? > > That's probably the problem... > > [ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95 > [ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95 > [ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80 > [ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98 > [ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80 > [ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95 > [ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95 > [ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95 > > idle_time is wrong. Actually, it's more likely that the idle_time there is correct and there's something else that's going runaway. My system's fans are running at about 4000 RPM, when they'd normally be running at 2000 RPM for this load average. I suspect there's something actually going wild. > >> Let assume load is sane, look for (in dbs_check_cpu()) >> if (load < (dbs_tuners_ins.down_threshold - 10)) { >> >> whether it is taken ever...if not, what is your >> (dbs_tuners_ins.down_threshold - 10) ? >> >> - Eero >> >> > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:05 ` Steven Noonan 2009-10-07 8:11 ` Steven Noonan @ 2009-10-07 8:14 ` Eero Nurkkala 2009-10-07 8:24 ` Steven Noonan 1 sibling, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-07 8:14 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote: > > Those look good. > > > > Well, might as well then go for: > > /drivers/cpufreq/cpufreq_conservative.c > > dbs_check_cpu() -> > > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load? > > That's probably the problem... > > [ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95 > [ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95 > [ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80 > [ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98 > [ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80 > [ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95 > [ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95 > [ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95 > > idle_time is wrong. > > > Let assume load is sane, look for (in dbs_check_cpu()) > > if (load < (dbs_tuners_ins.down_threshold - 10)) { > > > > whether it is taken ever...if not, what is your > > (dbs_tuners_ins.down_threshold - 10) ? > > > > - Eero > > > > <------------Sidenote------------------------------> Vankatesh, I think the following is not smp friendly: cpufreq_conservative.c: /* Get Absolute Load */ for_each_cpu(j, policy->cpus) { ... load = 100 * (wall_time - idle_time) / wall_time; now the load is only considered from the last cpu in line? in ondemand, load goes to max_load_freq...so max is taken from all cpus out there. (well, doesn't seem like the case with conservative). Of course I may be wrong, and things are like supposed. </------------Sidenote------------------------------> Steven, how do the cpu loads look like without the patch? - Eero ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:14 ` Eero Nurkkala @ 2009-10-07 8:24 ` Steven Noonan 2009-10-07 8:43 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 8:24 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, Oct 7, 2009 at 1:14 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Wed, 2009-10-07 at 10:05 +0200, ext Steven Noonan wrote: >> > Those look good. >> > >> > Well, might as well then go for: >> > /drivers/cpufreq/cpufreq_conservative.c >> > dbs_check_cpu() -> >> > load = 100 * (wall_time - idle_time) / wall_time; <- What is your load? >> >> That's probably the problem... >> >> [ 40.632277] cpufreq load = 100 * (66667 - 3310) / 66667 = 95 >> [ 40.698947] cpufreq load = 100 * (66661 - 3238) / 66661 = 95 >> [ 73.965425] cpufreq load = 100 * (66667 - 12820) / 66667 = 80 >> [ 74.032095] cpufreq load = 100 * (66661 - 1124) / 66661 = 98 >> [ 107.298571] cpufreq load = 100 * (66666 - 13092) / 66666 = 80 >> [ 107.365301] cpufreq load = 100 * (66722 - 3317) / 66722 = 95 >> [ 140.631717] cpufreq load = 100 * (66666 - 3311) / 66666 = 95 >> [ 140.698387] cpufreq load = 100 * (66662 - 3237) / 66662 = 95 >> >> idle_time is wrong. >> >> > Let assume load is sane, look for (in dbs_check_cpu()) >> > if (load < (dbs_tuners_ins.down_threshold - 10)) { >> > >> > whether it is taken ever...if not, what is your >> > (dbs_tuners_ins.down_threshold - 10) ? >> > >> > - Eero >> > >> > > > <------------Sidenote------------------------------> > Vankatesh, I think the following is not smp friendly: > cpufreq_conservative.c: > /* Get Absolute Load */ > for_each_cpu(j, policy->cpus) { > ... > load = 100 * (wall_time - idle_time) / wall_time; > > now the load is only considered from the last cpu in line? > in ondemand, load goes to max_load_freq...so max is taken > from all cpus out there. (well, doesn't seem like the case with > conservative). Of course I may be wrong, and things are like > supposed. > </------------Sidenote------------------------------> > > Steven, how do the cpu loads look like without the patch? They're sane: [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0 [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0 [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0 [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4 ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:24 ` Steven Noonan @ 2009-10-07 8:43 ` Eero Nurkkala 2009-10-07 8:52 ` Steven Noonan 0 siblings, 1 reply; 23+ messages in thread From: Eero Nurkkala @ 2009-10-07 8:43 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote: > > > > Steven, how do the cpu loads look like without the patch? > > They're sane: > > [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0 > [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0 > [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0 > [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4 Thank you. Could you please try the following: Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set and all subsequent calls from irq_exit() think we weren't idling, which is not true. --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle) now = tick_nohz_start_idle(ts); + ts->inidle = 1; + /* * If this cpu is offline and it is the one which updates * jiffies, then give up the assignment and let it be taken by @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle) if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) goto end; - ts->inidle = 1; - if (need_resched()) goto end; ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:43 ` Eero Nurkkala @ 2009-10-07 8:52 ` Steven Noonan 2009-10-07 9:31 ` Eero Nurkkala 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 8:52 UTC (permalink / raw) To: ext-eero.nurkkala Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi, Greg KH, Ingo Molnar On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala <ext-eero.nurkkala@nokia.com> wrote: > On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote: >> > >> > Steven, how do the cpu loads look like without the patch? >> >> They're sane: >> >> [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0 >> [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0 >> [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0 >> [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4 > > Thank you. Could you please try the following: > Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set > and all subsequent calls from irq_exit() think we weren't idling, > which is not true. > > --- a/kernel/time/tick-sched.c > +++ b/kernel/time/tick-sched.c > @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle) > > now = tick_nohz_start_idle(ts); > > + ts->inidle = 1; > + > /* > * If this cpu is offline and it is the one which updates > * jiffies, then give up the assignment and let it be taken by > @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle) > if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) > goto end; > > - ts->inidle = 1; > - > if (need_resched()) > goto end; > Yeah, that fixed it. The load is now sane, my system isn't heating up, and 'conservative' is now clocking my CPU down to the minimum as is appropriate. Can you explain why this only affected 'conservative', why it caused my machine to heat up, etc? Also, this fix should probably be passed on to Greg K. H. (so it goes in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and hopefully to Linus for 2.6.32). Both CC'd. - Steven ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 8:52 ` Steven Noonan @ 2009-10-07 9:31 ` Eero Nurkkala 2009-10-07 10:42 ` Thomas Gleixner 2009-10-07 11:06 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala 0 siblings, 2 replies; 23+ messages in thread From: Eero Nurkkala @ 2009-10-07 9:31 UTC (permalink / raw) To: ext Steven Noonan Cc: linux-kernel@vger.kernel.org, Thomas Gleixner, Rik van Riel, Venkatesh Pallipadi, Greg KH, Ingo Molnar On Wed, 2009-10-07 at 10:52 +0200, ext Steven Noonan wrote: > On Wed, Oct 7, 2009 at 1:43 AM, Eero Nurkkala > <ext-eero.nurkkala@nokia.com> wrote: > > On Wed, 2009-10-07 at 10:24 +0200, ext Steven Noonan wrote: > >> > > >> > Steven, how do the cpu loads look like without the patch? > >> > >> They're sane: > >> > >> [ 40.019381] cpufreq load = 100 * (66666 - 66337) / 66666 = 0 > >> [ 40.019396] cpufreq load = 100 * (66666 - 66299) / 66666 = 0 > >> [ 73.352580] cpufreq load = 100 * (66717 - 66349) / 66717 = 0 > >> [ 73.352595] cpufreq load = 100 * (66634 - 63848) / 66634 = 4 > > > > Thank you. Could you please try the following: > > Now, if ts->nohz_mode == NOHZ_MODE_INACTIVE, ts->inidle is not set > > and all subsequent calls from irq_exit() think we weren't idling, > > which is not true. > > > > --- a/kernel/time/tick-sched.c > > +++ b/kernel/time/tick-sched.c > > @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle) > > > > now = tick_nohz_start_idle(ts); > > > > + ts->inidle = 1; > > + > > /* > > * If this cpu is offline and it is the one which updates > > * jiffies, then give up the assignment and let it be taken by > > @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle) > > if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) > > goto end; > > > > - ts->inidle = 1; > > - > > if (need_resched()) > > goto end; > > > > Yeah, that fixed it. The load is now sane, my system isn't heating up, > and 'conservative' is now clocking my CPU down to the minimum as is > appropriate. > For some reason, this path is taken: if (ts->nohz_mode == NOHZ_MODE_INACTIVE) goto end; > Can you explain why this only affected 'conservative', why it caused > my machine to heat up, etc? > Maybe the loads dont look good with ondemand either? if you trace them out. Below is a patch that has explanation on what's happening in your system. > Also, this fix should probably be passed on to Greg K. H. (so it goes > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and > hopefully to Linus for 2.6.32). Both CC'd. > > - Steven From: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Date: Wed, 7 Oct 2009 11:54:26 +0300 Subject: [PATCH] NOHZ: update idle state properly Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side effects with cpufreq governors on some systems. If NOHZ_MODE_INACTIVE was set, ts->inidle was not being set. Then, all subsequent calls from irq_exit() bypassed calls to tick_nohz_start_idle() which resulted in wrong information passed to cpufreq governors. Fix this by updating the state of ts->inidle where it fits the best. Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> --- kernel/time/tick-sched.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index e0f59a2..4c1b0ac 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -233,6 +233,8 @@ void tick_nohz_stop_sched_tick(int inidle) now = tick_nohz_start_idle(ts); + ts->inidle = 1; + /* * If this cpu is offline and it is the one which updates * jiffies, then give up the assignment and let it be taken by @@ -248,8 +250,6 @@ void tick_nohz_stop_sched_tick(int inidle) if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) goto end; - ts->inidle = 1; - if (need_resched()) goto end; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 9:31 ` Eero Nurkkala @ 2009-10-07 10:42 ` Thomas Gleixner 2009-10-07 11:03 ` Steven Noonan 2009-10-07 11:06 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala 1 sibling, 1 reply; 23+ messages in thread From: Thomas Gleixner @ 2009-10-07 10:42 UTC (permalink / raw) To: Eero Nurkkala Cc: ext Steven Noonan, linux-kernel@vger.kernel.org, Rik van Riel, Venkatesh Pallipadi, Greg KH, Ingo Molnar On Wed, 7 Oct 2009, Eero Nurkkala wrote: > For some reason, this path is taken: > if (ts->nohz_mode == NOHZ_MODE_INACTIVE) > goto end; Well, probably because the machine never switched to NOHZ :) > > Also, this fix should probably be passed on to Greg K. H. (so it goes > > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and > > hopefully to Linus for 2.6.32). Both CC'd. Don't worry, it'll go the right way. No need to cc the world and some more :) > > - Steven > > From: Eero Nurkkala <ext-eero.nurkkala@nokia.com> > Date: Wed, 7 Oct 2009 11:54:26 +0300 > Subject: [PATCH] NOHZ: update idle state properly > > Commit f2e21c9610991e95621a81407cdbab881226419b > had unfortunate side effects with cpufreq governors on > some systems. > > If NOHZ_MODE_INACTIVE was set, ts->inidle was not being > set. Then, all subsequent calls from irq_exit() bypassed > calls to tick_nohz_start_idle() which resulted in wrong > information passed to cpufreq governors. Fix this by > updating the state of ts->inidle where it fits the best. Hmm, nasty that code silently relies on NOHZ functionality even when NOHZ is not active. That needs a big fat comment at least. Nice detective work though ! Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 10:42 ` Thomas Gleixner @ 2009-10-07 11:03 ` Steven Noonan 2009-10-07 11:15 ` Thomas Gleixner 0 siblings, 1 reply; 23+ messages in thread From: Steven Noonan @ 2009-10-07 11:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Eero Nurkkala, linux-kernel@vger.kernel.org, Rik van Riel, Venkatesh Pallipadi, Greg KH, Ingo Molnar On Wed, Oct 7, 2009 at 3:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, 7 Oct 2009, Eero Nurkkala wrote: >> For some reason, this path is taken: >> if (ts->nohz_mode == NOHZ_MODE_INACTIVE) >> goto end; > > Well, probably because the machine never switched to NOHZ :) > >> > Also, this fix should probably be passed on to Greg K. H. (so it goes >> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and >> > hopefully to Linus for 2.6.32). Both CC'd. > > Don't worry, it'll go the right way. No need to cc the world and some > more :) > Sorry if it's not kosher. I was told that when dealing with a bisection you should CC: 1. those involved with a bisected patch (the Signed-off-by people), 2. those responsible for the affected subsystem (doing an author check via script[1]), 3. and those to whom a fix should be passed on to. Any tips so I don't bug the wrong people (or too many people) in the future? [1] http://www.uplinklabs.net/~tycho/linux/scripts/git-authors-email.sh (credits to Ingo for making this one. It's handy) >> >> From: Eero Nurkkala <ext-eero.nurkkala@nokia.com> >> Date: Wed, 7 Oct 2009 11:54:26 +0300 >> Subject: [PATCH] NOHZ: update idle state properly >> >> Commit f2e21c9610991e95621a81407cdbab881226419b >> had unfortunate side effects with cpufreq governors on >> some systems. >> >> If NOHZ_MODE_INACTIVE was set, ts->inidle was not being >> set. Then, all subsequent calls from irq_exit() bypassed >> calls to tick_nohz_start_idle() which resulted in wrong >> information passed to cpufreq governors. Fix this by >> updating the state of ts->inidle where it fits the best. > > Hmm, nasty that code silently relies on NOHZ functionality even when > NOHZ is not active. That needs a big fat comment at least. > > Nice detective work though ! > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [BISECTED] "conservative" cpufreq governor broken 2009-10-07 11:03 ` Steven Noonan @ 2009-10-07 11:15 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2009-10-07 11:15 UTC (permalink / raw) To: Steven Noonan Cc: Eero Nurkkala, linux-kernel@vger.kernel.org, Rik van Riel, Venkatesh Pallipadi, Greg KH, Ingo Molnar On Wed, 7 Oct 2009, Steven Noonan wrote: > > Well, probably because the machine never switched to NOHZ :) > > > >> > Also, this fix should probably be passed on to Greg K. H. (so it goes > >> > in 2.6.31-stable) as well as Ingo Molnar (so it goes into -tip, and > >> > hopefully to Linus for 2.6.32). Both CC'd. > > > > Don't worry, it'll go the right way. No need to cc the world and some > > more :) > > > > Sorry if it's not kosher. I was told that when dealing with a > bisection you should CC: > 1. those involved with a bisected patch (the Signed-off-by people), > 2. those responsible for the affected subsystem (doing an author check > via script[1]), > 3. and those to whom a fix should be passed on to. > > Any tips so I don't bug the wrong people (or too many people) in the future? No worry. It's just Linus and Greg who got some extra mail into their inbox. The flow is that the patch is picked up by the maintainer and pushed into his git tree which is then offered to Linus for pulling it into mainline. We add a Cc: stable@kernel.org tag to the commit, which triggers a notifier to the stable kernel team when the patch hits Linus tree. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive 2009-10-07 9:31 ` Eero Nurkkala 2009-10-07 10:42 ` Thomas Gleixner @ 2009-10-07 11:06 ` tip-bot for Eero Nurkkala 2009-10-07 13:35 ` Rik van Riel 1 sibling, 1 reply; 23+ messages in thread From: tip-bot for Eero Nurkkala @ 2009-10-07 11:06 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, greg, venkatesh.pallipadi, riel, steven, ext-eero.nurkkala, tglx Commit-ID: fdc6f192e7e1ae80565af23cc33dc88e3dcdf184 Gitweb: http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184 Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com> AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Wed, 7 Oct 2009 13:05:05 +0200 NOHZ: update idle state also when NOHZ is inactive Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side effects with cpufreq governors on some systems. If the system did not switch into NOHZ mode ts->inidle is not set when tick_nohz_stop_sched_tick() is called from the idle routine. Therefor all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick() fail to call tick_nohz_start_idle(). This results in bogus idle accounting information which is passed to cpufreq governors. Set the inidle flag unconditionally of the NOHZ active state to keep the idle time accounting correct in any case. [ tglx: Added comment and tweaked the changelog ] Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Cc: Rik van Riel <riel@redhat.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Greg KH <greg@kroah.com> Cc: Steven Noonan <steven@uplinklabs.net> Cc: stable@kernel.org LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/time/tick-sched.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c index e0f59a2..89aed59 100644 --- a/kernel/time/tick-sched.c +++ b/kernel/time/tick-sched.c @@ -231,6 +231,13 @@ void tick_nohz_stop_sched_tick(int inidle) if (!inidle && !ts->inidle) goto end; + /* + * Set ts->inidle unconditionally. Even if the system did not + * switch to NOHZ mode the cpu frequency governers rely on the + * update of the idle time accounting in tick_nohz_start_idle(). + */ + ts->inidle = 1; + now = tick_nohz_start_idle(ts); /* @@ -248,8 +255,6 @@ void tick_nohz_stop_sched_tick(int inidle) if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE)) goto end; - ts->inidle = 1; - if (need_resched()) goto end; ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive 2009-10-07 11:06 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala @ 2009-10-07 13:35 ` Rik van Riel 0 siblings, 0 replies; 23+ messages in thread From: Rik van Riel @ 2009-10-07 13:35 UTC (permalink / raw) To: mingo, hpa, linux-kernel, venkatesh.pallipadi, greg, riel, steven, ext-eero.nurkkala, tglx Cc: linux-tip-commits tip-bot for Eero Nurkkala wrote: > Commit-ID: fdc6f192e7e1ae80565af23cc33dc88e3dcdf184 > Gitweb: http://git.kernel.org/tip/fdc6f192e7e1ae80565af23cc33dc88e3dcdf184 > Author: Eero Nurkkala <ext-eero.nurkkala@nokia.com> > AuthorDate: Wed, 7 Oct 2009 11:54:26 +0300 > Committer: Thomas Gleixner <tglx@linutronix.de> > CommitDate: Wed, 7 Oct 2009 13:05:05 +0200 > > NOHZ: update idle state also when NOHZ is inactive > > Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side > effects with cpufreq governors on some systems. > > If the system did not switch into NOHZ mode ts->inidle is not set when > tick_nohz_stop_sched_tick() is called from the idle routine. Therefor > all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick() > fail to call tick_nohz_start_idle(). This results in bogus idle > accounting information which is passed to cpufreq governors. > > Set the inidle flag unconditionally of the NOHZ active state to keep > the idle time accounting correct in any case. Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2009-10-07 13:36 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-10-05 16:32 [BISECTED] "conservative" cpufreq governor broken Steven Noonan 2009-10-06 7:31 ` Eero Nurkkala 2009-10-06 10:22 ` Steven Noonan 2009-10-06 10:43 ` Eero Nurkkala 2009-10-06 11:22 ` Steven Noonan 2009-10-06 12:36 ` Eero Nurkkala 2009-10-07 0:54 ` Steven Noonan 2009-10-07 1:22 ` Steven Noonan 2009-10-07 5:27 ` Eero Nurkkala 2009-10-07 7:30 ` Steven Noonan 2009-10-07 7:49 ` Eero Nurkkala 2009-10-07 8:05 ` Steven Noonan 2009-10-07 8:11 ` Steven Noonan 2009-10-07 8:14 ` Eero Nurkkala 2009-10-07 8:24 ` Steven Noonan 2009-10-07 8:43 ` Eero Nurkkala 2009-10-07 8:52 ` Steven Noonan 2009-10-07 9:31 ` Eero Nurkkala 2009-10-07 10:42 ` Thomas Gleixner 2009-10-07 11:03 ` Steven Noonan 2009-10-07 11:15 ` Thomas Gleixner 2009-10-07 11:06 ` [tip:timers/urgent] NOHZ: update idle state also when NOHZ is inactive tip-bot for Eero Nurkkala 2009-10-07 13:35 ` Rik van Riel
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.