50 Watt idle power regression bisected to Linux-3.10

public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed

* 50 Watt idle power regression bisected to Linux-3.10
@ 2013-12-07  8:00 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
                   ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Len Brown @ 2013-12-07  8:00 UTC (permalink / raw)
  To: tglx, Peter Zijlstra
  Cc: Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

Hello Thomas,

An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch:

commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Mar 21 22:50:03 2013 +0100

    x86: Use generic idle loop

ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499)
runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9.

The difference is that the good kernels allow about 98% residence
in the package C6 state, while the bad kernel is so noisy that it
gets into pc6 0% of the time.
(indeed, even core C6 is reduced to about 50% from over 99%)

No, Linux-3.13-rc3 does not fix this issue, even though it contains
the following patch, claiming to address an issue with the commit above:

commit ea8117478918a4734586d35ff530721b682425be
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Wed Sep 11 12:43:13 2013 +0200

    sched, idle: Fix the idle polling state logic

    Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
    regressed several workloads and caused excessive reschedule
    interrupts.

    The patch in question failed to notice that the x86 code had an
    inverted sense of the polling state versus the new generic code (x86:
    default polling, generic: default !polling).

    Fix the two prominent x86 mwait based idle drivers and introduce a few
    new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
    usage).

    Also switch the idle routines to using tif_need_resched() which is an
    immediate TIF_NEED_RESCHED test as opposed to need_resched which will
    end up being slightly different.

    Reported-by: Mike Galbraith <bitbucket@online.de>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Cc: lenb@kernel.org
    Cc: tglx@linutronix.de
    Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

How shall we proceed?

thanks,
-Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
@ 2013-12-07  8:39 ` Mike Galbraith
  2013-12-07 16:01   ` Len Brown
  2013-12-07 12:54 ` Thomas Gleixner
  2013-12-08  4:57 ` Mike Galbraith
  2 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-07  8:39 UTC (permalink / raw)
  To: Len Brown
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote:

> No, Linux-3.13-rc3 does not fix this issue, even though it contains
> the following patch, claiming to address an issue with the commit above:
> 
> commit ea8117478918a4734586d35ff530721b682425be
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Wed Sep 11 12:43:13 2013 +0200
> 
>     sched, idle: Fix the idle polling state logic
> 
>     Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
>     regressed several workloads and caused excessive reschedule
>     interrupts.

It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
beloved old box doesn't play hints game, so it continues to flog itself.

-Mike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
@ 2013-12-07 12:54 ` Thomas Gleixner
  2013-12-08  4:57 ` Mike Galbraith
  2 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-07 12:54 UTC (permalink / raw)
  To: Len Brown
  Cc: Peter Zijlstra, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

Len,

On Sat, 7 Dec 2013, Len Brown wrote:
>
> How shall we proceed?

Can you please gather a function trace so I can see what the system is
doing? Preferrably with 3.13-rc3 so we have Peters fix included.

Please send it offlist or put it somehwere for download.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:39 ` Mike Galbraith
@ 2013-12-07 16:01   ` Len Brown
  2013-12-07 16:45     ` Len Brown
  0 siblings, 1 reply; 51+ messages in thread
From: Len Brown @ 2013-12-07 16:01 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Sat, Dec 7, 2013 at 3:39 AM, Mike Galbraith <bitbucket@online.de> wrote:
> On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote:
>
>> No, Linux-3.13-rc3 does not fix this issue, even though it contains
>> the following patch, claiming to address an issue with the commit above:
>>
>> commit ea8117478918a4734586d35ff530721b682425be
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Wed Sep 11 12:43:13 2013 +0200
>>
>>     sched, idle: Fix the idle polling state logic
>>
>>     Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
>>     regressed several workloads and caused excessive reschedule
>>     interrupts.
>
> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> beloved old box doesn't play hints game, so it continues to flog itself.

Hi Mike,
What C-states are available on your Q6600 box?
Is it running HALT only?
(if so, I would expect C1 MWAIT be about the same, if it still existed)

But I would expect the box to have deeper C-states than just C1, so...

Can you show me the output of
dmesg | grep idle
grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*

Please elaborate on exactly what symptom you see.
Do you have a power meter?

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 16:01   ` Len Brown
@ 2013-12-07 16:45     ` Len Brown
  2013-12-07 19:17       ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: Len Brown @ 2013-12-07 16:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

>> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
>> beloved old box doesn't play hints game, so it continues to flog itself.

Thanks for pointing this out, Mike!

A Q6600 is a Kentsfield.  I dug one of those up.
Indeed, the only idle capabilities it has are  HALT
and old style MWAIT, and the latter is much more effective.
running 3.8 it idles at 75 watts.
running 3.8 with idle=nomwait it idles at 100 watts,
which is what it will do with 3.9 and later due to the patch below.

commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
Author: Len Brown <len.brown@intel.com>
Date:   Sun Feb 10 01:38:39 2013 -0500

    x86 idle: remove mwait_idle() and "idle=mwait" cmdline param

Kentsfield proves that patch was based on a fault assumption.
Sweet box in its day, ECC memory and everything -- probably still
a fair number of them running...

Plus, I've found another machine that depends on having an idle=mwait
idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
https://bugzilla.kernel.org/show_bug.cgi?id=60770)

So it looks like I need to (also) restore the simple idle=mwait idle loop
to make some machines happy.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 16:45     ` Len Brown
@ 2013-12-07 19:17       ` Mike Galbraith
  2013-12-10 11:41         ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-07 19:17 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: 
> >> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> >> beloved old box doesn't play hints game, so it continues to flog itself.
> 
> 
> Thanks for pointing this out, Mike!
> 
> A Q6600 is a Kentsfield.  I dug one of those up.
> Indeed, the only idle capabilities it has are  HALT
> and old style MWAIT, and the latter is much more effective.
> running 3.8 it idles at 75 watts.
> running 3.8 with idle=nomwait it idles at 100 watts,
> which is what it will do with 3.9 and later due to the patch below.
> 
> commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
> Author: Len Brown <len.brown@intel.com>
> Date:   Sun Feb 10 01:38:39 2013 -0500
> 
>     x86 idle: remove mwait_idle() and "idle=mwait" cmdline param
> 
> Kentsfield proves that patch was based on a fault assumption.
> Sweet box in its day, ECC memory and everything -- probably still
> a fair number of them running...
> 
> Plus, I've found another machine that depends on having an idle=mwait
> idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
> https://bugzilla.kernel.org/show_bug.cgi?id=60770)
> 
> So it looks like I need to (also) restore the simple idle=mwait idle loop
> to make some machines happy.

Cool, box will definitely be happier.

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
  2013-12-07  8:39 ` Mike Galbraith
  2013-12-07 12:54 ` Thomas Gleixner
@ 2013-12-08  4:57 ` Mike Galbraith
  2013-12-08 20:40   ` Len Brown
  2 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-08  4:57 UTC (permalink / raw)
  To: Len Brown
  Cc: tglx, Peter Zijlstra, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Sat, 2013-12-07 at 03:00 -0500, Len Brown wrote: 
> Hello Thomas,
> 
> An idle WSM-EX box (40 Xeon cores) runs 50 Watts hotter after this patch:
> 
> commit 7d1a941731fabf27e5fb6edbebb79fe856edb4e5
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Thu Mar 21 22:50:03 2013 +0100
> 
>     x86: Use generic idle loop
> 
> ie. the commit before this patch (aba92c9e2cf3042bf6efc68fa2e4235ba01bf499)
> runs at 50 watts less, as do Linux 3.7, 3.8 and 3.9.
> 
> The difference is that the good kernels allow about 98% residence
> in the package C6 state, while the bad kernel is so noisy that it
> gets into pc6 0% of the time.
> (indeed, even core C6 is reduced to about 50% from over 99%)

FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
powertop, deepest state it has.  It took a big hit from 69a37bea as well
as 7d1a9417, but with revert of 69a37bea, and addition of ea811747, it's
fairly close to 3.7 in fastpath weight, and greenery _seems_ fine.

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-08  4:57 ` Mike Galbraith
@ 2013-12-08 20:40   ` Len Brown
  2013-12-09  3:16     ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: Len Brown @ 2013-12-08 20:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

> FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
> powertop, deepest state it has.

Please run turbostat, which always describes what the hardware does.
Depending on the version of powertop and the hardware involved,
it may be describing that the OS requested -- which is not always the same.

The hardware reserves the right to demote any requested C-state to a
shallower one.  Also, it is the hardware decision whether to promote
core c-states requested by the OS into power saving package C-states.

cheers,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-08 20:40   ` Len Brown
@ 2013-12-09  3:16     ` Mike Galbraith
  2013-12-10  5:17       ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-09  3:16 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Sun, 2013-12-08 at 15:40 -0500, Len Brown wrote: 
> > FWIW, my little x3550 (E5620) box is spending ~99% of its time in C3 per
> > powertop, deepest state it has.
> 
> Please run turbostat, which always describes what the hardware does.
> Depending on the version of powertop and the hardware involved,
> it may be describing that the OS requested -- which is not always the same.
> 
> The hardware reserves the right to demote any requested C-state to a
> shallower one.  Also, it is the hardware decision whether to promote
> core c-states requested by the OS into power saving package C-states.

Hrm, turbostat says rtbox has a C6, and is using it.  Ah, BIOS says ACPI
C3 == Intel C6, so I suppose Intel C3 == ACPI C2.  Whatever, both tools
seem to agree that these particular boxen (x3550/E5620 and DL980/X7560)
are snoozing peacefully.

rtbox:~ # turbostat -i 60
cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
          0.18 1.73 2.40   0   0.19   0.02  99.61   25   0.08  92.74
  0   0   0.47 1.60 2.40   0   0.62   0.08  98.83   18   0.08  92.74
  1   1   0.03 1.61 2.40   0   0.05   0.00  99.92   25
  9   2   0.05 1.73 2.40   0   0.06   0.00  99.89   16
 10   3   0.15 2.13 2.40   0   0.05   0.01  99.79   20

vogelweide:~/:[0]# turbostat -i 60
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.06 2.03 2.26   0   0.76  99.18   0.00   49  94.97   0.00
 0   0   0   0.40 1.57 2.26   3  25.28  74.32   0.00   49  64.24   0.00
 0   1   1   0.08 2.16 2.26   3  19.28  80.64   0.00   48
 0   2   2   0.09 2.20 2.26   3   0.00  99.91   0.00   49
 0   3   3   0.09 2.20 2.26   3   0.00  99.91   0.00   49
 0   8   4   0.08 2.21 2.26   3   0.00  99.91   0.00   45
 0   9   5   0.08 2.22 2.26   3   0.00  99.92   0.00   46
 0  10   6   0.08 2.23 2.26   3   0.00  99.92   0.00   45
 0  11   7   0.08 2.23 2.26   3   0.00  99.92   0.00   45
 1   0   8   0.10 1.94 2.26   3   0.01  99.89   0.00   38  99.88   0.00
 1   1   9   0.08 2.11 2.26   3   0.00  99.92   0.00   39
 1   2  10   0.08 2.12 2.26   3   0.00  99.92   0.00   41
 1   3  11   0.07 2.14 2.26   3   0.00  99.92   0.00   41
 1   8  12   0.08 2.15 2.26   3   0.00  99.92   0.00   42
 1   9  13   0.07 2.17 2.26   3   0.00  99.92   0.00   43
 1  10  14   0.07 2.19 2.26   3   0.00  99.92   0.00   41
 1  11  15   0.07 2.20 2.26   3   0.00  99.93   0.00   41
 2   0  16   0.09 1.93 2.26   3   0.01  99.90   0.00   41  99.89   0.00
 2   1  17   0.07 2.09 2.26   3   0.00  99.92   0.00   40
 2   2  18   0.07 2.11 2.26   3   0.00  99.93   0.00   41
 2   3  19   0.07 2.12 2.26   3   0.00  99.93   0.00   42
 2   8  20   0.07 2.14 2.26   3   0.00  99.93   0.00   41
 2   9  21   0.07 2.16 2.26   3   0.00  99.93   0.00   41
 2  10  22   0.07 2.17 2.26   3   0.00  99.93   0.00   37
 2  11  23   0.07 2.20 2.26   3   0.00  99.93   0.00   39
 3   0  24   0.08 1.91 2.26   3   0.01  99.91   0.00   41  99.88   0.00
 3   1  25   0.06 2.07 2.26   3   0.00  99.93   0.00   42
 3   2  26   0.07 2.09 2.26   3   0.02  99.92   0.00   41
 3   3  27   0.06 2.11 2.26   3   0.00  99.94   0.00   42
 3   8  28   0.06 2.12 2.26   3   0.00  99.94   0.00   47
 3   9  29   0.06 2.15 2.26   3   0.00  99.94   0.00   46
 3  10  30   0.06 2.18 2.26   3   0.00  99.94   0.00   44
 3  11  31   0.06 2.19 2.26   3   0.02  99.92   0.00   46
 4   0  32   0.07 1.92 2.26   3   0.01  99.92   0.00   36  96.25   0.00
 4   1  33   0.06 2.07 2.26   3   0.00  99.94   0.00   36
 4   2  34   0.06 2.08 2.26   3   0.00  99.94   0.00   39
 4   3  35   0.05 2.11 2.26   3   0.00  99.94   0.00   39
 4   8  36   0.05 2.12 2.26   3   3.67  96.28   0.00   40
 4   9  37   0.05 2.15 2.26   3   0.00  99.94   0.00   42
 4  10  38   0.05 2.18 2.26   3   0.00  99.95   0.00   39
 4  11  39   0.05 2.20 2.26   3   0.00  99.95   0.00   40
 5   0  40   0.07 1.73 2.26   3   0.02  99.91   0.00   35  99.90   0.00
 5   1  41   0.05 2.00 2.26   3   0.00  99.95   0.00   36
 5   2  42   0.05 2.04 2.26   3   0.00  99.95   0.00   39
 5   3  43   0.04 2.04 2.26   3   0.00  99.95   0.00   38
 5   8  44   0.05 2.09 2.26   3   0.00  99.95   0.00   40
 5   9  45   0.04 2.10 2.26   3   0.00  99.95   0.00   41
 5  10  46   0.04 2.13 2.26   3   0.00  99.96   0.00   37
 5  11  47   0.04 2.17 2.26   3   0.00  99.96   0.00   37
 6   0  48   0.08 1.61 2.26   3   0.02  99.90   0.00   35  99.87   0.00
 6   1  49   0.05 1.79 2.26   3   0.01  99.94   0.00   36
 6   2  50   0.04 1.98 2.26   3   0.00  99.96   0.00   36
 6   3  51   0.04 2.00 2.26   3   0.00  99.96   0.00   37
 6   8  52   0.04 2.01 2.26   3   0.00  99.96   0.00   38
 6   9  53   0.03 2.07 2.26   3   0.00  99.96   0.00   36
 6  10  54   0.03 2.08 2.26   3   0.00  99.96   0.00   36
 6  11  55   0.03 2.14 2.26   3   0.00  99.96   0.00   36
 7   0  56   0.05 1.69 2.26   3   0.01  99.94   0.00   41  99.89   0.00
 7   1  57   0.03 1.87 2.26   3   0.00  99.97   0.00   41
 7   2  58   0.03 1.89 2.26   3   0.00  99.97   0.00   44
 7   3  59   0.03 1.93 2.26   3   0.00  99.97   0.00   44
 7   8  60   0.03 1.97 2.26   3   0.05  99.92   0.00   41
 7   9  61   0.03 2.02 2.26   3   0.00  99.97   0.00   38
 7  10  62   0.02 2.06 2.26   3   0.00  99.97   0.00   39
 7  11  63   0.03 2.12 2.26   3   0.00  99.97   0.00   36


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-09  3:16     ` Mike Galbraith
@ 2013-12-10  5:17       ` Mike Galbraith
  2013-12-10 11:45         ` Ingo Molnar
  2013-12-10 14:29         ` Thomas Gleixner
  0 siblings, 2 replies; 51+ messages in thread
From: Mike Galbraith @ 2013-12-10  5:17 UTC (permalink / raw)
  To: Len Brown
  Cc: Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

Hi Len,

I'm unable to reproduce those DL980 results.  I updated the kernel and
config yesterday, and happened to run turbostat again.. and the box was
nowhere near as quiet.  I ended up spending all day futzing with the
darn thing, checking various config/kernel combos, and none produced the
previous result.

ATM, running the same exact updated kernel as the x3550 is running
(still happily) with only a couple needed drivers added:

vogelweide:~/:[130]# turbostat -P -i 60
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
 0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
 1   0   8   0.04 2.02 2.26   5  63.10  36.86   0.00   41  13.31   0.00
 2   0  16   0.04 1.90 2.26   5  35.70  64.25   0.00   43  37.88   0.00
 3   0  24   0.03 2.08 2.26   5  39.78  60.19   0.00   42  29.00   0.00
 4   0  32   0.03 1.95 2.26   5  14.64  85.33   0.00   37  65.00   0.00
 5   0  40   0.03 1.95 2.26   5  15.96  84.02   0.00   36  74.34   0.00
 6   0  48   0.02 1.99 2.26   5  36.97  63.01   0.00   37  40.20   0.00
 7   0  56   0.02 2.08 2.26   5  57.54  42.44   0.00   44   0.08   0.00
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.03 2.08 2.26   0  31.24  68.73   0.00   53  44.26   0.00
 0   0   0   0.25 1.76 2.26   8  52.96  46.78   0.00   51  31.99   0.00
 1   0   8   0.03 1.96 2.26   8  50.69  49.28   0.00   41  30.80   0.00
 2   0  16   0.04 1.91 2.26   8  36.10  63.85   0.00   42  44.37   0.00
 3   0  24   0.03 1.96 2.26   8  24.15  75.82   0.00   42  59.14   0.00
 4   0  32   0.03 1.94 2.26   8  14.68  85.29   0.00   37  71.64   0.00
 5   0  40   0.03 1.94 2.26   8  16.01  83.96   0.00   36  78.62   0.00
 6   0  48   0.02 2.18 2.26   8  46.79  53.18   0.00   36  11.05   0.00
 7   0  56   0.02 2.01 2.26   8  50.84  49.14   0.00   45  26.50   0.00
pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.03 2.10 2.26   0  30.88  69.09   0.00   51  48.09   0.00
 0   0   0   0.23 1.84 2.26   8  51.98  47.79   0.00   50  31.06   0.00
 1   0   8   0.04 2.05 2.26   8  59.44  40.52   0.00   41  21.86   0.00
 2   0  16   0.04 1.73 2.26   8  22.87  77.09   0.00   41  62.26   0.00
 3   0  24   0.03 1.93 2.26   8   9.83  90.14   0.00   41  77.33   0.00
 4   0  32   0.03 1.90 2.26   8   8.66  91.31   0.00   37  89.16   0.00
 5   0  40   0.03 2.05 2.26   8  27.30  72.67   0.00   35  50.62   0.00
 6   0  48   0.02 2.07 2.26   8  44.81  55.17   0.00   36  18.97   0.00
 7   0  56   0.02 2.00 2.26   8  48.06  51.91   0.00   42  33.44   0.00

That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
dunno, kernel called itself master.  Whatever, I don't like test bogons,
and this is one, so I thought I should let you know.  DL980 refuses to
come close to the previous result.

'nuff of that, off to the slave pits before yet another day evaporates.

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-07 19:17       ` Mike Galbraith
@ 2013-12-10 11:41         ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-10 11:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Mike Galbraith <bitbucket@online.de> wrote:

> On Sat, 2013-12-07 at 11:45 -0500, Len Brown wrote: 
> > >> It fixes that, except for my Q6600 box.  Too bad mwait_idle() went away,
> > >> beloved old box doesn't play hints game, so it continues to flog itself.
> > 
> > 
> > Thanks for pointing this out, Mike!
> > 
> > A Q6600 is a Kentsfield.  I dug one of those up.
> > Indeed, the only idle capabilities it has are  HALT
> > and old style MWAIT, and the latter is much more effective.
> > running 3.8 it idles at 75 watts.
> > running 3.8 with idle=nomwait it idles at 100 watts,
> > which is what it will do with 3.9 and later due to the patch below.
> > 
> > commit 69fb3676df3329a7142803bb3502fa59dc0db2e3
> > Author: Len Brown <len.brown@intel.com>
> > Date:   Sun Feb 10 01:38:39 2013 -0500
> > 
> >     x86 idle: remove mwait_idle() and "idle=mwait" cmdline param
> > 
> > Kentsfield proves that patch was based on a fault assumption.
> > Sweet box in its day, ECC memory and everything -- probably still
> > a fair number of them running...
> > 
> > Plus, I've found another machine that depends on having an idle=mwait
> > idle loop (A Sony Vaio BIOS SMM code apparently assumes we use it in
> > https://bugzilla.kernel.org/show_bug.cgi?id=60770)
> > 
> > So it looks like I need to (also) restore the simple idle=mwait idle loop
> > to make some machines happy.
> 
> Cool, box will definitely be happier.

I assume old-style MWAIT will be activated automatically on such 
boxes, there's no need to pass in idle=mwait on the boot command line, 
correct?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10  5:17       ` Mike Galbraith
@ 2013-12-10 11:45         ` Ingo Molnar
  2013-12-10 14:29         ` Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-10 11:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Thomas Gleixner, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Mike Galbraith <bitbucket@online.de> wrote:

> Hi Len,
> 
> I'm unable to reproduce those DL980 results.  I updated the kernel and
> config yesterday, and happened to run turbostat again.. and the box was
> nowhere near as quiet.  I ended up spending all day futzing with the
> darn thing, checking various config/kernel combos, and none produced the
> previous result.
> 
> ATM, running the same exact updated kernel as the x3550 is running
> (still happily) with only a couple needed drivers added:
> 
> vogelweide:~/:[130]# turbostat -P -i 60
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
>  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
>  1   0   8   0.04 2.02 2.26   5  63.10  36.86   0.00   41  13.31   0.00
>  2   0  16   0.04 1.90 2.26   5  35.70  64.25   0.00   43  37.88   0.00
>  3   0  24   0.03 2.08 2.26   5  39.78  60.19   0.00   42  29.00   0.00
>  4   0  32   0.03 1.95 2.26   5  14.64  85.33   0.00   37  65.00   0.00
>  5   0  40   0.03 1.95 2.26   5  15.96  84.02   0.00   36  74.34   0.00
>  6   0  48   0.02 1.99 2.26   5  36.97  63.01   0.00   37  40.20   0.00
>  7   0  56   0.02 2.08 2.26   5  57.54  42.44   0.00   44   0.08   0.00
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.03 2.08 2.26   0  31.24  68.73   0.00   53  44.26   0.00
>  0   0   0   0.25 1.76 2.26   8  52.96  46.78   0.00   51  31.99   0.00
>  1   0   8   0.03 1.96 2.26   8  50.69  49.28   0.00   41  30.80   0.00
>  2   0  16   0.04 1.91 2.26   8  36.10  63.85   0.00   42  44.37   0.00
>  3   0  24   0.03 1.96 2.26   8  24.15  75.82   0.00   42  59.14   0.00
>  4   0  32   0.03 1.94 2.26   8  14.68  85.29   0.00   37  71.64   0.00
>  5   0  40   0.03 1.94 2.26   8  16.01  83.96   0.00   36  78.62   0.00
>  6   0  48   0.02 2.18 2.26   8  46.79  53.18   0.00   36  11.05   0.00
>  7   0  56   0.02 2.01 2.26   8  50.84  49.14   0.00   45  26.50   0.00
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.03 2.10 2.26   0  30.88  69.09   0.00   51  48.09   0.00
>  0   0   0   0.23 1.84 2.26   8  51.98  47.79   0.00   50  31.06   0.00
>  1   0   8   0.04 2.05 2.26   8  59.44  40.52   0.00   41  21.86   0.00
>  2   0  16   0.04 1.73 2.26   8  22.87  77.09   0.00   41  62.26   0.00
>  3   0  24   0.03 1.93 2.26   8   9.83  90.14   0.00   41  77.33   0.00
>  4   0  32   0.03 1.90 2.26   8   8.66  91.31   0.00   37  89.16   0.00
>  5   0  40   0.03 2.05 2.26   8  27.30  72.67   0.00   35  50.62   0.00
>  6   0  48   0.02 2.07 2.26   8  44.81  55.17   0.00   36  18.97   0.00
>  7   0  56   0.02 2.00 2.26   8  48.06  51.91   0.00   42  33.44   0.00
> 
> That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What 
> happened, dunno, kernel called itself master.  Whatever, I don't 
> like test bogons, and this is one, so I thought I should let you 
> know.  DL980 refuses to come close to the previous result.
> 
> 'nuff of that, off to the slave pits before yet another day 
> evaporates.

An ftrace function trace of the anomaly would be awfully useful.

(or a trace generated of a dozen strategically placed trace_printk()s 
debug lines tracing the guts of cpuidle.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10  5:17       ` Mike Galbraith
  2013-12-10 11:45         ` Ingo Molnar
@ 2013-12-10 14:29         ` Thomas Gleixner
  2013-12-10 15:06           ` Ingo Molnar
  2013-12-11  2:05           ` Thomas Gleixner
  1 sibling, 2 replies; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-10 14:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Tue, 10 Dec 2013, Mike Galbraith wrote:
> vogelweide:~/:[130]# turbostat -P -i 60
> pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
>              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
>  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00

...
 
> That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> dunno, kernel called itself master.  Whatever, I don't like test bogons,
> and this is one, so I thought I should let you know.  DL980 refuses to
> come close to the previous result.

Hmm, that's very similar to the issue Len is seing on his quad socket WM.

But I can't see it on my dual socket SNB, neither does PeterZ on his
dual socket WM machine. Heisenbugs ....

Can you offline all cpus except core 0, enable all trace events (nop
tracer), run "sleep 100" and stop and grab the trace? Please make sure
that your tracebuffer is big enough to cover the full thing.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10 14:29         ` Thomas Gleixner
@ 2013-12-10 15:06           ` Ingo Molnar
  2013-12-11  2:05           ` Thomas Gleixner
  1 sibling, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-10 15:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Tue, 10 Dec 2013, Mike Galbraith wrote:
> > vogelweide:~/:[130]# turbostat -P -i 60
> > pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
> >              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
> >  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
> 
> ...
>  
> > That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> > dunno, kernel called itself master.  Whatever, I don't like test bogons,
> > and this is one, so I thought I should let you know.  DL980 refuses to
> > come close to the previous result.
> 
> Hmm, that's very similar to the issue Len is seing on his quad 
> socket WM.
> 
> But I can't see it on my dual socket SNB, neither does PeterZ on his 
> dual socket WM machine. Heisenbugs ....
> 
> Can you offline all cpus except core 0, enable all trace events (nop 
> tracer), run "sleep 100" and stop and grab the trace? Please make 
> sure that your tracebuffer is big enough to cover the full thing.

I guess because we are facing a Heisenbug it might be prudent to check 
that the bug is still present with all those cpus offlined and with 
tracing enabled :-/

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-10 14:29         ` Thomas Gleixner
  2013-12-10 15:06           ` Ingo Molnar
@ 2013-12-11  2:05           ` Thomas Gleixner
  2013-12-11  3:21             ` Mike Galbraith
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-11  2:05 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86, Borislav Petkov

On Tue, 10 Dec 2013, Thomas Gleixner wrote:
> On Tue, 10 Dec 2013, Mike Galbraith wrote:
> > vogelweide:~/:[130]# turbostat -P -i 60
> > pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
> >              0.02 2.12 2.26   0  43.40  56.57   0.00   53  33.81   0.00
> >  0   0   0   0.23 2.10 2.26   5  65.47  34.30   0.00   52  10.69   0.00
> 
> ...
>  
> > That, vs solid >99% pc3 for old 3.0 enterprise kernel.  What happened,
> > dunno, kernel called itself master.  Whatever, I don't like test bogons,
> > and this is one, so I thought I should let you know.  DL980 refuses to
> > come close to the previous result.
> 
> Hmm, that's very similar to the issue Len is seing on his quad socket WM.
> 
> But I can't see it on my dual socket SNB, neither does PeterZ on his
> dual socket WM machine. Heisenbugs ....
> 
> Can you offline all cpus except core 0, enable all trace events (nop
> tracer), run "sleep 100" and stop and grab the trace? Please make sure
> that your tracebuffer is big enough to cover the full thing.

So Boris found a quad socket WM which shows the issue and he also
confirmed that dual socket ones cannot reproduce it...

The trace Boris provided shows similar bogosity as you and Len are
seing. Decoding it shows:

Sleep time	state=1	state=2	state=3	state=4
  0 -  49 ms:	202	  6	 32	948
 50 -  99 ms:	 15	  0	  0	 66
100 - 149 ms:	  3	  0	  0	 12
150 - 199 ms:	  1	  0	  0	  0

So we actually sleep for more than 150ms in C1!?! Lets look at the
trace:

          <idle>-0     [000] d...  1977.728444: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728446: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728448: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728450: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728452: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728453: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728455: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728456: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728458: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728460: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728462: cpu_idle: state=4 cpu_id=0
          <idle>-0     [000] ....  1977.728463: cpu_idle: state=4294967295 cpu_id=0
          <idle>-0     [000] d...  1977.728465: cpu_idle: state=1 cpu_id=0
          <idle>-0     [000] d.h.  1977.891539: local_timer_entry: vector=239

So what happens is that SIX consecutive calls into the idle machinery
with state=4 return right away. That means either need_resched() is
set, which is nowhere indicated by the trace and would be causing a
reschedule in between. And after that the idle governor decides to go
into state=1 where we stay for a whopping 163ms.

Now I wanted to analyze that a bit more and asked Boris to apply the
debug patch below.

Drumroll, then laugther and then banging the head against the wall:

Sleep time	state=1	state=2	state=3	state=4
  0 -  49 ms:	  8	  3	112	2849
 50 -  99 ms:	  0	  0	  0 	 419
100 - 149 ms:	  0	  0	  0	  84
150 - 199 ms:	  0	  0	  0	  21
200 - 249 ms:	  0	  0	  0	   3

Also turbostat confirms that the machine is staying in its deepest
power state as much as it can.

I neither can observe similar massive bouncing in and out of a state
as in the above trace. There are a few lonely ones, but they do not
do any damage.

Now what does this patch change?

Nothing in the logic, it merily changes the timing of that idle
code by adding trace_printks.

And what did the x86 -> generic idle code change do ?

 Lets ignore the subtle SMP issues vs. the polling flag which are
 fixed in current mainline and completely irrelevant for an idle UP
 machine. And all the traces are from an UP machine (all cores
 offlined except 0)

 It changed the timing as well.

So the trace addons making the issue go away AND the fact that nobody
can reproduce the issue on a dual socket machine makes me wonder even
more.

Now there is an important data point:

  The problem is timing sensitive and the issue is only exposed on
  quad socket machines.

So I looked at the code pathes once more and the only way we can
bounce out of that is that the mwait in intel_idle() comes back
prematurely.

Now what makes mwait come back early? Especially in a situation where
only a single core is involved in code execution?

Definitely not a change in the idle code, which is relating to UP
exacly ZERO, except the fact that we touch the cpu local nmi watchdog
variable via touch_nmi_watchdog() inside the irq disabled region
instead of the previous irq enabled region.

So what makes mwait come back right away?

Lets look at a snippet of the trace which was taken with the debug
patch applied:

<idle>-0  [000] d... 1747.667184: menu_select: expected_us: 39875 predicted_us: 22697

So we have according to the nohz code 39875 usec idle time ahead. The
governer heuristics make that: 22697 predicted time.

<idle>-0  [000] d... 1747.667187: menu_select: Trying idle state 1 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667188: menu_select: residency 6
<idle>-0  [000] d... 1747.667188: menu_select: exit_latency 3 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667189: menu_select: Trying idle state 2 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667190: menu_select: residency 20
<idle>-0  [000] d... 1747.667191: menu_select: exit_latency 10 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667191: menu_select: Trying idle state 3 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667192: menu_select: residency 80
<idle>-0  [000] d... 1747.667193: menu_select: exit_latency 20 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667194: menu_select: Trying idle state 4 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667194: menu_select: residency 800
<idle>-0  [000] d... 1747.667195: menu_select: exit_latency 200 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667195: cpu_idle: state=4 cpu_id=0

So the governor settles on state=4 which is the deepest one on this machine.

<idle>-0  [000] .... 1747.667197: cpuidle_idle_call: coupled 0: entered state 4

And TWO micro seconds later we are out of that state, but
need_resched() is NOT set and we did not get an interrupt and nothing
could have fiddled with the thread_info->state cacheline simply
because there is only ONE cpu online, i.e. CPU0 the very cpu which
trace we are staring at.

<idle>-0  [000] .... 1747.667198: cpu_idle: state=4294967295 cpu_id=0
<idle>-0  [000] d... 1747.667199: menu_select: expected_us: 39875 predicted_us: 19860

So we still have an expected 39875 usec sleep time ahead, which is
nonsense, but understandable because the nohz idle code did not go
through an update. But of course the immediate return takes the
residence of about 0 usec in state=4 into account and reduces the
predicted_us to 19860 usec

<idle>-0  [000] d... 1747.667201: menu_select: Trying idle state 1 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667201: menu_select: residency 6
<idle>-0  [000] d... 1747.667202: menu_select: exit_latency 3 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667203: menu_select: Trying idle state 2 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667204: menu_select: residency 20
<idle>-0  [000] d... 1747.667204: menu_select: exit_latency 10 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667205: menu_select: Trying idle state 3 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667206: menu_select: residency 80
<idle>-0  [000] d... 1747.667206: menu_select: exit_latency 20 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667207: menu_select: Trying idle state 4 s->dis 0 su->dis 0
<idle>-0  [000] d... 1747.667208: menu_select: residency 800
<idle>-0  [000] d... 1747.667208: menu_select: exit_latency 200 vs. 2000000000 multiplier 1
<idle>-0  [000] d... 1747.667209: cpu_idle: state=4 cpu_id=0

Now here we get lucky for whatever reasons and actually stay in state
4 until a real reason for leaving it happens.

<idle>-0  [000] d.h.  1747.669851: irq_handler_entry: irq=120 name=eth0-TxRx-0

So now the subtle difference between the old x86 idle code and the
generic one is that the new code fiddles with the polling state in
thread_info->state more than necessary. We need to fix that, but
nevertheless this whole mwait thing seems to be broken in some way or
the other and ever worked by chance.

So the difference boils down to:

+ modify(idle->thread_info->status)

  <some random code>

  mwait(idle->thread_info->status)

So on an UP machine the only CPU which cares about that state is that
very same idle thread which goes to sleep. Now there are a few
questions:

1) Why does writing to that cacheline from the very same cpu just
   before going to sleep cause an issue?

2) Why does it come back less often if the timing is different,
   e.g.when the debug patch is active?

3) Why does adding a mb() instead of the trace printks make the thing
   NOT go away?

   Is it really just timing related? I'm going to ask Boris to replace
   the mb() with a plain udelay(10) tomorrow.

4) Why can't we observe that on dual socket machines?

And for the record, I have analyzed traces before the switch to the
generic idle code which show the same drop in/ drop out issue on an UP
environment:

  <idle>-0     [000] d...   307.256288: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] ....   307.256289: cpu_idle: state=4294967295 cpu_id=0
  <idle>-0     [000] d...   307.256291: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] d.h.   307.257337: irq_handler_entry: irq=21 name=ata_piix

Yes, it does not cause the same issue as you can observe with the
current code, but definitely there is something extremly fishy in an
UP situation also there.

So it boils down to the following:

If something fiddles with the cache line which is monitored by mwait()
close before the mwait() happens then mwait() is playing silly
buggers. And that sillyness depends on the number of sockets.

And what's even worse is that we get a regression report for probably
exposing hardware sillyness. Something which should have been known
already - but unfortunately never has been documented as a requirement
for this mwait() stuff to work proper.

Sorry to be blunt about that. If stuff like this happens:

  <idle>-0     [000] d...   307.256288: cpu_idle: state=4 cpu_id=0
  <idle>-0     [000] ....   307.256289: cpu_idle: state=4294967295 cpu_id=0
  <idle>-0     [000] d...   307.256291: cpu_idle: state=4 cpu_id=0

on a kernel prior to that change in an UP scenario, then someone
should have investigated that years ago. But the preferred solution
was obviously to ignore that because it did not do substantial damage.

Now that the damage comes to light you cry murder for the completely
wrong reason.

The change exposes a shortcoming in mwait() rather than the more
obvious performance issue which mwait() is supposed to avoid in the
first place. Mike DID notice it and it got fixed, but of course it did
not heal the now obvious shortcomings of mwait()

Thanks,

	tglx
---

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index a55e68f..b6399af 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -118,7 +118,7 @@ int cpuidle_idle_call(void)
 	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
 	struct cpuidle_driver *drv;
 	int next_state, entered_state;
-	bool broadcast;
+	bool broadcast, coupled = false;

 	if (off || !initialized)
 		return -ENODEV;
@@ -147,15 +147,18 @@ int cpuidle_idle_call(void)
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);

-	if (cpuidle_state_is_coupled(dev, drv, next_state))
+	if (cpuidle_state_is_coupled(dev, drv, next_state)) {
 		entered_state = cpuidle_enter_state_coupled(dev, drv,
 							    next_state);
-	else
+		coupled = true;
+	} else
 		entered_state = cpuidle_enter_state(dev, drv, next_state);

 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);

+	trace_printk("coupled %d: entered state %d\n", coupled, entered_state);
+
 	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);

 	/* give the governor an opportunity to reflect on the outcome */
diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index cf7f2f0..9de7ee2 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 	data->expected_us =
 		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;

-
 	data->bucket = which_bucket(data->expected_us);

 	multiplier = performance_multiplier();
@@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 					 data->correction_factor[data->bucket],
 					 RESOLUTION * DECAY);

+	trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us,
+		     data->predicted_us);
+
 	get_typical_interval(data);

 	/*
@@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
 		struct cpuidle_state *s = &drv->states[i];
 		struct cpuidle_state_usage *su = &dev->states_usage[i];

+		trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i,
+			     s->disabled, su->disable);
 		if (s->disabled || su->disable)
 			continue;
+		trace_printk("residency %d\n", s->target_residency);
 		if (s->target_residency > data->predicted_us)
 			continue;
+		trace_printk("exit_latency %d vs. %d multiplier %d\n",
+			     s->exit_latency, latency_req, multiplier);
 		if (s->exit_latency > latency_req)
 			continue;
 		if (s->exit_latency * multiplier > data->predicted_us)

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11  2:05           ` Thomas Gleixner
@ 2013-12-11  3:21             ` Mike Galbraith
  2013-12-11 11:28               ` Thomas Gleixner
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-11  3:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86, Borislav Petkov

Alakazam..

pk cor CPU    %c0  GHz  TSC SMI    %c1    %c3    %c6 CTMP   %pc3   %pc6
             0.17 2.01 2.26   0   0.02  99.82   0.00   49  99.55   0.00
 0   0   0   0.95 1.45 2.26   2   0.43  98.62   0.00   48  98.48   0.00
 1   0   8   0.24 1.99 2.26   2   0.02  99.75   0.00   38  99.68   0.00
 2   0  16   0.17 1.97 2.26   2   0.02  99.81   0.00   40  99.65   0.00
 3   0  24   0.18 1.92 2.26   2   0.02  99.80   0.00   41  99.68   0.00
 4   0  32   0.18 1.95 2.26   2   0.02  99.80   0.00   36  99.66   0.00
 5   0  40   0.15 1.85 2.26   0   0.03  99.83   0.00   35  99.70   0.00
 6   0  48   0.10 1.83 2.26   0   0.01  99.89   0.00   36  99.79   0.00
 7   0  56   0.10 1.97 2.26   0   0.01  99.89   0.00   43  99.75   0.00

Yup, magical gremlin repellent works on 8 socket DL980 too.

> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index a55e68f..b6399af 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -118,7 +118,7 @@ int cpuidle_idle_call(void)
>  	struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
>  	struct cpuidle_driver *drv;
>  	int next_state, entered_state;
> -	bool broadcast;
> +	bool broadcast, coupled = false;
>  
>  	if (off || !initialized)
>  		return -ENODEV;
> @@ -147,15 +147,18 @@ int cpuidle_idle_call(void)
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &dev->cpu);
>  
> -	if (cpuidle_state_is_coupled(dev, drv, next_state))
> +	if (cpuidle_state_is_coupled(dev, drv, next_state)) {
>  		entered_state = cpuidle_enter_state_coupled(dev, drv,
>  							    next_state);
> -	else
> +		coupled = true;
> +	} else
>  		entered_state = cpuidle_enter_state(dev, drv, next_state);
>  
>  	if (broadcast)
>  		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
>  
> +	trace_printk("coupled %d: entered state %d\n", coupled, entered_state);
> +
>  	trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, dev->cpu);
>  
>  	/* give the governor an opportunity to reflect on the outcome */
> diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
> index cf7f2f0..9de7ee2 100644
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@ -309,7 +309,6 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  	data->expected_us =
>  		t.tv_sec * USEC_PER_SEC + t.tv_nsec / NSEC_PER_USEC;
>  
> -
>  	data->bucket = which_bucket(data->expected_us);
>  
>  	multiplier = performance_multiplier();
> @@ -330,6 +329,9 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  					 data->correction_factor[data->bucket],
>  					 RESOLUTION * DECAY);
>  
> +	trace_printk("expected_us: %d predicted_us: %d\n", data->expected_us,
> +		     data->predicted_us);
> +
>  	get_typical_interval(data);
>  
>  	/*
> @@ -349,10 +351,15 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
>  		struct cpuidle_state *s = &drv->states[i];
>  		struct cpuidle_state_usage *su = &dev->states_usage[i];
>  
> +		trace_printk("Trying idle state %d s->dis %d su->dis %d\n", i,
> +			     s->disabled, su->disable);
>  		if (s->disabled || su->disable)
>  			continue;
> +		trace_printk("residency %d\n", s->target_residency);
>  		if (s->target_residency > data->predicted_us)
>  			continue;
> +		trace_printk("exit_latency %d vs. %d multiplier %d\n",
> +			     s->exit_latency, latency_req, multiplier);
>  		if (s->exit_latency > latency_req)
>  			continue;
>  		if (s->exit_latency * multiplier > data->predicted_us)
> 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11  3:21             ` Mike Galbraith
@ 2013-12-11 11:28               ` Thomas Gleixner
  2013-12-11 11:38                 ` Borislav Petkov
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-11 11:28 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86, Borislav Petkov

On Wed, 11 Dec 2013, Mike Galbraith wrote:
> Alakazam..
> Yup, magical gremlin repellent works on 8 socket DL980 too.

Now here is a less magical version of the gremlin repellent.

And just for the amusement value: The erratum for the series 7400
says:

AAI65. MONITOR/MWAIT May Have Excessive False Wakeups

Problem:	Normally, if MWAIT is used to enter a C-state that is
		C1 or higher, a store to the address range armed by
		the MONITOR instruction will cause the processor to
		exit MWAIT. Due to this erratum, false wakeups may
		occur when the monitored address range was recently
		written prior to executing the MONITOR instruction.

Implication:	Due to this erratum, performance and power savings may
		be impacted due to excessive false wakeups.

Workaround: 	Execute a CLFLUSH Instruction immediately before every
		MONITOR instruction when the monitored location may
		have been recently written.

Now that looks like the very same issue on these westmere EX
machines.

These false wakeups can be observed already before the idle changes
and now they are just more prominent.

Adding that clflush() unconditionally fixes the issue at least on
Boris machine.

Mike, can you retest on that 8 socket monstrum, please?

So it looks like the idle power regression is actually a software
change which exhibits a hardware "regression".

So much for proper validated advertising which promises core power 0W
at idle for these beasts :)

Thanks,

	tglx
---
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..50299ad 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -376,7 +376,7 @@ static int intel_idle(struct cpuidle_device *dev,
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
+		clflush(&current_thread_info()->flags);
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:28               ` Thomas Gleixner
@ 2013-12-11 11:38                 ` Borislav Petkov
  2013-12-11 11:52                   ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2013-12-11 11:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, Len Brown, Peter Zijlstra, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 12:28:36PM +0100, Thomas Gleixner wrote:
> On Wed, 11 Dec 2013, Mike Galbraith wrote:
> > Alakazam..
> > Yup, magical gremlin repellent works on 8 socket DL980 too.
> 
> Now here is a less magical version of the gremlin repellent.
> 
> And just for the amusement value: The erratum for the series 7400
> says:
> 
> AAI65. MONITOR/MWAIT May Have Excessive False Wakeups
> 
> Problem:	Normally, if MWAIT is used to enter a C-state that is
> 		C1 or higher, a store to the address range armed by
> 		the MONITOR instruction will cause the processor to
> 		exit MWAIT. Due to this erratum, false wakeups may
> 		occur when the monitored address range was recently
> 		written prior to executing the MONITOR instruction.
> 
> Implication:	Due to this erratum, performance and power savings may
> 		be impacted due to excessive false wakeups.
> 
> Workaround: 	Execute a CLFLUSH Instruction immediately before every
> 		MONITOR instruction when the monitored location may
> 		have been recently written.
> 
> Now that looks like the very same issue on these westmere EX
> machines.
> 
> These false wakeups can be observed already before the idle changes
> and now they are just more prominent.
> 
> Adding that clflush() unconditionally fixes the issue at least on
> Boris machine.
> 
> Mike, can you retest on that 8 socket monstrum, please?
> 
> So it looks like the idle power regression is actually a software
> change which exhibits a hardware "regression".

Right, if it turns out that this is really the case and that this
erratum hasn't been fixed for models later than 29 - we'd need the
additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:38                 ` Borislav Petkov
@ 2013-12-11 11:52                   ` Peter Zijlstra
  2013-12-11 12:29                     ` Mike Galbraith
  2013-12-11 21:43                     ` Len Brown
  0 siblings, 2 replies; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-11 11:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Mike Galbraith, Len Brown, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> Right, if it turns out that this is really the case and that this
> erratum hasn't been fixed for models later than 29 - we'd need the
> additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.

You also need: https://lkml.org/lkml/2013/11/19/143

Because obviously not all mwait idle loops check that cpu bit.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:52                   ` Peter Zijlstra
@ 2013-12-11 12:29                     ` Mike Galbraith
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 21:43                     ` Len Brown
  1 sibling, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-11 12:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > Right, if it turns out that this is really the case and that this
> > erratum hasn't been fixed for models later than 29 - we'd need the
> > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> 
> You also need: https://lkml.org/lkml/2013/11/19/143
> 
> Because obviously not all mwait idle loops check that cpu bit.

I had tried that patch, to see if it would magically make the thing
start working, nope.  I had also tried...

---
 drivers/idle/intel_idle.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/idle/intel_idle.c
===================================================================
--- linux-2.6.orig/drivers/idle/intel_idle.c
+++ linux-2.6/drivers/idle/intel_idle.c
@@ -376,11 +376,14 @@ static int intel_idle(struct cpuidle_dev
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())
 			__mwait(eax, ecx);
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
 	}
 
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))

..a cflush before _and_ after, among other (shazam!.. darn) guesses, but
nogo.  Turning that into the tglx one liner indeed did fix the thing, as
did adding this to your patch.

---
 arch/x86/include/asm/mwait.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/x86/include/asm/mwait.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/mwait.h
+++ linux-2.6/arch/x86/include/asm/mwait.h
@@ -43,7 +43,7 @@ static inline void __mwait(unsigned long
 static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 {
 	if (!current_set_polling_and_test()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+//		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
 			clflush((void *)&current_thread_info()->flags);
 
 		__monitor((void *)&current_thread_info()->flags, 0, 0);



Grrr.

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2
ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt lahf_lm dtherm
tpr_shadow vnmi flexpriority ept vpid

-Mike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:29                     ` Mike Galbraith
@ 2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
                                           ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-11 12:43 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > Right, if it turns out that this is really the case and that this
> > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > 
> > You also need: https://lkml.org/lkml/2013/11/19/143
> > 
> > Because obviously not all mwait idle loops check that cpu bit.
> 
> I had tried that patch, to see if it would magically make the thing
> start working, nope.  I had also tried...

> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +			clflush((void *)&current_thread_info()->flags);

Yeah, you need a bit extra to enable that feature bit for your CPU as
bpetkov said.

Something like the below.. someone needs to double check and possibly
add SNB/IVB EX parts if they're already available.

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0dff939..015642eed045 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
-		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
+	if (c->x86 == 6 && cpu_has_clflush) {
+		switch (c->x86_model) {
+		case 29: /* Core2 EX */
+		case 46: /* NHM EX */
+		case 47: /* WSM EX */
+			set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
+			break;
+		}
+	}
 
 #ifdef CONFIG_X86_64
 	if (c->x86 == 15)

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
@ 2013-12-11 13:10                         ` Mike Galbraith
  2013-12-11 13:40                         ` Borislav Petkov
  2013-12-11 14:42                         ` Ingo Molnar
  2 siblings, 0 replies; 51+ messages in thread
From: Mike Galbraith @ 2013-12-11 13:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, 2013-12-11 at 13:43 +0100, Peter Zijlstra wrote: 
> On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > > Right, if it turns out that this is really the case and that this
> > > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > > 
> > > You also need: https://lkml.org/lkml/2013/11/19/143
> > > 
> > > Because obviously not all mwait idle loops check that cpu bit.
> > 
> > I had tried that patch, to see if it would magically make the thing
> > start working, nope.  I had also tried...
> 
> > +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > +			clflush((void *)&current_thread_info()->flags);
> 
> Yeah, you need a bit extra to enable that feature bit for your CPU as
> bpetkov said.

Works for me, one more for the stable bucket.

So as soon as Len resurrects mwait_idle for Q6600 (and other core2 when
booted max_cstates=1 so tsc clocksource is used instead of pos hpet),
all the (known) idle regressions should be history.

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
@ 2013-12-11 13:40                         ` Borislav Petkov
  2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 14:42                         ` Ingo Molnar
  2 siblings, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2013-12-11 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Thomas Gleixner, Len Brown, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> Something like the below.. someone needs to double check and possibly
> add SNB/IVB EX parts if they're already available.

Right, our friends at Intel would need to tell us which families/models
does AAI65 span... if, this is actually the case.

But now that Mike's box is fixed too, it very much looks like it.

Oh, and AAIxx errata are Intel Xeon Processor 7400 Series ones, we would
also need to know whether other incarnations are affected too.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 12:43                       ` Peter Zijlstra
  2013-12-11 13:10                         ` Mike Galbraith
  2013-12-11 13:40                         ` Borislav Petkov
@ 2013-12-11 14:42                         ` Ingo Molnar
  2013-12-11 15:02                           ` Thomas Gleixner
  2013-12-11 16:44                           ` Peter Zijlstra
  2 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 14:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 01:29:15PM +0100, Mike Galbraith wrote:
> > On Wed, 2013-12-11 at 12:52 +0100, Peter Zijlstra wrote: 
> > > On Wed, Dec 11, 2013 at 12:38:39PM +0100, Borislav Petkov wrote:
> > > > Right, if it turns out that this is really the case and that this
> > > > erratum hasn't been fixed for models later than 29 - we'd need the
> > > > additional model numbers to set X86_FEATURE_CLFLUSH_MONITOR correctly.
> > > 
> > > You also need: https://lkml.org/lkml/2013/11/19/143
> > > 
> > > Because obviously not all mwait idle loops check that cpu bit.
> > 
> > I had tried that patch, to see if it would magically make the thing
> > start working, nope.  I had also tried...
> 
> > +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> > +			clflush((void *)&current_thread_info()->flags);
> 
> Yeah, you need a bit extra to enable that feature bit for your CPU as
> bpetkov said.
> 
> Something like the below.. someone needs to double check and possibly
> add SNB/IVB EX parts if they're already available.
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0dff939..015642eed045 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -387,8 +387,15 @@ static void init_intel(struct cpuinfo_x86 *c)
>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>  	}
>  
> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
> -		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
> +	if (c->x86 == 6 && cpu_has_clflush) {
> +		switch (c->x86_model) {
> +		case 29: /* Core2 EX */
> +		case 46: /* NHM EX */
> +		case 47: /* WSM EX */
> +			set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
> +			break;
> +		}
> +	}

Another thing that is required I think is to issue a write barrier 
before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
the documentation CLFLUSH does not appear to be ordered (at all), so 
it might execute before the modification to the affected memory?


So something like:

		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
			smp_wmb(); /* order CLFLUSH */
			clflush(&current_thread_info()->flags);
		}

possibly put behind some utility function, smp_clflush() or so, hiding 
the CPU feature bit check as well:

		smp_clflush(&current_thread_info()->flags);

or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 13:40                         ` Borislav Petkov
@ 2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 16:02                             ` Borislav Petkov
  2013-12-11 16:43                             ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 14:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Borislav Petkov <bp@alien8.de> wrote:

> On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > Something like the below.. someone needs to double check and possibly
> > add SNB/IVB EX parts if they're already available.
> 
> Right, our friends at Intel would need to tell us which families/models
> does AAI65 span... if, this is actually the case.

I think CLFLUSH should be pretty universally available, IIRC graphics 
drivers were using it rather heavily in combination with 
write-combining MTRRs, both on Linux and on Windows.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:42                         ` Ingo Molnar
@ 2013-12-11 15:02                           ` Thomas Gleixner
  2013-12-11 15:09                             ` Ingo Molnar
  2013-12-11 16:44                           ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-11 15:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, 11 Dec 2013, Ingo Molnar wrote:
> 
> Another thing that is required I think is to issue a write barrier 
> before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> the documentation CLFLUSH does not appear to be ordered (at all), so 
> it might execute before the modification to the affected memory?

We already have a write barrier in the code which modifies
current_thread_info()->flags.
 
Thanks,

	tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 15:02                           ` Thomas Gleixner
@ 2013-12-11 15:09                             ` Ingo Molnar
  2013-12-11 16:44                               ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 15:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > 
> > Another thing that is required I think is to issue a write barrier 
> > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > it might execute before the modification to the affected memory?
> 
> We already have a write barrier in the code which modifies
> current_thread_info()->flags.

Which code is that, could you please cite it? (I tried finding it but 
my grep-fu failed me.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:56                           ` Ingo Molnar
@ 2013-12-11 16:02                             ` Borislav Petkov
  2013-12-11 16:43                             ` Peter Zijlstra
  1 sibling, 0 replies; 51+ messages in thread
From: Borislav Petkov @ 2013-12-11 16:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> I think CLFLUSH should be pretty universally available, IIRC
> graphics drivers were using it rather heavily in combination with
> write-combining MTRRs, both on Linux and on Windows.

... and it is also very expensive. So I don't think it would be in
Intel's best interest to do CLFLUSH unconditionally on all families but
only on those which really need to.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:56                           ` Ingo Molnar
  2013-12-11 16:02                             ` Borislav Petkov
@ 2013-12-11 16:43                             ` Peter Zijlstra
  2013-12-11 17:50                               ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> 
> * Borislav Petkov <bp@alien8.de> wrote:
> 
> > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > > Something like the below.. someone needs to double check and possibly
> > > add SNB/IVB EX parts if they're already available.
> > 
> > Right, our friends at Intel would need to tell us which families/models
> > does AAI65 span... if, this is actually the case.
> 
> I think CLFLUSH should be pretty universally available, IIRC graphics 
> drivers were using it rather heavily in combination with 
> write-combining MTRRs, both on Linux and on Windows.

The availability isn't the problem; the cost is. We shouldn't issue one
if its not required. Only 'broken' EX hardware needs it.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 14:42                         ` Ingo Molnar
  2013-12-11 15:02                           ` Thomas Gleixner
@ 2013-12-11 16:44                           ` Peter Zijlstra
  2013-12-11 17:47                             ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote:
> Another thing that is required I think is to issue a write barrier 
> before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> the documentation CLFLUSH does not appear to be ordered (at all), so 
> it might execute before the modification to the affected memory?
> 
> 
> So something like:
> 
> 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> 			smp_wmb(); /* order CLFLUSH */
> 			clflush(&current_thread_info()->flags);
> 		}

smp_wmb() is a NO-OP on x86 remember :-)

Also, a wmb doesn't actually need to flush the store buffers.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 15:09                             ` Ingo Molnar
@ 2013-12-11 16:44                               ` Peter Zijlstra
  2013-12-11 17:48                                 ` Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-11 16:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote:
> 
> * Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > > 
> > > Another thing that is required I think is to issue a write barrier 
> > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > > it might execute before the modification to the affected memory?
> > 
> > We already have a write barrier in the code which modifies
> > current_thread_info()->flags.
> 
> Which code is that, could you please cite it? (I tried finding it but 
> my grep-fu failed me.)

current_set_polling_and_test().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:44                           ` Peter Zijlstra
@ 2013-12-11 17:47                             ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mike Galbraith, Borislav Petkov, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:42:38PM +0100, Ingo Molnar wrote:
> > Another thing that is required I think is to issue a write barrier 
> > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > it might execute before the modification to the affected memory?
> > 
> > 
> > So something like:
> > 
> > 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR)) {
> > 			smp_wmb(); /* order CLFLUSH */
> > 			clflush(&current_thread_info()->flags);
> > 		}
> 
> smp_wmb() is a NO-OP on x86 remember :-)

Well, it's a compiler barrier but yes - I suspect a smp_mb() might be 
needed - at least according to the CLFLUSH documentation it has no 
implicit guaranteed ordering wrt. preceding writes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:44                               ` Peter Zijlstra
@ 2013-12-11 17:48                                 ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mike Galbraith, Borislav Petkov, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 04:09:11PM +0100, Ingo Molnar wrote:
> > 
> > * Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Wed, 11 Dec 2013, Ingo Molnar wrote:
> > > > 
> > > > Another thing that is required I think is to issue a write barrier 
> > > > before CLFLUSH instruction. By my (possibly incorrect ...) reading of 
> > > > the documentation CLFLUSH does not appear to be ordered (at all), so 
> > > > it might execute before the modification to the affected memory?
> > > 
> > > We already have a write barrier in the code which modifies
> > > current_thread_info()->flags.
> > 
> > Which code is that, could you please cite it? (I tried finding it but 
> > my grep-fu failed me.)
> 
> current_set_polling_and_test().

Thx, those variants have the right barrier indeed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 16:43                             ` Peter Zijlstra
@ 2013-12-11 17:50                               ` Ingo Molnar
  2013-12-11 23:08                                 ` H. Peter Anvin
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2013-12-11 17:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:56:55PM +0100, Ingo Molnar wrote:
> > 
> > * Borislav Petkov <bp@alien8.de> wrote:
> > 
> > > On Wed, Dec 11, 2013 at 01:43:52PM +0100, Peter Zijlstra wrote:
> > > > Something like the below.. someone needs to double check and possibly
> > > > add SNB/IVB EX parts if they're already available.
> > > 
> > > Right, our friends at Intel would need to tell us which 
> > > families/models does AAI65 span... if, this is actually the 
> > > case.
> > 
> > I think CLFLUSH should be pretty universally available, IIRC 
> > graphics drivers were using it rather heavily in combination with 
> > write-combining MTRRs, both on Linux and on Windows.
> 
> The availability isn't the problem; the cost is. We shouldn't issue 
> one if its not required. Only 'broken' EX hardware needs it.

Well, availability could be a problem too, if some CPU (real or 
virtual) implements MWAIT but not CLFLUSH.

In theory we could make mwait an alternatives variant and patch in the 
right combination of instructions? The CLFLUSH goes to the same 
address as on which the monitoring happens, so it could be considered 
one meta-instruction.

Thansk,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 11:52                   ` Peter Zijlstra
  2013-12-11 12:29                     ` Mike Galbraith
@ 2013-12-11 21:43                     ` Len Brown
  2013-12-11 22:22                       ` Thomas Gleixner
  1 sibling, 1 reply; 51+ messages in thread
From: Len Brown @ 2013-12-11 21:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Thomas Gleixner, Mike Galbraith, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

For the record...

intel_idle doesn't check that bit because it doesn't run on model 29 --
the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2.
Until now, i was not aware that this issue might apply to models other
than that one.

Checking w/ the HW guys...

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 21:43                     ` Len Brown
@ 2013-12-11 22:22                       ` Thomas Gleixner
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2013-12-11 22:22 UTC (permalink / raw)
  To: Len Brown
  Cc: Peter Zijlstra, Borislav Petkov, Mike Galbraith, Linux PM list,
	linux-kernel@vger.kernel.org, Jeremy Eder, x86

On Wed, 11 Dec 2013, Len Brown wrote:

> For the record...
> 
> intel_idle doesn't check that bit because it doesn't run on model 29 --
> the Xeon 7400 was the "Dunnington" 4-socket generation based on Core2.

Right, we wondered about the restricted model check
already. Interesting that this is a 4 socket issue as well.

> Until now, i was not aware that this issue might apply to models other
> than that one.
> 
> Checking w/ the HW guys...

I hope they are not equally surprised :)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 17:50                               ` Ingo Molnar
@ 2013-12-11 23:08                                 ` H. Peter Anvin
  2013-12-11 23:14                                   ` Borislav Petkov
  2013-12-12  8:51                                   ` Peter Zijlstra
  0 siblings, 2 replies; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-11 23:08 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> 
> Well, availability could be a problem too, if some CPU (real or 
> virtual) implements MWAIT but not CLFLUSH.
> 
> In theory we could make mwait an alternatives variant and patch in the 
> right combination of instructions? The CLFLUSH goes to the same 
> address as on which the monitoring happens, so it could be considered 
> one meta-instruction.
> 

The first thing to do is probably to drop the use of thread_info as a
wakeup doorbell.  It seemed like a good idea at the time -- after all,
there is one for each thread -- but it is extremely likely to be dirty
in the cache, which is (presumably) what causes these kinds of bugs to
be maximally likely.  Even if we don't do the CLFLUSH it is likely that
the hardware has to do something expensive behind the scenes.

So I would like to propose that we switch to using a percpu variable
which is a single cache line of nothing at all.  It would only ever be
touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
this problem without the need for the CLFLUSH.

	-hpa

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:08                                 ` H. Peter Anvin
@ 2013-12-11 23:14                                   ` Borislav Petkov
  2013-12-12  0:52                                     ` H. Peter Anvin
  2013-12-12  8:51                                   ` Peter Zijlstra
  1 sibling, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2013-12-11 23:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> So I would like to propose that we switch to using a percpu variable
> which is a single cache line of nothing at all. It would only ever
> be touched by MONITOR and for explicit wakeup. Hopefully that will
> resolve this problem without the need for the CLFLUSH.

Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
here :-)) cacheline only for that. And, if it works, we'll save us the
penalty from the CLFLUSH too, cool.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:14                                   ` Borislav Petkov
@ 2013-12-12  0:52                                     ` H. Peter Anvin
  2013-12-12  4:25                                       ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-12  0:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, Peter Zijlstra, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

[-- Attachment #1: Type: text/plain, Size: 824 bytes --]

On 12/11/2013 03:14 PM, Borislav Petkov wrote:
> On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
>> So I would like to propose that we switch to using a percpu variable
>> which is a single cache line of nothing at all. It would only ever
>> be touched by MONITOR and for explicit wakeup. Hopefully that will
>> resolve this problem without the need for the CLFLUSH.
> 
> Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
> here :-)) cacheline only for that. And, if it works, we'll save us the
> penalty from the CLFLUSH too, cool.
> 

Here is a POC patch... anyone willing to test it out?

Two obvious things to watch out for:

1. I couldn't actually spot any obvious cases of a deliberate monitor
   trigger.

2. Should we do cpu_relax() for all users, not just powerclamp?

	-hpa


[-- Attachment #2: 0001-x86-mwait-Use-a-dedicated-percpu-area-for-mwait-door.patch --]
[-- Type: text/x-patch, Size: 9373 bytes --]

>From 20a54d54ea06f050650ab8923b7d9ee1d21b5317 Mon Sep 17 00:00:00 2001
From: "H. Peter Anvin" <hpa@linux.intel.com>
Date: Wed, 11 Dec 2013 16:31:04 -0800
Subject: [PATCH] x86, mwait: Use a dedicated percpu area for mwait doorbell

We have used the cache line that includes thread_info.flags as the
mwait doorbell.  However, this is liable to be dirty in the cache,
which may trigger hardware errata, plus it might cause false wakeups.
Instead, use a dedicated wakeup doorbell area.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h  |  1 -
 arch/x86/include/asm/mwait.h       | 46 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/processor.h   | 23 -------------------
 arch/x86/kernel/acpi/cstate.c      |  6 +----
 arch/x86/kernel/cpu/common.c       | 17 ++++++++++++++
 arch/x86/kernel/cpu/intel.c        |  3 ---
 arch/x86/kernel/setup_percpu.c     |  3 +++
 arch/x86/kernel/smpboot.c          | 19 +---------------
 drivers/acpi/acpi_pad.c            |  3 +--
 drivers/idle/intel_idle.c          |  4 +---
 drivers/thermal/intel_powerclamp.c |  2 +-
 11 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index e099f95..cdc77f3 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -96,7 +96,6 @@
 #define X86_FEATURE_XTOPOLOGY	(3*32+22) /* cpu topology enum extensions */
 #define X86_FEATURE_TSC_RELIABLE (3*32+23) /* TSC is known to be reliable */
 #define X86_FEATURE_NONSTOP_TSC	(3*32+24) /* TSC does not stop in C states */
-#define X86_FEATURE_CLFLUSH_MONITOR (3*32+25) /* "" clflush reqd with monitor */
 #define X86_FEATURE_EXTD_APICID	(3*32+26) /* has extended APICID (8 bits) */
 #define X86_FEATURE_AMD_DCM     (3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF	(3*32+28) /* APERFMPERF */
diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 2f366d0..4c82863 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -13,4 +13,50 @@
 
 #define MWAIT_ECX_INTERRUPT_BREAK	0x1
 
+#ifndef __ASSEMBLY__
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+#include <linux/percpu.h>
+
+static inline void __monitor(const void *eax, unsigned long ecx,
+			     unsigned long edx)
+{
+	/* "monitor %eax, %ecx, %edx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc8;"
+		     :: "a" (eax), "c" (ecx), "d"(edx));
+}
+
+static inline void __mwait(unsigned long eax, unsigned long ecx)
+{
+	/* "mwait %eax, %ecx;" */
+	asm volatile(".byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
+{
+	trace_hardirqs_on();
+	/* "mwait %eax, %ecx;" */
+	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
+		     :: "a" (eax), "c" (ecx));
+}
+
+extern char __percpu *mwait_doorbell;
+
+void __init setup_mwait_doorbell(void);
+
+static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx)
+{
+	__monitor(__this_cpu_ptr(mwait_doorbell), ecx, edx);
+	mb();
+}
+
+static inline void x86_mwait_doorbell_bing_bong(int cpu)
+{
+	ACCESS_ONCE(*per_cpu_ptr(mwait_doorbell, cpu)) = 0;
+}
+
+#endif /* !__ASSEMBLY__ */
+
 #endif /* _ASM_X86_MWAIT_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index b7845a1..a51a79e 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -723,29 +723,6 @@ static inline void sync_core(void)
 #endif
 }
 
-static inline void __monitor(const void *eax, unsigned long ecx,
-			     unsigned long edx)
-{
-	/* "monitor %eax, %ecx, %edx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc8;"
-		     :: "a" (eax), "c" (ecx), "d"(edx));
-}
-
-static inline void __mwait(unsigned long eax, unsigned long ecx)
-{
-	/* "mwait %eax, %ecx;" */
-	asm volatile(".byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
-static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
-{
-	trace_hardirqs_on();
-	/* "mwait %eax, %ecx;" */
-	asm volatile("sti; .byte 0x0f, 0x01, 0xc9;"
-		     :: "a" (eax), "c" (ecx));
-}
-
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27..aec26c5 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		x86_monitor_doorbell(0, 0);
 		if (!need_resched())
 			__mwait(ax, cx);
 	}
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6abc172..b6b19ab 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -24,6 +24,7 @@
 #include <linux/cpumask.h>
 #include <asm/pgtable.h>
 #include <linux/atomic.h>
+#include <asm/mwait.h>
 #include <asm/proto.h>
 #include <asm/setup.h>
 #include <asm/apic.h>
@@ -63,6 +64,22 @@ void __init setup_cpu_local_masks(void)
 	alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
 }
 
+/* allocate percpu area for mwait doorbell */
+char __percpu *mwait_doorbell;
+
+void __init setup_mwait_doorbell(void)
+{
+	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
+		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
+						boot_cpu_data.clflush_size);
+
+		if (!mwait_doorbell) {
+			/* This should never happen... */
+			panic("Unable to allocate mwait doorbell!\n");
+		}
+	}
+}
+
 static void default_init(struct cpuinfo_x86 *c)
 {
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..47b4e7b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,9 +387,6 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
-		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
-
 #ifdef CONFIG_X86_64
 	if (c->x86 == 15)
 		c->x86_cache_alignment = c->x86_clflush_size * 2;
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index 5cdff03..917e1af 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -284,4 +284,7 @@ void __init setup_per_cpu_areas(void)
 
 	/* Setup cpu initialized, callin, callout masks */
 	setup_cpu_local_masks();
+
+	/* Setup MWAIT doorbell */
+	setup_mwait_doorbell();
 }
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85dc05a..558e097 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1368,7 +1368,6 @@ static inline void mwait_play_dead(void)
 	unsigned int eax, ebx, ecx, edx;
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
-	void *mwait_ptr;
 	int i;
 
 	if (!this_cpu_has(X86_FEATURE_MWAIT))
@@ -1400,26 +1399,10 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
-	/*
-	 * This should be a memory location in a cache line which is
-	 * unlikely to be touched by other processors.  The actual
-	 * content is immaterial as it is not actually modified in any way.
-	 */
-	mwait_ptr = &current_thread_info()->flags;
-
 	wbinvd();
 
 	while (1) {
-		/*
-		 * The CLFLUSH is a workaround for erratum AAI65 for
-		 * the Xeon 7400 series.  It's not clear it is actually
-		 * needed, but it should be harmless in either case.
-		 * The WBINVD is insufficient due to the spurious-wakeup
-		 * case where we return around the loop.
-		 */
-		clflush(mwait_ptr);
-		__monitor(mwait_ptr, 0, 0);
-		mb();
+		x86_monitor_doorbell(0, 0);
 		__mwait(eax, 0);
 		/*
 		 * If NMI wants to wake up CPU0, start CPU0.
diff --git a/drivers/acpi/acpi_pad.c b/drivers/acpi/acpi_pad.c
index fc6008f..38aaa8b 100644
--- a/drivers/acpi/acpi_pad.c
+++ b/drivers/acpi/acpi_pad.c
@@ -193,8 +193,7 @@ static int power_saving_thread(void *data)
 					CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 			stop_critical_timings();
 
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
-			smp_mb();
+			x86_monitor_doorbell(0, 0);
 			if (!need_resched())
 				__mwait(power_saving_mwait_eax, 1);
 
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..6fef6d9 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -376,9 +376,7 @@ static int intel_idle(struct cpuidle_device *dev,
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
 	if (!current_set_polling_and_test()) {
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		x86_monitor_doorbell(0, 0);
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
index 8f181b3..15cf013 100644
--- a/drivers/thermal/intel_powerclamp.c
+++ b/drivers/thermal/intel_powerclamp.c
@@ -438,7 +438,7 @@ static int clamp_thread(void *arg)
 			 */
 			local_touch_nmi();
 			stop_critical_timings();
-			__monitor((void *)&current_thread_info()->flags, 0, 0);
+			x86_monitor_doorbell(0, 0);
 			cpu_relax(); /* allow HT sibling to run */
 			__mwait(eax, ecx);
 			start_critical_timings();
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  0:52                                     ` H. Peter Anvin
@ 2013-12-12  4:25                                       ` Mike Galbraith
  2013-12-12  4:49                                         ` H. Peter Anvin
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-12  4:25 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, 2013-12-11 at 16:52 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 03:14 PM, Borislav Petkov wrote:
> > On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> >> So I would like to propose that we switch to using a percpu variable
> >> which is a single cache line of nothing at all. It would only ever
> >> be touched by MONITOR and for explicit wakeup. Hopefully that will
> >> resolve this problem without the need for the CLFLUSH.
> > 
> > Yep, makes a lot of sense to me to have an exclusive (overloaded meaning
> > here :-)) cacheline only for that. And, if it works, we'll save us the
> > penalty from the CLFLUSH too, cool.
> > 
> 
> Here is a POC patch... anyone willing to test it out?

Got it built, but it went boom on boot.  Off to rummage. 

[    0.000000] setup_percpu: NR_CPUS:64 nr_cpumask_bits:64 nr_cpu_ids:64 nr_node_ids:8
[    0.000000] PERCPU: Embedded 26 pages/cpu @ffff88027ee00000 s75904 r8192 d22400 u131072
[    0.000000] pcpu-alloc: s75904 r8192 d22400 u131072 alloc=1*2097152
[    0.000000] pcpu-alloc: [0] 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 
[    0.000000] pcpu-alloc: [0] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 
[    0.000000] pcpu-alloc: [0] 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 
[    0.000000] pcpu-alloc: [0] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 
[    0.000000] BUG: unable to handle kernel paging request at 000000000000b8a0
[    0.000000] IP: [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000] PGD 0 
[    0.000000] Oops: 0002 [#1] SMP 
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.0-master #185
[    0.000000] Hardware name: Hewlett-Packard ProLiant DL980 G7, BIOS P66 07/07/2010
[    0.000000] task: ffffffff81a10460 ti: ffffffff81a00000 task.ti: ffffffff81a00000
[    0.000000] RIP: 0010:[<ffffffff81a9d072>]  [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000] RSP: 0000:ffffffff81a01f28  EFLAGS: 00010002
[    0.000000] RAX: 0000000000014880 RBX: 0000000000000040 RCX: 0000000000000000
[    0.000000] RDX: 0000000000000040 RSI: 0000000000000040 RDI: ffffffff81a38e60
[    0.000000] RBP: ffffffff81a01f28 R08: 0000000000000040 R09: 0000000000000000
[    0.000000] R10: ffff88027f5f4880 R11: 0000000000000001 R12: 000000000000b850
[    0.000000] R13: 000000000000b026 R14: 000000000000b024 R15: 000000000000b020
[    0.000000] FS:  0000000000000000(0000) GS:ffff88027ee00000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: 000000000000b8a0 CR3: 0000000001a0b000 CR4: 00000000000000b0
[    0.000000] Stack:
[    0.000000]  ffffffff81a01f78 ffffffff81aa3641 ffffffff81a01f98 000000000000cd48
[    0.000000]  ffff88027ee00000 0000000000000000 0000000000000000 0000000000000000
[    0.000000]  0000000000000000 0000000000000000 ffffffff81a01fa8 ffffffff81a96d89
[    0.000000] Call Trace:
[    0.000000]  [<ffffffff81aa3641>] setup_per_cpu_areas+0x233/0x242
[    0.000000]  [<ffffffff81a96d89>] start_kernel+0x84/0x370
[    0.000000]  [<ffffffff81a964cc>] x86_64_start_reservations+0x1b/0x35
[    0.000000]  [<ffffffff81a96614>] x86_64_start_kernel+0x12e/0x135
[    0.000000] Code: 40 8f a7 81 e8 f6 fe ff ff c9 c3 55 48 8b 05 0a bf fd ff 48 89 e5 a8 08 75 02 c9 c3 0f b7 3d 84 bf fd ff 48 89 fe e8 fe dc 64 ff <48> 89 05 27 e8 56 7e 48 85 c0 75 e3 48 c7 c7 f0 83 78 81 e8 55 
[    0.000000] RIP  [<ffffffff81a9d072>] setup_mwait_doorbell+0x20/0x38
[    0.000000]  RSP <ffffffff81a01f28>
[    0.000000] CR2: 000000000000b8a0
[    0.000000] ---[ end trace f6e32c58e0729292 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

Build delta.

---
 arch/x86/include/asm/mwait.h   |    4 ++--
 arch/x86/kernel/cpu/common.c   |    7 ++++---
 arch/x86/kernel/setup_percpu.c |    1 +
 3 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/arch/x86/kernel/cpu/common.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/common.c
+++ linux-2.6/arch/x86/kernel/cpu/common.c
@@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
 }
 
 /* allocate percpu area for mwait doorbell */
-char __percpu *mwait_doorbell;
+DEFINE_PER_CPU(char *, mwait_doorbell);
+EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
 
 void __init setup_mwait_doorbell(void)
 {
 	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
-		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
-						boot_cpu_data.clflush_size);
+		mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size,
+						boot_cpu_data.x86_clflush_size);
 
 		if (!mwait_doorbell) {
 			/* This should never happen... */
Index: linux-2.6/arch/x86/kernel/setup_percpu.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup_percpu.c
+++ linux-2.6/arch/x86/kernel/setup_percpu.c
@@ -20,6 +20,7 @@
 #include <asm/cpumask.h>
 #include <asm/cpu.h>
 #include <asm/stackprotector.h>
+#include <asm/mwait.h>
 
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
Index: linux-2.6/arch/x86/include/asm/mwait.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/mwait.h
+++ linux-2.6/arch/x86/include/asm/mwait.h
@@ -42,9 +42,9 @@ static inline void __sti_mwait(unsigned
 		     :: "a" (eax), "c" (ecx));
 }
 
-extern char __percpu *mwait_doorbell;
+DECLARE_PER_CPU(char *, mwait_doorbell);
 
-void __init setup_mwait_doorbell(void);
+extern void __init setup_mwait_doorbell(void);
 
 static inline void x86_monitor_doorbell(unsigned long ecx, unsigned long edx)
 {




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:25                                       ` Mike Galbraith
@ 2013-12-12  4:49                                         ` H. Peter Anvin
  2013-12-12  4:59                                           ` Mike Galbraith
  2013-12-12  5:37                                           ` Mike Galbraith
  0 siblings, 2 replies; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-12  4:49 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On 12/11/2013 08:25 PM, Mike Galbraith wrote:
>  arch/x86/include/asm/mwait.h   |    4 ++--
>  arch/x86/kernel/cpu/common.c   |    7 ++++---
>  arch/x86/kernel/setup_percpu.c |    1 +
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> Index: linux-2.6/arch/x86/kernel/cpu/common.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> +++ linux-2.6/arch/x86/kernel/cpu/common.c
> @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
>  }
>  
>  /* allocate percpu area for mwait doorbell */
> -char __percpu *mwait_doorbell;
> +DEFINE_PER_CPU(char *, mwait_doorbell);
> +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
>  

Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
a percpu allocation, but the variable itself is not a percpu variable.
This explains your boom.

>  void __init setup_mwait_doorbell(void)
>  {
>  	if (boot_cpu_has(X86_FEATURE_MWAIT)) {
> -		mwait_doorbell = __alloc_percpu(boot_cpu_data.clflush_size,
> -						boot_cpu_data.clflush_size);
> +		mwait_doorbell = __alloc_percpu(boot_cpu_data.x86_clflush_size,
> +						boot_cpu_data.x86_clflush_size);
>  
>  		if (!mwait_doorbell) {
>  			/* This should never happen... */

	-hpa



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:49                                         ` H. Peter Anvin
@ 2013-12-12  4:59                                           ` Mike Galbraith
  2013-12-12  5:37                                           ` Mike Galbraith
  1 sibling, 0 replies; 51+ messages in thread
From: Mike Galbraith @ 2013-12-12  4:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
> >  arch/x86/include/asm/mwait.h   |    4 ++--
> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
> >  arch/x86/kernel/setup_percpu.c |    1 +
> >  3 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
> >  }
> >  
> >  /* allocate percpu area for mwait doorbell */
> > -char __percpu *mwait_doorbell;
> > +DEFINE_PER_CPU(char *, mwait_doorbell);
> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
> >  
> 
> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
> a percpu allocation, but the variable itself is not a percpu variable.
> This explains your boom.

Yeah, I know, I already slapped myself upside the head.

(what were you thinking mikie...la la la la la:)

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  4:49                                         ` H. Peter Anvin
  2013-12-12  4:59                                           ` Mike Galbraith
@ 2013-12-12  5:37                                           ` Mike Galbraith
  2013-12-12  5:45                                             ` H. Peter Anvin
  1 sibling, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-12  5:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
> >  arch/x86/include/asm/mwait.h   |    4 ++--
> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
> >  arch/x86/kernel/setup_percpu.c |    1 +
> >  3 files changed, 7 insertions(+), 5 deletions(-)
> > 
> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
> > ===================================================================
> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
> >  }
> >  
> >  /* allocate percpu area for mwait doorbell */
> > -char __percpu *mwait_doorbell;
> > +DEFINE_PER_CPU(char *, mwait_doorbell);
> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
> >  
> 
> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer to
> a percpu allocation, but the variable itself is not a percpu variable.
> This explains your boom.

With that fixed, it boots, but is not quite perfect.

... 
[  258.560079] fbcon: radeondrmfb (fb0) is primary device
[  258.722483] Console: switching to colour frame buffer device 128x48
[  258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer device
[  258.911991] radeon 0000:01:03.0: registered panic notifier
[  258.968772] [drm] Initialized radeon 2.35.0 20080528 for 0000:01:03.0 on minor 0
...
[  469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts
[  469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized
[  469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id: 0x2000, dev_id: 0x13)
[  469.830524] ipmi_si 00:01: IPMI kcs interface initialized
[  473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
...
[  711.636741] fuse init (API version 7.22)

...zzzz ok box, doctor appointment is in an hour away.

-Mike 



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:37                                           ` Mike Galbraith
@ 2013-12-12  5:45                                             ` H. Peter Anvin
  2013-12-12  5:57                                               ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-12  5:45 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

As in it hangs at that point?

Mike Galbraith <bitbucket@online.de> wrote:
>On Wed, 2013-12-11 at 20:49 -0800, H. Peter Anvin wrote: 
>> On 12/11/2013 08:25 PM, Mike Galbraith wrote:
>> >  arch/x86/include/asm/mwait.h   |    4 ++--
>> >  arch/x86/kernel/cpu/common.c   |    7 ++++---
>> >  arch/x86/kernel/setup_percpu.c |    1 +
>> >  3 files changed, 7 insertions(+), 5 deletions(-)
>> > 
>> > Index: linux-2.6/arch/x86/kernel/cpu/common.c
>> > ===================================================================
>> > --- linux-2.6.orig/arch/x86/kernel/cpu/common.c
>> > +++ linux-2.6/arch/x86/kernel/cpu/common.c
>> > @@ -65,13 +65,14 @@ void __init setup_cpu_local_masks(void)
>> >  }
>> >  
>> >  /* allocate percpu area for mwait doorbell */
>> > -char __percpu *mwait_doorbell;
>> > +DEFINE_PER_CPU(char *, mwait_doorbell);
>> > +EXPORT_PER_CPU_SYMBOL(mwait_doorbell);
>> >  
>> 
>> Sorry, this is wrong.  This is NOT a percpu variable, it is a pointer
>to
>> a percpu allocation, but the variable itself is not a percpu
>variable.
>> This explains your boom.
>
>With that fixed, it boots, but is not quite perfect.
>
>... 
>[  258.560079] fbcon: radeondrmfb (fb0) is primary device
>[  258.722483] Console: switching to colour frame buffer device 128x48
>[  258.847076] radeon 0000:01:03.0: fb0: radeondrmfb frame buffer
>device
>[  258.911991] radeon 0000:01:03.0: registered panic notifier
>[  258.968772] [drm] Initialized radeon 2.35.0 20080528 for
>0000:01:03.0 on minor 0
>...
>[  469.738604] netxen_nic 0000:04:00.3: using msi-x interrupts
>[  469.739078] netxen_nic 0000:04:00.3: eth5: GbE port initialized
>[  469.830512] ipmi_si 00:01: Found new BMC (man_id: 0x00000b, prod_id:
>0x2000, dev_id: 0x13)
>[  469.830524] ipmi_si 00:01: IPMI kcs interface initialized
>[  473.729862] iTCO_wdt: unable to reset NO_REBOOT flag, device
>disabled by hardware/BIOS
>...
>[  711.636741] fuse init (API version 7.22)
>
>...zzzz ok box, doctor appointment is in an hour away.
>
>-Mike 

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:45                                             ` H. Peter Anvin
@ 2013-12-12  5:57                                               ` Mike Galbraith
  2013-12-12  6:05                                                 ` Mike Galbraith
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-12  5:57 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
> As in it hangs at that point?

Nope, it's still going.

[1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz

Funny, continents move faster :)  Maybe missing a write or two.

-Mike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  5:57                                               ` Mike Galbraith
@ 2013-12-12  6:05                                                 ` Mike Galbraith
  2013-12-12  7:57                                                   ` H. Peter Anvin
  0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2013-12-12  6:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: 
> On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
> > As in it hangs at that point?
> 
> Nope, it's still going.
> 
> [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz
> 
> Funny, continents move faster :)  Maybe missing a write or two.

When I get back it may be done booting.  I'm gonna let it try for grins
while I'm away, then take a peek, see if I can spot it.

[ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1064 MHz, 2266 MHz
                                                                      done
Starting HAL daemon                                                   done


Setting up (localfs) network interfaces:
    lo        
    lo        IP address: 127.0.0.1/8   
              IP address: 127.0.0.2/8                                 done
    eth0      device: Broadcom Corporation NetXtreme II BCM5709 Gig
              No configuration found for eth0                         unused
    eth1      device: Broadcom Corporation NetXtreme II BCM5709 Gig
              No configuration found for eth1                         unused
    eth2      device: NetXen Incorporated NX3031 Multifunction 1/10
[ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
[ 2457.114455] netxen_nic: eth2 NIC Link is up
[ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
    eth2      IP address: 0.0.0.0/32


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  6:05                                                 ` Mike Galbraith
@ 2013-12-12  7:57                                                   ` H. Peter Anvin
  0 siblings, 0 replies; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-12  7:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Ingo Molnar, Peter Zijlstra, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

I'm guessing there is an implicit or explicit wake-up somewhere that I managed to miss.  If it is implicit it could be hard to catch.

Mike Galbraith <bitbucket@online.de> wrote:
>On Thu, 2013-12-12 at 06:57 +0100, Mike Galbraith wrote: 
>> On Wed, 2013-12-11 at 21:45 -0800, H. Peter Anvin wrote: 
>> > As in it hangs at that point?
>> 
>> Nope, it's still going.
>> 
>> [1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency
>limits: 1064 MHz, 2266 MHz
>> 
>> Funny, continents move faster :)  Maybe missing a write or two.
>
>When I get back it may be done booting.  I'm gonna let it try for grins
>while I'm away, then take a peek, see if I can spot it.
>
>[ 1567.578340] pcc-cpufreq: (v1.10.00) driver loaded with frequency
>limits: 1064 MHz, 2266 MHz
>                                                                   done
>Starting HAL daemon                                                  
>done
>
>
>Setting up (localfs) network interfaces:
>    lo        
>    lo        IP address: 127.0.0.1/8   
>           IP address: 127.0.0.2/8                                 done
>    eth0      device: Broadcom Corporation NetXtreme II BCM5709 Gig
>         No configuration found for eth0                         unused
>    eth1      device: Broadcom Corporation NetXtreme II BCM5709 Gig
>         No configuration found for eth1                         unused
>    eth2      device: NetXen Incorporated NX3031 Multifunction 1/10
>[ 2457.114007] IPv6: ADDRCONF(NETDEV_UP): eth2: link is not ready
>[ 2457.114455] netxen_nic: eth2 NIC Link is up
>[ 2457.223582] IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
>    eth2      IP address: 0.0.0.0/32

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-11 23:08                                 ` H. Peter Anvin
  2013-12-11 23:14                                   ` Borislav Petkov
@ 2013-12-12  8:51                                   ` Peter Zijlstra
  2013-12-12 13:28                                     ` Ingo Molnar
  1 sibling, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-12  8:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> > 
> > Well, availability could be a problem too, if some CPU (real or 
> > virtual) implements MWAIT but not CLFLUSH.
> > 
> > In theory we could make mwait an alternatives variant and patch in the 
> > right combination of instructions? The CLFLUSH goes to the same 
> > address as on which the monitoring happens, so it could be considered 
> > one meta-instruction.
> > 
> 
> The first thing to do is probably to drop the use of thread_info as a
> wakeup doorbell.  It seemed like a good idea at the time -- after all,
> there is one for each thread -- but it is extremely likely to be dirty
> in the cache, which is (presumably) what causes these kinds of bugs to
> be maximally likely.  Even if we don't do the CLFLUSH it is likely that
> the hardware has to do something expensive behind the scenes.
> 
> So I would like to propose that we switch to using a percpu variable
> which is a single cache line of nothing at all.  It would only ever be
> touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
> this problem without the need for the CLFLUSH.

The reason we use thread_info::flags is because we need to write
TIF_NEED_RESCHED into it to wake up anyhow.

Using another cacheline would mean the wakeup path would need to write a
second cross cpu cacheline -- that is badness too.

So no, I don't think we want to listen to another line.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12  8:51                                   ` Peter Zijlstra
@ 2013-12-12 13:28                                     ` Ingo Molnar
  2013-12-12 15:06                                       ` H. Peter Anvin
  0 siblings, 1 reply; 51+ messages in thread
From: Ingo Molnar @ 2013-12-12 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Dec 11, 2013 at 03:08:35PM -0800, H. Peter Anvin wrote:
> > On 12/11/2013 09:50 AM, Ingo Molnar wrote:
> > > 
> > > Well, availability could be a problem too, if some CPU (real or 
> > > virtual) implements MWAIT but not CLFLUSH.
> > > 
> > > In theory we could make mwait an alternatives variant and patch in the 
> > > right combination of instructions? The CLFLUSH goes to the same 
> > > address as on which the monitoring happens, so it could be considered 
> > > one meta-instruction.
> > > 
> > 
> > The first thing to do is probably to drop the use of thread_info as a
> > wakeup doorbell.  It seemed like a good idea at the time -- after all,
> > there is one for each thread -- but it is extremely likely to be dirty
> > in the cache, which is (presumably) what causes these kinds of bugs to
> > be maximally likely.  Even if we don't do the CLFLUSH it is likely that
> > the hardware has to do something expensive behind the scenes.
> > 
> > So I would like to propose that we switch to using a percpu variable
> > which is a single cache line of nothing at all.  It would only ever be
> > touched by MONITOR and for explicit wakeup.  Hopefully that will resolve
> > this problem without the need for the CLFLUSH.
> 
> The reason we use thread_info::flags is because we need to write
> TIF_NEED_RESCHED into it to wake up anyhow.
> 
> Using another cacheline would mean the wakeup path would need to write a
> second cross cpu cacheline -- that is badness too.
> 
> So no, I don't think we want to listen to another line.

Seconded ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12 13:28                                     ` Ingo Molnar
@ 2013-12-12 15:06                                       ` H. Peter Anvin
  2013-12-12 15:51                                         ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: H. Peter Anvin @ 2013-12-12 15:06 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	Linux PM list, linux-kernel@vger.kernel.org, Jeremy Eder, x86

On 12/12/2013 05:28 AM, Ingo Molnar wrote:
>>
>> The reason we use thread_info::flags is because we need to write
>> TIF_NEED_RESCHED into it to wake up anyhow.
>>
>> Using another cacheline would mean the wakeup path would need to write a
>> second cross cpu cacheline -- that is badness too.
>>
>> So no, I don't think we want to listen to another line.
> 

Right, okay, so that's the implicit wakeup.  However, I would think the
CLFLUSH would hurt a lot more.

	-hpa



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: 50 Watt idle power regression bisected to Linux-3.10
  2013-12-12 15:06                                       ` H. Peter Anvin
@ 2013-12-12 15:51                                         ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2013-12-12 15:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Borislav Petkov, Mike Galbraith, Thomas Gleixner,
	Len Brown, Linux PM list, linux-kernel@vger.kernel.org,
	Jeremy Eder, x86

On Thu, Dec 12, 2013 at 07:06:57AM -0800, H. Peter Anvin wrote:
> On 12/12/2013 05:28 AM, Ingo Molnar wrote:
> >>
> >> The reason we use thread_info::flags is because we need to write
> >> TIF_NEED_RESCHED into it to wake up anyhow.
> >>
> >> Using another cacheline would mean the wakeup path would need to write a
> >> second cross cpu cacheline -- that is badness too.
> >>
> >> So no, I don't think we want to listen to another line.
> > 
> 
> Right, okay, so that's the implicit wakeup.  However, I would think the
> CLFLUSH would hurt a lot more.

Maybe, but still, who cares? Its only a few broken cpus that actually
need the clflush, normal cpus do not. We should not optimize for the
broken case.



^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2013-12-12 15:52 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-07  8:00 50 Watt idle power regression bisected to Linux-3.10 Len Brown
2013-12-07  8:39 ` Mike Galbraith
2013-12-07 16:01   ` Len Brown
2013-12-07 16:45     ` Len Brown
2013-12-07 19:17       ` Mike Galbraith
2013-12-10 11:41         ` Ingo Molnar
2013-12-07 12:54 ` Thomas Gleixner
2013-12-08  4:57 ` Mike Galbraith
2013-12-08 20:40   ` Len Brown
2013-12-09  3:16     ` Mike Galbraith
2013-12-10  5:17       ` Mike Galbraith
2013-12-10 11:45         ` Ingo Molnar
2013-12-10 14:29         ` Thomas Gleixner
2013-12-10 15:06           ` Ingo Molnar
2013-12-11  2:05           ` Thomas Gleixner
2013-12-11  3:21             ` Mike Galbraith
2013-12-11 11:28               ` Thomas Gleixner
2013-12-11 11:38                 ` Borislav Petkov
2013-12-11 11:52                   ` Peter Zijlstra
2013-12-11 12:29                     ` Mike Galbraith
2013-12-11 12:43                       ` Peter Zijlstra
2013-12-11 13:10                         ` Mike Galbraith
2013-12-11 13:40                         ` Borislav Petkov
2013-12-11 14:56                           ` Ingo Molnar
2013-12-11 16:02                             ` Borislav Petkov
2013-12-11 16:43                             ` Peter Zijlstra
2013-12-11 17:50                               ` Ingo Molnar
2013-12-11 23:08                                 ` H. Peter Anvin
2013-12-11 23:14                                   ` Borislav Petkov
2013-12-12  0:52                                     ` H. Peter Anvin
2013-12-12  4:25                                       ` Mike Galbraith
2013-12-12  4:49                                         ` H. Peter Anvin
2013-12-12  4:59                                           ` Mike Galbraith
2013-12-12  5:37                                           ` Mike Galbraith
2013-12-12  5:45                                             ` H. Peter Anvin
2013-12-12  5:57                                               ` Mike Galbraith
2013-12-12  6:05                                                 ` Mike Galbraith
2013-12-12  7:57                                                   ` H. Peter Anvin
2013-12-12  8:51                                   ` Peter Zijlstra
2013-12-12 13:28                                     ` Ingo Molnar
2013-12-12 15:06                                       ` H. Peter Anvin
2013-12-12 15:51                                         ` Peter Zijlstra
2013-12-11 14:42                         ` Ingo Molnar
2013-12-11 15:02                           ` Thomas Gleixner
2013-12-11 15:09                             ` Ingo Molnar
2013-12-11 16:44                               ` Peter Zijlstra
2013-12-11 17:48                                 ` Ingo Molnar
2013-12-11 16:44                           ` Peter Zijlstra
2013-12-11 17:47                             ` Ingo Molnar
2013-12-11 21:43                     ` Len Brown
2013-12-11 22:22                       ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox