public inbox for linux-pm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86 idle: repair large-server 50-watt idle-power regression
@ 2013-12-18 21:44 Len Brown
  2013-12-19 12:22 ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: Len Brown @ 2013-12-18 21:44 UTC (permalink / raw)
  To: x86; +Cc: linux-pm, linux-kernel, Len Brown, stable

From: Len Brown <len.brown@intel.com>

Linux 3.10 changed the timing of how thread_info->flags is touched:

	x86: Use generic idle loop
	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)

This caused Intel NHM-EX and WSM-EX servers to experience a large number
of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
from deep C-states to shallow C-states, which caused these platforms
to experience a significant increase in idle power.

Note that this issue was already present before the commit above,
however, it wasn't seen often enough to be noticed in power measurements.

Here we extend an errata workaround from the Core2 EX "Dunnington"
to extend to NHM-EX and WSM-EX, to prevent these immediate
returns from MWAIT, reducing idle power on these platforms.

While only acpi_idle ran on Dunnington, intel_idle
may also run on these two newer systems.
As of today, there are no other models that are known
to need this tweak.

ref: https://lkml.org/lkml/2013/12/7/22
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
---
 arch/x86/kernel/cpu/intel.c | 3 ++-
 drivers/idle/intel_idle.c   | 3 +++
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index dc1ec0d..ea04b34 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
 			set_cpu_cap(c, X86_FEATURE_PEBS);
 	}
 
-	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
+	if (c->x86 == 6 && cpu_has_clflush &&
+	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
 		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
 
 #ifdef CONFIG_X86_64
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 92d1206..f80b700 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	if (!current_set_polling_and_test()) {
 
+		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
+			clflush((void *)&current_thread_info()->flags);
+
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		smp_mb();
 		if (!need_resched())
-- 
1.8.5.1.19.gdaad3aa

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-18 21:44 [PATCH] x86 idle: repair large-server 50-watt idle-power regression Len Brown
@ 2013-12-19 12:22 ` Ingo Molnar
  2013-12-19 14:40   ` H. Peter Anvin
  2013-12-19 15:55   ` H. Peter Anvin
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 12:22 UTC (permalink / raw)
  To: Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	H. Peter Anvin, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov


* Len Brown <lenb@kernel.org> wrote:

> From: Len Brown <len.brown@intel.com>
> 
> Linux 3.10 changed the timing of how thread_info->flags is touched:
> 
> 	x86: Use generic idle loop
> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
> 
> This caused Intel NHM-EX and WSM-EX servers to experience a large number
> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to demote
> from deep C-states to shallow C-states, which caused these platforms
> to experience a significant increase in idle power.
> 
> Note that this issue was already present before the commit above,
> however, it wasn't seen often enough to be noticed in power measurements.
> 
> Here we extend an errata workaround from the Core2 EX "Dunnington"
> to extend to NHM-EX and WSM-EX, to prevent these immediate
> returns from MWAIT, reducing idle power on these platforms.
> 
> While only acpi_idle ran on Dunnington, intel_idle
> may also run on these two newer systems.
> As of today, there are no other models that are known
> to need this tweak.
> 
> ref: https://lkml.org/lkml/2013/12/7/22
> Signed-off-by: Len Brown <len.brown@intel.com>
> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
> ---
>  arch/x86/kernel/cpu/intel.c | 3 ++-
>  drivers/idle/intel_idle.c   | 3 +++
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index dc1ec0d..ea04b34 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>  	}
>  
> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
> +	if (c->x86 == 6 && cpu_has_clflush &&
> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model == 47))
>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>  
>  #ifdef CONFIG_X86_64
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 92d1206..f80b700 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>  
>  	if (!current_set_polling_and_test()) {
>  
> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> +			clflush((void *)&current_thread_info()->flags);
> +
>  		__monitor((void *)&current_thread_info()->flags, 0, 0);

I don't think either of these casts to '(void *)' is needed, both the 
clflush() and __monitor() will take pointers.

Looks good to me otherwise - except that maybe the best way to 
represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
single 'instruction' which is patched in dynamically during bootup, 
using our usual alternatives framework.

On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
eliminating the branch above.

So the whole thing could be thought of as a slightly more complex 
'monitor' instruction - not exposing the quirk details to actual usage 
sites.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 12:22 ` Ingo Molnar
@ 2013-12-19 14:40   ` H. Peter Anvin
  2013-12-19 15:45     ` Borislav Petkov
  2013-12-19 15:55   ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 14:40 UTC (permalink / raw)
  To: Ingo Molnar, Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov

... or just use static_cpu_has() maybe?

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Len Brown <lenb@kernel.org> wrote:
>
>> From: Len Brown <len.brown@intel.com>
>> 
>> Linux 3.10 changed the timing of how thread_info->flags is touched:
>> 
>> 	x86: Use generic idle loop
>> 	(7d1a941731fabf27e5fb6edbebb79fe856edb4e5)
>> 
>> This caused Intel NHM-EX and WSM-EX servers to experience a large
>number
>> of immediate MONITOR/MWAIT break wakeups, which caused cpuidle to
>demote
>> from deep C-states to shallow C-states, which caused these platforms
>> to experience a significant increase in idle power.
>> 
>> Note that this issue was already present before the commit above,
>> however, it wasn't seen often enough to be noticed in power
>measurements.
>> 
>> Here we extend an errata workaround from the Core2 EX "Dunnington"
>> to extend to NHM-EX and WSM-EX, to prevent these immediate
>> returns from MWAIT, reducing idle power on these platforms.
>> 
>> While only acpi_idle ran on Dunnington, intel_idle
>> may also run on these two newer systems.
>> As of today, there are no other models that are known
>> to need this tweak.
>> 
>> ref: https://lkml.org/lkml/2013/12/7/22
>> Signed-off-by: Len Brown <len.brown@intel.com>
>> Cc: <stable@vger.kernel.org> # 3.12.x, 3.11.x, 3.10.x
>> ---
>>  arch/x86/kernel/cpu/intel.c | 3 ++-
>>  drivers/idle/intel_idle.c   | 3 +++
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/kernel/cpu/intel.c
>b/arch/x86/kernel/cpu/intel.c
>> index dc1ec0d..ea04b34 100644
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -387,7 +387,8 @@ static void init_intel(struct cpuinfo_x86 *c)
>>  			set_cpu_cap(c, X86_FEATURE_PEBS);
>>  	}
>>  
>> -	if (c->x86 == 6 && c->x86_model == 29 && cpu_has_clflush)
>> +	if (c->x86 == 6 && cpu_has_clflush &&
>> +	    (c->x86_model == 29 || c->x86_model == 46 || c->x86_model ==
>47))
>>  		set_cpu_cap(c, X86_FEATURE_CLFLUSH_MONITOR);
>>  
>>  #ifdef CONFIG_X86_64
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
>
>I don't think either of these casts to '(void *)' is needed, both the 
>clflush() and __monitor() will take pointers.
>
>Looks good to me otherwise - except that maybe the best way to 
>represent this quirk would be for the CLFLUSH+MONITOR sequence to be a 
>single 'instruction' which is patched in dynamically during bootup, 
>using our usual alternatives framework.
>
>On non-affected CPUs a NOP would remain in place of the CLFLUSH, 
>eliminating the branch above.
>
>So the whole thing could be thought of as a slightly more complex 
>'monitor' instruction - not exposing the quirk details to actual usage 
>sites.
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 14:40   ` H. Peter Anvin
@ 2013-12-19 15:45     ` Borislav Petkov
  0 siblings, 0 replies; 27+ messages in thread
From: Borislav Petkov @ 2013-12-19 15:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Mike Galbraith

On Thu, Dec 19, 2013 at 06:40:41AM -0800, H. Peter Anvin wrote:
> ... or just use static_cpu_has() maybe?

Right, if we can get the compiler to generate the shortest CLFLUSH of 3
bytes with register indirect addressing for the operand, then stomping
over those 3 bytes with the alternatives would be the fastest solution.

If, of course, three NOPs which get discarded in the front end are
cheaper than the near JMP from static_cpu_has.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 12:22 ` Ingo Molnar
  2013-12-19 14:40   ` H. Peter Anvin
@ 2013-12-19 15:55   ` H. Peter Anvin
  2013-12-19 16:02     ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 15:55 UTC (permalink / raw)
  To: Ingo Molnar, Len Brown
  Cc: x86, linux-pm, linux-kernel, Len Brown, stable, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, Mike Galbraith, Borislav Petkov

On 12/19/2013 04:22 AM, Ingo Molnar wrote:
>> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
>> index 92d1206..f80b700 100644
>> --- a/drivers/idle/intel_idle.c
>> +++ b/drivers/idle/intel_idle.c
>> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
>>  
>>  	if (!current_set_polling_and_test()) {
>>  
>> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
>> +			clflush((void *)&current_thread_info()->flags);
>> +
>>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> 
> I don't think either of these casts to '(void *)' is needed, both the 
> clflush() and __monitor() will take pointers.

__monitor() currently doesn't, which is idiotic.

	-hpa


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 15:55   ` H. Peter Anvin
@ 2013-12-19 16:02     ` Ingo Molnar
  2013-12-19 16:09       ` H. Peter Anvin
  2013-12-19 16:13       ` H. Peter Anvin
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 16:02 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 04:22 AM, Ingo Molnar wrote:
> >> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> >> index 92d1206..f80b700 100644
> >> --- a/drivers/idle/intel_idle.c
> >> +++ b/drivers/idle/intel_idle.c
> >> @@ -377,6 +377,9 @@ static int intel_idle(struct cpuidle_device *dev,
> >>  
> >>  	if (!current_set_polling_and_test()) {
> >>  
> >> +		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> >> +			clflush((void *)&current_thread_info()->flags);
> >> +
> >>  		__monitor((void *)&current_thread_info()->flags, 0, 0);
> > 
> > I don't think either of these casts to '(void *)' is needed, both the 
> > clflush() and __monitor() will take pointers.
> 
> __monitor() currently doesn't, which is idiotic.

Hm, __monitor() seems to take a void *:

 arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,

So writing:

		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
			clflush(&current_thread_info()->flags);

  		__monitor(&current_thread_info()->flags, 0, 0);

ought to work just fine.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:02     ` Ingo Molnar
@ 2013-12-19 16:09       ` H. Peter Anvin
  2013-12-19 16:13       ` H. Peter Anvin
  1 sibling, 0 replies; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 08:02 AM, Ingo Molnar wrote:
>>
>> __monitor() currently doesn't, which is idiotic.
> 
> Hm, __monitor() seems to take a void *:
> 
>  arch/x86/include/asm/processor.h:static inline void __monitor(const void *eax, unsigned long ecx,
> 
> So writing:
> 
> 		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
> 			clflush(&current_thread_info()->flags);
> 
>   		__monitor(&current_thread_info()->flags, 0, 0);
> 

Yes, brain failure on my part.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:02     ` Ingo Molnar
  2013-12-19 16:09       ` H. Peter Anvin
@ 2013-12-19 16:13       ` H. Peter Anvin
  2013-12-19 16:21         ` Peter Zijlstra
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Mike Galbraith,
	Borislav Petkov

[-- Attachment #1: Type: text/plain, Size: 189 bytes --]

How does this look?  Completely untested, of course.

I do wonder if we need more memory barriers, though.

An alternative would be to move everything into mwait_idle_with_hints().

	-hpa


[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1626 bytes --]

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7b034a4057f9..6dce588f94b4 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
 		     :: "a" (eax), "c" (ecx));
 }
 
+/*
+ * Issue a clflush in preparation for a monitor instruction if the CPU
+ * needs it.  We force the address into the ax register to get a fixed
+ * length for the instruction, however, this is what the monitor instruction
+ * is going to need anyway, so it shouldn't add any additional code.
+ */
+static inline void clflush_monitor(const void *addr, unsigned long ecx,
+				   unsigned long edx)
+{
+	alternative_input(ASM_NOP3,
+			  "clflush (%0)",
+			  X86_FEATURE_CLFLUSH_MONITOR,
+			  "a" (addr));
+	__monitor(addr, eax, edx);
+	smp_mb();
+}
+
 extern void select_idle_routine(const struct cpuinfo_x86 *c);
 extern void init_amd_e400_c1e_mask(void);
 
diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index d2b7f27781bc..b14d02354134 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -163,11 +163,7 @@ EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_probe);
 void mwait_idle_with_hints(unsigned long ax, unsigned long cx)
 {
 	if (!need_resched()) {
-		if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
-			clflush((void *)&current_thread_info()->flags);
-
-		__monitor((void *)&current_thread_info()->flags, 0, 0);
-		smp_mb();
+		clflush_monitor(&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__mwait(ax, cx);
 	}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:13       ` H. Peter Anvin
@ 2013-12-19 16:21         ` Peter Zijlstra
  2013-12-19 16:50           ` H. Peter Anvin
  0 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2013-12-19 16:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 08:13:21AM -0800, H. Peter Anvin wrote:
> How does this look?  Completely untested, of course.
> 
> I do wonder if we need more memory barriers, though.
> 
> An alternative would be to move everything into mwait_idle_with_hints().
> 
> 	-hpa
> 

> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 7b034a4057f9..6dce588f94b4 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -723,6 +723,23 @@ static inline void __sti_mwait(unsigned long eax, unsigned long ecx)
>  		     :: "a" (eax), "c" (ecx));
>  }
>  
> +/*
> + * Issue a clflush in preparation for a monitor instruction if the CPU
> + * needs it.  We force the address into the ax register to get a fixed
> + * length for the instruction, however, this is what the monitor instruction
> + * is going to need anyway, so it shouldn't add any additional code.
> + */
> +static inline void clflush_monitor(const void *addr, unsigned long ecx,
> +				   unsigned long edx)
> +{
> +	alternative_input(ASM_NOP3,
> +			  "clflush (%0)",
> +			  X86_FEATURE_CLFLUSH_MONITOR,
> +			  "a" (addr));
> +	__monitor(addr, eax, edx);
> +	smp_mb();
> +}

What's that mb for?

Also, can you please merge:

  http://marc.info/?l=linux-kernel&m=138685838420632

Thomas said he would pick up that series, but seems to have gone missing
the past week or so.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:21         ` Peter Zijlstra
@ 2013-12-19 16:50           ` H. Peter Anvin
  2013-12-19 17:07             ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 16:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> 
> What's that mb for?
> 

It already exists in mwait_idle_with_hints(); I just moved it into this
common function.  It is a bit odd, I have to admit; it seems like it
should be *before* the monitor (and possibly we should have one after
the CLFLUSH as well?)

> Also, can you please merge:
> 
>   http://marc.info/?l=linux-kernel&m=138685838420632
> 
> Thomas said he would pick up that series, but seems to have gone missing
> the past week or so.
> 

We need to get the immediate regression fixed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 16:50           ` H. Peter Anvin
@ 2013-12-19 17:07             ` Ingo Molnar
  2013-12-19 17:25               ` Peter Zijlstra
  2013-12-19 18:09               ` H. Peter Anvin
  0 siblings, 2 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 17:07 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > 
> > What's that mb for?
> > 
> 
> It already exists in mwait_idle_with_hints(); I just moved it into 
> this common function.  It is a bit odd, I have to admit; it seems 
> like it should be *before* the monitor (and possibly we should have 
> one after the CLFLUSH as well?)

Yes, I think we need a barrier before the CLFLUSH, because according 
to my reading of the Intel documentation CLFLUSH has no implicit 
ordering so it might get reordered with the store to ->flags in 
current_set_polling_and_test(), which might result in spurious wakeup 
problems again.

(And CLFLUSH is a store in a sense, so special in that the regular 
ordering for stores does not apply.)

Likewise, having a barrier before the MONITOR looks sensible as well. 
Having it _after_ monitor looks weird and is probably wrong. [It might 
have been the effects of someone seeing the spurious wakeup problems 
with realizing the true source, or so.]

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:07             ` Ingo Molnar
@ 2013-12-19 17:25               ` Peter Zijlstra
  2013-12-19 17:36                 ` Peter Zijlstra
                                   ` (2 more replies)
  2013-12-19 18:09               ` H. Peter Anvin
  1 sibling, 3 replies; 27+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> 
> * H. Peter Anvin <hpa@zytor.com> wrote:
> 
> > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > 
> > > What's that mb for?
> > > 
> > 
> > It already exists in mwait_idle_with_hints(); I just moved it into 
> > this common function.  It is a bit odd, I have to admit; it seems 
> > like it should be *before* the monitor (and possibly we should have 
> > one after the CLFLUSH as well?)
> 
> Yes, I think we need a barrier before the CLFLUSH, because according 
> to my reading of the Intel documentation CLFLUSH has no implicit 
> ordering so it might get reordered with the store to ->flags in 
> current_set_polling_and_test(), which might result in spurious wakeup 
> problems again.

No it cannot; since current_set_polling_and_test() already has a barrier
to prevent that.

Also, the location patched by hpa doesn't actually call that at all.

That said, I would find it very strange indeed if a CLFLUSH doesn't also
flush the store buffer.

> (And CLFLUSH is a store in a sense, so special in that the regular 
> ordering for stores does not apply.)
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]

I again have to disagree, one would expect monitor to flush all that is
required to start the monitor -- and it actually does so. As is
testified by this extra CLFLUSH being called a bug workaround.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25               ` Peter Zijlstra
@ 2013-12-19 17:36                 ` Peter Zijlstra
  2013-12-19 18:05                   ` H. Peter Anvin
  2013-12-19 17:50                 ` Peter Zijlstra
  2013-12-19 18:10                 ` Ingo Molnar
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> flush the store buffer.

OK, it explicitly states it does not do that and you indeed need an
mfence before the clflush.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25               ` Peter Zijlstra
  2013-12-19 17:36                 ` Peter Zijlstra
@ 2013-12-19 17:50                 ` Peter Zijlstra
  2013-12-19 18:18                   ` Ingo Molnar
  2013-12-19 18:10                 ` Ingo Molnar
  2 siblings, 1 reply; 27+ messages in thread
From: Peter Zijlstra @ 2013-12-19 17:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > Likewise, having a barrier before the MONITOR looks sensible as well. 
> 
> I again have to disagree, one would expect monitor to flush all that is
> required to start the monitor -- and it actually does so. As is
> testified by this extra CLFLUSH being called a bug workaround.

SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot pass
a previous STORE to the same address.

That said; there's enough holes in there to swim a titanic through,
seeing how MONITOR stares at an entire cacheline and LOAD/STORE order is
specified on location, whatever that means.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:36                 ` Peter Zijlstra
@ 2013-12-19 18:05                   ` H. Peter Anvin
  2013-12-19 18:14                     ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov

On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> That said, I would find it very strange indeed if a CLFLUSH doesn't also
>> flush the store buffer.
> 
> OK, it explicitly states it does not do that and you indeed need an
> mfence before the clflush.
> 

So, MONITOR is defined to be ordered as a load, which I think should be
adequate, but I really wonder if we should have mfence on both sides of
clflush.  This now is up to 9 bytes, and perhaps pushing it a bit with
how much we would be willing to patch out.

On the other hand - the CLFLUSH seems to have worked well enough by
itself, and this is all probabilistic anyway, so perhaps we should just
leave the naked CLFLUSH in and not worry about it unless measurements
say otherwise?

	-hpa


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:07             ` Ingo Molnar
  2013-12-19 17:25               ` Peter Zijlstra
@ 2013-12-19 18:09               ` H. Peter Anvin
  2013-12-19 18:19                 ` H. Peter Anvin
  2013-12-19 18:19                 ` Ingo Molnar
  1 sibling, 2 replies; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> 
> Likewise, having a barrier before the MONITOR looks sensible as well. 
> Having it _after_ monitor looks weird and is probably wrong. [It might 
> have been the effects of someone seeing the spurious wakeup problems 
> with realizing the true source, or so.]
> 

Does anyone know the history of this barrier after the monitor?  I know
Len is looking for a minimal patchset that can go into -stable, and it
seems prudent to not preturb the code more than necessary, but going
forward it would be nice to know...

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:25               ` Peter Zijlstra
  2013-12-19 17:36                 ` Peter Zijlstra
  2013-12-19 17:50                 ` Peter Zijlstra
@ 2013-12-19 18:10                 ` Ingo Molnar
  2 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > 
> > * H. Peter Anvin <hpa@zytor.com> wrote:
> > 
> > > On 12/19/2013 08:21 AM, Peter Zijlstra wrote:
> > > > 
> > > > What's that mb for?
> > > > 
> > > 
> > > It already exists in mwait_idle_with_hints(); I just moved it into 
> > > this common function.  It is a bit odd, I have to admit; it seems 
> > > like it should be *before* the monitor (and possibly we should have 
> > > one after the CLFLUSH as well?)
> > 
> > Yes, I think we need a barrier before the CLFLUSH, because according 
> > to my reading of the Intel documentation CLFLUSH has no implicit 
> > ordering so it might get reordered with the store to ->flags in 
> > current_set_polling_and_test(), which might result in spurious wakeup 
> > problems again.
> 
> No it cannot; since current_set_polling_and_test() already has a 
> barrier to prevent that.

See below:

> Also, the location patched by hpa doesn't actually call that at all.
> 
> That said, I would find it very strange indeed if a CLFLUSH doesn't 
> also flush the store buffer.

So, the Intel documentation says (sorry about the lazy-link):

  http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc31.htm

 "CLFLUSH is only ordered by the MFENCE instruction. It is not 
  guaranteed to be ordered by any other fencing, serializing or other 
  CLFLUSH instruction. For example, software can use an MFENCE 
  instruction to insure that previous stores are included in the 
  write-back."

So a specific MFENCE barrier is needed.

Also note that this wording excludes implicit serialization such as 
LOCK prefix or XCHG barriers. As it happens 
current_set_polling_and_test() uses smp_mb(), which happens to map to 
MFENCE on all CPUs that can do CLFLUSH, but that's really just an 
accident and in no way engineered.

_At minimum_ we need a prominent comment at the clflush usage site 
that we rely on the MFENCE in current_set_polling_and_test() ...

> > (And CLFLUSH is a store in a sense, so special in that the regular 
> > ordering for stores does not apply.)
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> I again have to disagree, one would expect monitor to flush all that 
> is required to start the monitor -- and it actually does so. As is 
> testified by this extra CLFLUSH being called a bug workaround.

This assumption would be safer - although AFAICS the Intel 
MONITOR/MWAIT documentation is quiet about this aspect.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:05                   ` H. Peter Anvin
@ 2013-12-19 18:14                     ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:36 AM, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> >> That said, I would find it very strange indeed if a CLFLUSH doesn't also
> >> flush the store buffer.
> > 
> > OK, it explicitly states it does not do that and you indeed need 
> > an mfence before the clflush.
> 
> So, MONITOR is defined to be ordered as a load, which I think should 
> be adequate, but I really wonder if we should have mfence on both 
> sides of clflush.  This now is up to 9 bytes, and perhaps pushing it 
> a bit with how much we would be willing to patch out.
> 
> On the other hand - the CLFLUSH seems to have worked well enough by 
> itself, and this is all probabilistic anyway, so perhaps we should 
> just leave the naked CLFLUSH in and not worry about it unless 
> measurements say otherwise?

So I think the window of breakage was rather large here, and since it 
seems to trigger on rare types of hardware I think we'd be better off 
by erring on the side of robustness this time around ...

This is the 'go to idle' path, which isn't as time-critical as the 
'get out of idle' code path.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 17:50                 ` Peter Zijlstra
@ 2013-12-19 18:18                   ` Ingo Molnar
  2013-12-19 21:05                     ` H. Peter Anvin
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
> > > Likewise, having a barrier before the MONITOR looks sensible as well. 
> > 
> > I again have to disagree, one would expect monitor to flush all that is
> > required to start the monitor -- and it actually does so. As is
> > testified by this extra CLFLUSH being called a bug workaround.
> 
> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
> pass a previous STORE to the same address.

Yes ... but you could argue that CLFLUSH is neither a load nor a 
store, it's a _cache sync_ operation, with its special ordering 
properties.

> That said; there's enough holes in there to swim a titanic through, 
> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
> order is specified on location, whatever that means.

I think assuming that MONITOR is ordered as a load or better is a 
pretty safe one (and in fact the Intel documentation seems to say so) 
- I'd say MONITOR is in micro-code and essentially snoops on cache 
events on that specific cache line, and loads the cache line on a 
snoop hit?

Btw., what state is the cache line after a MONITOR instruction, is it 
loaded as shared, or as excusive? (exclusive would probably be better 
for performance.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:09               ` H. Peter Anvin
@ 2013-12-19 18:19                 ` H. Peter Anvin
  2013-12-19 18:23                   ` Ingo Molnar
  2013-12-19 18:19                 ` Ingo Molnar
  1 sibling, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 18:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
>>
>> Likewise, having a barrier before the MONITOR looks sensible as well. 
>> Having it _after_ monitor looks weird and is probably wrong. [It might 
>> have been the effects of someone seeing the spurious wakeup problems 
>> with realizing the true source, or so.]
>>
> 
> Does anyone know the history of this barrier after the monitor?  I know
> Len is looking for a minimal patchset that can go into -stable, and it
> seems prudent to not preturb the code more than necessary, but going
> forward it would be nice to know...
> 

Hmm... it *looks* like it is intended to be part of the construct:

	smp_mb();
	if (!need_resched())
		...

I found a note in the HLT variant of the function saying:

/*
 * TS_POLLING-cleared state must be visible before we
 * test NEED_RESCHED:
 */

... which presumably has been copied elsewhere.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:09               ` H. Peter Anvin
  2013-12-19 18:19                 ` H. Peter Anvin
@ 2013-12-19 18:19                 ` Ingo Molnar
  2013-12-19 19:22                   ` H. Peter Anvin
  1 sibling, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:19 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> > 
> > Likewise, having a barrier before the MONITOR looks sensible as 
> > well. Having it _after_ monitor looks weird and is probably wrong. 
> > [It might have been the effects of someone seeing the spurious 
> > wakeup problems with realizing the true source, or so.]
> 
> Does anyone know the history of this barrier after the monitor?  I 
> know Len is looking for a minimal patchset that can go into -stable, 
> and it seems prudent to not preturb the code more than necessary, 
> but going forward it would be nice to know...

For the minimal fix I don't think we should change it - but for v3.14 
it looks like a speedup for the from-idle codepath, which is 
performance sensitive.

( It would also be nice to know whether MONITOR loads that cacheline 
  into the CPUs cache, and in what state it loads it. )

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:19                 ` H. Peter Anvin
@ 2013-12-19 18:23                   ` Ingo Molnar
       [not found]                     ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
  0 siblings, 1 reply; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:23 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> On 12/19/2013 10:09 AM, H. Peter Anvin wrote:
> > On 12/19/2013 09:07 AM, Ingo Molnar wrote:
> >>
> >> Likewise, having a barrier before the MONITOR looks sensible as well. 
> >> Having it _after_ monitor looks weird and is probably wrong. [It might 
> >> have been the effects of someone seeing the spurious wakeup problems 
> >> with realizing the true source, or so.]
> >>
> > 
> > Does anyone know the history of this barrier after the monitor?  I know
> > Len is looking for a minimal patchset that can go into -stable, and it
> > seems prudent to not preturb the code more than necessary, but going
> > forward it would be nice to know...
> > 
> 
> Hmm... it *looks* like it is intended to be part of the construct:
> 
> 	smp_mb();
> 	if (!need_resched())
> 		...
> 
> I found a note in the HLT variant of the function saying:
> 
> /*
>  * TS_POLLING-cleared state must be visible before we
>  * test NEED_RESCHED:
>  */

Yes, that makes sense: the need_resched test is a load, and MONITOR is 
a load as well. Can the two ever cross, or does the CPU guarantee that 
because it's the same address, the loads don't cross?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
       [not found]                     ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
@ 2013-12-19 18:43                       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Mike Galbraith, Thomas Gleixner, Len Brown,
	linux-pm, x86, Peter Zijlstra, Len Brown, H. Peter Anvin, stable,
	linux-kernel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The x86 memory rules are that two loads always execute in order (ie 
> rmb is a no-op).
> 
> So I see no reason for a memory barrier after the monitor. [...]

Yes, I'm leaning towards that interpretation as well, but the reason 
I'm a bit catious is the somewhat curious (to me!) wording of the 
MONITOR instruction:

  Sets up a linear address range to be monitored by hardware and 
  activates the monitor.

  ...

  The MONITOR instruction is ordered as a load operation with respect 
  to other memory trans­actions. The instruction can be used at all 
  privilege levels and is subject to all permission checking and 
  faults associated with a byte load. Like a load, the MONITOR 
  instruction sets the A-bit but not the D-bit in page tables.

Where apparently the 'range' means 'full cache line surrounding the 
memory address in question'.

We have no other load instructions that operate on such a large 
'range' of addresses, and I wanted to make sure it's a true (single 
byte) load for that specific address. The documentation does not 
appear to explicitly state that it's a load for that address - only 
that it's ordered as a load.

The reason I'm asking is because 'flags' itself might not be at the 
beginning of the cache line, as it's in the middle of thread_info:

 struct thread_info {
        struct task_struct      *task;          /* main task structure */
        struct exec_domain      *exec_domain;   /* execution domain */
        __u32                   flags;          /* low level flags */

while 'MONITOR' appears to work on the cache line. So are all 
addresses within that cache line ordered? Only the specific address 
given to the instruction itself? Only the first word of the cacheline 
itself?

The documentation is a bit vague, at least in my reading, and 
depending on which actual word the instruction reads (if it reads any 
word at all ... it's probably just setting up an address for MWAIT) 
from that cacheline, its ordering properties might be surprising.

> [...] But both sides of clflush sounds sane, and as mentioned the 
> "go to sleep" side isn't as critical as the "wake up" side if the 
> monitor.

Yeah.

> Please let's just make that pre-monitor hack be a static key, and do 
> mfence explicitly around the clflush inside that conditional 
> section.

Agreed.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:19                 ` Ingo Molnar
@ 2013-12-19 19:22                   ` H. Peter Anvin
  2013-12-19 19:27                     ` Peter Zijlstra
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 19:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> 
> ( It would also be nice to know whether MONITOR loads that cacheline 
>   into the CPUs cache, and in what state it loads it. )
> 

I would assume that is implementation-dependent.  However, one plausible
implementation is to load the cache line into the cache in shared state
and monitor for evictions.

	-hpa

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 19:22                   ` H. Peter Anvin
@ 2013-12-19 19:27                     ` Peter Zijlstra
  0 siblings, 0 replies; 27+ messages in thread
From: Peter Zijlstra @ 2013-12-19 19:27 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ingo Molnar, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov

On Thu, Dec 19, 2013 at 11:22:01AM -0800, H. Peter Anvin wrote:
> On 12/19/2013 10:19 AM, Ingo Molnar wrote:
> > 
> > ( It would also be nice to know whether MONITOR loads that cacheline 
> >   into the CPUs cache, and in what state it loads it. )
> > 
> 
> I would assume that is implementation-dependent.  However, one plausible
> implementation is to load the cache line into the cache in shared state
> and monitor for evictions.

I suppose the monitor part is the important part because certain C
states drop all cache, presumably including the cacheline we're actually
monitoring.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 18:18                   ` Ingo Molnar
@ 2013-12-19 21:05                     ` H. Peter Anvin
  2013-12-19 21:17                       ` Ingo Molnar
  0 siblings, 1 reply; 27+ messages in thread
From: H. Peter Anvin @ 2013-12-19 21:05 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: Len Brown, x86, linux-pm, linux-kernel, Len Brown, stable,
	Linus Torvalds, Thomas Gleixner, Mike Galbraith, Borislav Petkov

CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE according to the SDM.

Ingo Molnar <mingo@kernel.org> wrote:
>
>* Peter Zijlstra <peterz@infradead.org> wrote:
>
>> On Thu, Dec 19, 2013 at 06:25:35PM +0100, Peter Zijlstra wrote:
>> > On Thu, Dec 19, 2013 at 06:07:41PM +0100, Ingo Molnar wrote:
>> > > Likewise, having a barrier before the MONITOR looks sensible as
>well. 
>> > 
>> > I again have to disagree, one would expect monitor to flush all
>that is
>> > required to start the monitor -- and it actually does so. As is
>> > testified by this extra CLFLUSH being called a bug workaround.
>> 
>> SDM states that MONITOR is ordered like a LOAD, and a LOAD cannot 
>> pass a previous STORE to the same address.
>
>Yes ... but you could argue that CLFLUSH is neither a load nor a 
>store, it's a _cache sync_ operation, with its special ordering 
>properties.
>
>> That said; there's enough holes in there to swim a titanic through, 
>> seeing how MONITOR stares at an entire cacheline and LOAD/STORE 
>> order is specified on location, whatever that means.
>
>I think assuming that MONITOR is ordered as a load or better is a 
>pretty safe one (and in fact the Intel documentation seems to say so) 
>- I'd say MONITOR is in micro-code and essentially snoops on cache 
>events on that specific cache line, and loads the cache line on a 
>snoop hit?
>
>Btw., what state is the cache line after a MONITOR instruction, is it 
>loaded as shared, or as excusive? (exclusive would probably be better 
>for performance.)
>
>Thanks,
>
>	Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] x86 idle: repair large-server 50-watt idle-power regression
  2013-12-19 21:05                     ` H. Peter Anvin
@ 2013-12-19 21:17                       ` Ingo Molnar
  0 siblings, 0 replies; 27+ messages in thread
From: Ingo Molnar @ 2013-12-19 21:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, Len Brown, x86, linux-pm, linux-kernel, Len Brown,
	stable, Linus Torvalds, Thomas Gleixner, Mike Galbraith,
	Borislav Petkov


* H. Peter Anvin <hpa@zytor.com> wrote:

> CLFLUSH is only (guaranteed to be) ordered with respect to MFENCE 
> according to the SDM.

Yes, I quoted the relevant text a couple of hours ago, CLFLUSH is 
pretty special, so it's not ordered against atomics or serializing 
instructions for example - only ordered against explict MFENCE calls, 
which on x86 is mb()/smp_mb().

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-12-19 21:17 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-18 21:44 [PATCH] x86 idle: repair large-server 50-watt idle-power regression Len Brown
2013-12-19 12:22 ` Ingo Molnar
2013-12-19 14:40   ` H. Peter Anvin
2013-12-19 15:45     ` Borislav Petkov
2013-12-19 15:55   ` H. Peter Anvin
2013-12-19 16:02     ` Ingo Molnar
2013-12-19 16:09       ` H. Peter Anvin
2013-12-19 16:13       ` H. Peter Anvin
2013-12-19 16:21         ` Peter Zijlstra
2013-12-19 16:50           ` H. Peter Anvin
2013-12-19 17:07             ` Ingo Molnar
2013-12-19 17:25               ` Peter Zijlstra
2013-12-19 17:36                 ` Peter Zijlstra
2013-12-19 18:05                   ` H. Peter Anvin
2013-12-19 18:14                     ` Ingo Molnar
2013-12-19 17:50                 ` Peter Zijlstra
2013-12-19 18:18                   ` Ingo Molnar
2013-12-19 21:05                     ` H. Peter Anvin
2013-12-19 21:17                       ` Ingo Molnar
2013-12-19 18:10                 ` Ingo Molnar
2013-12-19 18:09               ` H. Peter Anvin
2013-12-19 18:19                 ` H. Peter Anvin
2013-12-19 18:23                   ` Ingo Molnar
     [not found]                     ` <CA+55aFzGxcML7j8CEvQPYzh0W81uVoAAVmGctMOUZ7CZ1yYd2A@mail.gmail.com>
2013-12-19 18:43                       ` Ingo Molnar
2013-12-19 18:19                 ` Ingo Molnar
2013-12-19 19:22                   ` H. Peter Anvin
2013-12-19 19:27                     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox