public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/2] x86: Make offline cpus to go to deepest idle state using mwait
@ 2009-05-22 23:19 venkatesh.pallipadi
  2009-05-22 23:19 ` [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs venkatesh.pallipadi
  2009-05-22 23:19 ` [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate venkatesh.pallipadi
  0 siblings, 2 replies; 14+ messages in thread
From: venkatesh.pallipadi @ 2009-05-22 23:19 UTC (permalink / raw)
  To: mingo, tglx, hpa
  Cc: linux-kernel, lenb, shaohua.li, svaidy, a.p.zijlstra,
	Venkatesh Pallipadi

cpu hotplug has this long standing bug, with offline cpus going into
hlt loop. That can result in offline CPU burning more power than while it is
in online state (as when online it can potentially go to deeper C-state).
Fix the bug using native mwait to enter deepest C-state in play_dead.

-- 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs
  2009-05-22 23:19 [patch 0/2] x86: Make offline cpus to go to deepest idle state using mwait venkatesh.pallipadi
@ 2009-05-22 23:19 ` venkatesh.pallipadi
  2009-05-23 10:44   ` Peter Zijlstra
  2009-05-22 23:19 ` [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate venkatesh.pallipadi
  1 sibling, 1 reply; 14+ messages in thread
From: venkatesh.pallipadi @ 2009-05-22 23:19 UTC (permalink / raw)
  To: mingo, tglx, hpa
  Cc: linux-kernel, lenb, shaohua.li, svaidy, a.p.zijlstra,
	Venkatesh Pallipadi

[-- Attachment #1: 0001-x86-Add-pm_play_dead-funcptr-to-power-efficiently-o.patch --]
[-- Type: text/plain, Size: 1411 bytes --]

Add a funcptr pm_play_dead (similar to pm_idle) that can take
the offline CPUs to the most power efficient idle state.

This patch just adds the func pointer. The pointer will get initialized
by patch that follows.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 arch/x86/include/asm/smp.h |    2 ++
 arch/x86/kernel/smpboot.c  |    7 ++++++-
 2 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 19e0d88..0388b81 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -148,6 +148,8 @@ static inline int num_booting_cpus(void)
 {
 	return cpumask_weight(cpu_callout_mask);
 }
+
+extern void (*pm_play_dead)(void);
 #endif /* CONFIG_SMP */
 
 extern unsigned disabled_cpus __cpuinitdata;
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 58d24ef..4f8af6a 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1195,6 +1195,8 @@ __init void prefill_possible_map(void)
 	nr_cpu_ids = possible;
 }
 
+void (*pm_play_dead)(void) = NULL;
+
 #ifdef CONFIG_HOTPLUG_CPU
 
 static void remove_siblinginfo(int cpu)
@@ -1313,7 +1315,10 @@ void play_dead_common(void)
 void native_play_dead(void)
 {
 	play_dead_common();
-	wbinvd_halt();
+	if (pm_play_dead)
+		pm_play_dead();
+	else
+		wbinvd_halt();
 }
 
 #else /* ... !CONFIG_HOTPLUG_CPU */
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate
  2009-05-22 23:19 [patch 0/2] x86: Make offline cpus to go to deepest idle state using mwait venkatesh.pallipadi
  2009-05-22 23:19 ` [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs venkatesh.pallipadi
@ 2009-05-22 23:19 ` venkatesh.pallipadi
  2009-05-25  0:56   ` Shaohua Li
  2010-09-17 23:46   ` [tip:x86/idle] x86, hotplug: Use mwait to offline a processor, fix the legacy case tip-bot for H. Peter Anvin
  1 sibling, 2 replies; 14+ messages in thread
From: venkatesh.pallipadi @ 2009-05-22 23:19 UTC (permalink / raw)
  To: mingo, tglx, hpa
  Cc: linux-kernel, lenb, shaohua.li, svaidy, a.p.zijlstra,
	Venkatesh Pallipadi

[-- Attachment #1: 0002-x86-put-offline-CPUs-into-deepest-mwait-cstate_subc.patch --]
[-- Type: text/plain, Size: 2075 bytes --]

Offline CPUs can save power by going into deepest cstate, subcstate
instead of hlt loop.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
---
 arch/x86/kernel/acpi/cstate.c |   51 +++++++++++++++++++++++++++++++++++++++++
 1 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
index bbbe4bb..5b0988a 100644
--- a/arch/x86/kernel/acpi/cstate.c
+++ b/arch/x86/kernel/acpi/cstate.c
@@ -150,6 +150,54 @@ void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 }
 EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
 
+static unsigned long mwait_play_dead_eax;
+
+static void mwait_play_dead(void)
+{
+	if (boot_cpu_data.x86 >= 4)
+		wbinvd();
+
+	while (1) {
+		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		smp_mb();
+		__mwait(mwait_play_dead_eax, 0);
+	}
+}
+
+static void mwait_play_dead_init(void *unused)
+{
+	unsigned int eax, ebx, ecx, edx;
+	unsigned int highest_cstate = 0;
+	unsigned int highest_subcstate = 0;
+	int i;
+
+	if (!boot_cpu_has(X86_FEATURE_MWAIT))
+		return;
+	
+	if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return;
+
+	pm_play_dead = mwait_play_dead;
+
+	cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &edx);
+	/*
+	 * mwait_play_dead_eax will be 0 if EDX enumeration is not valid.
+	 * Initialized below to cstate, sub_cstate value when EDX is valid.
+	 */
+	if (!(ecx & 0x1))
+		return;
+		
+	edx >>= MWAIT_SUBSTATE_SIZE;
+	for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+		if (edx & MWAIT_SUBSTATE_MASK) {
+			highest_cstate = i;
+			highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+		}
+	}
+	mwait_play_dead_eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+			      (highest_subcstate - 1);
+}
+
 static int __init ffh_cstate_init(void)
 {
 	struct cpuinfo_x86 *c = &boot_cpu_data;
@@ -157,6 +205,9 @@ static int __init ffh_cstate_init(void)
 		return -1;
 
 	cpu_cstate_entry = alloc_percpu(struct cstate_entry);
+
+	smp_call_function_single(0, mwait_play_dead_init, NULL, 1);
+
 	return 0;
 }
 
-- 
1.6.0.6

-- 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs
  2009-05-22 23:19 ` [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs venkatesh.pallipadi
@ 2009-05-23 10:44   ` Peter Zijlstra
  2009-05-23 15:07     ` Pallipadi, Venkatesh
  2009-06-22 17:25     ` Pallipadi, Venkatesh
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Zijlstra @ 2009-05-23 10:44 UTC (permalink / raw)
  To: venkatesh.pallipadi
  Cc: mingo, tglx, hpa, linux-kernel, lenb, shaohua.li, svaidy

On Fri, 2009-05-22 at 16:19 -0700, venkatesh.pallipadi@intel.com wrote:
> plain text document attachment
> (0001-x86-Add-pm_play_dead-funcptr-to-power-efficiently-o.patch)
> Add a funcptr pm_play_dead (similar to pm_idle) that can take
> the offline CPUs to the most power efficient idle state.
> 
> This patch just adds the func pointer. The pointer will get initialized
> by patch that follows.

Since the pm_idle function pointer has given us so much grief, I don't
think its wise to repeat that particular disaster.

I'd much rather see a framework where idle functions can be registered,
and selected from based on criteria such as wakeup latency as provided
by the pm_qos stuff, and power saving.

This framework should be shared between hotplug-idle and the regular
idle routines. Hotplug would of course not care about things like wakeup
latency and might therefore pick another idle routine.

> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> ---
>  arch/x86/include/asm/smp.h |    2 ++
>  arch/x86/kernel/smpboot.c  |    7 ++++++-
>  2 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
> index 19e0d88..0388b81 100644
> --- a/arch/x86/include/asm/smp.h
> +++ b/arch/x86/include/asm/smp.h
> @@ -148,6 +148,8 @@ static inline int num_booting_cpus(void)
>  {
>  	return cpumask_weight(cpu_callout_mask);
>  }
> +
> +extern void (*pm_play_dead)(void);
>  #endif /* CONFIG_SMP */
>  
>  extern unsigned disabled_cpus __cpuinitdata;
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 58d24ef..4f8af6a 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1195,6 +1195,8 @@ __init void prefill_possible_map(void)
>  	nr_cpu_ids = possible;
>  }
>  
> +void (*pm_play_dead)(void) = NULL;
> +
>  #ifdef CONFIG_HOTPLUG_CPU
>  
>  static void remove_siblinginfo(int cpu)
> @@ -1313,7 +1315,10 @@ void play_dead_common(void)
>  void native_play_dead(void)
>  {
>  	play_dead_common();
> -	wbinvd_halt();
> +	if (pm_play_dead)
> +		pm_play_dead();
> +	else
> +		wbinvd_halt();
>  }
>  
>  #else /* ... !CONFIG_HOTPLUG_CPU */
> -- 
> 1.6.0.6
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs
  2009-05-23 10:44   ` Peter Zijlstra
@ 2009-05-23 15:07     ` Pallipadi, Venkatesh
  2009-06-22 17:25     ` Pallipadi, Venkatesh
  1 sibling, 0 replies; 14+ messages in thread
From: Pallipadi, Venkatesh @ 2009-05-23 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo@elte.hu, tglx@linutronix.de, hpa@zytor.com,
	linux-kernel@vger.kernel.org, lenb@kernel.org, Li, Shaohua,
	svaidy@linux.vnet.ibm.com

 

>-----Original Message-----
>From: Peter Zijlstra [mailto:a.p.zijlstra@chello.nl] 
>Sent: Saturday, May 23, 2009 3:45 AM
>To: Pallipadi, Venkatesh
>Cc: mingo@elte.hu; tglx@linutronix.de; hpa@zytor.com; 
>linux-kernel@vger.kernel.org; lenb@kernel.org; Li, Shaohua; 
>svaidy@linux.vnet.ibm.com
>Subject: Re: [patch 1/2] x86: Add pm_play_dead funcptr to 
>power-efficiently offline CPUs
>
>On Fri, 2009-05-22 at 16:19 -0700, venkatesh.pallipadi@intel.com wrote:
>> plain text document attachment
>> (0001-x86-Add-pm_play_dead-funcptr-to-power-efficiently-o.patch)
>> Add a funcptr pm_play_dead (similar to pm_idle) that can take
>> the offline CPUs to the most power efficient idle state.
>> 
>> This patch just adds the func pointer. The pointer will get 
>initialized
>> by patch that follows.
>
>Since the pm_idle function pointer has given us so much grief, I don't
>think its wise to repeat that particular disaster.
>
>I'd much rather see a framework where idle functions can be registered,
>and selected from based on criteria such as wakeup latency as provided
>by the pm_qos stuff, and power saving.
>
>This framework should be shared between hotplug-idle and the regular
>idle routines. Hotplug would of course not care about things 
>like wakeup
>latency and might therefore pick another idle routine.
>

Yes. Thought about that approach. There are issues with that approach however.
- It would be ugly as we have to add logic in various low level idle handlers we have today mwait_idle, default_idle, acpi_mwait_idle, acpi_ioport_idle to make them aware of offline idle case and disable interrupt and not enable interrupt while going idle and not to wake up on interrupt.
- There are limitations on what we can and cannot do when offline. We have had issues with calling acpi methods and/or IO port accesses, during suspend before, so CPU going on suspend only uses C1. Specifically, with the above approach we will have:
1) CPU 1 going offline and starts using regular ACPI IO port C-states.
2) CPU 0 starts suspend and after some points it is unsafe to access ACPI io port accesses, which can potentially result in crash/hangs.

So, I would like to keep ACPI out of play_dead. And keeping idle and play_dead separate seemed to be cleaner way to do it.

Thanks,
Venki

 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate
  2009-05-22 23:19 ` [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate venkatesh.pallipadi
@ 2009-05-25  0:56   ` Shaohua Li
  2009-05-26 21:17     ` Pallipadi, Venkatesh
  2010-09-17 23:46   ` [tip:x86/idle] x86, hotplug: Use mwait to offline a processor, fix the legacy case tip-bot for H. Peter Anvin
  1 sibling, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2009-05-25  0:56 UTC (permalink / raw)
  To: Pallipadi, Venkatesh
  Cc: mingo@elte.hu, tglx@linutronix.de, hpa@zytor.com,
	linux-kernel@vger.kernel.org, lenb@kernel.org,
	svaidy@linux.vnet.ibm.com, a.p.zijlstra@chello.nl

On Sat, May 23, 2009 at 07:19:42AM +0800, Pallipadi, Venkatesh wrote:
> Offline CPUs can save power by going into deepest cstate, subcstate
> instead of hlt loop.
> 
> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> ---
>  arch/x86/kernel/acpi/cstate.c |   51 +++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 51 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> index bbbe4bb..5b0988a 100644
> --- a/arch/x86/kernel/acpi/cstate.c
> +++ b/arch/x86/kernel/acpi/cstate.c
> @@ -150,6 +150,54 @@ void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
>  }
>  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
>  
> +static unsigned long mwait_play_dead_eax;
> +
> +static void mwait_play_dead(void)
> +{
> +	if (boot_cpu_data.x86 >= 4)
> +		wbinvd();
> +
> +	while (1) {
> +		__monitor((void *)&current_thread_info()->flags, 0, 0);
> +		smp_mb();
> +		__mwait(mwait_play_dead_eax, 0);
> +	}
> +}
CPU is dead, can current_thread_info() still be used? Maybe just monitor a never changed
address.

Looks the patch will always take the highest native C-state, is this safe, considering
BIOS usually limit C-state?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate
  2009-05-25  0:56   ` Shaohua Li
@ 2009-05-26 21:17     ` Pallipadi, Venkatesh
  0 siblings, 0 replies; 14+ messages in thread
From: Pallipadi, Venkatesh @ 2009-05-26 21:17 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: mingo@elte.hu, tglx@linutronix.de, hpa@zytor.com,
	linux-kernel@vger.kernel.org, lenb@kernel.org,
	svaidy@linux.vnet.ibm.com, a.p.zijlstra@chello.nl

On Sun, 2009-05-24 at 17:56 -0700, Li, Shaohua wrote:
> On Sat, May 23, 2009 at 07:19:42AM +0800, Pallipadi, Venkatesh wrote:
> > Offline CPUs can save power by going into deepest cstate, subcstate
> > instead of hlt loop.
> > 
> > Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
> > ---
> >  arch/x86/kernel/acpi/cstate.c |   51 +++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 51 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/acpi/cstate.c b/arch/x86/kernel/acpi/cstate.c
> > index bbbe4bb..5b0988a 100644
> > --- a/arch/x86/kernel/acpi/cstate.c
> > +++ b/arch/x86/kernel/acpi/cstate.c
> > @@ -150,6 +150,54 @@ void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
> >  }
> >  EXPORT_SYMBOL_GPL(acpi_processor_ffh_cstate_enter);
> >  
> > +static unsigned long mwait_play_dead_eax;
> > +
> > +static void mwait_play_dead(void)
> > +{
> > +	if (boot_cpu_data.x86 >= 4)
> > +		wbinvd();
> > +
> > +	while (1) {
> > +		__monitor((void *)&current_thread_info()->flags, 0, 0);
> > +		smp_mb();
> > +		__mwait(mwait_play_dead_eax, 0);
> > +	}
> > +}
> CPU is dead, can current_thread_info() still be used? Maybe just monitor a never changed
> address.
> 

We still execute in this while loop even for an offline CPU, in case CPU
just wakes up for whatever reason (HT sibling woke up etc). So, we
should have stack and current_thread_info() even when offline, and we
can use it here. I had a never changing variable in my earlier version
of this patch. But removed it as I don't see why we should have couple
of cache lines of data when we can avoid it.

> Looks the patch will always take the highest native C-state, is this safe, considering
> BIOS usually limit C-state?
> 

It may not be safe in terms of wakeup latency etc. So, we use BIOS CST
for normal C-states. When offline, we don't really care about latency
part. If CPU supports a C-state, we should be able to enter it and we
can save most power with deepest C-state. If there are functionality
issues with C-state, I am sure CPU feature (hw or microcode) will have
it disabled, instead of depending on BIOSes to disable it.

Thanks,
Venki


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs
  2009-05-23 10:44   ` Peter Zijlstra
  2009-05-23 15:07     ` Pallipadi, Venkatesh
@ 2009-06-22 17:25     ` Pallipadi, Venkatesh
  1 sibling, 0 replies; 14+ messages in thread
From: Pallipadi, Venkatesh @ 2009-06-22 17:25 UTC (permalink / raw)
  To: Pallipadi, Venkatesh, Peter Zijlstra
  Cc: mingo@elte.hu, tglx@linutronix.de, hpa@zytor.com,
	linux-kernel@vger.kernel.org, lenb@kernel.org, Li, Shaohua,
	svaidy@linux.vnet.ibm.com

 

>From: Pallipadi, Venkatesh 
>>From: Peter Zijlstra [mailto:a.p.zijlstra@chello.nl] 
>>On Fri, 2009-05-22 at 16:19 -0700, 
>venkatesh.pallipadi@intel.com wrote:
>>> plain text document attachment
>>> (0001-x86-Add-pm_play_dead-funcptr-to-power-efficiently-o.patch)
>>> Add a funcptr pm_play_dead (similar to pm_idle) that can take
>>> the offline CPUs to the most power efficient idle state.
>>> 
>>> This patch just adds the func pointer. The pointer will get 
>>initialized
>>> by patch that follows.
>>
>>Since the pm_idle function pointer has given us so much grief, I don't
>>think its wise to repeat that particular disaster.
>>
>>I'd much rather see a framework where idle functions can be 
>registered,
>>and selected from based on criteria such as wakeup latency as provided
>>by the pm_qos stuff, and power saving.
>>
>>This framework should be shared between hotplug-idle and the regular
>>idle routines. Hotplug would of course not care about things 
>>like wakeup
>>latency and might therefore pick another idle routine.
>>
>
>Yes. Thought about that approach. There are issues with that 
>approach however.
>- It would be ugly as we have to add logic in various low 
>level idle handlers we have today mwait_idle, default_idle, 
>acpi_mwait_idle, acpi_ioport_idle to make them aware of 
>offline idle case and disable interrupt and not enable 
>interrupt while going idle and not to wake up on interrupt.
>- There are limitations on what we can and cannot do when 
>offline. We have had issues with calling acpi methods and/or 
>IO port accesses, during suspend before, so CPU going on 
>suspend only uses C1. Specifically, with the above approach we 
>will have:
>1) CPU 1 going offline and starts using regular ACPI IO port C-states.
>2) CPU 0 starts suspend and after some points it is unsafe to 
>access ACPI io port accesses, which can potentially result in 
>crash/hangs.
>
>So, I would like to keep ACPI out of play_dead. And keeping 
>idle and play_dead separate seemed to be cleaner way to do it.
>

Peter,

Any more comments on this approach? As I mentioned above, from my POV,
it is simpler and cleaner to keep ACPI out of the CPU offline path.
Do you agree?

Thanks,
Venki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [tip:x86/idle] x86, hotplug: Use mwait to offline a processor, fix the legacy case
  2009-05-22 23:19 ` [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate venkatesh.pallipadi
  2009-05-25  0:56   ` Shaohua Li
@ 2010-09-17 23:46   ` tip-bot for H. Peter Anvin
  2010-09-18  0:13     ` [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop tip-bot for H. Peter Anvin
  1 sibling, 1 reply; 14+ messages in thread
From: tip-bot for H. Peter Anvin @ 2010-09-17 23:46 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, arjan, venkatesh.pallipadi,
	a.p.zijlstra, lenb, tglx, hpa, venki

Commit-ID:  ea53069231f9317062910d6e772cca4ce93de8c8
Gitweb:     http://git.kernel.org/tip/ea53069231f9317062910d6e772cca4ce93de8c8
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Fri, 17 Sep 2010 15:39:11 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 17 Sep 2010 15:39:11 -0700

x86, hotplug: Use mwait to offline a processor, fix the legacy case

The code in native_play_dead() has a number of problems:

1. We should use MWAIT when available, to put ourselves into a deeper
   sleep state.
2. We use the existence of CLFLUSH to determine if WBINVD is safe, but
   that is totally bogus -- WBINVD is 486+, whereas CLFLUSH is a much
   later addition.
3. We should do WBINVD inside the loop, just in case of something like
   setting an A bit on page tables.  Pointed out by Arjan van de Ven.

This code is based in part of a previous patch by Venki Pallipadi, but
unlike that patch this one keeps all the detection code local instead
of pre-caching a bunch of information.  We're shutting down the CPU;
there is absolutely no hurry.

This patch moves all the code to C and deletes the global
wbinvd_halt() which is broken anyway.

Originally-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.hl>
LKML-Reference: <20090522232230.162239000@intel.com>
---
 arch/x86/include/asm/processor.h |   23 --------------
 arch/x86/kernel/smpboot.c        |   63 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 62 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 325b7bd..f358241 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -764,29 +764,6 @@ extern unsigned long		idle_halt;
 extern unsigned long		idle_nomwait;
 extern bool			c1e_detected;
 
-/*
- * on systems with caches, caches must be flashed as the absolute
- * last instruction before going into a suspended halt.  Otherwise,
- * dirty data can linger in the cache and become stale on resume,
- * leading to strange errors.
- *
- * perform a variety of operations to guarantee that the compiler
- * will not reorder instructions.  wbinvd itself is serializing
- * so the processor will not reorder.
- *
- * Systems without cache can just go into halt.
- */
-static inline void wbinvd_halt(void)
-{
-	mb();
-	/* check for clflush to determine if wbinvd is legal */
-	if (cpu_has_clflush)
-		asm volatile("cli; wbinvd; 1: hlt; jmp 1b" : : : "memory");
-	else
-		while (1)
-			halt();
-}
-
 extern void enable_sep_cpu(void);
 extern int sysenter_setup(void);
 
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 8b3bfc4..07bf423 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -62,6 +62,7 @@
 #include <asm/pgtable.h>
 #include <asm/tlbflush.h>
 #include <asm/mtrr.h>
+#include <asm/mwait.h>
 #include <asm/vmi.h>
 #include <asm/apic.h>
 #include <asm/setup.h>
@@ -1383,11 +1384,71 @@ void play_dead_common(void)
 	local_irq_disable();
 }
 
+/*
+ * We need to flush the caches before going to sleep, lest we have
+ * dirty data in our caches when we come back up.
+ */
+static inline void mwait_play_dead(void)
+{
+	unsigned int eax, ebx, ecx, edx;
+	unsigned int highest_cstate = 0;
+	unsigned int highest_subcstate = 0;
+	int i;
+
+	if (!cpu_has(&current_cpu_data, X86_FEATURE_MWAIT))
+		return;
+	if (current_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+		return;
+
+	eax = CPUID_MWAIT_LEAF;
+	ecx = 0;
+	native_cpuid(&eax, &ebx, &ecx, &edx);
+
+	/*
+	 * eax will be 0 if EDX enumeration is not valid.
+	 * Initialized below to cstate, sub_cstate value when EDX is valid.
+	 */
+	if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED)) {
+		eax = 0;
+	} else {
+		edx >>= MWAIT_SUBSTATE_SIZE;
+		for (i = 0; i < 7 && edx; i++, edx >>= MWAIT_SUBSTATE_SIZE) {
+			if (edx & MWAIT_SUBSTATE_MASK) {
+				highest_cstate = i;
+				highest_subcstate = edx & MWAIT_SUBSTATE_MASK;
+			}
+		}
+		eax = (highest_cstate << MWAIT_SUBSTATE_SIZE) |
+			(highest_subcstate - 1);
+	}
+
+	while (1) {
+		mb();
+		wbinvd();
+		__monitor(&current_thread_info()->flags, 0, 0);
+		mb();
+		__mwait(eax, 0);
+	}
+}
+
+static inline void hlt_play_dead(void)
+{
+	while (1) {
+		mb();
+		if (current_cpu_data.x86 >= 4)
+			wbinvd();
+		mb();
+		native_halt();
+	}
+}
+
 void native_play_dead(void)
 {
 	play_dead_common();
 	tboot_shutdown(TB_SHUTDOWN_WFS);
-	wbinvd_halt();
+
+	mwait_play_dead();	/* Only returns on failure */
+	hlt_play_dead();
 }
 
 #else /* ... !CONFIG_HOTPLUG_CPU */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop
  2010-09-17 23:46   ` [tip:x86/idle] x86, hotplug: Use mwait to offline a processor, fix the legacy case tip-bot for H. Peter Anvin
@ 2010-09-18  0:13     ` tip-bot for H. Peter Anvin
  2010-09-18  0:48       ` Venkatesh Pallipadi
  0 siblings, 1 reply; 14+ messages in thread
From: tip-bot for H. Peter Anvin @ 2010-09-18  0:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, asit.k.mallick, a.p.zijlstra, arjan,
	lenb, tglx, hpa, venki

Commit-ID:  a68e5c94f7d3dd64fef34dd5d97e365cae4bb42a
Gitweb:     http://git.kernel.org/tip/a68e5c94f7d3dd64fef34dd5d97e365cae4bb42a
Author:     H. Peter Anvin <hpa@linux.intel.com>
AuthorDate: Fri, 17 Sep 2010 17:06:46 -0700
Committer:  H. Peter Anvin <hpa@linux.intel.com>
CommitDate: Fri, 17 Sep 2010 17:10:23 -0700

x86, hotplug: Move WBINVD back outside the play_dead loop

On processors with hyperthreading, when only one thread is offlined
the other thread can cause a spurious wakeup on the idled thread.  We
do not want to re-WBINVD when that happens.

Ideally, we should simply skip WBINVD unless we're the last thread on
a particular core to shut down, but there might be similar issues
elsewhere in the system.

Thus, revert to previous behavior of only WBINVD outside the loop.
Partly as a result, remove the mb()'s around it: they are not
necessary since wbinvd() is a serializing instruction, but they were
intended to make sure the compiler didn't do any funny loop
optimizations.

Reported-by: Asit Mallick <asit.k.mallick@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.hl>
LKML-Reference: <tip-ea53069231f9317062910d6e772cca4ce93de8c8@git.kernel.org>
---
 arch/x86/kernel/smpboot.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 07bf423..55c80ffb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1422,9 +1422,9 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	wbinvd();
+
 	while (1) {
-		mb();
-		wbinvd();
 		__monitor(&current_thread_info()->flags, 0, 0);
 		mb();
 		__mwait(eax, 0);
@@ -1433,11 +1433,10 @@ static inline void mwait_play_dead(void)
 
 static inline void hlt_play_dead(void)
 {
+	if (current_cpu_data.x86 >= 4)
+		wbinvd();
+
 	while (1) {
-		mb();
-		if (current_cpu_data.x86 >= 4)
-			wbinvd();
-		mb();
 		native_halt();
 	}
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop
  2010-09-18  0:13     ` [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop tip-bot for H. Peter Anvin
@ 2010-09-18  0:48       ` Venkatesh Pallipadi
  2010-09-20 18:20         ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-18  0:48 UTC (permalink / raw)
  To: mingo, hpa, linux-kernel, asit.k.mallick, arjan, a.p.zijlstra,
	lenb, tglx, hpa, venki
  Cc: linux-tip-commits

On Fri, Sep 17, 2010 at 5:13 PM, tip-bot for H. Peter Anvin
<hpa@linux.intel.com> wrote:
> Commit-ID:  a68e5c94f7d3dd64fef34dd5d97e365cae4bb42a
> Gitweb:     http://git.kernel.org/tip/a68e5c94f7d3dd64fef34dd5d97e365cae4bb42a
> Author:     H. Peter Anvin <hpa@linux.intel.com>
> AuthorDate: Fri, 17 Sep 2010 17:06:46 -0700
> Committer:  H. Peter Anvin <hpa@linux.intel.com>
> CommitDate: Fri, 17 Sep 2010 17:10:23 -0700
>
> x86, hotplug: Move WBINVD back outside the play_dead loop
>
> On processors with hyperthreading, when only one thread is offlined
> the other thread can cause a spurious wakeup on the idled thread.  We
> do not want to re-WBINVD when that happens.
>
> Ideally, we should simply skip WBINVD unless we're the last thread on
> a particular core to shut down, but there might be similar issues
> elsewhere in the system.
>
> Thus, revert to previous behavior of only WBINVD outside the loop.
> Partly as a result, remove the mb()'s around it: they are not
> necessary since wbinvd() is a serializing instruction, but they were
> intended to make sure the compiler didn't do any funny loop
> optimizations.
>
> Reported-by: Asit Mallick <asit.k.mallick@intel.com>
> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
> Cc: Arjan van de Ven <arjan@linux.kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Venkatesh Pallipadi <venki@google.com>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.hl>
> LKML-Reference: <tip-ea53069231f9317062910d6e772cca4ce93de8c8@git.kernel.org>
> ---
>  arch/x86/kernel/smpboot.c |   11 +++++------
>  1 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 07bf423..55c80ffb 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1422,9 +1422,9 @@ static inline void mwait_play_dead(void)
>                        (highest_subcstate - 1);
>        }
>
> +       wbinvd();
> +
>        while (1) {
> -               mb();
> -               wbinvd();
>                __monitor(&current_thread_info()->flags, 0, 0);
>                mb();


Just one observation. There are some CPUs with errata that need
clflush before monitor. So, if that CPU wakesup spuriously it may have
problem reentering idle. Not sure whether that will be a problem as
that errata also depended on read happening on the flag. May be its
better to do monitor (0, 0, 0).

Thanks,
Venki

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop
  2010-09-18  0:48       ` Venkatesh Pallipadi
@ 2010-09-20 18:20         ` H. Peter Anvin
  2010-09-20 20:11           ` H. Peter Anvin
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-09-20 18:20 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: mingo, linux-kernel, asit.k.mallick, arjan, a.p.zijlstra, lenb,
	tglx, hpa, linux-tip-commits

On 09/17/2010 05:48 PM, Venkatesh Pallipadi wrote:
>>
>> +       wbinvd();
>> +
>>        while (1) {
>> -               mb();
>> -               wbinvd();
>>                __monitor(&current_thread_info()->flags, 0, 0);
>>                mb();
> 
> 
> Just one observation. There are some CPUs with errata that need
> clflush before monitor. So, if that CPU wakesup spuriously it may have
> problem reentering idle. Not sure whether that will be a problem as
> that errata also depended on read happening on the flag. May be its
> better to do monitor (0, 0, 0).
> 

It seems the easy way to deal with that would be to just add clflush
before monitor... it is *probably* redundant, but it should be safe to
do. It means depending on X86_FEATURE_CLFLUSH as well as
X86_FEATURE_MWAIT, but I don't think there is any x86 processor which
has MWAIT and not CLFLUSH, and I highly doubt there ever will be.

Does anyone see any downside?

	-hpa

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop
  2010-09-20 18:20         ` H. Peter Anvin
@ 2010-09-20 20:11           ` H. Peter Anvin
  2010-09-20 22:34             ` Venkatesh Pallipadi
  0 siblings, 1 reply; 14+ messages in thread
From: H. Peter Anvin @ 2010-09-20 20:11 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: mingo, linux-kernel, asit.k.mallick, arjan, a.p.zijlstra, lenb,
	tglx, hpa, linux-tip-commits, Andi Kleen

[-- Attachment #1: Type: text/plain, Size: 1074 bytes --]

On 09/20/2010 11:20 AM, H. Peter Anvin wrote:
> On 09/17/2010 05:48 PM, Venkatesh Pallipadi wrote:
>>>
>>> +       wbinvd();
>>> +
>>>        while (1) {
>>> -               mb();
>>> -               wbinvd();
>>>                __monitor(&current_thread_info()->flags, 0, 0);
>>>                mb();
>>
>>
>> Just one observation. There are some CPUs with errata that need
>> clflush before monitor. So, if that CPU wakesup spuriously it may have
>> problem reentering idle. Not sure whether that will be a problem as
>> that errata also depended on read happening on the flag. May be its
>> better to do monitor (0, 0, 0).
>>
> 
> It seems the easy way to deal with that would be to just add clflush
> before monitor... it is *probably* redundant, but it should be safe to
> do. It means depending on X86_FEATURE_CLFLUSH as well as
> X86_FEATURE_MWAIT, but I don't think there is any x86 processor which
> has MWAIT and not CLFLUSH, and I highly doubt there ever will be.
> 
> Does anyone see any downside?
> 

Patch for review... anyone see any problem with it?

	-hpa


[-- Attachment #2: 0001-x86-hotplug-In-the-MWAIT-case-of-play_dead-CLFLUSH-t.patch --]
[-- Type: text/x-patch, Size: 2265 bytes --]

>From 83c70ace9aad19ba7c41986f3c34a0808fbaf78f Mon Sep 17 00:00:00 2001
From: H. Peter Anvin <hpa@linux.intel.com>
Date: Mon, 20 Sep 2010 13:04:45 -0700
Subject: [PATCH] x86, hotplug: In the MWAIT case of play_dead, CLFLUSH the cache line

When we're using MWAIT for play_dead, explicitly CLFLUSH the cache
line before executing MONITOR.  This is a potential workaround for the
Xeon 7400 erratum AAI65 after having a spurious wakeup and returning
around the loop.  "Potential" here because it is not certain that that
erratum could actually trigger; however, the CLFLUSH should be
harmless.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan van de Ven <arjan@linux.kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Venkatesh Pallipadi <venki@google.com>
---
 arch/x86/kernel/smpboot.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 55c80ffb..fdccfe9 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1394,9 +1394,12 @@ static inline void mwait_play_dead(void)
 	unsigned int highest_cstate = 0;
 	unsigned int highest_subcstate = 0;
 	int i;
+	void *mwait_ptr;
 
 	if (!cpu_has(&current_cpu_data, X86_FEATURE_MWAIT))
 		return;
+	if (!cpu_has(&current_cpu_data, X86_FEATURE_CLFLSH))
+		return;
 	if (current_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
 		return;
 
@@ -1422,10 +1425,25 @@ static inline void mwait_play_dead(void)
 			(highest_subcstate - 1);
 	}
 
+	/*
+	 * This should be a memory location in a cache line which is
+	 * unlikely to be touched by other processors.  The actual
+	 * content is immaterial as it is not actually modified in any way.
+	 */
+	mwait_ptr = &current_thread_info()->flags;
+
 	wbinvd();
 
 	while (1) {
-		__monitor(&current_thread_info()->flags, 0, 0);
+		/*
+		 * The CLFLUSH is a workaround for erratum AAI65 for
+		 * the Xeon 7400 series.  It's not clear it is actually
+		 * needed, but it should be harmless in either case.
+		 * The WBINVD is insufficient due to the spurious-wakeup
+		 * case where we return around the loop.
+		 */
+		clflush(mwait_ptr);
+		__monitor(mwait_ptr, 0, 0);
 		mb();
 		__mwait(eax, 0);
 	}
-- 
1.7.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop
  2010-09-20 20:11           ` H. Peter Anvin
@ 2010-09-20 22:34             ` Venkatesh Pallipadi
  0 siblings, 0 replies; 14+ messages in thread
From: Venkatesh Pallipadi @ 2010-09-20 22:34 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: mingo, linux-kernel, asit.k.mallick, arjan, a.p.zijlstra, lenb,
	tglx, hpa, linux-tip-commits, Andi Kleen

On Mon, Sep 20, 2010 at 1:11 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 09/20/2010 11:20 AM, H. Peter Anvin wrote:
>> On 09/17/2010 05:48 PM, Venkatesh Pallipadi wrote:
>>>>
>>>> +       wbinvd();
>>>> +
>>>>        while (1) {
>>>> -               mb();
>>>> -               wbinvd();
>>>>                __monitor(&current_thread_info()->flags, 0, 0);
>>>>                mb();
>>>
>>>
>>> Just one observation. There are some CPUs with errata that need
>>> clflush before monitor. So, if that CPU wakesup spuriously it may have
>>> problem reentering idle. Not sure whether that will be a problem as
>>> that errata also depended on read happening on the flag. May be its
>>> better to do monitor (0, 0, 0).
>>>
>>
>> It seems the easy way to deal with that would be to just add clflush
>> before monitor... it is *probably* redundant, but it should be safe to
>> do. It means depending on X86_FEATURE_CLFLUSH as well as
>> X86_FEATURE_MWAIT, but I don't think there is any x86 processor which
>> has MWAIT and not CLFLUSH, and I highly doubt there ever will be.
>>
>> Does anyone see any downside?
>>
>
> Patch for review... anyone see any problem with it?
>

Looks good.

We can also close one of the longest standing bugs in
bugzilla.kernel.org  now :-)
https://bugzilla.kernel.org/show_bug.cgi?id=5471

Acked-by: Venkatesh Pallipadi <venki@google.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2010-09-20 22:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-22 23:19 [patch 0/2] x86: Make offline cpus to go to deepest idle state using mwait venkatesh.pallipadi
2009-05-22 23:19 ` [patch 1/2] x86: Add pm_play_dead funcptr to power-efficiently offline CPUs venkatesh.pallipadi
2009-05-23 10:44   ` Peter Zijlstra
2009-05-23 15:07     ` Pallipadi, Venkatesh
2009-06-22 17:25     ` Pallipadi, Venkatesh
2009-05-22 23:19 ` [patch 2/2] x86: put offline CPUs into deepest mwait cstate_subcstate venkatesh.pallipadi
2009-05-25  0:56   ` Shaohua Li
2009-05-26 21:17     ` Pallipadi, Venkatesh
2010-09-17 23:46   ` [tip:x86/idle] x86, hotplug: Use mwait to offline a processor, fix the legacy case tip-bot for H. Peter Anvin
2010-09-18  0:13     ` [tip:x86/idle] x86, hotplug: Move WBINVD back outside the play_dead loop tip-bot for H. Peter Anvin
2010-09-18  0:48       ` Venkatesh Pallipadi
2010-09-20 18:20         ` H. Peter Anvin
2010-09-20 20:11           ` H. Peter Anvin
2010-09-20 22:34             ` Venkatesh Pallipadi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox