linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc, kexec: Fix race in kexec shutdown
@ 2010-05-11  6:28 Michael Neuling
  2010-05-14  3:57 ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
  2010-05-14  3:57 ` [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
  0 siblings, 2 replies; 8+ messages in thread
From: Michael Neuling @ 2010-05-11  6:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, kexec, Anton Blanchard

In kexec_prepare_cpus cpu, the primary CPU IPIs the secondary CPUs to
kexec_smp_down().  kexec_smp_down() calls kexec_smp_wait() which sets
the hw_cpu_id() to -1.  The primary does this while leaving IRQs on
which means the primary can take a timer interrupt which can lead to
the primary IPIing one of the secondary CPUs (say, for a scheduler
re-balance) but since the secondary CPU now has a hw_cpu_id = -1, we
IPI CPU -1... Kaboom!

We are hitting this case regularly on POWER7 machines.  

Also, the secondaries are clearing out any pending IPIs before
guaranteeing that no more will be received.  

This changes kexec_prepare_cpus() so that we turn off IRQs in the
primary CPU much earlier.  It adds a paca flag to say that the
secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
rather than overloading hw_cpu_id with -1.

It also ensures that all CPUs have their IRQs off before we clear out
any pending IPI requests (in kexec_cpu_down()) to ensure there are no
trailing IPIs left unacknowledged.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---

 arch/powerpc/include/asm/paca.h        |    1 +
 arch/powerpc/kernel/machine_kexec_64.c |   28 ++++++++++++++++++++--------
 arch/powerpc/kernel/misc_64.S          |    3 ---
 3 files changed, 21 insertions(+), 11 deletions(-)

Index: linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/paca.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
@@ -82,6 +82,7 @@ struct paca_struct {
 	s16 hw_cpu_id;			/* Physical processor number */
 	u8 cpu_start;			/* At startup, processor spins until */
 					/* this becomes non-zero. */
+	u8 kexec_irqs_off;		/* set when kexec down has irqs off */
 #ifdef CONFIG_PPC_STD_MMU_64
 	struct slb_shadow *slb_shadow_ptr;
 
Index: linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/machine_kexec_64.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
@@ -155,16 +155,23 @@ void kexec_copy_flush(struct kimage *ima
 
 #ifdef CONFIG_SMP
 
-/* FIXME: we should schedule this function to be called on all cpus based
- * on calling the interrupts, but we would like to call it off irq level
- * so that the interrupt controller is clean.
- */
+static int kexec_all_irq_disabled = 0;
+
 static void kexec_smp_down(void *arg)
 {
+	local_irq_disable();
+	mb(); /* make sure our irqs are disabled before we say they are */
+	get_paca()->kexec_irqs_off = 1;
+	while(kexec_all_irq_disabled == 0)
+		cpu_relax();
+	mb(); /* make sure all irqs are disabled before this */
+	/*
+	 * Now every CPU has IRQs off, we can clear out any pending
+	 * IPIs and be sure that no more will come in after this.
+	 */
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 1);
 
-	local_irq_disable();
 	kexec_smp_wait();
 	/* NOTREACHED */
 }
@@ -174,14 +181,17 @@ static void kexec_prepare_cpus(void)
 	int my_cpu, i, notified=-1;
 
 	smp_call_function(kexec_smp_down, NULL, /* wait */0);
+	local_irq_disable();
+	mb(); /* make sure IRQs are disabled before we say they are */
+	get_paca()->kexec_irqs_off = 1;
 	my_cpu = get_cpu();
 
-	/* check the others cpus are now down (via paca hw cpu id == -1) */
+	/* check the others cpus are now down (via paca kexec_irqs_off == 1) */
 	for (i=0; i < NR_CPUS; i++) {
 		if (i == my_cpu)
 			continue;
 
-		while (paca[i].hw_cpu_id != -1) {
+		while (paca[i].kexec_irqs_off != 1) {
 			barrier();
 			if (!cpu_possible(i)) {
 				printk("kexec: cpu %d hw_cpu_id %d is not"
@@ -207,6 +217,9 @@ static void kexec_prepare_cpus(void)
 			}
 		}
 	}
+	mb();
+	/* we are sure every CPU has IRQs off at this point */
+	kexec_all_irq_disabled = 1;
 
 	/* after we tell the others to go down */
 	if (ppc_md.kexec_cpu_down)
@@ -214,7 +227,6 @@ static void kexec_prepare_cpus(void)
 
 	put_cpu();
 
-	local_irq_disable();
 }
 
 #else /* ! SMP */
Index: linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/misc_64.S
+++ linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
@@ -494,14 +494,11 @@ kexec_flag:
  * note: this is a terminal routine, it does not save lr
  *
  * get phys id from paca
- * set paca id to -1 to say we got here
  * switch to real mode
  * join other cpus in kexec_wait(phys_id)
  */
 _GLOBAL(kexec_smp_wait)
 	lhz	r3,PACAHWCPUID(r13)
-	li	r4,-1
-	sth	r4,PACAHWCPUID(r13)	/* let others know we left */
 	bl	real_mode
 	b	.kexec_wait
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown
  2010-05-11  6:28 [PATCH] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
  2010-05-14  3:57 ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
@ 2010-05-14  3:57 ` Michael Neuling
  2010-05-14  5:40   ` Michael Neuling
  2010-05-14  5:40   ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
  1 sibling, 2 replies; 8+ messages in thread
From: Michael Neuling @ 2010-05-14  3:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, kexec, Anton Blanchard, jlarrew

In kexec_prepare_cpus, the primary CPU IPIs the secondary CPUs to
kexec_smp_down().  kexec_smp_down() calls kexec_smp_wait() which sets
the hw_cpu_id() to -1.  The primary does this while leaving IRQs on
which means the primary can take a timer interrupt which can lead to
the IPIing one of the secondary CPUs (say, for a scheduler re-balance)
but since the secondary CPU now has a hw_cpu_id = -1, we IPI CPU
-1... Kaboom!

We are hitting this case regularly on POWER7 machines.  

There is also a second race, where the primary will tear down the MMU
mappings before knowing the secondaries have entered real mode.  

Also, the secondaries are clearing out any pending IPIs before
guaranteeing that no more will be received.  

This changes kexec_prepare_cpus() so that we turn off IRQs in the
primary CPU much earlier.  It adds a paca flag to say that the
secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
rather than overloading hw_cpu_id with -1.  This new paca flag is
again used to in indicate when the secondaries has entered real mode.

It also ensures that all CPUs have their IRQs off before we clear out
any pending IPI requests (in kexec_cpu_down()) to ensure there are no
trailing IPIs left unacknowledged.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---

 arch/powerpc/include/asm/kexec.h       |    4 ++
 arch/powerpc/include/asm/paca.h        |    1 
 arch/powerpc/kernel/asm-offsets.c      |    1 
 arch/powerpc/kernel/machine_kexec_64.c |   48 +++++++++++++++++++++++----------
 arch/powerpc/kernel/misc_64.S          |    8 +++--
 arch/powerpc/kernel/paca.c             |    2 +
 6 files changed, 47 insertions(+), 17 deletions(-)

Index: linux-2.6-ozlabs/arch/powerpc/include/asm/kexec.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/kexec.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/kexec.h
@@ -31,6 +31,10 @@
 #define KEXEC_ARCH KEXEC_ARCH_PPC
 #endif
 
+#define KEXEC_STATE_NONE 0
+#define KEXEC_STATE_IRQS_OFF 1
+#define KEXEC_STATE_REAL_MODE 2
+
 #ifndef __ASSEMBLY__
 #include <linux/cpumask.h>
 #include <asm/reg.h>
Index: linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/paca.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
@@ -82,6 +82,7 @@ struct paca_struct {
 	s16 hw_cpu_id;			/* Physical processor number */
 	u8 cpu_start;			/* At startup, processor spins until */
 					/* this becomes non-zero. */
+	u8 kexec_state;		/* set when kexec down has irqs off */
 #ifdef CONFIG_PPC_STD_MMU_64
 	struct slb_shadow *slb_shadow_ptr;
 
Index: linux-2.6-ozlabs/arch/powerpc/kernel/asm-offsets.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/asm-offsets.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/asm-offsets.c
@@ -184,6 +184,7 @@ int main(void)
 #endif /* CONFIG_PPC_STD_MMU_64 */
 	DEFINE(PACAEMERGSP, offsetof(struct paca_struct, emergency_sp));
 	DEFINE(PACAHWCPUID, offsetof(struct paca_struct, hw_cpu_id));
+	DEFINE(PACAKEXECSTATE, offsetof(struct paca_struct, kexec_state));
 	DEFINE(PACA_STARTPURR, offsetof(struct paca_struct, startpurr));
 	DEFINE(PACA_STARTSPURR, offsetof(struct paca_struct, startspurr));
 	DEFINE(PACA_USER_TIME, offsetof(struct paca_struct, user_time));
Index: linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/machine_kexec_64.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
@@ -155,33 +155,38 @@ void kexec_copy_flush(struct kimage *ima
 
 #ifdef CONFIG_SMP
 
-/* FIXME: we should schedule this function to be called on all cpus based
- * on calling the interrupts, but we would like to call it off irq level
- * so that the interrupt controller is clean.
- */
+static int kexec_all_irq_disabled = 0;
+
 static void kexec_smp_down(void *arg)
 {
+	local_irq_disable();
+	mb(); /* make sure our irqs are disabled before we say they are */
+	get_paca()->kexec_state = KEXEC_STATE_IRQS_OFF;
+	while(kexec_all_irq_disabled == 0)
+		cpu_relax();
+	mb(); /* make sure all irqs are disabled before this */
+	/*
+	 * Now every CPU has IRQs off, we can clear out any pending
+	 * IPIs and be sure that no more will come in after this.
+	 */
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 1);
 
-	local_irq_disable();
 	kexec_smp_wait();
 	/* NOTREACHED */
 }
 
-static void kexec_prepare_cpus(void)
+static void kexec_prepare_cpus_wait(int wait_state)
 {
 	int my_cpu, i, notified=-1;
 
-	smp_call_function(kexec_smp_down, NULL, /* wait */0);
 	my_cpu = get_cpu();
-
-	/* check the others cpus are now down (via paca hw cpu id == -1) */
+	/* check the others cpus are now down (via paca kexec_irqs_off == 1) */
 	for (i=0; i < NR_CPUS; i++) {
 		if (i == my_cpu)
 			continue;
 
-		while (paca[i].hw_cpu_id != -1) {
+		while (paca[i].kexec_state < wait_state) {
 			barrier();
 			if (!cpu_possible(i)) {
 				printk("kexec: cpu %d hw_cpu_id %d is not"
@@ -201,20 +206,35 @@ static void kexec_prepare_cpus(void)
 			}
 			if (i != notified) {
 				printk( "kexec: waiting for cpu %d (physical"
-						" %d) to go down\n",
-						i, paca[i].hw_cpu_id);
+						" %d) to enter %i state\n",
+					i, paca[i].hw_cpu_id, wait_state);
 				notified = i;
 			}
 		}
 	}
+	mb();
+}
+
+static void kexec_prepare_cpus(void)
+{
+
+	smp_call_function(kexec_smp_down, NULL, /* wait */0);
+	local_irq_disable();
+	mb(); /* make sure IRQs are disabled before we say they are */
+	get_paca()->kexec_state = KEXEC_STATE_IRQS_OFF;
+
+	kexec_prepare_cpus_wait(KEXEC_STATE_IRQS_OFF);
+	/* we are sure every CPU has IRQs off at this point */
+	kexec_all_irq_disabled = 1;
 
 	/* after we tell the others to go down */
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 0);
 
-	put_cpu();
+	/* Before removing MMU mapings make sure all CPUs have entered real mode */
+	kexec_prepare_cpus_wait(KEXEC_STATE_REAL_MODE);
 
-	local_irq_disable();
+	put_cpu();
 }
 
 #else /* ! SMP */
Index: linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/misc_64.S
+++ linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/cputable.h>
 #include <asm/thread_info.h>
+#include <asm/kexec.h>
 
 	.text
 
@@ -471,6 +472,10 @@ _GLOBAL(kexec_wait)
 1:	mflr	r5
 	addi	r5,r5,kexec_flag-1b
 
+	li	r4,KEXEC_STATE_REAL_MODE
+	stb	r4,PACAKEXECSTATE(r13)
+	SYNC
+
 99:	HMT_LOW
 #ifdef CONFIG_KEXEC		/* use no memory without kexec */
 	lwz	r4,0(r5)
@@ -494,14 +499,11 @@ kexec_flag:
  * note: this is a terminal routine, it does not save lr
  *
  * get phys id from paca
- * set paca id to -1 to say we got here
  * switch to real mode
  * join other cpus in kexec_wait(phys_id)
  */
 _GLOBAL(kexec_smp_wait)
 	lhz	r3,PACAHWCPUID(r13)
-	li	r4,-1
-	sth	r4,PACAHWCPUID(r13)	/* let others know we left */
 	bl	real_mode
 	b	.kexec_wait
 
Index: linux-2.6-ozlabs/arch/powerpc/kernel/paca.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/paca.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/paca.c
@@ -18,6 +18,7 @@
 #include <asm/pgtable.h>
 #include <asm/iseries/lpar_map.h>
 #include <asm/iseries/hv_types.h>
+#include <asm/kexec.h>
 
 /* This symbol is provided by the linker - let it fill in the paca
  * field correctly */
@@ -97,6 +98,7 @@ void __init initialise_paca(struct paca_
 	new_paca->kernelbase = (unsigned long) _stext;
 	new_paca->kernel_msr = MSR_KERNEL;
 	new_paca->hw_cpu_id = 0xffff;
+	new_paca->kexec_state = KEXEC_STATE_NONE;
 	new_paca->__current = &init_task;
 #ifdef CONFIG_PPC_STD_MMU_64
 	new_paca->slb_shadow_ptr = &slb_shadow[cpu];

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown
  2010-05-11  6:28 [PATCH] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
@ 2010-05-14  3:57 ` Michael Neuling
  2010-05-14  3:57 ` [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Neuling @ 2010-05-14  3:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, kexec, Anton Blanchard, jlarrew

When we are crashing, the crashing/primary CPU IPIs the secondaries to
turn off IRQs, go into real mode and wait in kexec_wait.  While this
is happening, the primary tears down all the MMU maps.  Unfortunately
the primary doesn't check to make sure the secondaries have entered
real mode before doing this.

On PHYP machines, the secondaries can take a long time shutting down
the IRQ controller as RTAS calls are need.  These RTAS calls need to
be serialised which resilts in the secondaries contending in
lock_rtas() and hence taking a long time to shut down.

We've hit this on large POWER7 machines, where some secondaries are
still waiting in lock_rtas(), when the primary tears down the HPTEs.

This patch makes sure all secondaries are in real mode before the
primary tears down the MMU.  It uses the new kexec_state entry in the
paca.  It times out if the secondaries don't reach real mode after
10sec.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---

 arch/powerpc/kernel/crash.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Index: linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/crash.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
@@ -162,6 +162,33 @@ static void crash_kexec_prepare_cpus(int
 	/* Leave the IPI callback set */
 }
 
+/* wait for all the CPUs to hit real mode but timeout if they don't come in */
+static void crash_kexec_wait_realmode(int cpu)
+{
+	unsigned int msecs;
+	int i;
+
+	/* check the others cpus are now down (via paca kexec_irqs_off == 1) */
+	msecs = 10000;
+	for (i=0; i < NR_CPUS && msecs > 0; i++) {
+		if (i == cpu)
+			continue;
+
+		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
+			barrier();
+			if (!cpu_possible(i)) {
+				break;
+			}
+			if (!cpu_online(i)) {
+				break;
+			}
+			msecs--;
+			mdelay(1);
+		}
+	}
+	mb();
+}
+
 /*
  * This function will be called by secondary cpus or by kexec cpu
  * if soft-reset is activated to stop some CPUs.
@@ -419,6 +446,7 @@ void default_machine_crash_shutdown(stru
 	crash_kexec_prepare_cpus(crashing_cpu);
 	cpu_set(crashing_cpu, cpus_in_crash);
 	crash_kexec_stop_spus();
+	crash_kexec_wait_realmode(crashing_cpu);
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(1, 0);
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown
  2010-05-14  3:57 ` [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
@ 2010-05-14  5:40   ` Michael Neuling
  2010-05-14  5:40   ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
  1 sibling, 0 replies; 8+ messages in thread
From: Michael Neuling @ 2010-05-14  5:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, kexec, Anton Blanchard, jlarrew

In kexec_prepare_cpus, the primary CPU IPIs the secondary CPUs to
kexec_smp_down().  kexec_smp_down() calls kexec_smp_wait() which sets
the hw_cpu_id() to -1.  The primary does this while leaving IRQs on
which means the primary can take a timer interrupt which can lead to
the IPIing one of the secondary CPUs (say, for a scheduler re-balance)
but since the secondary CPU now has a hw_cpu_id = -1, we IPI CPU
-1... Kaboom!

We are hitting this case regularly on POWER7 machines.  

There is also a second race, where the primary will tear down the MMU
mappings before knowing the secondaries have entered real mode.  

Also, the secondaries are clearing out any pending IPIs before
guaranteeing that no more will be received.  

This changes kexec_prepare_cpus() so that we turn off IRQs in the
primary CPU much earlier.  It adds a paca flag to say that the
secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
rather than overloading hw_cpu_id with -1.  This new paca flag is
again used to in indicate when the secondaries has entered real mode.

It also ensures that all CPUs have their IRQs off before we clear out
any pending IPI requests (in kexec_cpu_down()) to ensure there are no
trailing IPIs left unacknowledged.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---
Oops, missed quilt refresh in the last version...
---

 arch/powerpc/include/asm/kexec.h       |    4 ++
 arch/powerpc/include/asm/paca.h        |    1 
 arch/powerpc/kernel/asm-offsets.c      |    1 
 arch/powerpc/kernel/machine_kexec_64.c |   48 +++++++++++++++++++++++----------
 arch/powerpc/kernel/misc_64.S          |    8 +++--
 arch/powerpc/kernel/paca.c             |    2 +
 6 files changed, 47 insertions(+), 17 deletions(-)

Index: linux-2.6-ozlabs/arch/powerpc/include/asm/kexec.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/kexec.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/kexec.h
@@ -31,6 +31,10 @@
 #define KEXEC_ARCH KEXEC_ARCH_PPC
 #endif
 
+#define KEXEC_STATE_NONE 0
+#define KEXEC_STATE_IRQS_OFF 1
+#define KEXEC_STATE_REAL_MODE 2
+
 #ifndef __ASSEMBLY__
 #include <linux/cpumask.h>
 #include <asm/reg.h>
Index: linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/include/asm/paca.h
+++ linux-2.6-ozlabs/arch/powerpc/include/asm/paca.h
@@ -82,6 +82,7 @@ struct paca_struct {
 	s16 hw_cpu_id;			/* Physical processor number */
 	u8 cpu_start;			/* At startup, processor spins until */
 					/* this becomes non-zero. */
+	u8 kexec_state;		/* set when kexec down has irqs off */
 #ifdef CONFIG_PPC_STD_MMU_64
 	struct slb_shadow *slb_shadow_ptr;
 
Index: linux-2.6-ozlabs/arch/powerpc/kernel/asm-offsets.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/asm-offsets.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/asm-offsets.c
@@ -184,6 +184,7 @@ int main(void)
 #endif /* CONFIG_PPC_STD_MMU_64 */
 	DEFINE(PACAEMERGSP, offsetof(struct paca_struct, emergency_sp));
 	DEFINE(PACAHWCPUID, offsetof(struct paca_struct, hw_cpu_id));
+	DEFINE(PACAKEXECSTATE, offsetof(struct paca_struct, kexec_state));
 	DEFINE(PACA_STARTPURR, offsetof(struct paca_struct, startpurr));
 	DEFINE(PACA_STARTSPURR, offsetof(struct paca_struct, startspurr));
 	DEFINE(PACA_USER_TIME, offsetof(struct paca_struct, user_time));
Index: linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/machine_kexec_64.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/machine_kexec_64.c
@@ -155,33 +155,38 @@ void kexec_copy_flush(struct kimage *ima
 
 #ifdef CONFIG_SMP
 
-/* FIXME: we should schedule this function to be called on all cpus based
- * on calling the interrupts, but we would like to call it off irq level
- * so that the interrupt controller is clean.
- */
+static int kexec_all_irq_disabled = 0;
+
 static void kexec_smp_down(void *arg)
 {
+	local_irq_disable();
+	mb(); /* make sure our irqs are disabled before we say they are */
+	get_paca()->kexec_state = KEXEC_STATE_IRQS_OFF;
+	while(kexec_all_irq_disabled == 0)
+		cpu_relax();
+	mb(); /* make sure all irqs are disabled before this */
+	/*
+	 * Now every CPU has IRQs off, we can clear out any pending
+	 * IPIs and be sure that no more will come in after this.
+	 */
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 1);
 
-	local_irq_disable();
 	kexec_smp_wait();
 	/* NOTREACHED */
 }
 
-static void kexec_prepare_cpus(void)
+static void kexec_prepare_cpus_wait(int wait_state)
 {
 	int my_cpu, i, notified=-1;
 
-	smp_call_function(kexec_smp_down, NULL, /* wait */0);
 	my_cpu = get_cpu();
-
-	/* check the others cpus are now down (via paca hw cpu id == -1) */
+	/* Make sure each CPU has atleast made it to the state we need */
 	for (i=0; i < NR_CPUS; i++) {
 		if (i == my_cpu)
 			continue;
 
-		while (paca[i].hw_cpu_id != -1) {
+		while (paca[i].kexec_state < wait_state) {
 			barrier();
 			if (!cpu_possible(i)) {
 				printk("kexec: cpu %d hw_cpu_id %d is not"
@@ -201,20 +206,35 @@ static void kexec_prepare_cpus(void)
 			}
 			if (i != notified) {
 				printk( "kexec: waiting for cpu %d (physical"
-						" %d) to go down\n",
-						i, paca[i].hw_cpu_id);
+						" %d) to enter %i state\n",
+					i, paca[i].hw_cpu_id, wait_state);
 				notified = i;
 			}
 		}
 	}
+	mb();
+}
+
+static void kexec_prepare_cpus(void)
+{
+
+	smp_call_function(kexec_smp_down, NULL, /* wait */0);
+	local_irq_disable();
+	mb(); /* make sure IRQs are disabled before we say they are */
+	get_paca()->kexec_state = KEXEC_STATE_IRQS_OFF;
+
+	kexec_prepare_cpus_wait(KEXEC_STATE_IRQS_OFF);
+	/* we are sure every CPU has IRQs off at this point */
+	kexec_all_irq_disabled = 1;
 
 	/* after we tell the others to go down */
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(0, 0);
 
-	put_cpu();
+	/* Before removing MMU mapings make sure all CPUs have entered real mode */
+	kexec_prepare_cpus_wait(KEXEC_STATE_REAL_MODE);
 
-	local_irq_disable();
+	put_cpu();
 }
 
 #else /* ! SMP */
Index: linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/misc_64.S
+++ linux-2.6-ozlabs/arch/powerpc/kernel/misc_64.S
@@ -24,6 +24,7 @@
 #include <asm/asm-offsets.h>
 #include <asm/cputable.h>
 #include <asm/thread_info.h>
+#include <asm/kexec.h>
 
 	.text
 
@@ -471,6 +472,10 @@ _GLOBAL(kexec_wait)
 1:	mflr	r5
 	addi	r5,r5,kexec_flag-1b
 
+	li	r4,KEXEC_STATE_REAL_MODE
+	stb	r4,PACAKEXECSTATE(r13)
+	SYNC
+
 99:	HMT_LOW
 #ifdef CONFIG_KEXEC		/* use no memory without kexec */
 	lwz	r4,0(r5)
@@ -494,14 +499,11 @@ kexec_flag:
  * note: this is a terminal routine, it does not save lr
  *
  * get phys id from paca
- * set paca id to -1 to say we got here
  * switch to real mode
  * join other cpus in kexec_wait(phys_id)
  */
 _GLOBAL(kexec_smp_wait)
 	lhz	r3,PACAHWCPUID(r13)
-	li	r4,-1
-	sth	r4,PACAHWCPUID(r13)	/* let others know we left */
 	bl	real_mode
 	b	.kexec_wait
 
Index: linux-2.6-ozlabs/arch/powerpc/kernel/paca.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/paca.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/paca.c
@@ -18,6 +18,7 @@
 #include <asm/pgtable.h>
 #include <asm/iseries/lpar_map.h>
 #include <asm/iseries/hv_types.h>
+#include <asm/kexec.h>
 
 /* This symbol is provided by the linker - let it fill in the paca
  * field correctly */
@@ -97,6 +98,7 @@ void __init initialise_paca(struct paca_
 	new_paca->kernelbase = (unsigned long) _stext;
 	new_paca->kernel_msr = MSR_KERNEL;
 	new_paca->hw_cpu_id = 0xffff;
+	new_paca->kexec_state = KEXEC_STATE_NONE;
 	new_paca->__current = &init_task;
 #ifdef CONFIG_PPC_STD_MMU_64
 	new_paca->slb_shadow_ptr = &slb_shadow[cpu];

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown
  2010-05-14  3:57 ` [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
  2010-05-14  5:40   ` Michael Neuling
@ 2010-05-14  5:40   ` Michael Neuling
  2010-05-24 19:23     ` Kumar Gala
  1 sibling, 1 reply; 8+ messages in thread
From: Michael Neuling @ 2010-05-14  5:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, kexec, Anton Blanchard, jlarrew

When we are crashing, the crashing/primary CPU IPIs the secondaries to
turn off IRQs, go into real mode and wait in kexec_wait.  While this
is happening, the primary tears down all the MMU maps.  Unfortunately
the primary doesn't check to make sure the secondaries have entered
real mode before doing this.

On PHYP machines, the secondaries can take a long time shutting down
the IRQ controller as RTAS calls are need.  These RTAS calls need to
be serialised which resilts in the secondaries contending in
lock_rtas() and hence taking a long time to shut down.

We've hit this on large POWER7 machines, where some secondaries are
still waiting in lock_rtas(), when the primary tears down the HPTEs.

This patch makes sure all secondaries are in real mode before the
primary tears down the MMU.  It uses the new kexec_state entry in the
paca.  It times out if the secondaries don't reach real mode after
10sec.

Signed-off-by: Michael Neuling <mikey@neuling.org>
---

 arch/powerpc/kernel/crash.c |   27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Index: linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
===================================================================
--- linux-2.6-ozlabs.orig/arch/powerpc/kernel/crash.c
+++ linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
@@ -162,6 +162,32 @@ static void crash_kexec_prepare_cpus(int
 	/* Leave the IPI callback set */
 }
 
+/* wait for all the CPUs to hit real mode but timeout if they don't come in */
+static void crash_kexec_wait_realmode(int cpu)
+{
+	unsigned int msecs;
+	int i;
+
+	msecs = 10000;
+	for (i=0; i < NR_CPUS && msecs > 0; i++) {
+		if (i == cpu)
+			continue;
+
+		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
+			barrier();
+			if (!cpu_possible(i)) {
+				break;
+			}
+			if (!cpu_online(i)) {
+				break;
+			}
+			msecs--;
+			mdelay(1);
+		}
+	}
+	mb();
+}
+
 /*
  * This function will be called by secondary cpus or by kexec cpu
  * if soft-reset is activated to stop some CPUs.
@@ -412,6 +438,7 @@ void default_machine_crash_shutdown(stru
 	crash_kexec_prepare_cpus(crashing_cpu);
 	cpu_set(crashing_cpu, cpus_in_crash);
 	crash_kexec_stop_spus();
+	crash_kexec_wait_realmode(crashing_cpu);
 	if (ppc_md.kexec_cpu_down)
 		ppc_md.kexec_cpu_down(1, 0);
 }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown
  2010-05-14  5:40   ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
@ 2010-05-24 19:23     ` Kumar Gala
  2010-05-24 19:29       ` Kumar Gala
  0 siblings, 1 reply; 8+ messages in thread
From: Kumar Gala @ 2010-05-24 19:23 UTC (permalink / raw)
  To: Michael Neuling; +Cc: jlarrew, kexec, Anton Blanchard, linuxppc-dev


On May 14, 2010, at 12:40 AM, Michael Neuling wrote:

> When we are crashing, the crashing/primary CPU IPIs the secondaries to
> turn off IRQs, go into real mode and wait in kexec_wait.  While this
> is happening, the primary tears down all the MMU maps.  Unfortunately
> the primary doesn't check to make sure the secondaries have entered
> real mode before doing this.
>=20
> On PHYP machines, the secondaries can take a long time shutting down
> the IRQ controller as RTAS calls are need.  These RTAS calls need to
> be serialised which resilts in the secondaries contending in
> lock_rtas() and hence taking a long time to shut down.
>=20
> We've hit this on large POWER7 machines, where some secondaries are
> still waiting in lock_rtas(), when the primary tears down the HPTEs.
>=20
> This patch makes sure all secondaries are in real mode before the
> primary tears down the MMU.  It uses the new kexec_state entry in the
> paca.  It times out if the secondaries don't reach real mode after
> 10sec.
>=20
> Signed-off-by: Michael Neuling <mikey@neuling.org>
> ---
>=20
> arch/powerpc/kernel/crash.c |   27 +++++++++++++++++++++++++++
> 1 file changed, 27 insertions(+)
>=20
> Index: linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> --- linux-2.6-ozlabs.orig/arch/powerpc/kernel/crash.c
> +++ linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
> @@ -162,6 +162,32 @@ static void crash_kexec_prepare_cpus(int
> 	/* Leave the IPI callback set */
> }
>=20
> +/* wait for all the CPUs to hit real mode but timeout if they don't =
come in */
> +static void crash_kexec_wait_realmode(int cpu)
> +{
> +	unsigned int msecs;
> +	int i;
> +
> +	msecs =3D 10000;
> +	for (i=3D0; i < NR_CPUS && msecs > 0; i++) {
> +		if (i =3D=3D cpu)
> +			continue;
> +
> +		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
> +			barrier();
> +			if (!cpu_possible(i)) {
> +				break;
> +			}
> +			if (!cpu_online(i)) {
> +				break;
> +			}
> +			msecs--;
> +			mdelay(1);
> +		}
> +	}
> +	mb();
> +}
> +
> /*
>  * This function will be called by secondary cpus or by kexec cpu
>  * if soft-reset is activated to stop some CPUs.
> @@ -412,6 +438,7 @@ void default_machine_crash_shutdown(stru
> 	crash_kexec_prepare_cpus(crashing_cpu);
> 	cpu_set(crashing_cpu, cpus_in_crash);
> 	crash_kexec_stop_spus();

should this be

#ifdef CONFIG_PPC_STD_MMU

> +	crash_kexec_wait_realmode(crashing_cpu);

#endif

- k

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown
  2010-05-24 19:23     ` Kumar Gala
@ 2010-05-24 19:29       ` Kumar Gala
  2010-05-24 23:53         ` Michael Neuling
  0 siblings, 1 reply; 8+ messages in thread
From: Kumar Gala @ 2010-05-24 19:29 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, Michael Neuling, kexec, jlarrew, Anton Blanchard


On May 24, 2010, at 2:23 PM, Kumar Gala wrote:

>=20
> On May 14, 2010, at 12:40 AM, Michael Neuling wrote:
>=20
>> When we are crashing, the crashing/primary CPU IPIs the secondaries =
to
>> turn off IRQs, go into real mode and wait in kexec_wait.  While this
>> is happening, the primary tears down all the MMU maps.  Unfortunately
>> the primary doesn't check to make sure the secondaries have entered
>> real mode before doing this.
>>=20
>> On PHYP machines, the secondaries can take a long time shutting down
>> the IRQ controller as RTAS calls are need.  These RTAS calls need to
>> be serialised which resilts in the secondaries contending in
>> lock_rtas() and hence taking a long time to shut down.
>>=20
>> We've hit this on large POWER7 machines, where some secondaries are
>> still waiting in lock_rtas(), when the primary tears down the HPTEs.
>>=20
>> This patch makes sure all secondaries are in real mode before the
>> primary tears down the MMU.  It uses the new kexec_state entry in the
>> paca.  It times out if the secondaries don't reach real mode after
>> 10sec.
>>=20
>> Signed-off-by: Michael Neuling <mikey@neuling.org>
>> ---
>>=20
>> arch/powerpc/kernel/crash.c |   27 +++++++++++++++++++++++++++
>> 1 file changed, 27 insertions(+)
>>=20
>> Index: linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> --- linux-2.6-ozlabs.orig/arch/powerpc/kernel/crash.c
>> +++ linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
>> @@ -162,6 +162,32 @@ static void crash_kexec_prepare_cpus(int
>> 	/* Leave the IPI callback set */
>> }
>>=20
>> +/* wait for all the CPUs to hit real mode but timeout if they don't =
come in */
>> +static void crash_kexec_wait_realmode(int cpu)
>> +{
>> +	unsigned int msecs;
>> +	int i;
>> +
>> +	msecs =3D 10000;
>> +	for (i=3D0; i < NR_CPUS && msecs > 0; i++) {
>> +		if (i =3D=3D cpu)
>> +			continue;
>> +
>> +		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
>> +			barrier();
>> +			if (!cpu_possible(i)) {
>> +				break;
>> +			}
>> +			if (!cpu_online(i)) {
>> +				break;
>> +			}
>> +			msecs--;
>> +			mdelay(1);
>> +		}
>> +	}
>> +	mb();
>> +}
>> +
>> /*
>> * This function will be called by secondary cpus or by kexec cpu
>> * if soft-reset is activated to stop some CPUs.
>> @@ -412,6 +438,7 @@ void default_machine_crash_shutdown(stru
>> 	crash_kexec_prepare_cpus(crashing_cpu);
>> 	cpu_set(crashing_cpu, cpus_in_crash);
>> 	crash_kexec_stop_spus();
>=20
> should this be
>=20
> #ifdef CONFIG_PPC_STD_MMU
>=20
>> +	crash_kexec_wait_realmode(crashing_cpu);
>=20
> #endif

I'm going to make it CONFIG_PPC_STD_MMU_64 as part of a Kexec book-e =
patch

- k=

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown
  2010-05-24 19:29       ` Kumar Gala
@ 2010-05-24 23:53         ` Michael Neuling
  0 siblings, 0 replies; 8+ messages in thread
From: Michael Neuling @ 2010-05-24 23:53 UTC (permalink / raw)
  To: Kumar Gala; +Cc: linuxppc-dev, kexec, jlarrew, Anton Blanchard



In message <04AC722A-97CD-4451-B6AB-F4AC37EFAB1D@kernel.crashing.org> you wrote
:
> 
> On May 24, 2010, at 2:23 PM, Kumar Gala wrote:
> 
> >=20
> > On May 14, 2010, at 12:40 AM, Michael Neuling wrote:
> >=20
> >> When we are crashing, the crashing/primary CPU IPIs the secondaries =
> to
> >> turn off IRQs, go into real mode and wait in kexec_wait.  While this
> >> is happening, the primary tears down all the MMU maps.  Unfortunately
> >> the primary doesn't check to make sure the secondaries have entered
> >> real mode before doing this.
> >>=20
> >> On PHYP machines, the secondaries can take a long time shutting down
> >> the IRQ controller as RTAS calls are need.  These RTAS calls need to
> >> be serialised which resilts in the secondaries contending in
> >> lock_rtas() and hence taking a long time to shut down.
> >>=20
> >> We've hit this on large POWER7 machines, where some secondaries are
> >> still waiting in lock_rtas(), when the primary tears down the HPTEs.
> >>=20
> >> This patch makes sure all secondaries are in real mode before the
> >> primary tears down the MMU.  It uses the new kexec_state entry in the
> >> paca.  It times out if the secondaries don't reach real mode after
> >> 10sec.
> >>=20
> >> Signed-off-by: Michael Neuling <mikey@neuling.org>
> >> ---
> >>=20
> >> arch/powerpc/kernel/crash.c |   27 +++++++++++++++++++++++++++
> >> 1 file changed, 27 insertions(+)
> >>=20
> >> Index: linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> --- linux-2.6-ozlabs.orig/arch/powerpc/kernel/crash.c
> >> +++ linux-2.6-ozlabs/arch/powerpc/kernel/crash.c
> >> @@ -162,6 +162,32 @@ static void crash_kexec_prepare_cpus(int
> >> 	/* Leave the IPI callback set */
> >> }
> >>=20
> >> +/* wait for all the CPUs to hit real mode but timeout if they don't =
> come in */
> >> +static void crash_kexec_wait_realmode(int cpu)
> >> +{
> >> +	unsigned int msecs;
> >> +	int i;
> >> +
> >> +	msecs =3D 10000;
> >> +	for (i=3D0; i < NR_CPUS && msecs > 0; i++) {
> >> +		if (i =3D=3D cpu)
> >> +			continue;
> >> +
> >> +		while (paca[i].kexec_state < KEXEC_STATE_REAL_MODE) {
> >> +			barrier();
> >> +			if (!cpu_possible(i)) {
> >> +				break;
> >> +			}
> >> +			if (!cpu_online(i)) {
> >> +				break;
> >> +			}
> >> +			msecs--;
> >> +			mdelay(1);
> >> +		}
> >> +	}
> >> +	mb();
> >> +}
> >> +
> >> /*
> >> * This function will be called by secondary cpus or by kexec cpu
> >> * if soft-reset is activated to stop some CPUs.
> >> @@ -412,6 +438,7 @@ void default_machine_crash_shutdown(stru
> >> 	crash_kexec_prepare_cpus(crashing_cpu);
> >> 	cpu_set(crashing_cpu, cpus_in_crash);
> >> 	crash_kexec_stop_spus();
> >=20
> > should this be
> >=20
> > #ifdef CONFIG_PPC_STD_MMU
> >=20
> >> +	crash_kexec_wait_realmode(crashing_cpu);
> >=20
> > #endif
> 
> I'm going to make it CONFIG_PPC_STD_MMU_64 as part of a Kexec book-e =
> patch

Ok, thanks, I'll leave it up to you then

Mikey

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-05-24 23:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-11  6:28 [PATCH] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
2010-05-14  3:57 ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
2010-05-14  3:57 ` [PATCH 1/2] powerpc, kexec: Fix race in kexec shutdown Michael Neuling
2010-05-14  5:40   ` Michael Neuling
2010-05-14  5:40   ` [PATCH 2/2] powerpc, kdump: Fix race in kdump shutdown Michael Neuling
2010-05-24 19:23     ` Kumar Gala
2010-05-24 19:29       ` Kumar Gala
2010-05-24 23:53         ` Michael Neuling

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).