linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error path
@ 2014-06-20 14:28 Boris Ostrovsky
  2014-06-20 15:23 ` Borislav Petkov
  2014-07-24 23:36 ` [tip:x86/urgent] x86, MCE: Robustify mcheck_init_device tip-bot for Borislav Petkov
  0 siblings, 2 replies; 17+ messages in thread
From: Boris Ostrovsky @ 2014-06-20 14:28 UTC (permalink / raw)
  To: bp, tony.luck; +Cc: linux-kernel, linux-edac, mattieu.souchaud, boris.ostrovsky

Commit 9c15a24b038f4d8da93a2bc2554731f8953a7c17 (x86/mce: Improve
mcheck_init_device() error handling) unregisters (or never registers)
MCE's hotplug notifier if an error is encountered.

Since unplugging a CPU would normally result in the notifier deleting
MCE timer we are now left with the timer running if a CPU is removed on
a system where mcheck_init_device() had failed.

If we later hotplug this CPU back we add this timer again in
mcheck_cpu_init()). Eventually the two timers start intefering with each
other, causing soft lockups or system hangs.

We should leave the notifier always on and, in fact, set it up early
during the boot.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 42 ++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38..0d2828a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1677,6 +1677,11 @@ static void unexpected_machine_check(struct pt_regs *regs, long error_code)
 void (*machine_check_vector)(struct pt_regs *, long error_code) =
 						unexpected_machine_check;
 
+static int
+mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu);
+static struct notifier_block mce_cpu_notifier = {
+	.notifier_call = mce_cpu_callback,
+};
 /*
  * Called for each booted CPU to set up machine checks.
  * Must be called with preempt off:
@@ -1704,6 +1709,9 @@ void mcheck_cpu_init(struct cpuinfo_x86 *c)
 	__mcheck_cpu_init_timer();
 	INIT_WORK(&__get_cpu_var(mce_work), mce_process_work);
 	init_irq_work(&__get_cpu_var(mce_irq_work), &mce_irq_work_cb);
+
+	if (c == &boot_cpu_data)
+		register_cpu_notifier(&mce_cpu_notifier); /* pre-SMP */
 }
 
 /*
@@ -1951,6 +1959,7 @@ static struct miscdevice mce_chrdev_device = {
 	"mcelog",
 	&mce_chrdev_ops,
 };
+static bool is_mce_chrdev_set;
 
 static void __mce_disable_bank(void *arg)
 {
@@ -2376,14 +2385,18 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 
 	switch (action & ~CPU_TASKS_FROZEN) {
 	case CPU_ONLINE:
-		mce_device_create(cpu);
-		if (threshold_cpu_callback)
-			threshold_cpu_callback(action, cpu);
+		if (is_mce_chrdev_set) {
+			mce_device_create(cpu);
+			if (threshold_cpu_callback)
+				threshold_cpu_callback(action, cpu);
+		}
 		break;
 	case CPU_DEAD:
-		if (threshold_cpu_callback)
-			threshold_cpu_callback(action, cpu);
-		mce_device_remove(cpu);
+		if (is_mce_chrdev_set) {
+			if (threshold_cpu_callback)
+				threshold_cpu_callback(action, cpu);
+			mce_device_remove(cpu);
+		}
 		mce_intel_hcpu_update(cpu);
 		break;
 	case CPU_DOWN_PREPARE:
@@ -2404,10 +2417,6 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 	return NOTIFY_OK;
 }
 
-static struct notifier_block mce_cpu_notifier = {
-	.notifier_call = mce_cpu_callback,
-};
-
 static __init void mce_init_banks(void)
 {
 	int i;
@@ -2447,18 +2456,12 @@ static __init int mcheck_init_device(void)
 	if (err)
 		goto err_out_mem;
 
-	cpu_notifier_register_begin();
 	for_each_online_cpu(i) {
 		err = mce_device_create(i);
-		if (err) {
-			cpu_notifier_register_done();
+		if (err)
 			goto err_device_create;
-		}
 	}
 
-	__register_hotcpu_notifier(&mce_cpu_notifier);
-	cpu_notifier_register_done();
-
 	register_syscore_ops(&mce_syscore_ops);
 
 	/* register character device /dev/mcelog */
@@ -2466,15 +2469,12 @@ static __init int mcheck_init_device(void)
 	if (err)
 		goto err_register;
 
+	is_mce_chrdev_set = true;
 	return 0;
 
 err_register:
 	unregister_syscore_ops(&mce_syscore_ops);
 
-	cpu_notifier_register_begin();
-	__unregister_hotcpu_notifier(&mce_cpu_notifier);
-	cpu_notifier_register_done();
-
 err_device_create:
 	/*
 	 * We didn't keep track of which devices were created above, but
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 17+ messages in thread
* Re: [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error path
@ 2014-06-22 17:25 Boris Ostrovsky
  2014-06-24 10:25 ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Boris Ostrovsky @ 2014-06-22 17:25 UTC (permalink / raw)
  To: bp; +Cc: tony.luck, mattieu.souchaud, linux-edac, linux-kernel


----- bp@alien8.de wrote:

> On Fri, Jun 20, 2014 at 10:04:37PM -0400, Boris Ostrovsky wrote:
> > I'll try it later but this doesn't look sufficient to me: we might
> not
> > reach this point if subsys_system_register() or
> zalloc_cpumask_var()
> > fail.
> 
> If those fail, I'd say we have a much bigger problem than undeleted
> timers.


You can add 

Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

if you prefer to go with that version. I still think it's not 100% reliable (because of what I said above) but at least it fixes the current breakage.

> 
> > We could register the notifier as the first thing in this routine
> > (probably after mce_available() succeeds).
> 
> I guess...


Actually that won't work --- we need to register bus at sysfs first.

-boris

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-07-24 23:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-20 14:28 [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error path Boris Ostrovsky
2014-06-20 15:23 ` Borislav Petkov
2014-06-20 15:41   ` Boris Ostrovsky
2014-06-20 15:58     ` Borislav Petkov
2014-06-20 16:16       ` Boris Ostrovsky
2014-06-20 17:52         ` Borislav Petkov
2014-06-20 19:39           ` Boris Ostrovsky
2014-06-20 20:03             ` Borislav Petkov
2014-06-20 20:16               ` Boris Ostrovsky
2014-06-20 20:29                 ` Borislav Petkov
2014-06-20 20:43                   ` Boris Ostrovsky
2014-06-20 21:11                     ` Borislav Petkov
2014-06-21  2:04                       ` Boris Ostrovsky
2014-06-21 10:08                         ` Borislav Petkov
2014-07-24 23:36 ` [tip:x86/urgent] x86, MCE: Robustify mcheck_init_device tip-bot for Borislav Petkov
  -- strict thread matches above, loose matches on Subject: below --
2014-06-22 17:25 [PATCH] x86/mce: Don't unregister CPU hotplug notifier in error path Boris Ostrovsky
2014-06-24 10:25 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).