* [GIT PULL] RAS fix for 3.17
@ 2014-06-22 16:46 Borislav Petkov
2014-06-24 13:24 ` [GIT PULL] 2 RAS fixes for 3.17, refreshed Borislav Petkov
0 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2014-06-22 16:46 UTC (permalink / raw)
To: x86-ml; +Cc: Tony Luck, lkml
Hi guys,
please queue this for 3.17. We had it ready earlier but decided to delay
it for an extra testing period.
Thanks.
--
The following changes since commit a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee:
Linux 3.16-rc2 (2014-06-21 19:02:54 -1000)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/ras_for_3.17
for you to fetch changes up to 38356c1fbd8cd0f44a32ede2c97f0eb639d06613:
x86, MCE: Kill CPU_POST_DEAD (2014-06-22 18:36:39 +0200)
----------------------------------------------------------------
CPU_POST_DEAD is one of thorns in the path to getting CPU hotplug
seriously cleaned up. Kill its incarnation here in the MCE code.
----------------------------------------------------------------
Borislav Petkov (1):
x86, MCE: Kill CPU_POST_DEAD
arch/x86/kernel/cpu/mcheck/mce.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index bb92f38153b2..8fecdd34f2d2 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -2385,6 +2385,10 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
threshold_cpu_callback(action, cpu);
mce_device_remove(cpu);
mce_intel_hcpu_update(cpu);
+
+ /* intentionally ignoring frozen here */
+ if (!(action & CPU_TASKS_FROZEN))
+ cmci_rediscover();
break;
case CPU_DOWN_PREPARE:
smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
@@ -2396,11 +2400,6 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
break;
}
- if (action == CPU_POST_DEAD) {
- /* intentionally ignoring frozen here */
- cmci_rediscover();
- }
-
return NOTIFY_OK;
}
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply related [flat|nested] 7+ messages in thread* [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-22 16:46 [GIT PULL] RAS fix for 3.17 Borislav Petkov @ 2014-06-24 13:24 ` Borislav Petkov 2014-06-24 13:27 ` Konrad Rzeszutek Wilk 0 siblings, 1 reply; 7+ messages in thread From: Borislav Petkov @ 2014-06-24 13:24 UTC (permalink / raw) To: x86-ml; +Cc: Tony Luck, lkml On Sun, Jun 22, 2014 at 06:46:03PM +0200, Borislav Petkov wrote: > Hi guys, > > please queue this for 3.17. We had it ready earlier but decided to delay > it for an extra testing period. Actually, ignore that one. Here's a new pull request adding a fix for an issue BorisO reported. All non-critical stuff for 3.17. Please pull, thanks. --- The following changes since commit a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee: Linux 3.16-rc2 (2014-06-21 19:02:54 -1000) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/ras_for_3.17 for you to fetch changes up to 27c934158c5be0bebfb2970da521b9d9efc0058b: x86, MCE: Robustify mcheck_init_device (2014-06-24 15:17:01 +0200) ---------------------------------------------------------------- CPU_POST_DEAD is one of thorns in the path to getting CPU hotplug seriously cleaned up. Kill its incarnation here in the MCE code. Also, robustify mcheck_init_device() wrt CPU hotplug. ---------------------------------------------------------------- Borislav Petkov (2): x86, MCE: Kill CPU_POST_DEAD x86, MCE: Robustify mcheck_init_device arch/x86/kernel/cpu/mcheck/mce.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index bb92f38153b2..4fc57975acc1 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -2385,6 +2385,10 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) threshold_cpu_callback(action, cpu); mce_device_remove(cpu); mce_intel_hcpu_update(cpu); + + /* intentionally ignoring frozen here */ + if (!(action & CPU_TASKS_FROZEN)) + cmci_rediscover(); break; case CPU_DOWN_PREPARE: smp_call_function_single(cpu, mce_disable_cpu, &action, 1); @@ -2396,11 +2400,6 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) break; } - if (action == CPU_POST_DEAD) { - /* intentionally ignoring frozen here */ - cmci_rediscover(); - } - return NOTIFY_OK; } @@ -2451,6 +2450,12 @@ static __init int mcheck_init_device(void) for_each_online_cpu(i) { err = mce_device_create(i); if (err) { + /* + * Register notifier anyway (and do not unreg it) so + * that we don't leave undeleted timers, see notifier + * callback above. + */ + __register_hotcpu_notifier(&mce_cpu_notifier); cpu_notifier_register_done(); goto err_device_create; } @@ -2471,10 +2476,6 @@ static __init int mcheck_init_device(void) err_register: unregister_syscore_ops(&mce_syscore_ops); - cpu_notifier_register_begin(); - __unregister_hotcpu_notifier(&mce_cpu_notifier); - cpu_notifier_register_done(); - err_device_create: /* * We didn't keep track of which devices were created above, but -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-24 13:24 ` [GIT PULL] 2 RAS fixes for 3.17, refreshed Borislav Petkov @ 2014-06-24 13:27 ` Konrad Rzeszutek Wilk 2014-06-27 15:01 ` Borislav Petkov 0 siblings, 1 reply; 7+ messages in thread From: Konrad Rzeszutek Wilk @ 2014-06-24 13:27 UTC (permalink / raw) To: Borislav Petkov; +Cc: x86-ml, Tony Luck, lkml On Tue, Jun 24, 2014 at 03:24:39PM +0200, Borislav Petkov wrote: > On Sun, Jun 22, 2014 at 06:46:03PM +0200, Borislav Petkov wrote: > > Hi guys, > > > > please queue this for 3.17. We had it ready earlier but decided to delay > > it for an extra testing period. > > Actually, ignore that one. Here's a new pull request adding a fix for an > issue BorisO reported. All non-critical stuff for 3.17. Isn't that one a regression that was introduced in 3.16? Shouldn't it be going against 3.16 then? Thanks. > > Please pull, > thanks. > > --- > The following changes since commit a497c3ba1d97fc69c1e78e7b96435ba8c2cb42ee: > > Linux 3.16-rc2 (2014-06-21 19:02:54 -1000) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/ras_for_3.17 > > for you to fetch changes up to 27c934158c5be0bebfb2970da521b9d9efc0058b: > > x86, MCE: Robustify mcheck_init_device (2014-06-24 15:17:01 +0200) > > ---------------------------------------------------------------- > CPU_POST_DEAD is one of thorns in the path to getting CPU hotplug > seriously cleaned up. Kill its incarnation here in the MCE code. > > Also, robustify mcheck_init_device() wrt CPU hotplug. > > ---------------------------------------------------------------- > Borislav Petkov (2): > x86, MCE: Kill CPU_POST_DEAD > x86, MCE: Robustify mcheck_init_device > > arch/x86/kernel/cpu/mcheck/mce.c | 19 ++++++++++--------- > 1 file changed, 10 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index bb92f38153b2..4fc57975acc1 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -2385,6 +2385,10 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) > threshold_cpu_callback(action, cpu); > mce_device_remove(cpu); > mce_intel_hcpu_update(cpu); > + > + /* intentionally ignoring frozen here */ > + if (!(action & CPU_TASKS_FROZEN)) > + cmci_rediscover(); > break; > case CPU_DOWN_PREPARE: > smp_call_function_single(cpu, mce_disable_cpu, &action, 1); > @@ -2396,11 +2400,6 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) > break; > } > > - if (action == CPU_POST_DEAD) { > - /* intentionally ignoring frozen here */ > - cmci_rediscover(); > - } > - > return NOTIFY_OK; > } > > @@ -2451,6 +2450,12 @@ static __init int mcheck_init_device(void) > for_each_online_cpu(i) { > err = mce_device_create(i); > if (err) { > + /* > + * Register notifier anyway (and do not unreg it) so > + * that we don't leave undeleted timers, see notifier > + * callback above. > + */ > + __register_hotcpu_notifier(&mce_cpu_notifier); > cpu_notifier_register_done(); > goto err_device_create; > } > @@ -2471,10 +2476,6 @@ static __init int mcheck_init_device(void) > err_register: > unregister_syscore_ops(&mce_syscore_ops); > > - cpu_notifier_register_begin(); > - __unregister_hotcpu_notifier(&mce_cpu_notifier); > - cpu_notifier_register_done(); > - > err_device_create: > /* > * We didn't keep track of which devices were created above, but > > -- > Regards/Gruss, > Boris. > > Sent from a fat crate under my desk. Formatting is fine. > -- > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-24 13:27 ` Konrad Rzeszutek Wilk @ 2014-06-27 15:01 ` Borislav Petkov 2014-06-27 15:12 ` Boris Ostrovsky 0 siblings, 1 reply; 7+ messages in thread From: Borislav Petkov @ 2014-06-27 15:01 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: Konrad Rzeszutek Wilk, x86-ml, Tony Luck, lkml On Tue, Jun 24, 2014 at 09:27:01AM -0400, Konrad Rzeszutek Wilk wrote: > Isn't that one a regression that was introduced in 3.16? Hrrm, BorisO, you said misc_register would often fail in xen, is that correct? Because if so, we added the error check to misc_register in 3.16 so the mcheck_init_device fix would have to go in now. Right? Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-27 15:01 ` Borislav Petkov @ 2014-06-27 15:12 ` Boris Ostrovsky 2014-06-27 16:08 ` Borislav Petkov 0 siblings, 1 reply; 7+ messages in thread From: Boris Ostrovsky @ 2014-06-27 15:12 UTC (permalink / raw) To: Borislav Petkov; +Cc: Konrad Rzeszutek Wilk, x86-ml, Tony Luck, lkml On 06/27/2014 11:01 AM, Borislav Petkov wrote: > On Tue, Jun 24, 2014 at 09:27:01AM -0400, Konrad Rzeszutek Wilk wrote: >> Isn't that one a regression that was introduced in 3.16? > Hrrm, BorisO, you said misc_register would often fail in xen, is that > correct? Because if so, we added the error check to misc_register in > 3.16 so the mcheck_init_device fix would have to go in now. Yes, it fails because xen_late_init_mcelog() registers /dev/mcelog and (I think) it happens before mcheck_init_device(). In other words, misc_register() expected to fail in mcheck/mce.c on (privileged?) PV guests (provided right CONFIG_XEN_* is set). -boris ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-27 15:12 ` Boris Ostrovsky @ 2014-06-27 16:08 ` Borislav Petkov 2014-06-27 16:59 ` Boris Ostrovsky 0 siblings, 1 reply; 7+ messages in thread From: Borislav Petkov @ 2014-06-27 16:08 UTC (permalink / raw) To: Boris Ostrovsky; +Cc: Konrad Rzeszutek Wilk, x86-ml, Tony Luck, lkml On Fri, Jun 27, 2014 at 11:12:59AM -0400, Boris Ostrovsky wrote: > Yes, it fails because xen_late_init_mcelog() registers /dev/mcelog and (I > think) it happens before mcheck_init_device(). Yes, mcheck_init_device is device_initcall_sync() while xen_late_init_mcelog() is device_initcall(). > In other words, misc_register() expected to fail in mcheck/mce.c on > (privileged?) PV guests (provided right CONFIG_XEN_* is set). So cef12ee52b05 ("xen/mce: Add mcelog support for Xen platform") made it this way so that xen's init routine runs first. So it is not the case that misc_register() fails often on xen but it is *supposed* to fail by design, when running in dom0. And *then* you need the notifier *not* unregistered on the error path so that the timers do get deleted properly. Ok, I see it now. Frankly, I'm not really sure I want to rush this in now because it might break something else, Who TF knows what. Right now my gut feeling tells me we should still queue it for 3.17 and have it run for a while in linux-next. We can backport it to stable later after some testing... Hmmm. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [GIT PULL] 2 RAS fixes for 3.17, refreshed 2014-06-27 16:08 ` Borislav Petkov @ 2014-06-27 16:59 ` Boris Ostrovsky 0 siblings, 0 replies; 7+ messages in thread From: Boris Ostrovsky @ 2014-06-27 16:59 UTC (permalink / raw) To: Borislav Petkov; +Cc: Konrad Rzeszutek Wilk, x86-ml, Tony Luck, lkml On 06/27/2014 12:08 PM, Borislav Petkov wrote: > On Fri, Jun 27, 2014 at 11:12:59AM -0400, Boris Ostrovsky wrote: >> Yes, it fails because xen_late_init_mcelog() registers /dev/mcelog and (I >> think) it happens before mcheck_init_device(). > Yes, mcheck_init_device is device_initcall_sync() while > xen_late_init_mcelog() is device_initcall(). > >> In other words, misc_register() expected to fail in mcheck/mce.c on >> (privileged?) PV guests (provided right CONFIG_XEN_* is set). > So > > cef12ee52b05 ("xen/mce: Add mcelog support for Xen platform") > > made it this way so that xen's init routine runs first. > > So it is not the case that misc_register() fails often on xen but it is > *supposed* to fail by design, when running in dom0. And *then* you need > the notifier *not* unregistered on the error path so that the timers do > get deleted properly. > > Ok, I see it now. Frankly, I'm not really sure I want to rush this in > now because it might break something else, Who TF knows what. > > Right now my gut feeling tells me we should still queue it for 3.17 and > have it run for a while in linux-next. We can backport it to stable > later after some testing... I don't have a problem with having it soak in linux-next for a while but I am not too crazy about releasing 3.16 with this bug (even knowing that there will be a backport later). When we hit this problem the results are rather unpleasant in that it's not immediately clear what's happened. We are still at rc2 so we have 3-4 weeks before 3.16 goes out. -boris ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-06-27 16:58 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-06-22 16:46 [GIT PULL] RAS fix for 3.17 Borislav Petkov 2014-06-24 13:24 ` [GIT PULL] 2 RAS fixes for 3.17, refreshed Borislav Petkov 2014-06-24 13:27 ` Konrad Rzeszutek Wilk 2014-06-27 15:01 ` Borislav Petkov 2014-06-27 15:12 ` Boris Ostrovsky 2014-06-27 16:08 ` Borislav Petkov 2014-06-27 16:59 ` Boris Ostrovsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox