public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch-next] Remove notify_die in do_machine_check functioin
@ 2010-05-27  2:40 Jin Dongming
  2010-05-27  3:21 ` Huang Ying
  2010-05-27  6:58 ` Andi Kleen
  0 siblings, 2 replies; 8+ messages in thread
From: Jin Dongming @ 2010-05-27  2:40 UTC (permalink / raw)
  To: LKLM; +Cc: Andi Kleen, Huang Ying, Hidetoshi Seto

This patch fixes do_machine_check() failure caused by DIE_NMI.

I do MCE tests on my machine. When I inject Uncorrected Error(UE) into
kernel, the messages of test failure are always gotten. This problem
is caused by the notification of DIE_NMI in the front of do_machine_check().
Because there are some notifications used DIE_NMI, and when they finish their
own work and return NOTIFY_STOP as a result. The result makes
do_machine_check() return at that time.

So we decide to delete the notification of DIE_NMI. It is because when UE error
happens, if one of the cpu is down caused by the error of hook function of
DIE_NMI, the error type of UE may be different with the real one. For example,

        CPU0                                  CPU1
UE      do_machine_check()                    do_machine_check()
        |                                     |
        cpu down(hook error of DIE_NMI)       cpu OK(no hook error of DIE_NMI)
                                              |
                                              wait CPU0 timeout
                                              |
                                              Fatal Error
                                              (Timeout synchronizing machine
                                               check over CPUs)

And I test this patch on x86_64. It works well.

Signed-off-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 18cc425..5ed9df3 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -955,9 +955,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	percpu_inc(mce_exception_count);
 
-	if (notify_die(DIE_NMI, "machine check", regs, error_code,
-			   18, SIGKILL) == NOTIFY_STOP)
-		goto out;
 	if (!banks)
 		goto out;
 
-- 
1.7.0.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  2:40 [Patch-next] Remove notify_die in do_machine_check functioin Jin Dongming
@ 2010-05-27  3:21 ` Huang Ying
  2010-05-27  6:06   ` Hidetoshi Seto
  2010-05-27  6:54   ` Andi Kleen
  2010-05-27  6:58 ` Andi Kleen
  1 sibling, 2 replies; 8+ messages in thread
From: Huang Ying @ 2010-05-27  3:21 UTC (permalink / raw)
  To: Jin Dongming; +Cc: LKLM, Andi Kleen, Hidetoshi Seto

On Thu, 2010-05-27 at 10:40 +0800, Jin Dongming wrote:
> This patch fixes do_machine_check() failure caused by DIE_NMI.
> 
> I do MCE tests on my machine. When I inject Uncorrected Error(UE) into
> kernel, the messages of test failure are always gotten. This problem
> is caused by the notification of DIE_NMI in the front of do_machine_check().
> Because there are some notifications used DIE_NMI, and when they finish their
> own work and return NOTIFY_STOP as a result. The result makes
> do_machine_check() return at that time.
> 
> So we decide to delete the notification of DIE_NMI. It is because when UE error
> happens, if one of the cpu is down caused by the error of hook function of
> DIE_NMI, the error type of UE may be different with the real one. For example,
> 
>         CPU0                                  CPU1
> UE      do_machine_check()                    do_machine_check()
>         |                                     |
>         cpu down(hook error of DIE_NMI)       cpu OK(no hook error of DIE_NMI)
>                                               |
>                                               wait CPU0 timeout
>                                               |
>                                               Fatal Error
>                                               (Timeout synchronizing machine
>                                                check over CPUs)

Fatal error will only occur if tolerant = 0, which is not the common
case.

But I think the notify_die can be an issue here. For example UE is on
CPU0, and the MCE is consumed by notify_die; MCE on CPU1 will detect
nothing.

I have heard about that on some machine, some hardware error output pin
of chipset may be linked with some input pin of CPU which can cause MCE.
That is, MCE is used to report some chipset errors too. I think that is
why notify_die is called in do_machine_check. Simply removing notify_die
is not good for these machines.

Maybe we should fix the notifier user instead. Which notifier user
consumes the DIE_NMI notification?

Best Regards,
Huang Ying



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  3:21 ` Huang Ying
@ 2010-05-27  6:06   ` Hidetoshi Seto
  2010-05-27  6:57     ` Andi Kleen
  2010-05-27  6:54   ` Andi Kleen
  1 sibling, 1 reply; 8+ messages in thread
From: Hidetoshi Seto @ 2010-05-27  6:06 UTC (permalink / raw)
  To: Huang Ying; +Cc: Jin Dongming, LKLM, Andi Kleen

(2010/05/27 12:21), Huang Ying wrote:
> I have heard about that on some machine, some hardware error output pin
> of chipset may be linked with some input pin of CPU which can cause MCE.
> That is, MCE is used to report some chipset errors too. I think that is
> why notify_die is called in do_machine_check. Simply removing notify_die
> is not good for these machines.

Hum, it sounds like "notify_die here is hook for proprietary chipset
driver".  Anyone who have such machine and driver in real?

But if my understanding is correct the notify_die here will call all
registered callbacks and let them process if "DIE_NMI" is an event
what the callback really interested in.

Problems are (1) many callbacks will behave wrongly since they don't
aware that DIE_NMI event can be posted from Machine Check, and (2)
if the machine is not such special hardware it is just waste of time
in critical context where quick page-poisoning might be required.

> Maybe we should fix the notifier user instead. Which notifier user
> consumes the DIE_NMI notification?

What I found at a glance is:

[arch/x86/kernel/cpu/perf_event.c]
   1183 static int __kprobes
   1184 perf_event_nmi_handler(struct notifier_block *self,
   1185                          unsigned long cmd, void *__args)
   1186 {
   1187         struct die_args *args = __args;
   1188         struct pt_regs *regs;
   1189
   1190         if (!atomic_read(&active_events))
   1191                 return NOTIFY_DONE;
   1192
   1193         switch (cmd) {
   1194         case DIE_NMI:
   1195         case DIE_NMI_IPI:
   1196                 break;
   1197
   1198         default:
   1199                 return NOTIFY_DONE;
   1200         }
   1201
   1202         regs = args->regs;
   1203
   1204         apic_write(APIC_LVTPC, APIC_DM_NMI);
   1205         /*
   1206          * Can't rely on the handled return value to say it was our NMI, two
   1207          * events could trigger 'simultaneously' raising two back-to-back NMIs.
   1208          *
   1209          * If the first NMI handles both, the latter will be empty and daze
   1210          * the CPU.
   1211          */
   1212         x86_pmu.handle_irq(regs);
   1213
   1214         return NOTIFY_STOP;
   1215 }

However I think fixing the notifier users is wrong direction.
(At least I have no idea how many ISVs will be affected)

One quick alternative is define "DIE_MCE" and use it instead, but
if special hook like this is really required, I suppose we should
invent some special interface for external plug-in like a chipset's
LLHEH (low-level hardware error handler) etc., to allow additional
platform-specific error handling in critical context.

So I think simply removing it is good to start.
If there are no complaints and no users in these days, we are done.
Otherwise we will get fresh real requirement and will be able to do
proper things.


Thanks,
H.Seto


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  3:21 ` Huang Ying
  2010-05-27  6:06   ` Hidetoshi Seto
@ 2010-05-27  6:54   ` Andi Kleen
  2010-05-27 11:04     ` Alan Cox
  1 sibling, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2010-05-27  6:54 UTC (permalink / raw)
  To: Huang Ying; +Cc: Jin Dongming, LKLM, Hidetoshi Seto


> I have heard about that on some machine, some hardware error output pin
> of chipset may be linked with some input pin of CPU which can cause MCE.

Yes that happens.

> That is, MCE is used to report some chipset errors too. I think that is
> why notify_die is called in do_machine_check. Simply removing notify_die
> is not good for these machines.

In general deciding what to do on a MCE is rather complicated
and probably too much for any die handler.

> Maybe we should fix the notifier user instead. Which notifier user
> consumes the DIE_NMI notification?

Yes. It would be good to find out which user it is. Perhaps gdb?

One approach would be to give it a different type (DIE_MCE)

But today we don't really need it. notify_die() is primarily for debuggers
of all kinds, and I never liked the idea to call a debugger on a machine
check.

So I would be ok with just removing the call.

-Andi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  6:06   ` Hidetoshi Seto
@ 2010-05-27  6:57     ` Andi Kleen
  0 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2010-05-27  6:57 UTC (permalink / raw)
  To: Hidetoshi Seto; +Cc: Huang Ying, Jin Dongming, LKLM

, Hidetoshi Seto wrote:
> (2010/05/27 12:21), Huang Ying wrote:
>> I have heard about that on some machine, some hardware error output pin
>> of chipset may be linked with some input pin of CPU which can cause MCE.
>> That is, MCE is used to report some chipset errors too. I think that is
>> why notify_die is called in do_machine_check. Simply removing notify_die
>> is not good for these machines.
>
> Hum, it sounds like "notify_die here is hook for proprietary chipset
> driver".  Anyone who have such machine and driver in real?

No, the die hook was to be compatible with the old KDB patchkit
which hooked into MCE too.

> Problems are (1) many callbacks will behave wrongly since they don't
> aware that DIE_NMI event can be posted from Machine Check, and (2)
> if the machine is not such special hardware it is just waste of time
> in critical context where quick page-poisoning might be required.

Yes the best action is probably to just remove it right now.

> One quick alternative is define "DIE_MCE" and use it instead, but
> if special hook like this is really required, I suppose we should
> invent some special interface for external plug-in like a chipset's
> LLHEH (low-level hardware error handler) etc., to allow additional
> platform-specific error handling in critical context.

I don't think we need or want that.

-Andi



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  2:40 [Patch-next] Remove notify_die in do_machine_check functioin Jin Dongming
  2010-05-27  3:21 ` Huang Ying
@ 2010-05-27  6:58 ` Andi Kleen
  1 sibling, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2010-05-27  6:58 UTC (permalink / raw)
  To: Jin Dongming; +Cc: LKLM, Huang Ying, Hidetoshi Seto


> And I test this patch on x86_64. It works well.
> 
> Signed-off-by: Jin Dongming<jin.dongming@np.css.fujitsu.com>
> Signed-off-by: Hidetoshi Seto<seto.hidetoshi@jp.fujitsu.com>

Reviewed-by: Andi Kleen <ak@linux.intel.com>

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27  6:54   ` Andi Kleen
@ 2010-05-27 11:04     ` Alan Cox
  2010-05-27 13:15       ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Cox @ 2010-05-27 11:04 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Huang Ying, Jin Dongming, LKLM, Hidetoshi Seto

> In general deciding what to do on a MCE is rather complicated
> and probably too much for any die handler.

True enough

> But today we don't really need it. notify_die() is primarily for debuggers
> of all kinds, and I never liked the idea to call a debugger on a machine
> check.

That would be because you don't do driver work I suspect. If you are
doing driver work then its extremely useful ending up in the debugger
when you get an MCE because some random bit of hardware on the bus
decided to throw a tantrum.

This is particularly the case with AMD/ATI and AMD/Nvidia chipset systems
which tend to throw this kind of error if you prod some of the chipset
controllers (eg the Nvidia SATA) in them in just the wrong way.

So NAK simply removing it. As a driver writer I want to end up in the
debugger when this happens so I can work out what led up to the MCE.

Alan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch-next] Remove notify_die in do_machine_check functioin
  2010-05-27 11:04     ` Alan Cox
@ 2010-05-27 13:15       ` Andi Kleen
  0 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2010-05-27 13:15 UTC (permalink / raw)
  To: Alan Cox; +Cc: Huang Ying, Jin Dongming, LKLM, Hidetoshi Seto

Hi Alan,

> That would be because you don't do driver work I suspect. If you are
> doing driver work then its extremely useful ending up in the debugger
> when you get an MCE because some random bit of hardware on the bus
> decided to throw a tantrum.
>
> This is particularly the case with AMD/ATI and AMD/Nvidia chipset systems
> which tend to throw this kind of error if you prod some of the chipset
> controllers (eg the Nvidia SATA) in them in just the wrong way.
>
> So NAK simply removing it. As a driver writer I want to end up in the
> debugger when this happens so I can work out what led up to the MCE.

Have you ever tried that? It does not sound like it to be honest :)

You have no chance to figure out why the MCE happened
either, unless you run through the handler first.

Unless you want to do all the work the MCE handler does manually
somehow in the debugger (reading all banks on all CPUs, parsing
all the bits, doing all the other work). I wrote the code
to do that and even I am a bit scared of doing all the manually.

Also if the MCE is recoverable you'll just get a log entry
with all the information and if it's not recoverable you
get a panic which ends up entering the debugger anyways.

In addition you won't get a single debugger entry, but a parallel
entry on all CPUs because a MCE is broadcast.

So overall I still think handling MCEs in debuggers does not
make sense.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-05-27 13:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-27  2:40 [Patch-next] Remove notify_die in do_machine_check functioin Jin Dongming
2010-05-27  3:21 ` Huang Ying
2010-05-27  6:06   ` Hidetoshi Seto
2010-05-27  6:57     ` Andi Kleen
2010-05-27  6:54   ` Andi Kleen
2010-05-27 11:04     ` Alan Cox
2010-05-27 13:15       ` Andi Kleen
2010-05-27  6:58 ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox