[git pull] machine check recovery fix

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [git pull] machine check recovery fix
@ 2012-05-17 17:10 Luck, Tony
  2012-05-17 22:45 ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Luck, Tony @ 2012-05-17 17:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

Linus: Sent this to you on Monday as a patch:
	https://lkml.org/lkml/2012/5/14/350
perhaps it got lost. Here it is again in handy "git pull" form:

The following changes since commit d48b97b403d23f6df0b990cee652bdf9a52337a3:

  Linux 3.4-rc6 (2012-05-06 15:07:32 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/linus-mce-fix

for you to fetch changes up to dad1743e5993f19b3d7e7bd0fb35dc45b5326626:

  x86/mce: Only restart instruction after machine check recovery if it is safe (2012-05-14 15:07:48 -0700)

----------------------------------------------------------------
Fix machine check recovery

----------------------------------------------------------------
Tony Luck (1):
      x86/mce: Only restart instruction after machine check recovery if it is safe

 arch/x86/kernel/cpu/mcheck/mce.c |   14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-17 17:10 [git pull] machine check recovery fix Luck, Tony
@ 2012-05-17 22:45 ` Linus Torvalds
  2012-05-18  0:14   ` Tony Luck
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2012-05-17 22:45 UTC (permalink / raw)
  To: Luck, Tony; +Cc: linux-kernel

On Thu, May 17, 2012 at 10:10 AM, Luck, Tony <tony.luck@intel.com> wrote:
> Linus: Sent this to you on Monday as a patch:

So I really didn't like the patch.

I'm not entirely sure why I dislike it so much, but I don't like how
it seems to mix up the software rules and the hardware rules. They are
two totally separate things.

Also, the whole "nonrestartable state flag" means - if I understood
things correctly - that you really cannot do the "iret" even from the
NMI handler. So trying to push this into the whole process
notification seems entirely incorrect, because that still requires
that we return from the NMI - using the very machine state that we're
not supposed to use.

So I seriously believe the patch is wrong.

What I think *could* be right is something that says

 - if the "can't restart" flag is set *AND* the state saved is
user-space, then we can treat the NMI as a regular interrupt (because
we're clearly not interrupting kernel mode), and we can kill the
process directly.

 - if "can't restart" is set, and we're in kernel mode, we need to
panic (or, perhaps, just say "screw it, we don't have any choice,
we're going to try to restart anyway")

I guess the notify bit kind of emulates that "if the NMI happened in
user space" thing, but it seems to really do that more by mistake than
by design. Or at least it doesn't seem to be explicitly documented as
being intentional.

I dunno. I'm just very uncomfortable with the patch.

                   Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-17 22:45 ` Linus Torvalds
@ 2012-05-18  0:14   ` Tony Luck
  2012-05-18  0:25     ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Tony Luck @ 2012-05-18  0:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, May 17, 2012 at 3:45 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Also, the whole "nonrestartable state flag" means - if I understood
> things correctly - that you really cannot do the "iret" even from the
> NMI handler.

Not quite ... we can "iret" ... but not back to the instruction that was
executing when the machine check occurred. We need to go some
place else .... hence we send a signal that will either kill the process,
or take them to their signal handler. [Their signal handler might try
to return, and fall into nowhere ... but that's a user programming error.
If bad things happen to the process and it gets the wrong answer, that
is their own fault].

It's the "PCC" bit (processor context corrupt) that says that we can't
iret at all. We

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  0:14   ` Tony Luck
@ 2012-05-18  0:25     ` Linus Torvalds
  2012-05-18  2:37       ` Tony Luck
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2012-05-18  0:25 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

On Thu, May 17, 2012 at 5:14 PM, Tony Luck <tony.luck@intel.com> wrote:
> On Thu, May 17, 2012 at 3:45 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> Also, the whole "nonrestartable state flag" means - if I understood
>> things correctly - that you really cannot do the "iret" even from the
>> NMI handler.
>
> Not quite ... we can "iret" ... but not back to the instruction that was
> executing when the machine check occurred. We need to go some
> place else .... hence we send a signal that will either kill the process

Tony, I don't think you understand.

If the machine check happened in kernel space, we currently *are*
returning to the instruction that executed. With or without your
patch. That's my argument.

Your _TIF_MCE_NOTIFY games do *nothing*, because they only get tested
at return to user space - not on return to the MC faulting kernel
space instruction.

This is what I was talking about - the thing looks to work entirely
*accidentally* - and only for the user-space case.

                        Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  0:25     ` Linus Torvalds
@ 2012-05-18  2:37       ` Tony Luck
  2012-05-18  3:33         ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Tony Luck @ 2012-05-18  2:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Thu, May 17, 2012 at 5:25 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> If the machine check happened in kernel space, we currently *are*
> returning to the instruction that executed. With or without your
> patch. That's my argument.

When we assign the severity for the error we check for kernel vs. user
space. If we took the machine check in kernel space it will get assigned
MCE_PANIC_SEVERITY severity ... and we won;t try to do any
recovery.  We only play the games with TIF_MCE_NOTIFY if we
see severity MCE_AR_SEVERITY ... which we will only do if the
machine check happened in user mode.

So current recovery code only tries to deal with the user case.

Machine checks in kernel space are a future project ... and I agree
will be a monster pain because we'll have to figure out everything
in machine check context.

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  2:37       ` Tony Luck
@ 2012-05-18  3:33         ` Linus Torvalds
  2012-05-18 16:46           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Linus Torvalds @ 2012-05-18  3:33 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

On Thu, May 17, 2012 at 7:37 PM, Tony Luck <tony.luck@intel.com> wrote:
>
> When we assign the severity for the error we check for kernel vs. user
> space. If we took the machine check in kernel space it will get assigned
> MCE_PANIC_SEVERITY severity ... and we won;t try to do any
> recovery.  We only play the games with TIF_MCE_NOTIFY if we
> see severity MCE_AR_SEVERITY ... which we will only do if the
> machine check happened in user mode.

Ok, this code is a rats nest. I'm looking at mce_severity(), and I'm
just going "that's unreadable, and totally not understandable".

And it's wrong.

> So current recovery code only tries to deal with the user case.

Ugh. And it mixes up the EIPV bit checking both in "error_context()"
_and_ in the logic in the "severities" array, both of which can check
that bit in two different ways.

The code is crazy. It's an unreadable mess. Not surprising that it
then also is buggy.

Who seriously thinks that you get a *more* reliable machine by
executing code that seems to do these kinds of random things?

Looking at the code, I don't think it has been written by a human. I'm
guessing here, but did somebody stare at some MCE document flow-chart
while writing this? It feels like the way things happened was:

 - an engineer told a technical writer what he thought the flow should be
 - a technical writer wrote a piece of document
 - another unrelated engineer then took that document and tried to codify it
 - nobody actually talked to each other and asked each other "does
this make sense"

Some of that code is clearly pure garbage. For example, if I as a user
can trigger a machine check (say, I can make the CPU overheat by
looping on all CPU's), I can fool that code to say that it happened
IN_KERNEL by just making "ip" be zero. Which is quite trivially done
even if I cannot mmap the virtual NULL address: just create a new code
segment, put a "jmp 0" at the zero address, and jump to that. Voila -
guaranteed zero ip.

Which then makes the MCE code say "oh, that must be kernel mode". For
the great (NOT!) reason that some less than giften individual (see - I
can avoid the word "moron" if I really try!) thought that "zero means
not available". Even if zero is also a possible valid value!

That "is it in kernel mode" check also seems to not know about vm86
mode. Let's hope those MCE's can never happen on an instruction in
vm86 mode, because then the CS check is crap too.

In fact, it's *all* crap. Because it shouldn't check "m->cs" and
"m->ip" at all, because what matters is not which instruction caused
the MCE, but whether the *return* address is in kernel mode or not!
Maybe the error that triggered the MCE happened in user mode, but
asynchronously, so the return address is in kernel mode. So the whole
"error_context()" thing is testing entirely the wrong thing.

What we should worry about is whether RIPV is set, and we're returning
to kernel mode or not. THAT is the case that the kernel cares about,
because it cannot fix it up.

So I'm looking at mce-severity.c, and I think the problems there go
*much* deeper than your little patch.

There's two totally independent bugs in just *one* line of code in the
error_context() function.

                         Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  3:33         ` Linus Torvalds
@ 2012-05-18 16:46           ` Linus Torvalds
  2012-05-18 16:57             ` Luck, Tony
  2012-05-18 17:40           ` Borislav Petkov
  2012-05-21 23:32           ` Andi Kleen
  2 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2012-05-18 16:46 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

Ok, having thought it over some more, I decided to do the pull.

I still think MCE is confusing and actively buggy in many respects,
and your RIPV check is much much too late, but with your patch it's no
worse than it used to be.

                       Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [git pull] machine check recovery fix
  2012-05-18 16:46           ` Linus Torvalds
@ 2012-05-18 16:57             ` Luck, Tony
  0 siblings, 0 replies; 11+ messages in thread
From: Luck, Tony @ 2012-05-18 16:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel@vger.kernel.org

> Ok, having thought it over some more, I decided to do the pull.

Thanks.

> I still think MCE is confusing and actively buggy in many respects,
> and your RIPV check is much much too late, but with your patch it's no
> worse than it used to be.

Yes, it is confusing (and buggy ... VM86 mode check needs to be added
for one thing ... the special treatment of "ip == 0" is also bad). The
lookup table in mce-severity.c is there so that you can take the ".o"
from the kernel tree and link with some user mode sanity checking
tools in the mce-test toolset. Which I think is a step up from following
the pseudo-code in the SDM (which have their own bugs).

I'll try to make it better - but it is never going to be beautiful.

-Tony

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  3:33         ` Linus Torvalds
  2012-05-18 16:46           ` Linus Torvalds
@ 2012-05-18 17:40           ` Borislav Petkov
  2012-05-21 23:32           ` Andi Kleen
  2 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2012-05-18 17:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tony Luck, linux-kernel

On Thu, May 17, 2012 at 08:33:44PM -0700, Linus Torvalds wrote:
> On Thu, May 17, 2012 at 7:37 PM, Tony Luck <tony.luck@intel.com> wrote:
> >
> > When we assign the severity for the error we check for kernel vs. user
> > space. If we took the machine check in kernel space it will get assigned
> > MCE_PANIC_SEVERITY severity ... and we won;t try to do any
> > recovery.  We only play the games with TIF_MCE_NOTIFY if we
> > see severity MCE_AR_SEVERITY ... which we will only do if the
> > machine check happened in user mode.
> 
> Ok, this code is a rats nest. I'm looking at mce_severity(), and I'm
> just going "that's unreadable, and totally not understandable".
> 
> And it's wrong.

Tell me about it. And it needs to be per-vendor.

> > So current recovery code only tries to deal with the user case.
> 
> Ugh. And it mixes up the EIPV bit checking both in "error_context()"
> _and_ in the logic in the "severities" array, both of which can check
> that bit in two different ways.
> 
> The code is crazy. It's an unreadable mess. Not surprising that it
> then also is buggy.
> 
> Who seriously thinks that you get a *more* reliable machine by
> executing code that seems to do these kinds of random things?
> 
> Looking at the code, I don't think it has been written by a human. I'm
> guessing here, but did somebody stare at some MCE document flow-chart
> while writing this? It feels like the way things happened was:
> 
>  - an engineer told a technical writer what he thought the flow should be
>  - a technical writer wrote a piece of document
>  - another unrelated engineer then took that document and tried to codify it
>  - nobody actually talked to each other and asked each other "does
> this make sense"

:-)

> Some of that code is clearly pure garbage. For example, if I as a user
> can trigger a machine check (say, I can make the CPU overheat by
> looping on all CPU's), I can fool that code to say that it happened
> IN_KERNEL by just making "ip" be zero. Which is quite trivially done
> even if I cannot mmap the virtual NULL address: just create a new code
> segment, put a "jmp 0" at the zero address, and jump to that. Voila -
> guaranteed zero ip.
> 
> Which then makes the MCE code say "oh, that must be kernel mode". For
> the great (NOT!) reason that some less than giften individual (see - I
> can avoid the word "moron" if I really try!) thought that "zero means
> not available". Even if zero is also a possible valid value!
> 
> That "is it in kernel mode" check also seems to not know about vm86
> mode. Let's hope those MCE's can never happen on an instruction in
> vm86 mode, because then the CS check is crap too.
> 
> In fact, it's *all* crap. Because it shouldn't check "m->cs" and
> "m->ip" at all, because what matters is not which instruction caused
> the MCE, but whether the *return* address is in kernel mode or not!
>
> Maybe the error that triggered the MCE happened in user mode, but
> asynchronously, so the return address is in kernel mode. So the whole
> "error_context()" thing is testing entirely the wrong thing.
>
> What we should worry about is whether RIPV is set, and we're returning
> to kernel mode or not. THAT is the case that the kernel cares about,
> because it cannot fix it up.

Ok, this is just nasty but you're right. When EIPV is set, the saved CS
and rIP point to the instruction that caused the MCE, otherwise not.

Maybe for the return-to-kernel-space case we need to decode instructions
in the #MC handler to know what happens, like kvm? Hmmm...

OTOH, this probably needs verification with CPU guys but shouldn't we
have raised the #MC before doing the context switch? What I'm saying is,
I don't think the #MC is raised across contexts (or raised as late as
when a %cr3 write happens) and getting an #MC in userspace should mean
some insn in userspace caused it, and getting an #MC in the NMI handler
would mean, some insn of the NMI handler caused it... Hmm, I hope I'm
making sense.

I'll ask around.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-18  3:33         ` Linus Torvalds
  2012-05-18 16:46           ` Linus Torvalds
  2012-05-18 17:40           ` Borislav Petkov
@ 2012-05-21 23:32           ` Andi Kleen
  2012-05-21 23:43             ` Linus Torvalds
  2 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2012-05-21 23:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Tony Luck, linux-kernel

Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> In fact, it's *all* crap. Because it shouldn't check "m->cs" and
> "m->ip" at all, because what matters is not which instruction caused
> the MCE, but whether the *return* address is in kernel mode or not!

No it matters which instruction caused the error, because it's the
one which saw data corruption. If that was not in kernel you
can safely just return because the kernel is completely fine
and the instruction can be restarted. It's just like a interrupt.

In the cases where this cannot be determined the MCE code
only uses the address and does not use this.

> Maybe the error that triggered the MCE happened in user mode, but
> asynchronously, so the return address is in kernel mode. So the whole
> "error_context()" thing is testing entirely the wrong thing.

EIPV==1 means the error IP is valid.

The asynchronous cases never handle this.

Yes the logic is rather hairy, but mainly because the whole problem
is very.

> That "is it in kernel mode" check also seems to not know about vm86
> mode. Let's hope those MCE's can never happen on an instruction in
> vm86 mode, because then the CS check is crap too.

I fixed the VM86 thing a long time ago, but it was never merged
unfortunately. Not that it matters much, because the systems which
have recoverable machine checks usually have far too much memory
for 32bit kernels.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [git pull] machine check recovery fix
  2012-05-21 23:32           ` Andi Kleen
@ 2012-05-21 23:43             ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2012-05-21 23:43 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tony Luck, linux-kernel

On Mon, May 21, 2012 at 4:32 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> No it matters which instruction caused the error, because it's the
> one which saw data corruption. If that was not in kernel you
> can safely just return because the kernel is completely fine
> and the instruction can be restarted. It's just like a interrupt.

Wrong.

The MCE interface doesn't even *give* that information, so you're just
full of it.

There's no way to know whether the EIP that you read from the MSR
happened in real mode, in kernel mode, or whatever. You need to check
eflags and the code segment, neither of which - as far as I know, and
certainly not as far as the current code knows - even exists.

So you *have* to check the return information, not the "MCE" information.

Checking the MCE data is stupid and wrong. Stop doing it, and stop
making idiotic excuses for it.

The only thing that actually has the relevant information is the
return stack. Seriously. As such, only the RIPV bit can *possibly* be
the one that you can use. Any time you use "mce->ip" for anything at
all that isn't just reporting, you are just doing moronic things.

Stop doing stupid things.

                  Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-05-21 23:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-17 17:10 [git pull] machine check recovery fix Luck, Tony
2012-05-17 22:45 ` Linus Torvalds
2012-05-18  0:14   ` Tony Luck
2012-05-18  0:25     ` Linus Torvalds
2012-05-18  2:37       ` Tony Luck
2012-05-18  3:33         ` Linus Torvalds
2012-05-18 16:46           ` Linus Torvalds
2012-05-18 16:57             ` Luck, Tony
2012-05-18 17:40           ` Borislav Petkov
2012-05-21 23:32           ` Andi Kleen
2012-05-21 23:43             ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox