* [git pull] machine check recovery fix
@ 2012-05-17 17:10 Luck, Tony
2012-05-17 22:45 ` Linus Torvalds
0 siblings, 1 reply; 11+ messages in thread
From: Luck, Tony @ 2012-05-17 17:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-kernel
Linus: Sent this to you on Monday as a patch:
https://lkml.org/lkml/2012/5/14/350
perhaps it got lost. Here it is again in handy "git pull" form:
The following changes since commit d48b97b403d23f6df0b990cee652bdf9a52337a3:
Linux 3.4-rc6 (2012-05-06 15:07:32 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/linus-mce-fix
for you to fetch changes up to dad1743e5993f19b3d7e7bd0fb35dc45b5326626:
x86/mce: Only restart instruction after machine check recovery if it is safe (2012-05-14 15:07:48 -0700)
----------------------------------------------------------------
Fix machine check recovery
----------------------------------------------------------------
Tony Luck (1):
x86/mce: Only restart instruction after machine check recovery if it is safe
arch/x86/kernel/cpu/mcheck/mce.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [git pull] machine check recovery fix 2012-05-17 17:10 [git pull] machine check recovery fix Luck, Tony @ 2012-05-17 22:45 ` Linus Torvalds 2012-05-18 0:14 ` Tony Luck 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2012-05-17 22:45 UTC (permalink / raw) To: Luck, Tony; +Cc: linux-kernel On Thu, May 17, 2012 at 10:10 AM, Luck, Tony <tony.luck@intel.com> wrote: > Linus: Sent this to you on Monday as a patch: So I really didn't like the patch. I'm not entirely sure why I dislike it so much, but I don't like how it seems to mix up the software rules and the hardware rules. They are two totally separate things. Also, the whole "nonrestartable state flag" means - if I understood things correctly - that you really cannot do the "iret" even from the NMI handler. So trying to push this into the whole process notification seems entirely incorrect, because that still requires that we return from the NMI - using the very machine state that we're not supposed to use. So I seriously believe the patch is wrong. What I think *could* be right is something that says - if the "can't restart" flag is set *AND* the state saved is user-space, then we can treat the NMI as a regular interrupt (because we're clearly not interrupting kernel mode), and we can kill the process directly. - if "can't restart" is set, and we're in kernel mode, we need to panic (or, perhaps, just say "screw it, we don't have any choice, we're going to try to restart anyway") I guess the notify bit kind of emulates that "if the NMI happened in user space" thing, but it seems to really do that more by mistake than by design. Or at least it doesn't seem to be explicitly documented as being intentional. I dunno. I'm just very uncomfortable with the patch. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-17 22:45 ` Linus Torvalds @ 2012-05-18 0:14 ` Tony Luck 2012-05-18 0:25 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2012-05-18 0:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, May 17, 2012 at 3:45 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > Also, the whole "nonrestartable state flag" means - if I understood > things correctly - that you really cannot do the "iret" even from the > NMI handler. Not quite ... we can "iret" ... but not back to the instruction that was executing when the machine check occurred. We need to go some place else .... hence we send a signal that will either kill the process, or take them to their signal handler. [Their signal handler might try to return, and fall into nowhere ... but that's a user programming error. If bad things happen to the process and it gets the wrong answer, that is their own fault]. It's the "PCC" bit (processor context corrupt) that says that we can't iret at all. We -Tony ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 0:14 ` Tony Luck @ 2012-05-18 0:25 ` Linus Torvalds 2012-05-18 2:37 ` Tony Luck 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2012-05-18 0:25 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel On Thu, May 17, 2012 at 5:14 PM, Tony Luck <tony.luck@intel.com> wrote: > On Thu, May 17, 2012 at 3:45 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> Also, the whole "nonrestartable state flag" means - if I understood >> things correctly - that you really cannot do the "iret" even from the >> NMI handler. > > Not quite ... we can "iret" ... but not back to the instruction that was > executing when the machine check occurred. We need to go some > place else .... hence we send a signal that will either kill the process Tony, I don't think you understand. If the machine check happened in kernel space, we currently *are* returning to the instruction that executed. With or without your patch. That's my argument. Your _TIF_MCE_NOTIFY games do *nothing*, because they only get tested at return to user space - not on return to the MC faulting kernel space instruction. This is what I was talking about - the thing looks to work entirely *accidentally* - and only for the user-space case. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 0:25 ` Linus Torvalds @ 2012-05-18 2:37 ` Tony Luck 2012-05-18 3:33 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Tony Luck @ 2012-05-18 2:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Thu, May 17, 2012 at 5:25 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > If the machine check happened in kernel space, we currently *are* > returning to the instruction that executed. With or without your > patch. That's my argument. When we assign the severity for the error we check for kernel vs. user space. If we took the machine check in kernel space it will get assigned MCE_PANIC_SEVERITY severity ... and we won;t try to do any recovery. We only play the games with TIF_MCE_NOTIFY if we see severity MCE_AR_SEVERITY ... which we will only do if the machine check happened in user mode. So current recovery code only tries to deal with the user case. Machine checks in kernel space are a future project ... and I agree will be a monster pain because we'll have to figure out everything in machine check context. -Tony ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 2:37 ` Tony Luck @ 2012-05-18 3:33 ` Linus Torvalds 2012-05-18 16:46 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Linus Torvalds @ 2012-05-18 3:33 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel On Thu, May 17, 2012 at 7:37 PM, Tony Luck <tony.luck@intel.com> wrote: > > When we assign the severity for the error we check for kernel vs. user > space. If we took the machine check in kernel space it will get assigned > MCE_PANIC_SEVERITY severity ... and we won;t try to do any > recovery. We only play the games with TIF_MCE_NOTIFY if we > see severity MCE_AR_SEVERITY ... which we will only do if the > machine check happened in user mode. Ok, this code is a rats nest. I'm looking at mce_severity(), and I'm just going "that's unreadable, and totally not understandable". And it's wrong. > So current recovery code only tries to deal with the user case. Ugh. And it mixes up the EIPV bit checking both in "error_context()" _and_ in the logic in the "severities" array, both of which can check that bit in two different ways. The code is crazy. It's an unreadable mess. Not surprising that it then also is buggy. Who seriously thinks that you get a *more* reliable machine by executing code that seems to do these kinds of random things? Looking at the code, I don't think it has been written by a human. I'm guessing here, but did somebody stare at some MCE document flow-chart while writing this? It feels like the way things happened was: - an engineer told a technical writer what he thought the flow should be - a technical writer wrote a piece of document - another unrelated engineer then took that document and tried to codify it - nobody actually talked to each other and asked each other "does this make sense" Some of that code is clearly pure garbage. For example, if I as a user can trigger a machine check (say, I can make the CPU overheat by looping on all CPU's), I can fool that code to say that it happened IN_KERNEL by just making "ip" be zero. Which is quite trivially done even if I cannot mmap the virtual NULL address: just create a new code segment, put a "jmp 0" at the zero address, and jump to that. Voila - guaranteed zero ip. Which then makes the MCE code say "oh, that must be kernel mode". For the great (NOT!) reason that some less than giften individual (see - I can avoid the word "moron" if I really try!) thought that "zero means not available". Even if zero is also a possible valid value! That "is it in kernel mode" check also seems to not know about vm86 mode. Let's hope those MCE's can never happen on an instruction in vm86 mode, because then the CS check is crap too. In fact, it's *all* crap. Because it shouldn't check "m->cs" and "m->ip" at all, because what matters is not which instruction caused the MCE, but whether the *return* address is in kernel mode or not! Maybe the error that triggered the MCE happened in user mode, but asynchronously, so the return address is in kernel mode. So the whole "error_context()" thing is testing entirely the wrong thing. What we should worry about is whether RIPV is set, and we're returning to kernel mode or not. THAT is the case that the kernel cares about, because it cannot fix it up. So I'm looking at mce-severity.c, and I think the problems there go *much* deeper than your little patch. There's two totally independent bugs in just *one* line of code in the error_context() function. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 3:33 ` Linus Torvalds @ 2012-05-18 16:46 ` Linus Torvalds 2012-05-18 16:57 ` Luck, Tony 2012-05-18 17:40 ` Borislav Petkov 2012-05-21 23:32 ` Andi Kleen 2 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2012-05-18 16:46 UTC (permalink / raw) To: Tony Luck; +Cc: linux-kernel Ok, having thought it over some more, I decided to do the pull. I still think MCE is confusing and actively buggy in many respects, and your RIPV check is much much too late, but with your patch it's no worse than it used to be. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [git pull] machine check recovery fix 2012-05-18 16:46 ` Linus Torvalds @ 2012-05-18 16:57 ` Luck, Tony 0 siblings, 0 replies; 11+ messages in thread From: Luck, Tony @ 2012-05-18 16:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel@vger.kernel.org > Ok, having thought it over some more, I decided to do the pull. Thanks. > I still think MCE is confusing and actively buggy in many respects, > and your RIPV check is much much too late, but with your patch it's no > worse than it used to be. Yes, it is confusing (and buggy ... VM86 mode check needs to be added for one thing ... the special treatment of "ip == 0" is also bad). The lookup table in mce-severity.c is there so that you can take the ".o" from the kernel tree and link with some user mode sanity checking tools in the mce-test toolset. Which I think is a step up from following the pseudo-code in the SDM (which have their own bugs). I'll try to make it better - but it is never going to be beautiful. -Tony ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 3:33 ` Linus Torvalds 2012-05-18 16:46 ` Linus Torvalds @ 2012-05-18 17:40 ` Borislav Petkov 2012-05-21 23:32 ` Andi Kleen 2 siblings, 0 replies; 11+ messages in thread From: Borislav Petkov @ 2012-05-18 17:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Tony Luck, linux-kernel On Thu, May 17, 2012 at 08:33:44PM -0700, Linus Torvalds wrote: > On Thu, May 17, 2012 at 7:37 PM, Tony Luck <tony.luck@intel.com> wrote: > > > > When we assign the severity for the error we check for kernel vs. user > > space. If we took the machine check in kernel space it will get assigned > > MCE_PANIC_SEVERITY severity ... and we won;t try to do any > > recovery. We only play the games with TIF_MCE_NOTIFY if we > > see severity MCE_AR_SEVERITY ... which we will only do if the > > machine check happened in user mode. > > Ok, this code is a rats nest. I'm looking at mce_severity(), and I'm > just going "that's unreadable, and totally not understandable". > > And it's wrong. Tell me about it. And it needs to be per-vendor. > > So current recovery code only tries to deal with the user case. > > Ugh. And it mixes up the EIPV bit checking both in "error_context()" > _and_ in the logic in the "severities" array, both of which can check > that bit in two different ways. > > The code is crazy. It's an unreadable mess. Not surprising that it > then also is buggy. > > Who seriously thinks that you get a *more* reliable machine by > executing code that seems to do these kinds of random things? > > Looking at the code, I don't think it has been written by a human. I'm > guessing here, but did somebody stare at some MCE document flow-chart > while writing this? It feels like the way things happened was: > > - an engineer told a technical writer what he thought the flow should be > - a technical writer wrote a piece of document > - another unrelated engineer then took that document and tried to codify it > - nobody actually talked to each other and asked each other "does > this make sense" :-) > Some of that code is clearly pure garbage. For example, if I as a user > can trigger a machine check (say, I can make the CPU overheat by > looping on all CPU's), I can fool that code to say that it happened > IN_KERNEL by just making "ip" be zero. Which is quite trivially done > even if I cannot mmap the virtual NULL address: just create a new code > segment, put a "jmp 0" at the zero address, and jump to that. Voila - > guaranteed zero ip. > > Which then makes the MCE code say "oh, that must be kernel mode". For > the great (NOT!) reason that some less than giften individual (see - I > can avoid the word "moron" if I really try!) thought that "zero means > not available". Even if zero is also a possible valid value! > > That "is it in kernel mode" check also seems to not know about vm86 > mode. Let's hope those MCE's can never happen on an instruction in > vm86 mode, because then the CS check is crap too. > > In fact, it's *all* crap. Because it shouldn't check "m->cs" and > "m->ip" at all, because what matters is not which instruction caused > the MCE, but whether the *return* address is in kernel mode or not! > > Maybe the error that triggered the MCE happened in user mode, but > asynchronously, so the return address is in kernel mode. So the whole > "error_context()" thing is testing entirely the wrong thing. > > What we should worry about is whether RIPV is set, and we're returning > to kernel mode or not. THAT is the case that the kernel cares about, > because it cannot fix it up. Ok, this is just nasty but you're right. When EIPV is set, the saved CS and rIP point to the instruction that caused the MCE, otherwise not. Maybe for the return-to-kernel-space case we need to decode instructions in the #MC handler to know what happens, like kvm? Hmmm... OTOH, this probably needs verification with CPU guys but shouldn't we have raised the #MC before doing the context switch? What I'm saying is, I don't think the #MC is raised across contexts (or raised as late as when a %cr3 write happens) and getting an #MC in userspace should mean some insn in userspace caused it, and getting an #MC in the NMI handler would mean, some insn of the NMI handler caused it... Hmm, I hope I'm making sense. I'll ask around. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-18 3:33 ` Linus Torvalds 2012-05-18 16:46 ` Linus Torvalds 2012-05-18 17:40 ` Borislav Petkov @ 2012-05-21 23:32 ` Andi Kleen 2012-05-21 23:43 ` Linus Torvalds 2 siblings, 1 reply; 11+ messages in thread From: Andi Kleen @ 2012-05-21 23:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Tony Luck, linux-kernel Linus Torvalds <torvalds@linux-foundation.org> writes: > > In fact, it's *all* crap. Because it shouldn't check "m->cs" and > "m->ip" at all, because what matters is not which instruction caused > the MCE, but whether the *return* address is in kernel mode or not! No it matters which instruction caused the error, because it's the one which saw data corruption. If that was not in kernel you can safely just return because the kernel is completely fine and the instruction can be restarted. It's just like a interrupt. In the cases where this cannot be determined the MCE code only uses the address and does not use this. > Maybe the error that triggered the MCE happened in user mode, but > asynchronously, so the return address is in kernel mode. So the whole > "error_context()" thing is testing entirely the wrong thing. EIPV==1 means the error IP is valid. The asynchronous cases never handle this. Yes the logic is rather hairy, but mainly because the whole problem is very. > That "is it in kernel mode" check also seems to not know about vm86 > mode. Let's hope those MCE's can never happen on an instruction in > vm86 mode, because then the CS check is crap too. I fixed the VM86 thing a long time ago, but it was never merged unfortunately. Not that it matters much, because the systems which have recoverable machine checks usually have far too much memory for 32bit kernels. -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [git pull] machine check recovery fix 2012-05-21 23:32 ` Andi Kleen @ 2012-05-21 23:43 ` Linus Torvalds 0 siblings, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2012-05-21 23:43 UTC (permalink / raw) To: Andi Kleen; +Cc: Tony Luck, linux-kernel On Mon, May 21, 2012 at 4:32 PM, Andi Kleen <andi@firstfloor.org> wrote: > > No it matters which instruction caused the error, because it's the > one which saw data corruption. If that was not in kernel you > can safely just return because the kernel is completely fine > and the instruction can be restarted. It's just like a interrupt. Wrong. The MCE interface doesn't even *give* that information, so you're just full of it. There's no way to know whether the EIP that you read from the MSR happened in real mode, in kernel mode, or whatever. You need to check eflags and the code segment, neither of which - as far as I know, and certainly not as far as the current code knows - even exists. So you *have* to check the return information, not the "MCE" information. Checking the MCE data is stupid and wrong. Stop doing it, and stop making idiotic excuses for it. The only thing that actually has the relevant information is the return stack. Seriously. As such, only the RIPV bit can *possibly* be the one that you can use. Any time you use "mce->ip" for anything at all that isn't just reporting, you are just doing moronic things. Stop doing stupid things. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-05-21 23:44 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-05-17 17:10 [git pull] machine check recovery fix Luck, Tony 2012-05-17 22:45 ` Linus Torvalds 2012-05-18 0:14 ` Tony Luck 2012-05-18 0:25 ` Linus Torvalds 2012-05-18 2:37 ` Tony Luck 2012-05-18 3:33 ` Linus Torvalds 2012-05-18 16:46 ` Linus Torvalds 2012-05-18 16:57 ` Luck, Tony 2012-05-18 17:40 ` Borislav Petkov 2012-05-21 23:32 ` Andi Kleen 2012-05-21 23:43 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox