* Dealing with the NMI mess
@ 2015-07-23 20:21 Andy Lutomirski
2015-07-23 20:38 ` Linus Torvalds
` (3 more replies)
0 siblings, 4 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 20:21 UTC (permalink / raw)
To: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds,
Steven Rostedt, Brian Gerst
[moved to a new thread, cc list trimmed]
Hi all-
We've considered two approaches to dealing with NMIs:
1. Allow nesting. We know quite well how messy that is.
2. Forbid IRET inside NMIs. Doable but maybe not that pretty.
We haven't considered:
3. Forbid faults (other than MCE) inside NMI.
Option 3 is almost easy. There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB. #DB is easy to
fix (e.g. with my patches or Peter's patches).
What if we went all out and forbade page faults in NMI as well. There
are two reasons that I can think of that we might page fault inside an
NMI:
a) vmalloc fault. I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.
b) User memory access faults.
The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space. What if we did it differently? Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode. That would let us capture user state
*correctly*, which we currently can't really do. There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works. In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state. (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)
To make this work, we'd have to teach NMI-from-userspace to call the
callback itself. It would look like:
prepare_exit_to_usermode() {
...
while (blah blah blah) {
if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
perf_capture_user_state();
...
}
...
}
and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI. (Why not? Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)
There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.
Peter, can this be done without breaking the perf ABI? If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.
If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best. Maybe we should go one step
farther and ban all DR7 breakpoints period. Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.
--Andy
^ permalink raw reply [flat|nested] 85+ messages in thread* Re: Dealing with the NMI mess 2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski @ 2015-07-23 20:38 ` Linus Torvalds 2015-07-23 20:49 ` Andy Lutomirski ` (2 more replies) 2015-07-23 21:17 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 3 replies; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 20:38 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. > > We haven't considered: > > 3. Forbid faults (other than MCE) inside NMI. I'd really prefer #2. #3 depends on us getting many things right, and never introducing new cases in the future. #2, in contrast, seems to be fairly localized. Yes, RF is an issue, but returning to user space with RF clear doesn't really seem to be all that problematic. The point of RF is to make forward progress in the face of debug register faults, but I don't see what was wrong with the whole "disable any debug events that happen with interrupts disabled". And no, I do *not* believe that we should disable debug faults ahead of time. We should take them, disable them, and return with 'ret'. No complex "you can't put breakpoints in this region" crap, no magic rules, no subtle issues. I really think your "disallow #DB" is pointless. I think your "prevent instruction breakpoints in NMI" is wrong. Let them happen. Take them and disable them. Return with RT clear. Go on with your life. And the "take them and disable them" is really simple. No "am I in an NMI contect" thing (because that leads to the whole question about "what is NMI context"). That's not the real rule anyway. No, make it very simple and straightforward. Make the test be "uhhuh, I got a #DB in kernel mode, and interrupts were disabled - I know I'm going to return with "ret", so I'm just going to have to disable this breakpoint". Nothing clever. Nothing subtle. Nothing that needs "this range of instructions is magical". No. Just a very simple rule: if the context we return to is kernel mode and interrupts are disabled, we're using 'ret', so we cannot suppress debug faults. Did I miss something? There were a lot of emails flying around, but I *thought* I saw them all.. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:38 ` Linus Torvalds @ 2015-07-23 20:49 ` Andy Lutomirski 2015-07-23 21:08 ` Linus Torvalds 2015-07-23 20:52 ` Willy Tarreau 2015-07-23 21:20 ` Peter Zijlstra 2 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 20:49 UTC (permalink / raw) To: Linus Torvalds Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 1:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. >> >> We haven't considered: >> >> 3. Forbid faults (other than MCE) inside NMI. > > I'd really prefer #2. #3 depends on us getting many things right, and > never introducing new cases in the future. > > #2, in contrast, seems to be fairly localized. Yes, RF is an issue, > but returning to user space with RF clear doesn't really seem to be > all that problematic. > > The point of RF is to make forward progress in the face of debug > register faults, but I don't see what was wrong with the whole > "disable any debug events that happen with interrupts disabled". > > And no, I do *not* believe that we should disable debug faults ahead > of time. We should take them, disable them, and return with 'ret'. No > complex "you can't put breakpoints in this region" crap, no magic > rules, no subtle issues. > > I really think your "disallow #DB" is pointless. I think your "prevent > instruction breakpoints in NMI" is wrong. Let them happen. Take them > and disable them. Return with RT clear. Go on with your life. > > And the "take them and disable them" is really simple. No "am I in an > NMI contect" thing (because that leads to the whole question about > "what is NMI context"). That's not the real rule anyway. > > No, make it very simple and straightforward. Make the test be "uhhuh, > I got a #DB in kernel mode, and interrupts were disabled - I know I'm > going to return with "ret", so I'm just going to have to disable this > breakpoint". > > Nothing clever. Nothing subtle. Nothing that needs "this range of > instructions is magical". No. Just a very simple rule: if the context > we return to is kernel mode and interrupts are disabled, we're using > 'ret', so we cannot suppress debug faults. There are some subtleties in here. Issue A: to return with RF clear, we need to disarm the breakpoint. If it's limited to the duration of the NMI, that's easy. If not, when do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti flags during context switch may target the wrong task. Issue B: single-step exception after SYSENTER. The patches I just sent fix that, though. Issue C: #DB with invalid stack pointer (can happen due to watchpoints during SYSCALL entry or SYSRET exit). I guess we need to ban such watchpoints. Issue D: debug exception inside EFI (especially mixed-mode EFI). We can't return using RET, so we need to catch that case. These issues mostly go away if we preemptively disarm DR7 early in NMI processing and rearm it at the end. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:49 ` Andy Lutomirski @ 2015-07-23 21:08 ` Linus Torvalds 2015-07-23 21:31 ` Steven Rostedt 0 siblings, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:08 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Issue A: to return with RF clear, we need to disarm the breakpoint. > If it's limited to the duration of the NMI, that's easy. If not, when > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti > flags during context switch may target the wrong task. We don't re-arm it. We can entertain the notion *eventually* to do something clever, but for now, just say: stability and simplicity is more important. People can use tracepoints in interrupts-off code (they get rewritten with 'int3', that's fine), but not instruction breakpoints. > Issue C: #DB with invalid stack pointer (can happen due to watchpoints > during SYSCALL entry or SYSRET exit). I guess we need to ban such > watchpoints. .. but this is unrelated, to NMI, just "syscall is a nasty interface". Don't we already ban them? > Issue D: debug exception inside EFI (especially mixed-mode EFI). We > can't return using RET, so we need to catch that case. If NMI code calls EFI code, then it's broken. > These issues mostly go away if we preemptively disarm DR7 early in NMI > processing and rearm it at the end. I'm not *violently* opposed to that, but it's just a band-aid. It doesn't *fix* anything. You aren't protecting against random DB exceptions just because somebody put a data breakpoint on the NMI stack, for example. You still get page faults. Etc etc. So I thinkt he whole "use ret instead" is a pretty simple approach. Make that "just work". Then, if you want to play with dr7 inside NMI to make it more likely that you can have breakpoints live in irq-off situation, I think that's a magic special case. It shouldn't be part of the design. Things should work without it. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:08 ` Linus Torvalds @ 2015-07-23 21:31 ` Steven Rostedt 2015-07-23 21:46 ` Willy Tarreau 2015-07-23 21:48 ` Linus Torvalds 0 siblings, 2 replies; 85+ messages in thread From: Steven Rostedt @ 2015-07-23 21:31 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, 23 Jul 2015 14:08:59 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Issue A: to return with RF clear, we need to disarm the breakpoint. > > If it's limited to the duration of the NMI, that's easy. If not, when > > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti > > flags during context switch may target the wrong task. > > We don't re-arm it. > Let me get this straight. The idea is in the #DB handler to detect that it was triggered in NMI context, and if so, simply disarm that breakpoint permanently, right? Nothing should be adding hw breakpoints to NMI code anyway. Sounds perfectly reasonable to me. Of course, how we tell we are in NMI brings back all the races as we had in the nesting code. We can check the per-cpu variable that is set with nmi_enter() and cleared at nmi_exit() but what happens if the breakpoint is outside those calls. We can check the stack pointer, but then we are back to userspace fooling us. Maybe add the DF trick again? -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:31 ` Steven Rostedt @ 2015-07-23 21:46 ` Willy Tarreau 2015-07-23 21:46 ` Andy Lutomirski 2015-07-23 21:48 ` Linus Torvalds 1 sibling, 1 reply; 85+ messages in thread From: Willy Tarreau @ 2015-07-23 21:46 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote: > On Thu, 23 Jul 2015 14:08:59 -0700 > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > > > Issue A: to return with RF clear, we need to disarm the breakpoint. > > > If it's limited to the duration of the NMI, that's easy. If not, when > > > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti > > > flags during context switch may target the wrong task. > > > > We don't re-arm it. > > > > Let me get this straight. The idea is in the #DB handler to detect that > it was triggered in NMI context, and if so, simply disarm that > breakpoint permanently, right? > > Nothing should be adding hw breakpoints to NMI code anyway. Sounds > perfectly reasonable to me. Of course, how we tell we are in NMI > brings back all the races as we had in the nesting code. We can check > the per-cpu variable that is set with nmi_enter() and cleared at > nmi_exit() but what happens if the breakpoint is outside those calls. > We can check the stack pointer, but then we are back to userspace > fooling us. Maybe add the DF trick again? Can't the back link of the TSS tell us where we come from ? At least it should not be manipulable from user-space. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:46 ` Willy Tarreau @ 2015-07-23 21:46 ` Andy Lutomirski 2015-07-23 21:50 ` Willy Tarreau 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 21:46 UTC (permalink / raw) To: Willy Tarreau Cc: Steven Rostedt, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <w@1wt.eu> wrote: > On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote: >> On Thu, 23 Jul 2015 14:08:59 -0700 >> Linus Torvalds <torvalds@linux-foundation.org> wrote: >> >> > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> > > >> > > Issue A: to return with RF clear, we need to disarm the breakpoint. >> > > If it's limited to the duration of the NMI, that's easy. If not, when >> > > do we re-arm? New prepare_exit_to_usermode hook? Hmm, setting ti >> > > flags during context switch may target the wrong task. >> > >> > We don't re-arm it. >> > >> >> Let me get this straight. The idea is in the #DB handler to detect that >> it was triggered in NMI context, and if so, simply disarm that >> breakpoint permanently, right? >> >> Nothing should be adding hw breakpoints to NMI code anyway. Sounds >> perfectly reasonable to me. Of course, how we tell we are in NMI >> brings back all the races as we had in the nesting code. We can check >> the per-cpu variable that is set with nmi_enter() and cleared at >> nmi_exit() but what happens if the breakpoint is outside those calls. >> We can check the stack pointer, but then we are back to userspace >> fooling us. Maybe add the DF trick again? > > Can't the back link of the TSS tell us where we come from ? At least > it should not be manipulable from user-space. Not on 64-bit -- there are no tasks :) --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:46 ` Andy Lutomirski @ 2015-07-23 21:50 ` Willy Tarreau 0 siblings, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-23 21:50 UTC (permalink / raw) To: Andy Lutomirski Cc: Steven Rostedt, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 02:46:49PM -0700, Andy Lutomirski wrote: > On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <w@1wt.eu> wrote: > > Can't the back link of the TSS tell us where we come from ? At least > > it should not be manipulable from user-space. > > Not on 64-bit -- there are no tasks :) Ah crap, sorry for the noise then! Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:31 ` Steven Rostedt 2015-07-23 21:46 ` Willy Tarreau @ 2015-07-23 21:48 ` Linus Torvalds 2015-07-23 21:50 ` Andy Lutomirski 1 sibling, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:48 UTC (permalink / raw) To: Steven Rostedt Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > Let me get this straight. The idea is in the #DB handler to detect that > it was triggered in NMI context, and if so, simply disarm that > breakpoint permanently, right? No, for simplicity, I'd make it cover not just NMI code, but any "kernel code with interrupts disabled". Because that's the test we'd use for "use ret instead of iret". And that wider test is exactly because it's so damn hard to get the exact instruction boundaries right. Let's *not* go down the path (again) of having to get the whole %rip range and "magic stack pointer values" etc. Make it simple and completely unambiguous. The rule really would be: - if we return to kernel space and interrupts are disabled, we will use "ret" rather than "iret" Hard rule. Simple. Straightforward. No random %rip values. No random %rsp values. NO CRAP. - but because we use "ret" rather than "iret" we can't get RF semantics, it means that #DB is special. RF is supposed to make us make forward progress So for that reason, #DB just says "if the breakpoint happened during that interrupts-ff reghion, I will clear %dr7 to guarantee forward progress" So those would be the two main rules. Very simple, and avoiding all nasty cases. Now, I'd be willing to then hide the "oops, we clear dr7 very agrressively" issue by having a few additional _heuristics_. But I call them "heuristics" because unlike the current NMI nesting games, they aren't about core stability. They are about "ok, maybe somebody wants to trigger those faults, and we'll be _nice_ and try to make it easy for them", but nothing more. So for example, if that "#DB clears %dr7" happened, it sounds easy to set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a cached value, so that if we disabled things because of some user stack trace access, it will be re-enabled by the time we return to user space. I think that sounds reasonable, but it's not something the core low-level entry x86 assembly code needs to even care about. It's not that level of "core", it's just being polite. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:48 ` Linus Torvalds @ 2015-07-23 21:50 ` Andy Lutomirski 2015-07-23 21:59 ` Linus Torvalds 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 21:50 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote: >> >> Let me get this straight. The idea is in the #DB handler to detect that >> it was triggered in NMI context, and if so, simply disarm that >> breakpoint permanently, right? > > No, for simplicity, I'd make it cover not just NMI code, but any > "kernel code with interrupts disabled". > > Because that's the test we'd use for "use ret instead of iret". > > And that wider test is exactly because it's so damn hard to get the > exact instruction boundaries right. Let's *not* go down the path > (again) of having to get the whole %rip range and "magic stack pointer > values" etc. > > Make it simple and completely unambiguous. The rule really would be: > > - if we return to kernel space and interrupts are disabled, we will > use "ret" rather than "iret" > > Hard rule. Simple. Straightforward. No random %rip values. No > random %rsp values. NO CRAP. > > - but because we use "ret" rather than "iret" we can't get RF > semantics, it means that #DB is special. RF is supposed to make us > make forward progress > > So for that reason, #DB just says "if the breakpoint happened > during that interrupts-ff reghion, I will clear %dr7 to guarantee > forward progress" What if we relax it slightly: "if the breakpoint happened during that interrupts-off region, I will clear all *kernel breakpoints* in %dr7 to guarantee forward progress"? Watchpoints don't need RF to make forward progress, and, by leaving watchpoints alone, we avoid breaking gdb. > > So those would be the two main rules. Very simple, and avoiding all nasty cases. > > Now, I'd be willing to then hide the "oops, we clear dr7 very > agrressively" issue by having a few additional _heuristics_. But I > call them "heuristics" because unlike the current NMI nesting games, > they aren't about core stability. They are about "ok, maybe somebody > wants to trigger those faults, and we'll be _nice_ and try to make it > easy for them", but nothing more. > > So for example, if that "#DB clears %dr7" happened, it sounds easy to > set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a > cached value, so that if we disabled things because of some user stack > trace access, it will be re-enabled by the time we return to user > space. I think that sounds reasonable, but it's not something the core > low-level entry x86 assembly code needs to even care about. It's not > that level of "core", it's just being polite. Once we limit it to instruction breakpoints, I don't think re-enabling before returning to userspace matters. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:50 ` Andy Lutomirski @ 2015-07-23 21:59 ` Linus Torvalds 2015-07-24 8:13 ` Peter Zijlstra 0 siblings, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Brian Gerst On Thu, Jul 23, 2015 at 2:50 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > What if we relax it slightly: "if the breakpoint happened during that > interrupts-off region, I will clear all *kernel breakpoints* in %dr7 > to guarantee forward progress"? > > Watchpoints don't need RF to make forward progress, and, by leaving > watchpoints alone, we avoid breaking gdb. Hmmm. I thought watchpoints were "before the instruction" too, but that's just because I haven't used them in ages, and I didn't remember the details. I just looked it up. You're right - the memory watchpoints trigger after the instruction has executed, so RF isn't an issue. So yes, the only issue is instruction breakpoints, and those are the only ones we need to clear. And that makes it really easy. So yes, I agree. We only need to clear all kernel breakpoints. So we don't even need that _TIF_USER_WORK_MASK thing, because user space isn't setting kernel code breakpoints, it's just kgdb. Sounds good to me. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:59 ` Linus Torvalds @ 2015-07-24 8:13 ` Peter Zijlstra 2015-07-24 9:02 ` Willy Tarreau ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 8:13 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote: > Hmmm. I thought watchpoints were "before the instruction" too, but > that's just because I haven't used them in ages, and I didn't remember > the details. I just looked it up. > > You're right - the memory watchpoints trigger after the instruction > has executed, so RF isn't an issue. So yes, the only issue is > instruction breakpoints, and those are the only ones we need to clear. > > And that makes it really easy. > > So yes, I agree. We only need to clear all kernel breakpoints. But but but, we can access userspace with !IF, imagine someone doing: local_irq_disable(); copy_from_user_inatomic(); and as luck would have it, there's a breakpoint on the user memory we just touched. And we go and disable a user breakpoint. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 8:13 ` Peter Zijlstra @ 2015-07-24 9:02 ` Willy Tarreau 2015-07-24 11:58 ` Steven Rostedt 2015-07-24 15:48 ` Andy Lutomirski 2 siblings, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 9:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andy Lutomirski, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 10:13:26AM +0200, Peter Zijlstra wrote: > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote: > > Hmmm. I thought watchpoints were "before the instruction" too, but > > that's just because I haven't used them in ages, and I didn't remember > > the details. I just looked it up. > > > > You're right - the memory watchpoints trigger after the instruction > > has executed, so RF isn't an issue. So yes, the only issue is > > instruction breakpoints, and those are the only ones we need to clear. > > > > And that makes it really easy. > > > > So yes, I agree. We only need to clear all kernel breakpoints. > > But but but, we can access userspace with !IF, imagine someone doing: > > local_irq_disable(); > copy_from_user_inatomic(); > > and as luck would have it, there's a breakpoint on the user memory we > just touched. And we go and disable a user breakpoint. Then shouldn't we use !IF && RSP matches NMI's stack ? User-space cannot control the two at once. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 8:13 ` Peter Zijlstra 2015-07-24 9:02 ` Willy Tarreau @ 2015-07-24 11:58 ` Steven Rostedt 2015-07-24 12:43 ` Peter Zijlstra 2015-07-24 15:48 ` Andy Lutomirski 2 siblings, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 11:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 10:13:26 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote: > > Hmmm. I thought watchpoints were "before the instruction" too, but > > that's just because I haven't used them in ages, and I didn't remember > > the details. I just looked it up. > > > > You're right - the memory watchpoints trigger after the instruction > > has executed, so RF isn't an issue. So yes, the only issue is > > instruction breakpoints, and those are the only ones we need to clear. > > > > And that makes it really easy. > > > > So yes, I agree. We only need to clear all kernel breakpoints. > > But but but, we can access userspace with !IF, imagine someone doing: > > local_irq_disable(); > copy_from_user_inatomic(); > > and as luck would have it, there's a breakpoint on the user memory we > just touched. And we go and disable a user breakpoint. Where does the kernel do that to user text? I would think that user data would only have watchpoints, and Andy and Linus said that those would not be disabled (I'm guessing because they don't have the RF flag set, and forward progress can proceed). If the kernel does the above to user code and there's a breakpoint there, would it even trigger? I'm not too familiar with how to use hw breakpoints, but I'm guessing (correct me if I'm wrong) that breakpoints on code that trigger when executed, but watchpoints on data trigger when accessed. Then copy_from_user_inatomic() would only trigger on watchpoints (it's not executing that code, at least I hope it isn't!), and those wont bother us. Or am I totally off base here? -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 11:58 ` Steven Rostedt @ 2015-07-24 12:43 ` Peter Zijlstra 2015-07-24 13:03 ` Steven Rostedt 0 siblings, 1 reply; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 12:43 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 07:58:41AM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 10:13:26 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote: > > > Hmmm. I thought watchpoints were "before the instruction" too, but > > > that's just because I haven't used them in ages, and I didn't remember > > > the details. I just looked it up. > > > > > > You're right - the memory watchpoints trigger after the instruction > > > has executed, so RF isn't an issue. So yes, the only issue is > > > instruction breakpoints, and those are the only ones we need to clear. > > > > > > And that makes it really easy. > > > > > > So yes, I agree. We only need to clear all kernel breakpoints. > > > > But but but, we can access userspace with !IF, imagine someone doing: > > > > local_irq_disable(); > > copy_from_user_inatomic(); > > > > and as luck would have it, there's a breakpoint on the user memory we > > just touched. And we go and disable a user breakpoint. > > Where does the kernel do that to user text? I would think that user > data would only have watchpoints, and Andy and Linus said that those > would not be disabled (I'm guessing because they don't have the RF flag > set, and forward progress can proceed). If the kernel does the above to > user code and there's a breakpoint there, would it even trigger? > > I'm not too familiar with how to use hw breakpoints, but I'm guessing > (correct me if I'm wrong) that breakpoints on code that trigger when > executed, but watchpoints on data trigger when accessed. Then > copy_from_user_inatomic() would only trigger on watchpoints (it's not > executing that code, at least I hope it isn't!), and those wont bother > us. These things can be: RW, W, X. Sure, hitting a user X watchpoint is going to be 'interesting', but its fairly easy to hit a RW one. Just watch an on-stack variable and get perf to copy a huge chunk of stack (like it does for the dwarf stuff). ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 12:43 ` Peter Zijlstra @ 2015-07-24 13:03 ` Steven Rostedt 2015-07-24 13:21 ` Willy Tarreau 0 siblings, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 13:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 14:43:04 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > I'm not too familiar with how to use hw breakpoints, but I'm guessing > > (correct me if I'm wrong) that breakpoints on code that trigger when > > executed, but watchpoints on data trigger when accessed. Then > > copy_from_user_inatomic() would only trigger on watchpoints (it's not > > executing that code, at least I hope it isn't!), and those wont bother > > us. > > These things can be: RW, W, X. > > Sure, hitting a user X watchpoint is going to be 'interesting', but its > fairly easy to hit a RW one. But do we care if we do hit one? The return from the #DB handler can use a RET. Right? -- Steve > > Just watch an on-stack variable and get perf to copy a huge chunk of > stack (like it does for the dwarf stuff). ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 13:03 ` Steven Rostedt @ 2015-07-24 13:21 ` Willy Tarreau 2015-07-24 13:30 ` Peter Zijlstra 2015-07-24 14:31 ` Steven Rostedt 0 siblings, 2 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 13:21 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 14:43:04 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > I'm not too familiar with how to use hw breakpoints, but I'm guessing > > > (correct me if I'm wrong) that breakpoints on code that trigger when > > > executed, but watchpoints on data trigger when accessed. Then > > > copy_from_user_inatomic() would only trigger on watchpoints (it's not > > > executing that code, at least I hope it isn't!), and those wont bother > > > us. > > > > These things can be: RW, W, X. > > > > Sure, hitting a user X watchpoint is going to be 'interesting', but its > > fairly easy to hit a RW one. > > But do we care if we do hit one? The return from the #DB handler can > use a RET. Right? My understanding is that by using RET we can't set the RF flag and #DB will immediately strike again when the operation is attempted again. Thus we have to completely disable the breakpoints on leaving after the first one strikes, resulting in some userland breakpoints being missed. Maybe it can be accepted as a limitation when perf is running. I don't know if the output of perf is that relevant when a debugger is present BTW. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 13:21 ` Willy Tarreau @ 2015-07-24 13:30 ` Peter Zijlstra 2015-07-24 13:33 ` Peter Zijlstra 2015-07-24 14:31 ` Steven Rostedt 1 sibling, 1 reply; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 13:30 UTC (permalink / raw) To: Willy Tarreau Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 03:21:28PM +0200, Willy Tarreau wrote: > On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote: > > On Fri, 24 Jul 2015 14:43:04 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > I'm not too familiar with how to use hw breakpoints, but I'm guessing > > > > (correct me if I'm wrong) that breakpoints on code that trigger when > > > > executed, but watchpoints on data trigger when accessed. Then > > > > copy_from_user_inatomic() would only trigger on watchpoints (it's not > > > > executing that code, at least I hope it isn't!), and those wont bother > > > > us. > > > > > > These things can be: RW, W, X. > > > > > > Sure, hitting a user X watchpoint is going to be 'interesting', but its > > > fairly easy to hit a RW one. > > > > But do we care if we do hit one? The return from the #DB handler can > > use a RET. Right? Look at do_debug(), it has lovely bits like: preempt_conditional_sti(); in it, we do _NOT_ want to be re-enabling interrupts if we're called from an !IF context, that'd be _bad_. > My understanding is that by using RET we can't set the RF flag and #DB > will immediately strike again when the operation is attempted again. Thus > we have to completely disable the breakpoints on leaving after the first > one strikes, resulting in some userland breakpoints being missed. Maybe > it can be accepted as a limitation when perf is running. I don't know if > the output of perf is that relevant when a debugger is present BTW. The patch I posted will re-enable the breakpoints before returning to userspace. So userspace will only 'miss' events generated by the kernel. Missing reads from the kernel is not a problem -- and maybe even expected, but certainly unavoidable. Missing updates from the kernel might be a problem, you'd get a variable change content even though you have a W watchpoint on it, that'd be surprising. Then again, I suppose we can argue the variable changed through another mapping and watchpoints work on the virtual address, so tough cookies or somesuch -- the kernel could in fact do this on highmem kernel anyway. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 13:30 ` Peter Zijlstra @ 2015-07-24 13:33 ` Peter Zijlstra 0 siblings, 0 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 13:33 UTC (permalink / raw) To: Willy Tarreau Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 03:30:13PM +0200, Peter Zijlstra wrote: > > > But do we care if we do hit one? The return from the #DB handler can > > > use a RET. Right? > > Look at do_debug(), it has lovely bits like: > > preempt_conditional_sti(); > > in it, we do _NOT_ want to be re-enabling interrupts if we're called > from an !IF context, that'd be _bad_. Ah, I forgot the conditional thing was the STI depending on regs->flags & IF.. In any case, better safe than sorry and simply not do #DB ever if !IF. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 13:21 ` Willy Tarreau 2015-07-24 13:30 ` Peter Zijlstra @ 2015-07-24 14:31 ` Steven Rostedt 2015-07-24 14:59 ` Willy Tarreau 1 sibling, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 14:31 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 15:21:28 +0200 Willy Tarreau <w@1wt.eu> wrote: > My understanding is that by using RET we can't set the RF flag and #DB But the RF flag is only set for instruction (executing) breakpoints. It is not set for data (RW) ones. -- Steve > will immediately strike again when the operation is attempted again. Thus > we have to completely disable the breakpoints on leaving after the first > one strikes, resulting in some userland breakpoints being missed. Maybe > it can be accepted as a limitation when perf is running. I don't know if > the output of perf is that relevant when a debugger is present BTW. > > Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 14:31 ` Steven Rostedt @ 2015-07-24 14:59 ` Willy Tarreau 2015-07-24 15:16 ` Steven Rostedt 0 siblings, 1 reply; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 14:59 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 15:21:28 +0200 > Willy Tarreau <w@1wt.eu> wrote: > > > My understanding is that by using RET we can't set the RF flag and #DB > > But the RF flag is only set for instruction (executing) breakpoints. It > is not set for data (RW) ones. True but these also are the most complicated to deal with. The data accesses can always be emulated (not what I'm suggesting here) while instructions are much harder to emulate. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 14:59 ` Willy Tarreau @ 2015-07-24 15:16 ` Steven Rostedt 2015-07-24 15:26 ` Willy Tarreau 0 siblings, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 15:16 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 16:59:01 +0200 Willy Tarreau <w@1wt.eu> wrote: > On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote: > > On Fri, 24 Jul 2015 15:21:28 +0200 > > Willy Tarreau <w@1wt.eu> wrote: > > > > > My understanding is that by using RET we can't set the RF flag and #DB > > > > But the RF flag is only set for instruction (executing) breakpoints. It > > is not set for data (RW) ones. > > True but these also are the most complicated to deal with. The data > accesses can always be emulated (not what I'm suggesting here) while > instructions are much harder to emulate. The point is, if we trigger a #DB on an instruction breakpoint while !IF, then we simply disable that breakpoint and do the RET. What emulation is needed? -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:16 ` Steven Rostedt @ 2015-07-24 15:26 ` Willy Tarreau 2015-07-24 15:30 ` Peter Zijlstra 2015-07-24 15:34 ` Steven Rostedt 0 siblings, 2 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 15:26 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 11:16:21AM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 16:59:01 +0200 > Willy Tarreau <w@1wt.eu> wrote: > > > On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote: > > > On Fri, 24 Jul 2015 15:21:28 +0200 > > > Willy Tarreau <w@1wt.eu> wrote: > > > > > > > My understanding is that by using RET we can't set the RF flag and #DB > > > > > > But the RF flag is only set for instruction (executing) breakpoints. It > > > is not set for data (RW) ones. > > > > True but these also are the most complicated to deal with. The data > > accesses can always be emulated (not what I'm suggesting here) while > > instructions are much harder to emulate. > > The point is, if we trigger a #DB on an instruction breakpoint > while !IF, then we simply disable that breakpoint and do the RET. Yes but the breakpoint remains disabled then. Or I'm missing something. > What emulation is needed? I was speaking about redoing the operation with BP disabled before re-enabling it. But that's not the point here anyway. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:26 ` Willy Tarreau @ 2015-07-24 15:30 ` Peter Zijlstra 2015-07-24 15:33 ` Willy Tarreau 2015-07-24 18:29 ` Linus Torvalds 2015-07-24 15:34 ` Steven Rostedt 1 sibling, 2 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 15:30 UTC (permalink / raw) To: Willy Tarreau Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote: > > > > The point is, if we trigger a #DB on an instruction breakpoint > > while !IF, then we simply disable that breakpoint and do the RET. > > Yes but the breakpoint remains disabled then. Or I'm missing > something. http://marc.info/?l=linux-kernel&m=143773601130974 We re-enable before going back to userspace. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:30 ` Peter Zijlstra @ 2015-07-24 15:33 ` Willy Tarreau 2015-07-24 18:29 ` Linus Torvalds 1 sibling, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 15:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 05:30:54PM +0200, Peter Zijlstra wrote: > On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote: > > > > > > The point is, if we trigger a #DB on an instruction breakpoint > > > while !IF, then we simply disable that breakpoint and do the RET. > > > > Yes but the breakpoint remains disabled then. Or I'm missing > > something. > > http://marc.info/?l=linux-kernel&m=143773601130974 > > We re-enable before going back to userspace. Ah OK thanks Peter. I'm sorry if I'm adding more noise than anything here, it's hard to follow and it becomes a bit complex. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:30 ` Peter Zijlstra 2015-07-24 15:33 ` Willy Tarreau @ 2015-07-24 18:29 ` Linus Torvalds 2015-07-24 18:41 ` Linus Torvalds 2015-07-24 19:55 ` Peter Zijlstra 1 sibling, 2 replies; 85+ messages in thread From: Linus Torvalds @ 2015-07-24 18:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote: >> > >> > The point is, if we trigger a #DB on an instruction breakpoint >> > while !IF, then we simply disable that breakpoint and do the RET. >> >> Yes but the breakpoint remains disabled then. Or I'm missing >> something. > > http://marc.info/?l=linux-kernel&m=143773601130974 > > We re-enable before going back to userspace. Actually, Andy had a good argument that we don't even need this. We just don't ever need to disable data breakpoints. Even if we end up doing cli(); copy_from_user_inatomic(); that actually works fine. If there are data breakpoints, we will have (a) things will run slow as hell anyway. Intel CPU's slow down to a relative crawl. (b) let's say we have a data breakpoint on the data we're reading above (c) we take a #DB fault after the instruction has completed, so we make forward progress even if we return with RF clear (d) even if the data breakpoint is unaligned and triggers multiple times, it's going to be a "small number" of multiple times. And see (a). This never happens in practice, and the much bigger slowdown is how data breakpoints tend to slow things down in general. (e) yes, the string instructions may hit the data breakpoint multilpe times for the "same" instruction, but the forward progress part is still true even for the string instructions. In fact, it's actually likely <i>more</i> true for string instructions, because they are documented to be less exact, and may trigger the data watchpoint only after a "group of iterations". so I think we just leave data breakpoint alone. The only debug conditions that are *faults* rather than traps are the instruction breakpoints, and we can detect and disable those by just saying "oh, we're in kernel mode". So in the #DB handler, we would basically only clear instruction breakpoints, and only when they trigger. If we have a data breakpoint that triggers (even in kernel mode, and with interrupts disabled), let it trigger and return with "ret" anyway. No biggie. (Ok, so the "General detect fault" is also a fault rather than a trap, but that's the "write to debug registers when it's disabled" thing, very different) Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 18:29 ` Linus Torvalds @ 2015-07-24 18:41 ` Linus Torvalds 2015-07-24 19:05 ` Steven Rostedt 2015-07-24 19:55 ` Peter Zijlstra 1 sibling, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-24 18:41 UTC (permalink / raw) To: Peter Zijlstra Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So in the #DB handler, we would basically only clear instruction > breakpoints, and only when they trigger. If we have a data breakpoint > that triggers (even in kernel mode, and with interrupts disabled), let > it trigger and return with "ret" anyway. No biggie. So we'd not only look at "which breakpoint triggered", we'd also look at the actual debug register and check that "R/Wn == 0", and only disable it for that case. So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7) is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7. Something like unsigned long mask = 0; unsigned int dr6 = debug_read(6); unsigned int dr7 = debug_read(7) int i; for (i = 0; i < 4; i++) { if ((dr6 >> i) & 1) { if (!((dr7 >> (4*i+16)) & 3)) mask |= 3 << (i*2); } } if (mask) debug_write(dr7 & ~mask, 7); (yeah, I could easily have screwed that up) But the above should only clear bits in dr7 that are actually associated with the instruction breakpoint that triggered, and since it's a _kernel_ instruction breakpoint, not a user one, we can clear it and forget it. No need to re-enable at all. Hmm? Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 18:41 ` Linus Torvalds @ 2015-07-24 19:05 ` Steven Rostedt 0 siblings, 0 replies; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 19:05 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Willy Tarreau, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 11:41:55 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So in the #DB handler, we would basically only clear instruction > > breakpoints, and only when they trigger. If we have a data breakpoint > > that triggers (even in kernel mode, and with interrupts disabled), let > > it trigger and return with "ret" anyway. No biggie. > > So we'd not only look at "which breakpoint triggered", we'd also look > at the actual debug register and check that "R/Wn == 0", and only > disable it for that case. > > So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether > it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7) > is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7. > > Something like > > unsigned long mask = 0; > unsigned int dr6 = debug_read(6); > unsigned int dr7 = debug_read(7) > int i; > > for (i = 0; i < 4; i++) { > if ((dr6 >> i) & 1) { > if (!((dr7 >> (4*i+16)) & 3)) > mask |= 3 << (i*2); > } > } > > if (mask) > debug_write(dr7 & ~mask, 7); Macros would be nice for readability. for (i = 0; i < 4; i++) { if ((dr6 >> i) & 1) { int shift = DR_CONTROL_SIZE * i + DR_CONTROL_SHIFT; if (!((dr7 >> shift) & DR_RW_READ)) mask |= (DR_LOCAL_ENABLE|DR_GLOBAL_ENABLE) << (i * DR_ENABLE_SIZE); } } -- Steve > > (yeah, I could easily have screwed that up) > > But the above should only clear bits in dr7 that are actually > associated with the instruction breakpoint that triggered, and since > it's a _kernel_ instruction breakpoint, not a user one, we can clear > it and forget it. No need to re-enable at all. > > Hmm? > > Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 18:29 ` Linus Torvalds 2015-07-24 18:41 ` Linus Torvalds @ 2015-07-24 19:55 ` Peter Zijlstra 2015-07-24 20:22 ` Linus Torvalds 1 sibling, 1 reply; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 19:55 UTC (permalink / raw) To: Linus Torvalds Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 11:29:29AM -0700, Linus Torvalds wrote: > On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote: > >> > > >> > The point is, if we trigger a #DB on an instruction breakpoint > >> > while !IF, then we simply disable that breakpoint and do the RET. > >> > >> Yes but the breakpoint remains disabled then. Or I'm missing > >> something. > > > > http://marc.info/?l=linux-kernel&m=143773601130974 > > > > We re-enable before going back to userspace. > > Actually, Andy had a good argument that we don't even need this. > > We just don't ever need to disable data breakpoints. Even if we end up doing > > cli(); > copy_from_user_inatomic(); > > that actually works fine. If there are data breakpoints, we will have I worry that we'll end up running the do_debug() handlers from effective NMI context. The NMI might have preempted locks which these handlers require etc.. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 19:55 ` Peter Zijlstra @ 2015-07-24 20:22 ` Linus Torvalds 2015-07-24 20:51 ` Peter Zijlstra 0 siblings, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-24 20:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > I worry that we'll end up running the do_debug() handlers from effective > NMI context. > > The NMI might have preempted locks which these handlers require etc.. If #DB takes any locks like that, then #DB is broken. Pretty much by definition, a data breakpoint can happen on pretty much absolutely any code. This is in no way NMI-specific as far as I can tell. Do we really take locks in the #DB handler? Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 20:22 ` Linus Torvalds @ 2015-07-24 20:51 ` Peter Zijlstra 2015-07-24 21:07 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 20:51 UTC (permalink / raw) To: Linus Torvalds Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote: > On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > I worry that we'll end up running the do_debug() handlers from effective > > NMI context. > > > > The NMI might have preempted locks which these handlers require etc.. > > If #DB takes any locks like that, then #DB is broken. > > Pretty much by definition, a data breakpoint can happen on pretty much > absolutely any code. This is in no way NMI-specific as far as I can > tell. > > Do we really take locks in the #DB handler? do_debug() send_sigtrap() force_sig_info() spin_lock_irqsave() Now, I don't pretend to understand the condition before send_sigtrap(), so it _might_ be ok, but it sure as heck could do with a comment. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 20:51 ` Peter Zijlstra @ 2015-07-24 21:07 ` Steven Rostedt 2015-07-24 21:08 ` Andy Lutomirski 2015-07-24 23:53 ` Linus Torvalds 2 siblings, 0 replies; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 21:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Willy Tarreau, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 22:51:19 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > Do we really take locks in the #DB handler? > > do_debug() > send_sigtrap() > force_sig_info() > spin_lock_irqsave() > > Now, I don't pretend to understand the condition before send_sigtrap(), > so it _might_ be ok, but it sure as heck could do with a comment. Or that force_sig_info() in send_sigtrap() looks like it can easily be change to use an irq work queue. -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 20:51 ` Peter Zijlstra 2015-07-24 21:07 ` Steven Rostedt @ 2015-07-24 21:08 ` Andy Lutomirski 2015-07-30 15:41 ` Paolo Bonzini 2015-07-24 23:53 ` Linus Torvalds 2 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-24 21:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote: >> On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > I worry that we'll end up running the do_debug() handlers from effective >> > NMI context. >> > >> > The NMI might have preempted locks which these handlers require etc.. >> >> If #DB takes any locks like that, then #DB is broken. >> >> Pretty much by definition, a data breakpoint can happen on pretty much >> absolutely any code. This is in no way NMI-specific as far as I can >> tell. >> >> Do we really take locks in the #DB handler? > > do_debug() > send_sigtrap() > force_sig_info() > spin_lock_irqsave() > > Now, I don't pretend to understand the condition before send_sigtrap(), > so it _might_ be ok, but it sure as heck could do with a comment. Let's try to decode it. user_icebp is set if int $0x01 happens, except it isn't because user code can't actually do that -- it'll cause #GP instead. user_icebp is also set if the user has a bloody in-circuit emulator, given the name. But who on Earth has one of those on a system new enough to run Linux and, even if they have one, why on Earth are they using it to send SIGTRAP. In any event, user_icebp is only set if user_mode(regs), so it's safe locking-wise. But please let's delete it. Otherwise, we do send_sigtrap if we got a single-step exception from user mode (because we suppress single-step exceptions from kernel mode a couple lines above, but we should really BUG on those except for the single case of SYSENTER with TF set) or if we get a breakpoint exception that wasn't eaten by perf. For *#&!'s sake. we should rewrite this pile of crap. // before kprobes and notify_die #if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION) if (!user_mode(regs) && regs->ip == sysenter_target) { fix it up and return; } notify_die, etc. preempt_conditional_sti(regs); do_trap(X86_TRAP_DB, SIGTRAP, "debug", regs, error_code, NULL); preempt_conditional_cli(regs); except we should do something to disallow fixup_exception here. Or we could open-code if(user_mode) send_sigtrap() else die() here. I really don't think that we should be sending signals to userspace due to user address watchpoints that hit in kernel mode. Or, if we do think we should send signals for those, then, as Steven said, we should make that explicit and use IRQ work for that. As it stands, this is probably an exploitable DoS -- just point a watchpoint down the stack a little bit from yourself and call raise(). --Andy -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 21:08 ` Andy Lutomirski @ 2015-07-30 15:41 ` Paolo Bonzini 2015-07-30 21:22 ` Andy Lutomirski 0 siblings, 1 reply; 85+ messages in thread From: Paolo Bonzini @ 2015-07-30 15:41 UTC (permalink / raw) To: Andy Lutomirski, Peter Zijlstra Cc: Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On 24/07/2015 23:08, Andy Lutomirski wrote: > user_icebp is set if int $0x01 happens, except it isn't because user > code can't actually do that -- it'll cause #GP instead. > > user_icebp is also set if the user has a bloody in-circuit emulator, > given the name. But who on Earth has one of those on a system new > enough to run Linux and, even if they have one, why on Earth are they > using it to send SIGTRAP. You do not need either "int $0x01" or an ICE to set user_icebp = 1. You can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB instead of #BP. The historical name is ICEBP because in-circuit emulators used it for software breakpoints, just like your usual debugger used 0xcc aka int3. And just like 0xcc it's unprivileged, so you can actually get a SIGTRAP with asm(".byte 0xf1"). So... > In any event, user_icebp is only set if user_mode(regs), so it's safe > locking-wise. But please let's delete it. ... it's safe, but it has some use (!). Paolo ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-30 15:41 ` Paolo Bonzini @ 2015-07-30 21:22 ` Andy Lutomirski 2015-07-30 21:58 ` Brian Gerst ` (2 more replies) 0 siblings, 3 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-30 21:22 UTC (permalink / raw) To: Paolo Bonzini Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > On 24/07/2015 23:08, Andy Lutomirski wrote: >> user_icebp is set if int $0x01 happens, except it isn't because user >> code can't actually do that -- it'll cause #GP instead. >> >> user_icebp is also set if the user has a bloody in-circuit emulator, >> given the name. But who on Earth has one of those on a system new >> enough to run Linux and, even if they have one, why on Earth are they >> using it to send SIGTRAP. > > You do not need either "int $0x01" or an ICE to set user_icebp = 1. You > can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB > instead of #BP. Great. There's an opcode that invokes an interrupt gate that's not marked as allowing unprivileged access, and that opcode doesn't appear in the SDM. It appears in the APM opcode map with no explanation at all. Thanks, CPU vendors. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-30 21:22 ` Andy Lutomirski @ 2015-07-30 21:58 ` Brian Gerst 2015-07-30 22:59 ` Thomas Gleixner 2015-07-31 4:22 ` Borislav Petkov 2 siblings, 0 replies; 85+ messages in thread From: Brian Gerst @ 2015-07-30 21:58 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner On Thu, Jul 30, 2015 at 5:22 PM, Andy Lutomirski <luto@amacapital.net> wrote: > On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> >> On 24/07/2015 23:08, Andy Lutomirski wrote: >>> user_icebp is set if int $0x01 happens, except it isn't because user >>> code can't actually do that -- it'll cause #GP instead. >>> >>> user_icebp is also set if the user has a bloody in-circuit emulator, >>> given the name. But who on Earth has one of those on a system new >>> enough to run Linux and, even if they have one, why on Earth are they >>> using it to send SIGTRAP. >> >> You do not need either "int $0x01" or an ICE to set user_icebp = 1. You >> can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB >> instead of #BP. > > Great. There's an opcode that invokes an interrupt gate that's not > marked as allowing unprivileged access, and that opcode doesn't appear > in the SDM. It appears in the APM opcode map with no explanation at > all. > > Thanks, CPU vendors. > > --Andy Some Windows programs (running in Wine) use this opcode for anti-debugging code. See commit a1e80fafc9f0742a1776a0490258cb64912411b0. -- Brian Gerst ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-30 21:22 ` Andy Lutomirski 2015-07-30 21:58 ` Brian Gerst @ 2015-07-30 22:59 ` Thomas Gleixner 2015-07-31 4:22 ` Borislav Petkov 2 siblings, 0 replies; 85+ messages in thread From: Thomas Gleixner @ 2015-07-30 22:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Brian Gerst On Thu, 30 Jul 2015, Andy Lutomirski wrote: > On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > > > On 24/07/2015 23:08, Andy Lutomirski wrote: > >> user_icebp is set if int $0x01 happens, except it isn't because user > >> code can't actually do that -- it'll cause #GP instead. > >> > >> user_icebp is also set if the user has a bloody in-circuit emulator, > >> given the name. But who on Earth has one of those on a system new > >> enough to run Linux and, even if they have one, why on Earth are they > >> using it to send SIGTRAP. > > > > You do not need either "int $0x01" or an ICE to set user_icebp = 1. You > > can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB > > instead of #BP. > > Great. There's an opcode that invokes an interrupt gate that's not > marked as allowing unprivileged access, and that opcode doesn't appear > in the SDM. It appears in the APM opcode map with no explanation at > all. The only SDM reference I found is: "The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64 and IA-32 architectures. These opcodes, even though undefined, do not generate an invalid opcode exception." D6 is actually something useful: if (carry flag set) AL = FF else AL = 0 It's been there since i386. It has been conveniant for return code magic from ASM to C. I haven't thought of it for at least a decade :) So all we need to worry about is F1, but thats bad enough :( Thanks, tglx ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-30 21:22 ` Andy Lutomirski 2015-07-30 21:58 ` Brian Gerst 2015-07-30 22:59 ` Thomas Gleixner @ 2015-07-31 4:22 ` Borislav Petkov 2015-07-31 5:11 ` Andy Lutomirski 2 siblings, 1 reply; 85+ messages in thread From: Borislav Petkov @ 2015-07-31 4:22 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote: > Great. There's an opcode that invokes an interrupt gate that's not > marked as allowing unprivileged access, and that opcode doesn't appear > in the SDM. It appears in the APM opcode map with no explanation at > all. > > Thanks, CPU vendors. Here's something better: http://www.rcollins.org/secrets/opcodes/ICEBP.html -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 4:22 ` Borislav Petkov @ 2015-07-31 5:11 ` Andy Lutomirski 2015-07-31 7:51 ` Paolo Bonzini 2015-07-31 8:03 ` Borislav Petkov 0 siblings, 2 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-31 5:11 UTC (permalink / raw) To: Borislav Petkov Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Thu, Jul 30, 2015 at 9:22 PM, Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote: >> Great. There's an opcode that invokes an interrupt gate that's not >> marked as allowing unprivileged access, and that opcode doesn't appear >> in the SDM. It appears in the APM opcode map with no explanation at >> all. >> >> Thanks, CPU vendors. > > Here's something better: > > http://www.rcollins.org/secrets/opcodes/ICEBP.html This instruction is awesome. Binutils can disassemble it (it's called "icebp") but it can't assemble it. KVM has special handling for it on VMX and actually reports it to QEMU on SVM (complete with a defined ABI). We have an asm macro so we can assemble it for 32-bit but not 64-bit, despite the fact that it works on 64-bit. The kernel instruction decoder can't decode it. Fortunately, it looks like the vm86 case is correct (or as correct as any of the vm86 junk can be), although I haven't tested it. I bet that icebp is like int3 in that it punches through vm86 mode instead of sending #GP. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 5:11 ` Andy Lutomirski @ 2015-07-31 7:51 ` Paolo Bonzini 2015-07-31 8:03 ` Borislav Petkov 1 sibling, 0 replies; 85+ messages in thread From: Paolo Bonzini @ 2015-07-31 7:51 UTC (permalink / raw) To: Andy Lutomirski, Borislav Petkov Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On 31/07/2015 07:11, Andy Lutomirski wrote: > This instruction is awesome. Binutils can disassemble it (it's called > "icebp") but it can't assemble it. KVM has special handling for it on > VMX and actually reports it to QEMU on SVM (complete with a defined > ABI). FWIW it's not reported to QEMU, it's only reported to a nested hypervisor. So the ABI is simply the SVM spec. It's not surprising that VMX support was provided by the Wine guys... Paolo > We have an asm macro so we can assemble it for 32-bit but not > 64-bit, despite the fact that it works on 64-bit. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 5:11 ` Andy Lutomirski 2015-07-31 7:51 ` Paolo Bonzini @ 2015-07-31 8:03 ` Borislav Petkov 2015-07-31 9:27 ` Paolo Bonzini 2015-09-07 5:39 ` Maciej W. Rozycki 1 sibling, 2 replies; 85+ messages in thread From: Borislav Petkov @ 2015-07-31 8:03 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Thu, Jul 30, 2015 at 10:11:40PM -0700, Andy Lutomirski wrote: > This instruction is awesome. Binutils can disassemble it (it's called > "icebp") but it can't assemble it. KVM has special handling for it on > VMX and actually reports it to QEMU on SVM (complete with a defined > ABI). Fun. > We have an asm macro so we can assemble it for 32-bit but not > 64-bit, despite the fact that it works on 64-bit. > > The kernel instruction decoder can't decode it. Yeah, the kernel insn decoder needs to be fixed. Even my decoder can decode it: $ echo "0xf1" | ./x86d - 0: f1 icebp Big deal. :-) Let's do some fun and games: $ cat icebp.c int main() { asm volatile(".byte 0xf1"); return 0; } $ gcc -Wall -o icebp{,.c} $ objdump -d icebp ... 00000000004004ac <main>: 4004ac: 55 push %rbp 4004ad: 48 89 e5 mov %rsp,%rbp 4004b0: f1 icebp 4004b1: b8 00 00 00 00 mov $0x0,%eax 4004b6: 5d pop %rbp 4004b7: c3 retq 4004b8: 90 nop ... $ ./icebp Trace/breakpoint trap ^ this in qemu. On baremetal it gets a SIGTRAP with TRAP_BRKPT. Looks like signal handling knows about it... $ strace /tmp/icebp execve("/tmp/icebp", ["/tmp/icebp"], [/* 27 vars */]) = 0 brk(0) = 0x1680000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e243d000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=127070, ...}) = 0 mmap(NULL, 127070, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f71e241d000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1729984, ...}) = 0 mmap(NULL, 3836448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f71e1e76000 mprotect(0x7f71e2015000, 2097152, PROT_NONE) = 0 mmap(0x7f71e2215000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19f000) = 0x7f71e2215000 mmap(0x7f71e221b000, 14880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f71e221b000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241c000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241b000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241a000 arch_prctl(ARCH_SET_FS, 0x7f71e241b700) = 0 mprotect(0x7f71e2215000, 16384, PROT_READ) = 0 mprotect(0x7f71e243f000, 4096, PROT_READ) = 0 munmap(0x7f71e241d000, 127070) = 0 --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=4195505, si_uid=0} --- +++ killed by SIGTRAP +++ Trace/breakpoint trap > Fortunately, it looks like the vm86 case is correct (or as correct as > any of the vm86 junk can be), although I haven't tested it. I bet > that icebp is like int3 in that it punches through vm86 mode instead > of sending #GP. Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing. But why do you say it is special - it simply raises #DB, i.e. vector 1. Web page seems to say so when interrupt redirection is disabled. It sounds like a nice and quick way to generate a breakpoint. You can do that with INT 01, i.e., the CD opcode, too. If I'd had to guess, it isn't documented because of the proprietary ICE aspect. And no one uses ICEs anymore so it is going to be forgotten with people popping off and on and asking about the undocumented opcode. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 8:03 ` Borislav Petkov @ 2015-07-31 9:27 ` Paolo Bonzini 2015-07-31 10:25 ` Borislav Petkov 2015-09-07 5:39 ` Maciej W. Rozycki 1 sibling, 1 reply; 85+ messages in thread From: Paolo Bonzini @ 2015-07-31 9:27 UTC (permalink / raw) To: Borislav Petkov, Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On 31/07/2015 10:03, Borislav Petkov wrote: > $ ./icebp > Trace/breakpoint trap > > ^ this in qemu. Is the strace different between KVM and baremetal? QEMU dynamic translation is broken I think, but KVM should be the same as baremetal. >> Fortunately, it looks like the vm86 case is correct (or as correct as >> any of the vm86 junk can be), although I haven't tested it. I bet >> that icebp is like int3 in that it punches through vm86 mode instead >> of sending #GP. > > Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing. No, it sends #GP. > But why do you say it is special - it simply raises #DB, i.e. vector 1. > Web page seems to say so when interrupt redirection is disabled. It > sounds like a nice and quick way to generate a breakpoint. You can do > that with INT 01, i.e., the CD opcode, too. > > If I'd had to guess, it isn't documented because of the proprietary ICE > aspect. And no one uses ICEs anymore so it is going to be forgotten with > people popping off and on and asking about the undocumented opcode. The reason why it isn't documented is probably hidden within Intel. Besides ICEBP, which is a bit fringe, there's no reason not to document SALC which Thomas mentioned. SALC all has been there since the 8086, and has been undocumented for thirty-odd years. The AAM/AAD variants with immediates other than 10 also have been undocumented for fifteen years or so (an instruction doing a division by 10 where the second byte of the opcode is 10? oh, certainly no one is going to try changing the second byte...) Paolo ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 9:27 ` Paolo Bonzini @ 2015-07-31 10:25 ` Borislav Petkov 2015-07-31 10:26 ` Paolo Bonzini 0 siblings, 1 reply; 85+ messages in thread From: Borislav Petkov @ 2015-07-31 10:25 UTC (permalink / raw) To: Paolo Bonzini Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Fri, Jul 31, 2015 at 11:27:13AM +0200, Paolo Bonzini wrote: > Is the strace different between KVM and baremetal? Yes, the signal part is missing from kvm: $ strace ./icebp execve("./icebp", ["./icebp"], [/* 20 vars */]) = 0 brk(0) = 0x601000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7ff6000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=95207, ...}) = 0 mmap(NULL, 95207, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffff7fde000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 832 fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0 mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7a53000 mprotect(0x7ffff7bd3000, 2097152, PROT_NONE) = 0 mmap(0x7ffff7dd3000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7ffff7dd3000 mmap(0x7ffff7dd8000, 18520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dd8000 close(3) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdd000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdc000 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdb000 arch_prctl(ARCH_SET_FS, 0x7ffff7fdc700) = 0 mprotect(0x7ffff7dd3000, 16384, PROT_READ) = 0 mprotect(0x7ffff7ffc000, 4096, PROT_READ) = 0 munmap(0x7ffff7fde000, 95207) = 0 exit_group(0) = ? > No, it sends #GP. True story: [ 697.707990] traps: icebp[3537] general protection ip:4004b0 sp:7fffffffe610 error:a in icebp[400000+1000] but why? I guess our IDT entry at 1 is funny... Too lazy to check. > The reason why it isn't documented is probably hidden within Intel. > Besides ICEBP, which is a bit fringe, there's no reason not to document > SALC which Thomas mentioned. SALC all has been there since the 8086, > and has been undocumented for thirty-odd years. That one is invalid (on an IVB): [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000] AMD APM documents it as invalid too. > The AAM/AAD variants with immediates other than 10 also have been > undocumented for fifteen years or so (an instruction doing a division > by 10 where the second byte of the opcode is 10? oh, certainly no one > is going to try changing the second byte...) There's this in the AMD APM: "In most modern assemblers, the AAM instruction adjusts to base-10 values. However, by coding the instruction directly in binary, it can adjust to any base specified by the immediate byte value (ib) suffixed onto the D4h opcode. For example, code D408h for octal, D40Ah for decimal, and D40Ch for duodecimal (base 12)." -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 10:25 ` Borislav Petkov @ 2015-07-31 10:26 ` Paolo Bonzini 2015-07-31 10:32 ` Borislav Petkov 0 siblings, 1 reply; 85+ messages in thread From: Paolo Bonzini @ 2015-07-31 10:26 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On 31/07/2015 12:25, Borislav Petkov wrote: >> > The reason why it isn't documented is probably hidden within Intel. >> > Besides ICEBP, which is a bit fringe, there's no reason not to document >> > SALC which Thomas mentioned. SALC all has been there since the 8086, >> > and has been undocumented for thirty-odd years. > That one is invalid (on an IVB): > > [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000] > > AMD APM documents it as invalid too. It's valid in 32-bit. Paolo ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 10:26 ` Paolo Bonzini @ 2015-07-31 10:32 ` Borislav Petkov 0 siblings, 0 replies; 85+ messages in thread From: Borislav Petkov @ 2015-07-31 10:32 UTC (permalink / raw) To: Paolo Bonzini Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Fri, Jul 31, 2015 at 12:26:34PM +0200, Paolo Bonzini wrote: > > > On 31/07/2015 12:25, Borislav Petkov wrote: > >> > The reason why it isn't documented is probably hidden within Intel. > >> > Besides ICEBP, which is a bit fringe, there's no reason not to document > >> > SALC which Thomas mentioned. SALC all has been there since the 8086, > >> > and has been undocumented for thirty-odd years. > > That one is invalid (on an IVB): > > > > [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000] > > > > AMD APM documents it as invalid too. > > It's valid in 32-bit. Yap, no invalid opcode there. I guess there's another bug in the APM's opcode table then. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-31 8:03 ` Borislav Petkov 2015-07-31 9:27 ` Paolo Bonzini @ 2015-09-07 5:39 ` Maciej W. Rozycki 2015-09-07 7:42 ` Ingo Molnar 1 sibling, 1 reply; 85+ messages in thread From: Maciej W. Rozycki @ 2015-09-07 5:39 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Fri, 31 Jul 2015, Borislav Petkov wrote: > Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing. > > But why do you say it is special - it simply raises #DB, i.e. vector 1. > Web page seems to say so when interrupt redirection is disabled. It > sounds like a nice and quick way to generate a breakpoint. You can do > that with INT 01, i.e., the CD opcode, too. > > If I'd had to guess, it isn't documented because of the proprietary ICE > aspect. And no one uses ICEs anymore so it is going to be forgotten with > people popping off and on and asking about the undocumented opcode. FYI, it's actually still in use with modern hardware, as a software breakpoint (and hence it has to be a single byte INT1 instruction rather than a multiple-byte regular INT 1 encoding) with JTAG probe hardware used for bare-metal debugging. E.g. Intel Atom supports it and boards have been available with a JTAG connector, which Intel calls XDP aka Extended Debug Port, e.g. the D945GCLF board (aka Crown Beach IIRC) had one. By fiddling with some bits in the CPU, which are only accessible through JTAG, probe firmware takes control over #DB making it trap into the debug mode rather than into the kernel. As noted above INT1 is used rather than INT3 (which still traps into the kernel with #BP as usually) for software breakpoints, but all the other DR0-7 resources are also available to the probe and the General Detect fault is used to prevent the kernel from fiddling with them. Similarly single-stepping traps into probe firmware. Debug mode transitions are completely transparent to any kernel-mode software run. I did some work on this a few years ago, including emulating DR0-7 accesses in software down the JTAG handler upon a General Detect fault to keep the kernel both happy and away from real debug registers. ;) Yes, you can debug any software with this stuff, including the Linux kernel: set instruction and data breakpoints, single-step it, poke at all hardware registers, including descriptor registers not otherwise accessible (you can set funny modes for segments, also in the 64-bit mode), etc. One complication though is you operate on physical addresses when poking at memory, you can't ask the CPU's MMU to remap them for you (you can walk page tables manually of course, just as the MMU would). I hope this clears things a bit around this stuff. :) You might be able to find some more by issuing a query for "Extended Debug Port" with your favourite Internet search engine. It's been a while since this discussion, but I thought I'd chime in as you might find it interesting. I'm actually a bit surprised the knowledge about this is so poor among x86 experts. Maciej ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 5:39 ` Maciej W. Rozycki @ 2015-09-07 7:42 ` Ingo Molnar 2015-09-07 8:19 ` Maciej W. Rozycki 0 siblings, 1 reply; 85+ messages in thread From: Ingo Molnar @ 2015-09-07 7:42 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Borislav Petkov, Andy Lutomirski, Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst * Maciej W. Rozycki <macro@linux-mips.org> wrote: > I did some work on this a few years ago, including emulating DR0-7 accesses in > software down the JTAG handler upon a General Detect fault to keep the kernel > both happy and away from real debug registers. ;) Yes, you can debug any > software with this stuff, including the Linux kernel: set instruction and data > breakpoints, single-step it, poke at all hardware registers, including > descriptor registers not otherwise accessible (you can set funny modes for > segments, also in the 64-bit mode), etc. One complication though is you operate > on physical addresses when poking at memory, you can't ask the CPU's MMU to > remap them for you (you can walk page tables manually of course, just as the MMU > would). Essentially the ICE breakpoint instruction enters SMM mode? Thanks, Ingo ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 7:42 ` Ingo Molnar @ 2015-09-07 8:19 ` Maciej W. Rozycki 2015-09-07 10:19 ` Paolo Bonzini 0 siblings, 1 reply; 85+ messages in thread From: Maciej W. Rozycki @ 2015-09-07 8:19 UTC (permalink / raw) To: Ingo Molnar Cc: Borislav Petkov, Andy Lutomirski, Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, 7 Sep 2015, Ingo Molnar wrote: > > I did some work on this a few years ago, including emulating DR0-7 accesses in > > software down the JTAG handler upon a General Detect fault to keep the kernel > > both happy and away from real debug registers. ;) Yes, you can debug any > > software with this stuff, including the Linux kernel: set instruction and data > > breakpoints, single-step it, poke at all hardware registers, including > > descriptor registers not otherwise accessible (you can set funny modes for > > segments, also in the 64-bit mode), etc. One complication though is you operate > > on physical addresses when poking at memory, you can't ask the CPU's MMU to > > remap them for you (you can walk page tables manually of course, just as the MMU > > would). > > Essentially the ICE breakpoint instruction enters SMM mode? I didn't do stuff at the probe firmware level so I can't say for sure, but my gut feeling is the debug mode is indeed very close if not the same as SMM. I think duplicating the logic would be an unnecessary waste of silicon. And obviously it's any cause of #DB that enters this mode. The probe can also request it right at the exit from the reset state, so that you can debug software (e.g BIOS startup) right from the reset vector. You don't need working RAM for that. Maciej ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 8:19 ` Maciej W. Rozycki @ 2015-09-07 10:19 ` Paolo Bonzini 2015-09-07 17:01 ` Maciej W. Rozycki 0 siblings, 1 reply; 85+ messages in thread From: Paolo Bonzini @ 2015-09-07 10:19 UTC (permalink / raw) To: Maciej W. Rozycki, Ingo Molnar Cc: Borislav Petkov, Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On 07/09/2015 10:19, Maciej W. Rozycki wrote: >> > Essentially the ICE breakpoint instruction enters SMM mode? > I didn't do stuff at the probe firmware level so I can't say for sure, > but my gut feeling is the debug mode is indeed very close if not the same > as SMM. I think duplicating the logic would be an unnecessary waste of > silicon. I researched SMM a bit recently in order to implement it in KVM, and the best source of folklore seems to be http://www.rcollins.org/ddj (which I also have on paper :)). The author there says that SMM design was roughly based on the 386's probe/ICE mode design, but it's actually separate. Most notably, on the 386 the state save areas almost mirror each other, but when I say mirror... I do mean mirror: directions are reversed, and what is on top for probe mode is on bottom for SMM. :) In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel who actually won the lawsuit. I couldn't find more information about the lawsuit. It's probably diverged more and more over time, for example because SMM is now considered security-sensitive while probe mode isn't. In addition, the same DDJ article says that Pentium JTAG probe mode "doesn't resemble SMM at all, doesn't use a state save map, or even execute any code of its own", whatever that means. Paolo > And obviously it's any cause of #DB that enters this mode. The probe can > also request it right at the exit from the reset state, so that you can > debug software (e.g BIOS startup) right from the reset vector. You don't > need working RAM for that. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 10:19 ` Paolo Bonzini @ 2015-09-07 17:01 ` Maciej W. Rozycki 2015-09-07 17:22 ` Andy Lutomirski 0 siblings, 1 reply; 85+ messages in thread From: Maciej W. Rozycki @ 2015-09-07 17:01 UTC (permalink / raw) To: Paolo Bonzini Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, 7 Sep 2015, Paolo Bonzini wrote: > > I didn't do stuff at the probe firmware level so I can't say for sure, > > but my gut feeling is the debug mode is indeed very close if not the same > > as SMM. I think duplicating the logic would be an unnecessary waste of > > silicon. > > I researched SMM a bit recently in order to implement it in KVM, and the > best source of folklore seems to be http://www.rcollins.org/ddj (which I > also have on paper :)). Robert did an excellent job figuring it all, but his stuff is a bit dated, things may have changed since, especially as JTAG debugging has since become ubiquitous in the embedded world and consequently better developed. > The author there says that SMM design was roughly based on the 386's > probe/ICE mode design, but it's actually separate. Most notably, on the > 386 the state save areas almost mirror each other, but when I say > mirror... I do mean mirror: directions are reversed, and what is on top > for probe mode is on bottom for SMM. :) That might be a minor implementation detail, needed for whatever reason. > In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel > who actually won the lawsuit. I couldn't find more information about > the lawsuit. > > It's probably diverged more and more over time, for example because SMM > is now considered security-sensitive while probe mode isn't. In > addition, the same DDJ article says that Pentium JTAG probe mode > "doesn't resemble SMM at all, doesn't use a state save map, or even > execute any code of its own", whatever that means. At least I am fairly sure the RSM instruction is used to quit the debug mode just like with SMM, so I'd be surprised if they bothered implementing separate state save area structures for the two modes. Some control bits in the CPU may well be set differently between the two modes though, addressing issues like security sensitivity you mentioned. A state save/restore approach is definitely used (unlike with some other processors that expose internal registers through JTAG directly) as you cannot switch between operation modes (e.g. real vs protected) on the fly while in the debug mode. You actually need to return to the regular mode (e.g. ask to single-step a NOP) for a mode change to take effect. Ditto about other registers -- any read-only bits are only masked out in the register state once a regular-mode instruction has executed. The use of RSM also prompts a question whether you can nest debug mode in SMM (to debug SMM code) -- this is actually similar to the NMI vs IRET issue considered in this thread -- or nest debug mode in debug mode, e.g. by taking a #DB exception from an INT1 instruction while in either mode. I don't know. Some other processors (MIPS) that implement a JTAG debug mode allow such nesting and care has to be taken in probe firmware to handle it correctly and ensure the context to return to is not clobbered if such a situation is to be arranged. And also -- as you may have expected -- the debug mode return instruction has to be avoided in the nested handler. These are all implementation-specific details, including the INT1 instruction, which is why I am not at all surprised that they are omitted from architecture manuals. The JTAG debug mode itself is no rocket science though, everybody seems to have it these days. Though for cost and power consumption saving reasons the RTL block implementing the debug module may obviously be omitted from production silicon known, perhaps by definition, to never require one. Maciej ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 17:01 ` Maciej W. Rozycki @ 2015-09-07 17:22 ` Andy Lutomirski 2015-09-07 19:30 ` Maciej W. Rozycki 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-09-07 17:22 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, Sep 7, 2015 at 10:01 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote: > These are all implementation-specific details, including the INT1 > instruction, which is why I am not at all surprised that they are omitted > from architecture manuals. That bit is BS, though. The INT1 instruction, executed in user mode (CPL3) with no hardware debugger attached, will enter the kernel through a gate at vector 1, *even if that gate has DPL == 0*. If there's an instruction that bypasses hardware protection mechanisms, then Intel should document it rather than relying on OS writers to know enough folklore to get it right. Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the interrupt or trap gate only if an exception or interrupt is generated with an INT n, INT 3, or INTO instruction." It does not say "the processor does not check the DPL of the interrupt or trap gate if the exception or interrupt is generated with the undocumented ICEBP instruction." --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 17:22 ` Andy Lutomirski @ 2015-09-07 19:30 ` Maciej W. Rozycki 2015-09-07 21:56 ` Andy Lutomirski 0 siblings, 1 reply; 85+ messages in thread From: Maciej W. Rozycki @ 2015-09-07 19:30 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, 7 Sep 2015, Andy Lutomirski wrote: > > These are all implementation-specific details, including the INT1 > > instruction, which is why I am not at all surprised that they are omitted > > from architecture manuals. > > That bit is BS, though. The INT1 instruction, executed in user mode > (CPL3) with no hardware debugger attached, will enter the kernel > through a gate at vector 1, *even if that gate has DPL == 0*. > > If there's an instruction that bypasses hardware protection > mechanisms, then Intel should document it rather than relying on OS > writers to know enough folklore to get it right. > > Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the > interrupt or trap gate only if an exception or interrupt is generated > with an INT n, INT 3, or INTO instruction." It does not say "the > processor does not check the DPL of the interrupt or trap gate if the > exception or interrupt is generated with the undocumented ICEBP > instruction." It does not have to be mentioned, because it's implied by how the #DB exception is propagated: regardless of its origin it never checks the DPL. And user-mode software may well use POPF at any time to set the TF bit in the flags register to the same effect, so the OS needs to be prepared for a #DB exception it hasn't scheduled itself anyway. Maciej ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 19:30 ` Maciej W. Rozycki @ 2015-09-07 21:56 ` Andy Lutomirski 2015-09-08 16:21 ` Maciej W. Rozycki 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-09-07 21:56 UTC (permalink / raw) To: Maciej W. Rozycki Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, Sep 7, 2015 at 12:30 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote: > On Mon, 7 Sep 2015, Andy Lutomirski wrote: > >> > These are all implementation-specific details, including the INT1 >> > instruction, which is why I am not at all surprised that they are omitted >> > from architecture manuals. >> >> That bit is BS, though. The INT1 instruction, executed in user mode >> (CPL3) with no hardware debugger attached, will enter the kernel >> through a gate at vector 1, *even if that gate has DPL == 0*. >> >> If there's an instruction that bypasses hardware protection >> mechanisms, then Intel should document it rather than relying on OS >> writers to know enough folklore to get it right. >> >> Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the >> interrupt or trap gate only if an exception or interrupt is generated >> with an INT n, INT 3, or INTO instruction." It does not say "the >> processor does not check the DPL of the interrupt or trap gate if the >> exception or interrupt is generated with the undocumented ICEBP >> instruction." > > It does not have to be mentioned, because it's implied by how the #DB > exception is propagated: regardless of its origin it never checks the DPL. > And user-mode software may well use POPF at any time to set the TF bit in > the flags register to the same effect, so the OS needs to be prepared for > a #DB exception it hasn't scheduled itself anyway. Not really. int $1 checks DPL. Setting TF results in saved TF set and the corresponding bit in DR6 set as well. Triggering a #DB using the debug registers requires active OS help. So operating systems need to handle a #DB without no indicated cause without spewing warnings or crashing, and there is no indication whatsoever in the SDM or APM that this is the case. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-09-07 21:56 ` Andy Lutomirski @ 2015-09-08 16:21 ` Maciej W. Rozycki 0 siblings, 0 replies; 85+ messages in thread From: Maciej W. Rozycki @ 2015-09-08 16:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst On Mon, 7 Sep 2015, Andy Lutomirski wrote: > > It does not have to be mentioned, because it's implied by how the #DB > > exception is propagated: regardless of its origin it never checks the DPL. > > And user-mode software may well use POPF at any time to set the TF bit in > > the flags register to the same effect, so the OS needs to be prepared for > > a #DB exception it hasn't scheduled itself anyway. > > Not really. > > int $1 checks DPL. Setting TF results in saved TF set and the > corresponding bit in DR6 set as well. Triggering a #DB using the > debug registers requires active OS help. INT $1 is a software interrupt instruction, it does not trigger a #DB. Similarly INT $13 checks DPL while #GP does not. Or maybe INT $6 vs UD2 is a better analogy; the latter is as much INT6 as the 0xf1 encoding is INT1. Yes, you'll get a DR6 status with no new bits set. So what? You can ignore it and IRET with no adverse effects. You can print diagnostics if you're pedantic. You can kill the offending user program, but that's no harm, because it already did the undefined. None of these is an issue, and certainly not one for security. Panicking OTOH would be, but that would IMHO be a silly choice and a bad OS design. You never need to crash due to a user-mode exception, even an unknown one. What if you run on a new CPU which has a new user-mode exception unknown at the time the OS binary was compiled? That's an analogous situation for an architecture like x86 where strict backwards compatibility is maintained. A reasonable #DB handler will do something like: { int dr6 = read_dr6(); write_dr6(0); if (dr6 & DR6_MASK_X) handle_dr6_x(); if (dr6 & DR6_MASK_Y) handle_dr6_y(); /* Etc... */ return; } and will work just fine where invoked with no bits set in DR6. > So operating systems need to handle a #DB without no indicated cause > without spewing warnings or crashing, and there is no indication > whatsoever in the SDM or APM that this is the case. Strictly speaking the SDM does not state that at least one status bit shall be set in DR6 either. FAOD I'm not saying of course that documenting INT1 as a model-specific instruction encoding reserved for #DB generation or stating something to the effect that the OS is required to handle (e.g. discard) a #DB exception seen with no status bits set in DR6 would be bad. No, it would certainly be nice. But I maintain that I don't see it as strictly necessary. Pester Intel if you disagree, I'm not the right person to complain about it anyway. ;) Maciej ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 20:51 ` Peter Zijlstra 2015-07-24 21:07 ` Steven Rostedt 2015-07-24 21:08 ` Andy Lutomirski @ 2015-07-24 23:53 ` Linus Torvalds 2 siblings, 0 replies; 85+ messages in thread From: Linus Torvalds @ 2015-07-24 23:53 UTC (permalink / raw) To: Peter Zijlstra Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > do_debug() > send_sigtrap() > force_sig_info() > spin_lock_irqsave() > > Now, I don't pretend to understand the condition before send_sigtrap(), > so it _might_ be ok, but it sure as heck could do with a comment. Ugh. As Andy said, I think that's ok, because it's actually the single-step case, and won't trigger for kernel mode. So we should be ok. Although the code I agree is not good. I'd personally be more worried about the usual crazy "notify_die()" crap. I absoluely detest those notifier chain things. They are hooks for random crap that shouldn't be hooked into, but whatever. It's not a problem in practice, it's just a sign of a certain kind of diseased mind. On the whole I think we're ok. I'd love to get rid of things, and yes, I think we should probably explicitly handle the in-kernel case first and just return without doing anything, just to make the code more obviously safe. But it doesn't look like a fundamental problem spot. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:26 ` Willy Tarreau 2015-07-24 15:30 ` Peter Zijlstra @ 2015-07-24 15:34 ` Steven Rostedt 2015-07-24 15:49 ` Willy Tarreau 1 sibling, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 15:34 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 17:26:37 +0200 Willy Tarreau <w@1wt.eu> wrote: > > The point is, if we trigger a #DB on an instruction breakpoint > > while !IF, then we simply disable that breakpoint and do the RET. > > Yes but the breakpoint remains disabled then. Or I'm missing > something. Do we care? If it was an instruction breakpoint with !IF set, then it had to have happened in the kernel. And kgdb or whatever added it there needs to deal with that. There should be no instances in the kernel where we execute userspace code with interrupts disabled. -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:34 ` Steven Rostedt @ 2015-07-24 15:49 ` Willy Tarreau 0 siblings, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 15:49 UTC (permalink / raw) To: Steven Rostedt Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 11:34:26AM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 17:26:37 +0200 > Willy Tarreau <w@1wt.eu> wrote: > > > > > The point is, if we trigger a #DB on an instruction breakpoint > > > while !IF, then we simply disable that breakpoint and do the RET. > > > > Yes but the breakpoint remains disabled then. Or I'm missing > > something. > > Do we care? If it was an instruction breakpoint with !IF set, then it > had to have happened in the kernel. And kgdb or whatever added it there > needs to deal with that. I was concerned that an RW BP would remain disabled when returning to user space but Peter cleared that out by pointing me to the discussion where it was explained that they are re-enabled when returning to user space. So no problem here for me. Thanks, Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 8:13 ` Peter Zijlstra 2015-07-24 9:02 ` Willy Tarreau 2015-07-24 11:58 ` Steven Rostedt @ 2015-07-24 15:48 ` Andy Lutomirski 2015-07-24 16:02 ` Steven Rostedt ` (3 more replies) 2 siblings, 4 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-24 15:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 1:13 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote: >> Hmmm. I thought watchpoints were "before the instruction" too, but >> that's just because I haven't used them in ages, and I didn't remember >> the details. I just looked it up. >> >> You're right - the memory watchpoints trigger after the instruction >> has executed, so RF isn't an issue. So yes, the only issue is >> instruction breakpoints, and those are the only ones we need to clear. >> >> And that makes it really easy. >> >> So yes, I agree. We only need to clear all kernel breakpoints. > > But but but, we can access userspace with !IF, imagine someone doing: > > local_irq_disable(); > copy_from_user_inatomic(); > > and as luck would have it, there's a breakpoint on the user memory we > just touched. And we go and disable a user breakpoint. > The Intel SDM says: 17.3.1.2 Data Memory and I/O Breakpoint Exception Conditions Data memory and I/O breakpoints are reported when the processor attempts to access a memory or I/O address specified in a breakpoint-address register (DR0 through DR3) that has been set up to detect data or I/O accesses (R/W flag is set to 1, 2, or 3). The processor generates the exception after it executes the instruction that made the access, so these breakpoint condition causes a trap-class exception to be generated. So by the time we detect that we've hit a watchpoint, the instruction that tripped it is done and we don't need RF. Furthermore, after reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if we hit a watchpoint. So this might be as simple as: if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs)) for (i = 0; i < 4; i++) if (dr6 & (DR_TRAP0<<i)) { /* hit a kernel breakpoint with IF clear */ dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT)); } I'm not saying that your code is wrong, but I think this is simpler and avoids poking at yet more per-cpu state from NMI context, which is kind of nice. If you don't like the RF games above, it would also be straightforward to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a breakpoint. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:48 ` Andy Lutomirski @ 2015-07-24 16:02 ` Steven Rostedt 2015-07-24 16:08 ` Willy Tarreau 2015-07-24 16:06 ` Steven Rostedt ` (2 subsequent siblings) 3 siblings, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 16:02 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 08:48:57 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > So by the time we detect that we've hit a watchpoint, the instruction > that tripped it is done and we don't need RF. Furthermore, after > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if > we hit a watchpoint. So this might be as simple as: > > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | Um, isn't 0xf * DR_TRAP0 same as a constant "true"? -- Steve > X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs)) > for (i = 0; i < 4; i++) > if (dr6 & (DR_TRAP0<<i)) { > /* hit a kernel breakpoint with IF clear */ > dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT)); > } > > I'm not saying that your code is wrong, but I think this is simpler > and avoids poking at yet more per-cpu state from NMI context, which is > kind of nice. > > If you don't like the RF games above, it would also be straightforward > to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a > breakpoint. > > --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 16:02 ` Steven Rostedt @ 2015-07-24 16:08 ` Willy Tarreau 2015-07-24 16:31 ` Steven Rostedt 0 siblings, 1 reply; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 16:08 UTC (permalink / raw) To: Steven Rostedt Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 12:02:49PM -0400, Steven Rostedt wrote: > On Fri, 24 Jul 2015 08:48:57 -0700 > Andy Lutomirski <luto@amacapital.net> wrote: > > > So by the time we detect that we've hit a watchpoint, the instruction > > that tripped it is done and we don't need RF. Furthermore, after > > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if > > we hit a watchpoint. So this might be as simple as: > > > > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | > > Um, isn't 0xf * DR_TRAP0 same as a constant "true"? For me it's a typo, it should have been : if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | (the purpose was to read all 4 lower bits at once). Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 16:08 ` Willy Tarreau @ 2015-07-24 16:31 ` Steven Rostedt 0 siblings, 0 replies; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 16:31 UTC (permalink / raw) To: Willy Tarreau Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 18:08:06 +0200 Willy Tarreau <w@1wt.eu> wrote: > > Um, isn't 0xf * DR_TRAP0 same as a constant "true"? > > For me it's a typo, it should have been : > > if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | > > (the purpose was to read all 4 lower bits at once). I figured that after I sent it, but the 0xf * DR_TRAP0 is also confusing to me. Actually, why not use proper naming: dr6 & DR_TRAP_BITS -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:48 ` Andy Lutomirski 2015-07-24 16:02 ` Steven Rostedt @ 2015-07-24 16:06 ` Steven Rostedt 2015-07-24 16:25 ` Willy Tarreau 2015-07-24 17:10 ` Willy Tarreau 3 siblings, 0 replies; 85+ messages in thread From: Steven Rostedt @ 2015-07-24 16:06 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, 24 Jul 2015 08:48:57 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > So by the time we detect that we've hit a watchpoint, the instruction > that tripped it is done and we don't need RF. Furthermore, after > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if > we hit a watchpoint. So this might be as simple as: > > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | > X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs)) Also, wouldn't !(regs->X86_EFLAGS_IF) && !user_mode(regs) be a bug? When do we allow coming from userspace with interrupts disabled? -- Steve > for (i = 0; i < 4; i++) > if (dr6 & (DR_TRAP0<<i)) { > /* hit a kernel breakpoint with IF clear */ > dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT)); > } > > I'm not saying that your code is wrong, but I think this is simpler > and avoids poking at yet more per-cpu state from NMI context, which is > kind of nice. > > If you don't like the RF games above, it would also be straightforward > to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a > breakpoint. > > --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:48 ` Andy Lutomirski 2015-07-24 16:02 ` Steven Rostedt 2015-07-24 16:06 ` Steven Rostedt @ 2015-07-24 16:25 ` Willy Tarreau 2015-07-24 17:21 ` Andy Lutomirski 2015-07-24 17:10 ` Willy Tarreau 3 siblings, 1 reply; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 16:25 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote: > So by the time we detect that we've hit a watchpoint, the instruction > that tripped it is done and we don't need RF. Furthermore, after > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if > we hit a watchpoint. Apparently after reading 17.3.1.1, it seems like RF can still be set if a data breakpoint triggers in a repeated string instruction before the last iteration. However I don't think we care because as long as we return to the string instruction, since the data location was already visited it won't trigger again so the loss of the flag should be safe. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 16:25 ` Willy Tarreau @ 2015-07-24 17:21 ` Andy Lutomirski 0 siblings, 0 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-24 17:21 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 9:25 AM, Willy Tarreau <w@1wt.eu> wrote: > On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote: >> So by the time we detect that we've hit a watchpoint, the instruction >> that tripped it is done and we don't need RF. Furthermore, after >> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if >> we hit a watchpoint. > > Apparently after reading 17.3.1.1, it seems like RF can still be set > if a data breakpoint triggers in a repeated string instruction before > the last iteration. However I don't think we care because as long as > we return to the string instruction, since the data location was already > visited it won't trigger again so the loss of the flag should be safe. > Oh, right. So my proposal is wrong: it'll clear a watchpoint incorrectly if we hit it in the middle of a string operation. So we should either parse dr0..dr3 (whichever one triggered) or do Peter's think and clear dr7 entirely. I still prefer just clearing the breakpoint that triggered. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 15:48 ` Andy Lutomirski ` (2 preceding siblings ...) 2015-07-24 16:25 ` Willy Tarreau @ 2015-07-24 17:10 ` Willy Tarreau 2015-07-24 17:20 ` Andy Lutomirski 2015-07-24 17:21 ` Willy Tarreau 3 siblings, 2 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 17:10 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote: > So by the time we detect that we've hit a watchpoint, the instruction > that tripped it is done and we don't need RF. Furthermore, after > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if > we hit a watchpoint. So this might be as simple as: > > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | > X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs)) > for (i = 0; i < 4; i++) > if (dr6 & (DR_TRAP0<<i)) { > /* hit a kernel breakpoint with IF clear */ > dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT)); > } > > I'm not saying that your code is wrong, but I think this is simpler > and avoids poking at yet more per-cpu state from NMI context, which is > kind of nice. > > If you don't like the RF games above, it would also be straightforward > to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a > breakpoint. Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET to emulate IRET, and then possibly simplify the flags processing. It says that it takes the CPL3 code segment but nowhere it says that the target is validated for effectively being userland, and further it suggests that it doesn't validate anything : "It is the responsibility of the OS to ensure the descriptors in the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET (consistent with the base, limit, and attribute values forced by the instructions)." The OS has to set the RSP by itself before doing SYSRET, which opens a race between "mov rsp" and "sysret", but if we only take that path once we figure we come from NMI (using just IF+RSP), we know that IRQs and NMIs are still disabled and cannot strike at this instant. Maybe MCEs can, but they would execute within the NMI's stack just as if they were triggered inside the NMI as well so I don't see a problem here. I tried to imagine a case where kernel page faults, then NMI comes in, then debug strikes and we have to return from debug to NMI then to fault handler and I don't think we break the chain. Of course there are many subtleties I can't grab because I don't understand all the details. Do you think that could simplify things or that it's another stupid idea ? Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 17:10 ` Willy Tarreau @ 2015-07-24 17:20 ` Andy Lutomirski 2015-07-30 15:54 ` Paolo Bonzini 2015-07-24 17:21 ` Willy Tarreau 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-24 17:20 UTC (permalink / raw) To: Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 10:10 AM, Willy Tarreau <w@1wt.eu> wrote: > On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote: >> So by the time we detect that we've hit a watchpoint, the instruction >> that tripped it is done and we don't need RF. Furthermore, after >> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if >> we hit a watchpoint. So this might be as simple as: >> >> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF | >> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs)) >> for (i = 0; i < 4; i++) >> if (dr6 & (DR_TRAP0<<i)) { >> /* hit a kernel breakpoint with IF clear */ >> dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT)); >> } >> >> I'm not saying that your code is wrong, but I think this is simpler >> and avoids poking at yet more per-cpu state from NMI context, which is >> kind of nice. >> >> If you don't like the RF games above, it would also be straightforward >> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a >> breakpoint. > > Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET > to emulate IRET, and then possibly simplify the flags processing. It says > that it takes the CPL3 code segment but nowhere it says that the target is > validated for effectively being userland, and further it suggests that it > doesn't validate anything : > > "It is the responsibility of the OS to ensure the descriptors in > the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET > (consistent with the base, limit, and attribute values forced by > the instructions)." You are an evil bastard. I seriously doubt that this will work. SYSRET goes to CPL3 no matter what. Also, I don't think you want to start poking at MSRs to return. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 17:20 ` Andy Lutomirski @ 2015-07-30 15:54 ` Paolo Bonzini 0 siblings, 0 replies; 85+ messages in thread From: Paolo Bonzini @ 2015-07-30 15:54 UTC (permalink / raw) To: Andy Lutomirski, Willy Tarreau Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On 24/07/2015 19:20, Andy Lutomirski wrote: > > Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET > > to emulate IRET, and then possibly simplify the flags processing. It says > > that it takes the CPL3 code segment but nowhere it says that the target is > > validated for effectively being userland, and further it suggests that it > > doesn't validate anything : > > > > "It is the responsibility of the OS to ensure the descriptors in > > the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET > > (consistent with the base, limit, and attribute values forced by > > the instructions)." > You are an evil bastard. I seriously doubt that this will work. > SYSRET goes to CPL3 no matter what. Also, I don't think you want to > start poking at MSRs to return. On Intel the bottom two bits of the selector are forced to 11. The pseudocode of SYSRET in the SDM has an explicit CS.Selector ← (IA32_STAR[63:48]+ either 0 or 16) OR 3; ... SS.Selector ← (IA32_STAR[63:48]+8) OR 3; On AMD it's even worse, because you get a weird state with CS.DPL=CS.RPL=SS.DPL=SS.RPL=0 but still the CPL is 3. This is seriously messed up because the CPL is always SS.DPL except in this case. AMD even had to add a separate field for the CPL to their VM control block, just to account for this case. Intel more sanely uses SS.DPL. Paolo ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-24 17:10 ` Willy Tarreau 2015-07-24 17:20 ` Andy Lutomirski @ 2015-07-24 17:21 ` Willy Tarreau 1 sibling, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-24 17:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Brian Gerst On Fri, Jul 24, 2015 at 07:10:18PM +0200, Willy Tarreau wrote: > The OS has to set the RSP by itself before doing SYSRET, which opens a > race between "mov rsp" and "sysret", but if we only take that path once > we figure we come from NMI (using just IF+RSP), we know that IRQs and > NMIs are still disabled and cannot strike at this instant. Maybe MCEs > can, but they would execute within the NMI's stack just as if they were > triggered inside the NMI as well so I don't see a problem here. OK too bad I just found the response here in the code :-( * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, * restoring TF results in a trap from userspace immediately after * SYSRET. This would cause an infinite loop whenever #DB happens * with register state that satisfies the opportunistic SYSRET * conditions. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:38 ` Linus Torvalds 2015-07-23 20:49 ` Andy Lutomirski @ 2015-07-23 20:52 ` Willy Tarreau 2015-07-23 20:53 ` Andy Lutomirski 2015-07-23 21:13 ` Linus Torvalds 2015-07-23 21:20 ` Peter Zijlstra 2 siblings, 2 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-23 20:52 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote: > On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. > > > > We haven't considered: > > > > 3. Forbid faults (other than MCE) inside NMI. > > I'd really prefer #2. #3 depends on us getting many things right, and > never introducing new cases in the future. > > #2, in contrast, seems to be fairly localized. Yes, RF is an issue, > but returning to user space with RF clear doesn't really seem to be > all that problematic. What's the worst case that can happen with RF cleared when returing to user space ? My understanding is that it's just that we risk to break again on an instruction that had a break point set and which already triggered the breakpoint, right ? If so the problem probably is whether there's a risk of looping again without ever getting a chance to execute this instruction normally. But if the NMIs don't bomb as fast as we can process them, at some point the instruction should get a chance to be executed, so the problem doesn't seem dramatic. That makes me think that I have no idea what happens if we try to step-trace "int 2", I don't even know if we pass through the NMI handler. Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:52 ` Willy Tarreau @ 2015-07-23 20:53 ` Andy Lutomirski 2015-07-23 21:07 ` Willy Tarreau 2015-07-23 21:13 ` Linus Torvalds 1 sibling, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 20:53 UTC (permalink / raw) To: Willy Tarreau Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote: > On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote: >> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> > >> > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. >> > >> > We haven't considered: >> > >> > 3. Forbid faults (other than MCE) inside NMI. >> >> I'd really prefer #2. #3 depends on us getting many things right, and >> never introducing new cases in the future. >> >> #2, in contrast, seems to be fairly localized. Yes, RF is an issue, >> but returning to user space with RF clear doesn't really seem to be >> all that problematic. > > What's the worst case that can happen with RF cleared when returing > to user space ? My understanding is that it's just that we risk to > break again on an instruction that had a break point set and which > already triggered the breakpoint, right ? I assume Linus meant returning to kernel space with RF clear. Returns to userspace have their own fancy logic here, and it's survived for a couple of releases, including through an explicit test of RF handling :) --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:53 ` Andy Lutomirski @ 2015-07-23 21:07 ` Willy Tarreau 0 siblings, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-23 21:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 01:53:34PM -0700, Andy Lutomirski wrote: > On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote: > > On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote: > >> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote: > >> > > >> > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. > >> > > >> > We haven't considered: > >> > > >> > 3. Forbid faults (other than MCE) inside NMI. > >> > >> I'd really prefer #2. #3 depends on us getting many things right, and > >> never introducing new cases in the future. > >> > >> #2, in contrast, seems to be fairly localized. Yes, RF is an issue, > >> but returning to user space with RF clear doesn't really seem to be > >> all that problematic. > > > > What's the worst case that can happen with RF cleared when returing > > to user space ? My understanding is that it's just that we risk to > > break again on an instruction that had a break point set and which > > already triggered the breakpoint, right ? > > I assume Linus meant returning to kernel space with RF clear. Returns > to userspace have their own fancy logic here, and it's survived for a > couple of releases, including through an explicit test of RF handling > :) Ah you must be right, got it. Yes you want to break into the NMI handler and you either disable all breakpoints/single-step until the NMI's iret by clearing DR7, or you loop over and over on the same instruction if you try to restart the stopped instruction with RF clear. That makes sense. Thanks, Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:52 ` Willy Tarreau 2015-07-23 20:53 ` Andy Lutomirski @ 2015-07-23 21:13 ` Linus Torvalds 2015-07-23 21:18 ` Willy Tarreau 1 sibling, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:13 UTC (permalink / raw) To: Willy Tarreau Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote: > > What's the worst case that can happen with RF cleared when returing > to user space ? Not a good idea. We are fine breaking breakpoints on the kernel ("use the tracing infrastructure instead"). Breaking it in user space is not really an option. And we really don't need to. We'd only use 'ret' when returning to kernel code. And not even for the usual case, only for the "interrupts are off" case. If somebody tries to put a breakpoint on something that is used in an irq-off situation, they are doing something very specialized, and we cna tell them: "sorry, we had to break your use case because it's crazy any other way". Those kind of people are by definition not "users". They are mucking with kernel internals. Breaking them is not a regression. Btw, we should still ask Intel for that "fast iret that doesn't re-enable NMI". So for possible future CPU's we might let people do crazy things again. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:13 ` Linus Torvalds @ 2015-07-23 21:18 ` Willy Tarreau 0 siblings, 0 replies; 85+ messages in thread From: Willy Tarreau @ 2015-07-23 21:18 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 02:13:16PM -0700, Linus Torvalds wrote: > On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote: > > > > What's the worst case that can happen with RF cleared when returing > > to user space ? > > Not a good idea. We are fine breaking breakpoints on the kernel ("use > the tracing infrastructure instead"). Breaking it in user space is not > really an option. But that wouldn't disable the breakpoint, just make it strike again, so the user would not be hurt. > And we really don't need to. We'd only use 'ret' when returning to > kernel code. And not even for the usual case, only for the "interrupts > are off" case. If somebody tries to put a breakpoint on something > that is used in an irq-off situation, they are doing something very > specialized, and we cna tell them: "sorry, we had to break your use > case because it's crazy any other way". > > Those kind of people are by definition not "users". They are mucking > with kernel internals. Breaking them is not a regression. > > Btw, we should still ask Intel for that "fast iret that doesn't > re-enable NMI". So for possible future CPU's we might let people do > crazy things again. I'm just thinking that there should be an option for this : task switching. You can store the EFLAGS in the TSS, so by preparing a dummy task with everything needed to emulate iret, we might be able to do it without the iret instruction. Or is this a stupid idea ? At least now I've well understood that ugliness is not an excuse for not proposing something :-) Willy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:38 ` Linus Torvalds 2015-07-23 20:49 ` Andy Lutomirski 2015-07-23 20:52 ` Willy Tarreau @ 2015-07-23 21:20 ` Peter Zijlstra 2015-07-23 21:35 ` Linus Torvalds 2 siblings, 1 reply; 85+ messages in thread From: Peter Zijlstra @ 2015-07-23 21:20 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote: > And the "take them and disable them" is really simple. No "am I in an > NMI contect" thing (because that leads to the whole question about > "what is NMI context"). That's not the real rule anyway. > > No, make it very simple and straightforward. Make the test be "uhhuh, > I got a #DB in kernel mode, and interrupts were disabled - I know I'm > going to return with "ret", so I'm just going to have to disable this > breakpoint". > > Nothing clever. Nothing subtle. Nothing that needs "this range of > instructions is magical". No. Just a very simple rule: if the context > we return to is kernel mode and interrupts are disabled, we're using > 'ret', so we cannot suppress debug faults. > > Did I miss something? There were a lot of emails flying around, but I > *thought* I saw them all.. So the NMI could trigger userspace debug register faults, and simply disabling them would make the whole debug register thing entirely unreliable. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:20 ` Peter Zijlstra @ 2015-07-23 21:35 ` Linus Torvalds 2015-07-23 21:45 ` Andy Lutomirski 0 siblings, 1 reply; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: > > So the NMI could trigger userspace debug register faults, and simply > disabling them would make the whole debug register thing entirely > unreliable. We could easily set something to re-enable them for when we actually return to user space. I'd be ok with just setting the _TIF_USER_WORK_MASK. But even that should not be a requirement for the basic stability and core integrity of the kernel. Not like the current horrid mess with NMI nesting and ESP fixing etc. And realistically, nobody will ever even notice. So the whole "ok, we can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail that is more like cleaning up things, not a core issue. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:35 ` Linus Torvalds @ 2015-07-23 21:45 ` Andy Lutomirski 2015-07-23 21:54 ` Linus Torvalds 0 siblings, 1 reply; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 21:45 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 2:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> So the NMI could trigger userspace debug register faults, and simply >> disabling them would make the whole debug register thing entirely >> unreliable. > > We could easily set something to re-enable them for when we actually > return to user space. I'd be ok with just setting the > _TIF_USER_WORK_MASK. > > But even that should not be a requirement for the basic stability and > core integrity of the kernel. Not like the current horrid mess with > NMI nesting and ESP fixing etc. > > And realistically, nobody will ever even notice. So the whole "ok, we > can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail > that is more like cleaning up things, not a core issue. > Or we just re-enable them on the way out of NMI (i.e. the very last thing we do in the NMI handler). I don't want to break regular userspace gdb when perf is running. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:45 ` Andy Lutomirski @ 2015-07-23 21:54 ` Linus Torvalds 2015-07-23 21:59 ` Andy Lutomirski 2015-07-24 11:06 ` Peter Zijlstra 0 siblings, 2 replies; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 21:54 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > Or we just re-enable them on the way out of NMI (i.e. the very last > thing we do in the NMI handler). I don't want to break regular > userspace gdb when perf is running. I'd really prefer it if we don't touch NMI code in those kinds of ways. The NMI code is fragile as hell. All the problems we have with it is exactly due to "where is the boundary" issues. That's why I *don't* want NMI code to do magic crap. Anything that says "disable this during this magic window" is broken. The problems we've had are exactly about atomicity of the entry/exit conditions, and there is no really good way to get them right. I'd be much happier with a _TIF_USER_WORK_MASK approach exactly because it's so *obvious* that it's not a boundary condition. I dislike the "disable and re-enable dr7 in the NMI handler" exactly because it smells like "we can only handle faults in _this_ region". It may be true, but it's also what I want us to get away from. I'd much rather have the "big picture" be that we can take faults anywhere at all (*), and that none of the core code really cares. Then we "fix up" user space. Linus (*) And yes, sysenter and not having a stack at all is very special, and I think we will *always* have to have that magical special case of the first few instructions there. But that's a separate hardware limitation we can't get around. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:54 ` Linus Torvalds @ 2015-07-23 21:59 ` Andy Lutomirski 2015-07-23 22:03 ` Linus Torvalds 2015-07-24 10:28 ` Peter Zijlstra 2015-07-24 11:06 ` Peter Zijlstra 1 sibling, 2 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 21:59 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 2:54 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote: >> >> Or we just re-enable them on the way out of NMI (i.e. the very last >> thing we do in the NMI handler). I don't want to break regular >> userspace gdb when perf is running. > > I'd really prefer it if we don't touch NMI code in those kinds of > ways. The NMI code is fragile as hell. All the problems we have with > it is exactly due to "where is the boundary" issues. > > That's why I *don't* want NMI code to do magic crap. Anything that > says "disable this during this magic window" is broken. The problems > we've had are exactly about atomicity of the entry/exit conditions, > and there is no really good way to get them right. > > I'd be much happier with a _TIF_USER_WORK_MASK approach exactly > because it's so *obvious* that it's not a boundary condition. > > I dislike the "disable and re-enable dr7 in the NMI handler" exactly > because it smells like "we can only handle faults in _this_ region". > It may be true, but it's also what I want us to get away from. I'd > much rather have the "big picture" be that we can take faults anywhere > at all (*), and that none of the core code really cares. Then we "fix > up" user space. OK, new proposal: In do_debug, if we trip an instruction breakpoint while !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm *that breakpoint*. Why? It only looks at hardware state (dr6 and dr7), and it can't break gdb, because gdb can't set a breakpoint that will cause this problem. All the other variants of this either need cached state or break gdb watchpoints on stack variables with perf running. --Andy > > Linus > > (*) And yes, sysenter and not having a stack at all is very special, > and I think we will *always* have to have that magical special case of > the first few instructions there. But that's a separate hardware > limitation we can't get around. -- Andy Lutomirski AMA Capital Management, LLC ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:59 ` Andy Lutomirski @ 2015-07-23 22:03 ` Linus Torvalds 2015-07-24 10:28 ` Peter Zijlstra 1 sibling, 0 replies; 85+ messages in thread From: Linus Torvalds @ 2015-07-23 22:03 UTC (permalink / raw) To: Andy Lutomirski Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 2:59 PM, Andy Lutomirski <luto@amacapital.net> wrote: > OK, new proposal: > > In do_debug, if we trip an instruction breakpoint while > !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm > *that breakpoint*. Ack. The more targeted we can make this while still guaranteeing forward progress, the better. So that sounds really good. Linus ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:59 ` Andy Lutomirski 2015-07-23 22:03 ` Linus Torvalds @ 2015-07-24 10:28 ` Peter Zijlstra 1 sibling, 0 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 10:28 UTC (permalink / raw) To: Andy Lutomirski Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 02:59:46PM -0700, Andy Lutomirski wrote: > OK, new proposal: > > In do_debug, if we trip an instruction breakpoint while > !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm > *that breakpoint*. Doesn't !IF already imply that it must be kernel space? AFAIK user space cannot clear IF. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:54 ` Linus Torvalds 2015-07-23 21:59 ` Andy Lutomirski @ 2015-07-24 11:06 ` Peter Zijlstra 1 sibling, 0 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-24 11:06 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 02:54:54PM -0700, Linus Torvalds wrote: > On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote: > > > > Or we just re-enable them on the way out of NMI (i.e. the very last > > thing we do in the NMI handler). I don't want to break regular > > userspace gdb when perf is running. > > I'd really prefer it if we don't touch NMI code in those kinds of > ways. The NMI code is fragile as hell. All the problems we have with > it is exactly due to "where is the boundary" issues. > > That's why I *don't* want NMI code to do magic crap. Anything that > says "disable this during this magic window" is broken. The problems > we've had are exactly about atomicity of the entry/exit conditions, > and there is no really good way to get them right. > > I'd be much happier with a _TIF_USER_WORK_MASK approach exactly > because it's so *obvious* that it's not a boundary condition. > > I dislike the "disable and re-enable dr7 in the NMI handler" exactly > because it smells like "we can only handle faults in _this_ region". > It may be true, but it's also what I want us to get away from. I'd > much rather have the "big picture" be that we can take faults anywhere > at all (*), and that none of the core code really cares. Then we "fix > up" user space. A wee bit something like so? We need the intermediate self-IPI because NMI/MCE etc do not deal with TIF flags. I further cleared all of DR7 in an attempt at reducing the amount of state tracked. And it doesn't distinguish between kernel/user watchpoints because the kernel can touch both from !IF. --- arch/x86/kernel/traps.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 8e65d8a9b8db..e8308e9c2b1e 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -570,6 +570,33 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s) NOKPROBE_SYMBOL(fixup_bad_iret); #endif +struct do_debug_state { + unsigned long dr7; + struct irq_work irq_work; + struct callback_head task_work; +}; + +static void __debug_irq_trampoline(struct irq_work *work) +{ + struct do_debug_state *dds = + container_of(work, struct do_debug_state, irq_work); + + task_work_add(current, &dds->task_work, true); +} + +static void __debug_restore_dr7(struct callback_head *work) +{ + struct do_debug_state *dds = + container_of(work, struct do_debug_state, task_work); + + set_debugreg(dds->dr7, 7); +} + +static DEFINE_PER_CPU(struct do_debug_state, do_debug_state) = { + .irq_work = { .func = __debug_irq_trampoline, }, + .task_work = { .func = __debug_restore_dr7, }, +}; + /* * Our handling of the processor debug registers is non-trivial. * We do not clear them on entry and exit from the kernel. Therefore @@ -603,6 +630,16 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code) ist_enter(regs); + if (arch_irqs_disabled_flags(regs->flags)) { + struct do_debug_state *dds = this_cpu_ptr(&do_debug_state); + + get_debugreg(dds->dr7, 7); + set_debugreg(0, 7); + irq_work_queue(&dds->irq_work); + + goto exit; + } + get_debugreg(dr6, 6); /* Filter out all the reserved bits which are preset to 1 */ ^ permalink raw reply related [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski 2015-07-23 20:38 ` Linus Torvalds @ 2015-07-23 21:17 ` Peter Zijlstra 2015-07-23 21:20 ` Steven Rostedt 2015-07-24 16:33 ` Raymond Jennings 3 siblings, 0 replies; 85+ messages in thread From: Peter Zijlstra @ 2015-07-23 21:17 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Linus Torvalds, Steven Rostedt, Brian Gerst On Thu, Jul 23, 2015 at 01:21:16PM -0700, Andy Lutomirski wrote: > 3. Forbid faults (other than MCE) inside NMI. > > Option 3 is almost easy. There are really only two kinds of faults > that can legitimately nest inside NMI: #PF and #DB. #DB is easy to > fix (e.g. with my patches or Peter's patches). > > What if we went all out and forbade page faults in NMI as well. There > are two reasons that I can think of that we might page fault inside an > NMI: > > b) User memory access faults. > > The reason we access user state in general from an NMI is to allow > perf to capture enough user stack data to let the tooling backtrace > back to user space. What if we did it differently? Instead of > capturing this data in NMI context, capture it in > prepare_exit_to_usermode. > Peter, can this be done without breaking the perf ABI? If we were > designing all of this stuff from scratch right now, I'd suggest doing > it this way, but I'm not sure whether it makes sense to try to > retrofit it in. Not really; but also almost :/ So the thing is that we currently attach the user backtrace to all events -- and there can be many before we return to userspace again. So none of those events would have a userspace stack, I'm sure that's going to confuse the tooling. OTOH, userspace stacks are a best effort thing, we bail at the first sign of trouble (eg. the stack page is not there). Now realistically this 'never' happens, and it would result in consistently truncated user traces, where your proposal would result in a whole bunch of events with no user traces and then an 'extra' event with a one. ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski 2015-07-23 20:38 ` Linus Torvalds 2015-07-23 21:17 ` Peter Zijlstra @ 2015-07-23 21:20 ` Steven Rostedt 2015-07-23 21:46 ` Andy Lutomirski 2015-07-24 16:33 ` Raymond Jennings 3 siblings, 1 reply; 85+ messages in thread From: Steven Rostedt @ 2015-07-23 21:20 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds, Brian Gerst On Thu, 23 Jul 2015 13:21:16 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > 3. Forbid faults (other than MCE) inside NMI. > > Option 3 is almost easy. There are really only two kinds of faults > that can legitimately nest inside NMI: #PF and #DB. #DB is easy to > fix (e.g. with my patches or Peter's patches). What about int3? Which is needed to make ftrace work. This was a requirement to get rid of stomp-machine when updating ftrace functions, as well as the rational for doing the whole NMI nesting work in the first place. > > What if we went all out and forbade page faults in NMI as well. There > are two reasons that I can think of that we might page fault inside an > NMI: > > a) vmalloc fault. I think Ingo already half-implemented a rework to > eliminate vmalloc faults entirely. > > b) User memory access faults. c) stack tracing faults I would have NMIs debug deadlocks with printing stack traces. The stack tracer can page fault, and before the NMI nesting code, while debugging machines, these stack dumps would randomly reboot the box. While writing the NMI nesting code I realized why those reboots happened, and that was due to the stack trace faulting, and the printk from NMI was slow enough to have another NMI go off and stomp over the outer NMIs stack. Which lead to triple faults and such. -- Steve ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 21:20 ` Steven Rostedt @ 2015-07-23 21:46 ` Andy Lutomirski 0 siblings, 0 replies; 85+ messages in thread From: Andy Lutomirski @ 2015-07-23 21:46 UTC (permalink / raw) To: Steven Rostedt Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds, Brian Gerst On Thu, Jul 23, 2015 at 2:20 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > On Thu, 23 Jul 2015 13:21:16 -0700 > Andy Lutomirski <luto@amacapital.net> wrote: > >> 3. Forbid faults (other than MCE) inside NMI. >> >> Option 3 is almost easy. There are really only two kinds of faults >> that can legitimately nest inside NMI: #PF and #DB. #DB is easy to >> fix (e.g. with my patches or Peter's patches). > > What about int3? Which is needed to make ftrace work. This was a > requirement to get rid of stomp-machine when updating ftrace functions, > as well as the rational for doing the whole NMI nesting work in the > first place. OK, I'm convinced. So I'll keep working on fixing up int3 to be less magical. Patches coming eventually. --Andy ^ permalink raw reply [flat|nested] 85+ messages in thread
* Re: Dealing with the NMI mess 2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski ` (2 preceding siblings ...) 2015-07-23 21:20 ` Steven Rostedt @ 2015-07-24 16:33 ` Raymond Jennings 3 siblings, 0 replies; 85+ messages in thread From: Raymond Jennings @ 2015-07-24 16:33 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds, Steven Rostedt, Brian Gerst On Thu, 2015-07-23 at 13:21 -0700, Andy Lutomirski wrote: > [moved to a new thread, cc list trimmed] > > Hi all- > > We've considered two approaches to dealing with NMIs: > > 1. Allow nesting. We know quite well how messy that is. This might be a stupid question, but 1. What exactly does the NMI handler handle 2. Is it possible for the NMI handler to just increment a counter and return if it nests, and let the outer handler notice and rerun itself. > 2. Forbid IRET inside NMIs. Doable but maybe not that pretty. > > We haven't considered: > > 3. Forbid faults (other than MCE) inside NMI. > > Option 3 is almost easy. There are really only two kinds of faults > that can legitimately nest inside NMI: #PF and #DB. #DB is easy to > fix (e.g. with my patches or Peter's patches). > > What if we went all out and forbade page faults in NMI as well. There > are two reasons that I can think of that we might page fault inside an > NMI: > > a) vmalloc fault. I think Ingo already half-implemented a rework to > eliminate vmalloc faults entirely. > > b) User memory access faults. > > The reason we access user state in general from an NMI is to allow > perf to capture enough user stack data to let the tooling backtrace > back to user space. What if we did it differently? Instead of > capturing this data in NMI context, capture it in > prepare_exit_to_usermode. That would let us capture user state > *correctly*, which we currently can't really do. There's a > never-ending series of minor bugs in which we try to guess the user > register state from NMI context, and it sort of works. In > prepare_exit_to_usermode, we really truly know the user state. > There's a race where an NMI hits during or after > prepare_exit_to_usermode, but maybe that's okay -- just admit defeat > in that case and don't show the user state. (Realistically, without > CFI data, we're not going to be guaranteed to get the right state > anyway.) > > To make this work, we'd have to teach NMI-from-userspace to call the > callback itself. It would look like: > > prepare_exit_to_usermode() { > ... > while (blah blah blah) { > if (cached_flags & TIF_PERF_CAPTURE_USER_STATE) > perf_capture_user_state(); > ... > } > ... > } > > and then, on NMI exit, we'd call perf_capture_user_state directly, > since we don't want to enable IRQs or do opportunsitic sysret on exit > from NMI. (Why not? Because NMIs are still masked, and we don't want > to pay for double-IRET to unmask them, so we really want to leave IRQs > off and IRET straight back to user mode.) > > There's an unavoidable race in which we enter user mode with > TIF_PERF_CAPTURE_USER_STATE still set. In principle, we could > IPI-to-self from the NMI handler to cover that case (mostly -- we > capture the wrong state if we're on our way to an IRET fault), or we > could just check on entry if the flag is still set and, if so, admit > defeat. > > Peter, can this be done without breaking the perf ABI? If we were > designing all of this stuff from scratch right now, I'd suggest doing > it this way, but I'm not sure whether it makes sense to try to > retrofit it in. > > > If we decide to stick with option 2, then I've now convinced myself > that banning all kernel breakpoints and watchpoints during NMI > processing is probably for the best. Maybe we should go one step > farther and ban all DR7 breakpoints period. Sure, it will slow down > perf if there are user breakpoints or watchpoints set, but, having > looked at the asm, returning from #DB using RET is, while doable, > distinctly ugly. > > --Andy > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 85+ messages in thread
end of thread, other threads:[~2015-09-08 16:21 UTC | newest] Thread overview: 85+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski 2015-07-23 20:38 ` Linus Torvalds 2015-07-23 20:49 ` Andy Lutomirski 2015-07-23 21:08 ` Linus Torvalds 2015-07-23 21:31 ` Steven Rostedt 2015-07-23 21:46 ` Willy Tarreau 2015-07-23 21:46 ` Andy Lutomirski 2015-07-23 21:50 ` Willy Tarreau 2015-07-23 21:48 ` Linus Torvalds 2015-07-23 21:50 ` Andy Lutomirski 2015-07-23 21:59 ` Linus Torvalds 2015-07-24 8:13 ` Peter Zijlstra 2015-07-24 9:02 ` Willy Tarreau 2015-07-24 11:58 ` Steven Rostedt 2015-07-24 12:43 ` Peter Zijlstra 2015-07-24 13:03 ` Steven Rostedt 2015-07-24 13:21 ` Willy Tarreau 2015-07-24 13:30 ` Peter Zijlstra 2015-07-24 13:33 ` Peter Zijlstra 2015-07-24 14:31 ` Steven Rostedt 2015-07-24 14:59 ` Willy Tarreau 2015-07-24 15:16 ` Steven Rostedt 2015-07-24 15:26 ` Willy Tarreau 2015-07-24 15:30 ` Peter Zijlstra 2015-07-24 15:33 ` Willy Tarreau 2015-07-24 18:29 ` Linus Torvalds 2015-07-24 18:41 ` Linus Torvalds 2015-07-24 19:05 ` Steven Rostedt 2015-07-24 19:55 ` Peter Zijlstra 2015-07-24 20:22 ` Linus Torvalds 2015-07-24 20:51 ` Peter Zijlstra 2015-07-24 21:07 ` Steven Rostedt 2015-07-24 21:08 ` Andy Lutomirski 2015-07-30 15:41 ` Paolo Bonzini 2015-07-30 21:22 ` Andy Lutomirski 2015-07-30 21:58 ` Brian Gerst 2015-07-30 22:59 ` Thomas Gleixner 2015-07-31 4:22 ` Borislav Petkov 2015-07-31 5:11 ` Andy Lutomirski 2015-07-31 7:51 ` Paolo Bonzini 2015-07-31 8:03 ` Borislav Petkov 2015-07-31 9:27 ` Paolo Bonzini 2015-07-31 10:25 ` Borislav Petkov 2015-07-31 10:26 ` Paolo Bonzini 2015-07-31 10:32 ` Borislav Petkov 2015-09-07 5:39 ` Maciej W. Rozycki 2015-09-07 7:42 ` Ingo Molnar 2015-09-07 8:19 ` Maciej W. Rozycki 2015-09-07 10:19 ` Paolo Bonzini 2015-09-07 17:01 ` Maciej W. Rozycki 2015-09-07 17:22 ` Andy Lutomirski 2015-09-07 19:30 ` Maciej W. Rozycki 2015-09-07 21:56 ` Andy Lutomirski 2015-09-08 16:21 ` Maciej W. Rozycki 2015-07-24 23:53 ` Linus Torvalds 2015-07-24 15:34 ` Steven Rostedt 2015-07-24 15:49 ` Willy Tarreau 2015-07-24 15:48 ` Andy Lutomirski 2015-07-24 16:02 ` Steven Rostedt 2015-07-24 16:08 ` Willy Tarreau 2015-07-24 16:31 ` Steven Rostedt 2015-07-24 16:06 ` Steven Rostedt 2015-07-24 16:25 ` Willy Tarreau 2015-07-24 17:21 ` Andy Lutomirski 2015-07-24 17:10 ` Willy Tarreau 2015-07-24 17:20 ` Andy Lutomirski 2015-07-30 15:54 ` Paolo Bonzini 2015-07-24 17:21 ` Willy Tarreau 2015-07-23 20:52 ` Willy Tarreau 2015-07-23 20:53 ` Andy Lutomirski 2015-07-23 21:07 ` Willy Tarreau 2015-07-23 21:13 ` Linus Torvalds 2015-07-23 21:18 ` Willy Tarreau 2015-07-23 21:20 ` Peter Zijlstra 2015-07-23 21:35 ` Linus Torvalds 2015-07-23 21:45 ` Andy Lutomirski 2015-07-23 21:54 ` Linus Torvalds 2015-07-23 21:59 ` Andy Lutomirski 2015-07-23 22:03 ` Linus Torvalds 2015-07-24 10:28 ` Peter Zijlstra 2015-07-24 11:06 ` Peter Zijlstra 2015-07-23 21:17 ` Peter Zijlstra 2015-07-23 21:20 ` Steven Rostedt 2015-07-23 21:46 ` Andy Lutomirski 2015-07-24 16:33 ` Raymond Jennings
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).