Dealing with the NMI mess

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Dealing with the NMI mess
@ 2015-07-23 20:21 Andy Lutomirski
  2015-07-23 20:38 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 20:21 UTC (permalink / raw)
  To: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, Brian Gerst

[moved to a new thread, cc list trimmed]

Hi all-

We've considered two approaches to dealing with NMIs:

1. Allow nesting.  We know quite well how messy that is.

2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.

We haven't considered:

3. Forbid faults (other than MCE) inside NMI.

Option 3 is almost easy.  There are really only two kinds of faults
that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
fix (e.g. with my patches or Peter's patches).

What if we went all out and forbade page faults in NMI as well.  There
are two reasons that I can think of that we might page fault inside an
NMI:

a) vmalloc fault.  I think Ingo already half-implemented a rework to
eliminate vmalloc faults entirely.

b) User memory access faults.

The reason we access user state in general from an NMI is to allow
perf to capture enough user stack data to let the tooling backtrace
back to user space.  What if we did it differently?  Instead of
capturing this data in NMI context, capture it in
prepare_exit_to_usermode.  That would let us capture user state
*correctly*, which we currently can't really do.  There's a
never-ending series of minor bugs in which we try to guess the user
register state from NMI context, and it sort of works.  In
prepare_exit_to_usermode, we really truly know the user state.
There's a race where an NMI hits during or after
prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
in that case and don't show the user state.  (Realistically, without
CFI data, we're not going to be guaranteed to get the right state
anyway.)

To make this work, we'd have to teach NMI-from-userspace to call the
callback itself.  It would look like:

prepare_exit_to_usermode() {
  ...
  while (blah blah blah) {
    if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
      perf_capture_user_state();
    ...
  }
  ...
}

and then, on NMI exit, we'd call perf_capture_user_state directly,
since we don't want to enable IRQs or do opportunsitic sysret on exit
from NMI.  (Why not?  Because NMIs are still masked, and we don't want
to pay for double-IRET to unmask them, so we really want to leave IRQs
off and IRET straight back to user mode.)

There's an unavoidable race in which we enter user mode with
TIF_PERF_CAPTURE_USER_STATE still set.  In principle, we could
IPI-to-self from the NMI handler to cover that case (mostly -- we
capture the wrong state if we're on our way to an IRET fault), or we
could just check on entry if the flag is still set and, if so, admit
defeat.

Peter, can this be done without breaking the perf ABI?  If we were
designing all of this stuff from scratch right now, I'd suggest doing
it this way, but I'm not sure whether it makes sense to try to
retrofit it in.

If we decide to stick with option 2, then I've now convinced myself
that banning all kernel breakpoints and watchpoints during NMI
processing is probably for the best.  Maybe we should go one step
farther and ban all DR7 breakpoints period.  Sure, it will slow down
perf if there are user breakpoints or watchpoints set, but, having
looked at the asm, returning from #DB using RET is, while doable,
distinctly ugly.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski
@ 2015-07-23 20:38 ` Linus Torvalds
  2015-07-23 20:49   ` Andy Lutomirski
                     ` (2 more replies)
  2015-07-23 21:17 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 20:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
>
> We haven't considered:
>
> 3. Forbid faults (other than MCE) inside NMI.

I'd really prefer #2. #3 depends on us getting many things right, and
never introducing new cases in the future.

#2, in contrast, seems to be fairly localized. Yes, RF is an issue,
but returning to user space with RF clear doesn't really seem to be
all that problematic.

The point of RF is to make forward progress in the face of debug
register faults, but I don't see what was wrong with the whole
"disable any debug events that happen with interrupts disabled".

And no, I do *not* believe that we should disable debug faults ahead
of time. We should take them, disable them, and return with 'ret'. No
complex "you can't put breakpoints in this region" crap, no magic
rules, no subtle issues.

I really think your "disallow #DB" is pointless. I think your "prevent
instruction breakpoints in NMI" is wrong. Let them happen. Take them
and disable them. Return with RT clear. Go on with your life.

And the "take them and disable them" is really simple. No "am I in an
NMI contect" thing (because that leads to the whole question about
"what is NMI context"). That's not the real rule anyway.

No, make it very simple and straightforward. Make the test be "uhhuh,
I got a #DB in kernel mode, and interrupts were disabled - I know I'm
going to return with "ret", so I'm just going to have to disable this
breakpoint".

Nothing clever. Nothing subtle. Nothing that needs "this range of
instructions is magical". No.  Just a very simple rule: if the context
we return to is kernel mode and interrupts are disabled, we're using
'ret', so we cannot suppress debug faults.

Did I miss something? There were a lot of emails flying around, but I
*thought* I saw them all..

              Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:38 ` Linus Torvalds
@ 2015-07-23 20:49   ` Andy Lutomirski
  2015-07-23 21:08     ` Linus Torvalds
  2015-07-23 20:52   ` Willy Tarreau
  2015-07-23 21:20   ` Peter Zijlstra
  2 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 20:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 1:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
>>
>> We haven't considered:
>>
>> 3. Forbid faults (other than MCE) inside NMI.
>
> I'd really prefer #2. #3 depends on us getting many things right, and
> never introducing new cases in the future.
>
> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> but returning to user space with RF clear doesn't really seem to be
> all that problematic.
>
> The point of RF is to make forward progress in the face of debug
> register faults, but I don't see what was wrong with the whole
> "disable any debug events that happen with interrupts disabled".
>
> And no, I do *not* believe that we should disable debug faults ahead
> of time. We should take them, disable them, and return with 'ret'. No
> complex "you can't put breakpoints in this region" crap, no magic
> rules, no subtle issues.
>
> I really think your "disallow #DB" is pointless. I think your "prevent
> instruction breakpoints in NMI" is wrong. Let them happen. Take them
> and disable them. Return with RT clear. Go on with your life.
>
> And the "take them and disable them" is really simple. No "am I in an
> NMI contect" thing (because that leads to the whole question about
> "what is NMI context"). That's not the real rule anyway.
>
> No, make it very simple and straightforward. Make the test be "uhhuh,
> I got a #DB in kernel mode, and interrupts were disabled - I know I'm
> going to return with "ret", so I'm just going to have to disable this
> breakpoint".
>
> Nothing clever. Nothing subtle. Nothing that needs "this range of
> instructions is magical". No.  Just a very simple rule: if the context
> we return to is kernel mode and interrupts are disabled, we're using
> 'ret', so we cannot suppress debug faults.

There are some subtleties in here.

Issue A: to return with RF clear, we need to disarm the breakpoint.
If it's limited to the duration of the NMI, that's easy.  If not, when
do we re-arm?  New prepare_exit_to_usermode hook?  Hmm, setting ti
flags during context switch may target the wrong task.

Issue B: single-step exception after SYSENTER.  The patches I just
sent fix that, though.

Issue C: #DB with invalid stack pointer (can happen due to watchpoints
during SYSCALL entry or SYSRET exit).  I guess we need to ban such
watchpoints.

Issue D: debug exception inside EFI (especially mixed-mode EFI).  We
can't return using RET, so we need to catch that case.

These issues mostly go away if we preemptively disarm DR7 early in NMI
processing and rearm it at the end.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:49   ` Andy Lutomirski
@ 2015-07-23 21:08     ` Linus Torvalds
  2015-07-23 21:31       ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:08 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Issue A: to return with RF clear, we need to disarm the breakpoint.
> If it's limited to the duration of the NMI, that's easy.  If not, when
> do we re-arm?  New prepare_exit_to_usermode hook?  Hmm, setting ti
> flags during context switch may target the wrong task.

We don't re-arm it.

We can entertain the notion *eventually* to do something clever, but
for now, just say: stability and simplicity is more important.

People can use tracepoints in interrupts-off code (they get rewritten
with 'int3', that's fine), but not instruction breakpoints.

> Issue C: #DB with invalid stack pointer (can happen due to watchpoints
> during SYSCALL entry or SYSRET exit).  I guess we need to ban such
> watchpoints.

.. but this is unrelated, to NMI, just "syscall is a nasty interface".
Don't we already ban them?

> Issue D: debug exception inside EFI (especially mixed-mode EFI).  We
> can't return using RET, so we need to catch that case.

If NMI code calls EFI code, then it's broken.

> These issues mostly go away if we preemptively disarm DR7 early in NMI
> processing and rearm it at the end.

I'm not *violently* opposed to that, but it's just a band-aid. It
doesn't *fix* anything. You aren't protecting against random DB
exceptions just because somebody put a data breakpoint on the NMI
stack, for example. You still get page faults. Etc etc.

So I thinkt he whole "use ret instead" is a pretty simple approach.
Make that "just work".

Then, if you want to play with dr7 inside NMI to make it more likely
that you can have breakpoints live in irq-off situation, I think
that's a magic special case. It shouldn't be part of the design.
Things should work without it.

             Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:08     ` Linus Torvalds
@ 2015-07-23 21:31       ` Steven Rostedt
  2015-07-23 21:46         ` Willy Tarreau
  2015-07-23 21:48         ` Linus Torvalds
  0 siblings, 2 replies; 85+ messages in thread
From: Steven Rostedt @ 2015-07-23 21:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst

On Thu, 23 Jul 2015 14:08:59 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Issue A: to return with RF clear, we need to disarm the breakpoint.
> > If it's limited to the duration of the NMI, that's easy.  If not, when
> > do we re-arm?  New prepare_exit_to_usermode hook?  Hmm, setting ti
> > flags during context switch may target the wrong task.
> 
> We don't re-arm it.
> 

Let me get this straight. The idea is in the #DB handler to detect that
it was triggered in NMI context, and if so, simply disarm that
breakpoint permanently, right?

Nothing should be adding hw breakpoints to NMI code anyway. Sounds
perfectly reasonable to me. Of course, how we tell we are in NMI
brings back all the races as we had in the nesting code. We can check
the per-cpu variable that is set with nmi_enter() and cleared at
nmi_exit() but what happens if the breakpoint is outside those calls.
We can check the stack pointer, but then we are back to userspace
fooling us. Maybe add the DF trick again?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:31       ` Steven Rostedt
@ 2015-07-23 21:46         ` Willy Tarreau
  2015-07-23 21:46           ` Andy Lutomirski
  2015-07-23 21:48         ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Willy Tarreau @ 2015-07-23 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Peter Zijlstra, Brian Gerst

On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote:
> On Thu, 23 Jul 2015 14:08:59 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > Issue A: to return with RF clear, we need to disarm the breakpoint.
> > > If it's limited to the duration of the NMI, that's easy.  If not, when
> > > do we re-arm?  New prepare_exit_to_usermode hook?  Hmm, setting ti
> > > flags during context switch may target the wrong task.
> > 
> > We don't re-arm it.
> > 
> 
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?
> 
> Nothing should be adding hw breakpoints to NMI code anyway. Sounds
> perfectly reasonable to me. Of course, how we tell we are in NMI
> brings back all the races as we had in the nesting code. We can check
> the per-cpu variable that is set with nmi_enter() and cleared at
> nmi_exit() but what happens if the breakpoint is outside those calls.
> We can check the stack pointer, but then we are back to userspace
> fooling us. Maybe add the DF trick again?

Can't the back link of the TSS tell us where we come from ? At least
it should not be manipulable from user-space.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:46         ` Willy Tarreau
@ 2015-07-23 21:46           ` Andy Lutomirski
  2015-07-23 21:50             ` Willy Tarreau
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 21:46 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Steven Rostedt, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Peter Zijlstra, Brian Gerst

On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <w@1wt.eu> wrote:
> On Thu, Jul 23, 2015 at 05:31:05PM -0400, Steven Rostedt wrote:
>> On Thu, 23 Jul 2015 14:08:59 -0700
>> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> > On Thu, Jul 23, 2015 at 1:49 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> > >
>> > > Issue A: to return with RF clear, we need to disarm the breakpoint.
>> > > If it's limited to the duration of the NMI, that's easy.  If not, when
>> > > do we re-arm?  New prepare_exit_to_usermode hook?  Hmm, setting ti
>> > > flags during context switch may target the wrong task.
>> >
>> > We don't re-arm it.
>> >
>>
>> Let me get this straight. The idea is in the #DB handler to detect that
>> it was triggered in NMI context, and if so, simply disarm that
>> breakpoint permanently, right?
>>
>> Nothing should be adding hw breakpoints to NMI code anyway. Sounds
>> perfectly reasonable to me. Of course, how we tell we are in NMI
>> brings back all the races as we had in the nesting code. We can check
>> the per-cpu variable that is set with nmi_enter() and cleared at
>> nmi_exit() but what happens if the breakpoint is outside those calls.
>> We can check the stack pointer, but then we are back to userspace
>> fooling us. Maybe add the DF trick again?
>
> Can't the back link of the TSS tell us where we come from ? At least
> it should not be manipulable from user-space.

Not on 64-bit -- there are no tasks :)

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:46           ` Andy Lutomirski
@ 2015-07-23 21:50             ` Willy Tarreau
  0 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-23 21:50 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Peter Zijlstra, Brian Gerst

On Thu, Jul 23, 2015 at 02:46:49PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 2:46 PM, Willy Tarreau <w@1wt.eu> wrote:
> > Can't the back link of the TSS tell us where we come from ? At least
> > it should not be manipulable from user-space.
> 
> Not on 64-bit -- there are no tasks :)

Ah crap, sorry for the noise then!

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:31       ` Steven Rostedt
  2015-07-23 21:46         ` Willy Tarreau
@ 2015-07-23 21:48         ` Linus Torvalds
  2015-07-23 21:50           ` Andy Lutomirski
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:48 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Let me get this straight. The idea is in the #DB handler to detect that
> it was triggered in NMI context, and if so, simply disarm that
> breakpoint permanently, right?

No, for simplicity, I'd make it cover not just NMI code, but any
"kernel code with interrupts disabled".

Because that's the test we'd use for "use ret instead of iret".

And that wider test is exactly because it's so damn hard to get the
exact instruction boundaries right. Let's *not* go down the path
(again) of having to get the whole %rip range and "magic stack pointer
values" etc.

Make it simple and completely unambiguous. The rule really would be:

 - if we return to kernel space and interrupts are disabled, we will
use "ret" rather than "iret"

   Hard rule. Simple. Straightforward. No random %rip values. No
random %rsp values. NO CRAP.

 - but because we use "ret" rather than "iret" we can't get RF
semantics, it means that #DB is special. RF is supposed to make us
make forward progress

   So for that reason, #DB just says "if the breakpoint happened
during that interrupts-ff reghion, I will clear %dr7 to guarantee
forward progress"

So those would be the two main rules. Very simple, and avoiding all nasty cases.

Now, I'd be willing to then hide the "oops, we clear dr7 very
agrressively" issue by having a few additional _heuristics_. But I
call them "heuristics" because unlike the current NMI nesting games,
they aren't about core stability. They are about "ok, maybe somebody
wants to trigger those faults, and we'll be _nice_ and try to make it
easy for them", but nothing more.

So for example, if that "#DB clears %dr7" happened, it sounds easy to
set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
cached value, so that if we disabled things because of some user stack
trace access, it will be re-enabled by the time we return to user
space. I think that sounds reasonable, but it's not something the core
low-level entry x86 assembly code needs to even care about. It's not
that level of "core", it's just being polite.

                 Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:48         ` Linus Torvalds
@ 2015-07-23 21:50           ` Andy Lutomirski
  2015-07-23 21:59             ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 21:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:48 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 23, 2015 at 2:31 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> Let me get this straight. The idea is in the #DB handler to detect that
>> it was triggered in NMI context, and if so, simply disarm that
>> breakpoint permanently, right?
>
> No, for simplicity, I'd make it cover not just NMI code, but any
> "kernel code with interrupts disabled".
>
> Because that's the test we'd use for "use ret instead of iret".
>
> And that wider test is exactly because it's so damn hard to get the
> exact instruction boundaries right. Let's *not* go down the path
> (again) of having to get the whole %rip range and "magic stack pointer
> values" etc.
>
> Make it simple and completely unambiguous. The rule really would be:
>
>  - if we return to kernel space and interrupts are disabled, we will
> use "ret" rather than "iret"
>
>    Hard rule. Simple. Straightforward. No random %rip values. No
> random %rsp values. NO CRAP.
>
>  - but because we use "ret" rather than "iret" we can't get RF
> semantics, it means that #DB is special. RF is supposed to make us
> make forward progress
>
>    So for that reason, #DB just says "if the breakpoint happened
> during that interrupts-ff reghion, I will clear %dr7 to guarantee
> forward progress"

What if we relax it slightly: "if the breakpoint happened during that
interrupts-off region, I will clear all *kernel breakpoints* in %dr7
to guarantee forward progress"?

Watchpoints don't need RF to make forward progress, and, by leaving
watchpoints alone, we avoid breaking gdb.

>
> So those would be the two main rules. Very simple, and avoiding all nasty cases.
>
> Now, I'd be willing to then hide the "oops, we clear dr7 very
> agrressively" issue by having a few additional _heuristics_. But I
> call them "heuristics" because unlike the current NMI nesting games,
> they aren't about core stability. They are about "ok, maybe somebody
> wants to trigger those faults, and we'll be _nice_ and try to make it
> easy for them", but nothing more.
>
> So for example, if that "#DB clears %dr7" happened, it sounds easy to
> set _TIF_USER_WORK_MASK, and just force %dr7 to be re-loaded from a
> cached value, so that if we disabled things because of some user stack
> trace access, it will be re-enabled by the time we return to user
> space. I think that sounds reasonable, but it's not something the core
> low-level entry x86 assembly code needs to even care about. It's not
> that level of "core", it's just being polite.

Once we limit it to instruction breakpoints, I don't think re-enabling
before returning to userspace matters.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:50           ` Andy Lutomirski
@ 2015-07-23 21:59             ` Linus Torvalds
  2015-07-24  8:13               ` Peter Zijlstra
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Peter Zijlstra,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:50 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> What if we relax it slightly: "if the breakpoint happened during that
> interrupts-off region, I will clear all *kernel breakpoints* in %dr7
> to guarantee forward progress"?
>
> Watchpoints don't need RF to make forward progress, and, by leaving
> watchpoints alone, we avoid breaking gdb.

Hmmm. I thought watchpoints were "before the instruction" too, but
that's just because I haven't used them in ages, and I didn't remember
the details. I just looked it up.

You're right - the memory watchpoints trigger after the instruction
has executed, so RF isn't an issue. So yes, the only issue is
instruction breakpoints, and those are the only ones we need to clear.

And that makes it really easy.

So yes, I agree. We only need to clear all kernel breakpoints.

So we don't even need that _TIF_USER_WORK_MASK thing, because user
space isn't setting kernel code breakpoints, it's just kgdb.

Sounds good to me.

                Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:59             ` Linus Torvalds
@ 2015-07-24  8:13               ` Peter Zijlstra
  2015-07-24  9:02                 ` Willy Tarreau
                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24  8:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> Hmmm. I thought watchpoints were "before the instruction" too, but
> that's just because I haven't used them in ages, and I didn't remember
> the details. I just looked it up.
> 
> You're right - the memory watchpoints trigger after the instruction
> has executed, so RF isn't an issue. So yes, the only issue is
> instruction breakpoints, and those are the only ones we need to clear.
> 
> And that makes it really easy.
> 
> So yes, I agree. We only need to clear all kernel breakpoints.

But but but, we can access userspace with !IF, imagine someone doing:

  local_irq_disable();
  copy_from_user_inatomic();

and as luck would have it, there's a breakpoint on the user memory we
just touched. And we go and disable a user breakpoint.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24  8:13               ` Peter Zijlstra
@ 2015-07-24  9:02                 ` Willy Tarreau
  2015-07-24 11:58                 ` Steven Rostedt
  2015-07-24 15:48                 ` Andy Lutomirski
  2 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 10:13:26AM +0200, Peter Zijlstra wrote:
> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > Hmmm. I thought watchpoints were "before the instruction" too, but
> > that's just because I haven't used them in ages, and I didn't remember
> > the details. I just looked it up.
> > 
> > You're right - the memory watchpoints trigger after the instruction
> > has executed, so RF isn't an issue. So yes, the only issue is
> > instruction breakpoints, and those are the only ones we need to clear.
> > 
> > And that makes it really easy.
> > 
> > So yes, I agree. We only need to clear all kernel breakpoints.
> 
> But but but, we can access userspace with !IF, imagine someone doing:
> 
>   local_irq_disable();
>   copy_from_user_inatomic();
> 
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.

Then shouldn't we use !IF && RSP matches NMI's stack ?
User-space cannot control the two at once.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24  8:13               ` Peter Zijlstra
  2015-07-24  9:02                 ` Willy Tarreau
@ 2015-07-24 11:58                 ` Steven Rostedt
  2015-07-24 12:43                   ` Peter Zijlstra
  2015-07-24 15:48                 ` Andy Lutomirski
  2 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, 24 Jul 2015 10:13:26 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > Hmmm. I thought watchpoints were "before the instruction" too, but
> > that's just because I haven't used them in ages, and I didn't remember
> > the details. I just looked it up.
> > 
> > You're right - the memory watchpoints trigger after the instruction
> > has executed, so RF isn't an issue. So yes, the only issue is
> > instruction breakpoints, and those are the only ones we need to clear.
> > 
> > And that makes it really easy.
> > 
> > So yes, I agree. We only need to clear all kernel breakpoints.
> 
> But but but, we can access userspace with !IF, imagine someone doing:
> 
>   local_irq_disable();
>   copy_from_user_inatomic();
> 
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.

Where does the kernel do that to user text? I would think that user
data would only have watchpoints, and Andy and Linus said that those
would not be disabled (I'm guessing because they don't have the RF flag
set, and forward progress can proceed). If the kernel does the above to
user code and there's a breakpoint there, would it even trigger?

I'm not too familiar with how to use hw breakpoints, but I'm guessing
(correct me if I'm wrong) that breakpoints on code that trigger when
executed, but watchpoints on data trigger when accessed. Then
copy_from_user_inatomic() would only trigger on watchpoints (it's not
executing that code, at least I hope it isn't!), and those wont bother
us.

Or am I totally off base here?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 11:58                 ` Steven Rostedt
@ 2015-07-24 12:43                   ` Peter Zijlstra
  2015-07-24 13:03                     ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 12:43 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, Jul 24, 2015 at 07:58:41AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 10:13:26 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
> > > Hmmm. I thought watchpoints were "before the instruction" too, but
> > > that's just because I haven't used them in ages, and I didn't remember
> > > the details. I just looked it up.
> > > 
> > > You're right - the memory watchpoints trigger after the instruction
> > > has executed, so RF isn't an issue. So yes, the only issue is
> > > instruction breakpoints, and those are the only ones we need to clear.
> > > 
> > > And that makes it really easy.
> > > 
> > > So yes, I agree. We only need to clear all kernel breakpoints.
> > 
> > But but but, we can access userspace with !IF, imagine someone doing:
> > 
> >   local_irq_disable();
> >   copy_from_user_inatomic();
> > 
> > and as luck would have it, there's a breakpoint on the user memory we
> > just touched. And we go and disable a user breakpoint.
> 
> Where does the kernel do that to user text? I would think that user
> data would only have watchpoints, and Andy and Linus said that those
> would not be disabled (I'm guessing because they don't have the RF flag
> set, and forward progress can proceed). If the kernel does the above to
> user code and there's a breakpoint there, would it even trigger?
> 
> I'm not too familiar with how to use hw breakpoints, but I'm guessing
> (correct me if I'm wrong) that breakpoints on code that trigger when
> executed, but watchpoints on data trigger when accessed. Then
> copy_from_user_inatomic() would only trigger on watchpoints (it's not
> executing that code, at least I hope it isn't!), and those wont bother
> us.

These things can be: RW, W, X.

Sure, hitting a user X watchpoint is going to be 'interesting', but its
fairly easy to hit a RW one.

Just watch an on-stack variable and get perf to copy a huge chunk of
stack (like it does for the dwarf stuff).


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 12:43                   ` Peter Zijlstra
@ 2015-07-24 13:03                     ` Steven Rostedt
  2015-07-24 13:21                       ` Willy Tarreau
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, 24 Jul 2015 14:43:04 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 
> > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > (correct me if I'm wrong) that breakpoints on code that trigger when
> > executed, but watchpoints on data trigger when accessed. Then
> > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > executing that code, at least I hope it isn't!), and those wont bother
> > us.
> 
> These things can be: RW, W, X.
> 
> Sure, hitting a user X watchpoint is going to be 'interesting', but its
> fairly easy to hit a RW one.

But do we care if we do hit one? The return from the #DB handler can
use a RET. Right?

-- Steve


> 
> Just watch an on-stack variable and get perf to copy a huge chunk of
> stack (like it does for the dwarf stuff).


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 13:03                     ` Steven Rostedt
@ 2015-07-24 13:21                       ` Willy Tarreau
  2015-07-24 13:30                         ` Peter Zijlstra
  2015-07-24 14:31                         ` Steven Rostedt
  0 siblings, 2 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 13:21 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 14:43:04 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
>  
> > > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > > (correct me if I'm wrong) that breakpoints on code that trigger when
> > > executed, but watchpoints on data trigger when accessed. Then
> > > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > > executing that code, at least I hope it isn't!), and those wont bother
> > > us.
> > 
> > These things can be: RW, W, X.
> > 
> > Sure, hitting a user X watchpoint is going to be 'interesting', but its
> > fairly easy to hit a RW one.
> 
> But do we care if we do hit one? The return from the #DB handler can
> use a RET. Right?

My understanding is that by using RET we can't set the RF flag and #DB
will immediately strike again when the operation is attempted again. Thus
we have to completely disable the breakpoints on leaving after the first
one strikes, resulting in some userland breakpoints being missed. Maybe
it can be accepted as a limitation when perf is running. I don't know if
the output of perf is that relevant when a debugger is present BTW.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 13:21                       ` Willy Tarreau
@ 2015-07-24 13:30                         ` Peter Zijlstra
  2015-07-24 13:33                           ` Peter Zijlstra
  2015-07-24 14:31                         ` Steven Rostedt
  1 sibling, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 13:30 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 03:21:28PM +0200, Willy Tarreau wrote:
> On Fri, Jul 24, 2015 at 09:03:42AM -0400, Steven Rostedt wrote:
> > On Fri, 24 Jul 2015 14:43:04 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >  
> > > > I'm not too familiar with how to use hw breakpoints, but I'm guessing
> > > > (correct me if I'm wrong) that breakpoints on code that trigger when
> > > > executed, but watchpoints on data trigger when accessed. Then
> > > > copy_from_user_inatomic() would only trigger on watchpoints (it's not
> > > > executing that code, at least I hope it isn't!), and those wont bother
> > > > us.
> > > 
> > > These things can be: RW, W, X.
> > > 
> > > Sure, hitting a user X watchpoint is going to be 'interesting', but its
> > > fairly easy to hit a RW one.
> > 
> > But do we care if we do hit one? The return from the #DB handler can
> > use a RET. Right?

Look at do_debug(), it has lovely bits like:

	preempt_conditional_sti();

in it, we do _NOT_ want to be re-enabling interrupts if we're called
from an !IF context, that'd be _bad_.

> My understanding is that by using RET we can't set the RF flag and #DB
> will immediately strike again when the operation is attempted again. Thus
> we have to completely disable the breakpoints on leaving after the first
> one strikes, resulting in some userland breakpoints being missed. Maybe
> it can be accepted as a limitation when perf is running. I don't know if
> the output of perf is that relevant when a debugger is present BTW.

The patch I posted will re-enable the breakpoints before returning to
userspace. So userspace will only 'miss' events generated by the kernel.

Missing reads from the kernel is not a problem -- and maybe even
expected, but certainly unavoidable.

Missing updates from the kernel might be a problem, you'd get a variable
change content even though you have a W watchpoint on it, that'd be
surprising.

Then again, I suppose we can argue the variable changed through another
mapping and watchpoints work on the virtual address, so tough cookies or
somesuch -- the kernel could in fact do this on highmem kernel anyway.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 13:30                         ` Peter Zijlstra
@ 2015-07-24 13:33                           ` Peter Zijlstra
  0 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 13:33 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 03:30:13PM +0200, Peter Zijlstra wrote:
> > > But do we care if we do hit one? The return from the #DB handler can
> > > use a RET. Right?
> 
> Look at do_debug(), it has lovely bits like:
> 
> 	preempt_conditional_sti();
> 
> in it, we do _NOT_ want to be re-enabling interrupts if we're called
> from an !IF context, that'd be _bad_.

Ah, I forgot the conditional thing was the STI depending on regs->flags
& IF..

In any case, better safe than sorry and simply not do #DB ever if !IF.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 13:21                       ` Willy Tarreau
  2015-07-24 13:30                         ` Peter Zijlstra
@ 2015-07-24 14:31                         ` Steven Rostedt
  2015-07-24 14:59                           ` Willy Tarreau
  1 sibling, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 14:31 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 15:21:28 +0200
Willy Tarreau <w@1wt.eu> wrote:

> My understanding is that by using RET we can't set the RF flag and #DB

But the RF flag is only set for instruction (executing) breakpoints. It
is not set for data (RW) ones.

-- Steve

> will immediately strike again when the operation is attempted again. Thus
> we have to completely disable the breakpoints on leaving after the first
> one strikes, resulting in some userland breakpoints being missed. Maybe
> it can be accepted as a limitation when perf is running. I don't know if
> the output of perf is that relevant when a debugger is present BTW.
> 
> Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 14:31                         ` Steven Rostedt
@ 2015-07-24 14:59                           ` Willy Tarreau
  2015-07-24 15:16                             ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 14:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 15:21:28 +0200
> Willy Tarreau <w@1wt.eu> wrote:
> 
> > My understanding is that by using RET we can't set the RF flag and #DB
> 
> But the RF flag is only set for instruction (executing) breakpoints. It
> is not set for data (RW) ones.

True but these also are the most complicated to deal with. The data
accesses can always be emulated (not what I'm suggesting here) while
instructions are much harder to emulate.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 14:59                           ` Willy Tarreau
@ 2015-07-24 15:16                             ` Steven Rostedt
  2015-07-24 15:26                               ` Willy Tarreau
  0 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 15:16 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 16:59:01 +0200
Willy Tarreau <w@1wt.eu> wrote:

> On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> > On Fri, 24 Jul 2015 15:21:28 +0200
> > Willy Tarreau <w@1wt.eu> wrote:
> > 
> > > My understanding is that by using RET we can't set the RF flag and #DB
> > 
> > But the RF flag is only set for instruction (executing) breakpoints. It
> > is not set for data (RW) ones.
> 
> True but these also are the most complicated to deal with. The data
> accesses can always be emulated (not what I'm suggesting here) while
> instructions are much harder to emulate.

The point is, if we trigger a #DB on an instruction breakpoint
while !IF, then we simply disable that breakpoint and do the RET. What
emulation is needed?

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:16                             ` Steven Rostedt
@ 2015-07-24 15:26                               ` Willy Tarreau
  2015-07-24 15:30                                 ` Peter Zijlstra
  2015-07-24 15:34                                 ` Steven Rostedt
  0 siblings, 2 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 15:26 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 11:16:21AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 16:59:01 +0200
> Willy Tarreau <w@1wt.eu> wrote:
> 
> > On Fri, Jul 24, 2015 at 10:31:27AM -0400, Steven Rostedt wrote:
> > > On Fri, 24 Jul 2015 15:21:28 +0200
> > > Willy Tarreau <w@1wt.eu> wrote:
> > > 
> > > > My understanding is that by using RET we can't set the RF flag and #DB
> > > 
> > > But the RF flag is only set for instruction (executing) breakpoints. It
> > > is not set for data (RW) ones.
> > 
> > True but these also are the most complicated to deal with. The data
> > accesses can always be emulated (not what I'm suggesting here) while
> > instructions are much harder to emulate.
> 
> The point is, if we trigger a #DB on an instruction breakpoint
> while !IF, then we simply disable that breakpoint and do the RET.

Yes but the breakpoint remains disabled then. Or I'm missing
something.

> What emulation is needed?

I was speaking about redoing the operation with BP disabled before
re-enabling it. But that's not the point here anyway.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:26                               ` Willy Tarreau
@ 2015-07-24 15:30                                 ` Peter Zijlstra
  2015-07-24 15:33                                   ` Willy Tarreau
  2015-07-24 18:29                                   ` Linus Torvalds
  2015-07-24 15:34                                 ` Steven Rostedt
  1 sibling, 2 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 15:30 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> > 
> > The point is, if we trigger a #DB on an instruction breakpoint
> > while !IF, then we simply disable that breakpoint and do the RET.
> 
> Yes but the breakpoint remains disabled then. Or I'm missing
> something.

http://marc.info/?l=linux-kernel&m=143773601130974

We re-enable before going back to userspace.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:30                                 ` Peter Zijlstra
@ 2015-07-24 15:33                                   ` Willy Tarreau
  2015-07-24 18:29                                   ` Linus Torvalds
  1 sibling, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Steven Rostedt, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 05:30:54PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> > > 
> > > The point is, if we trigger a #DB on an instruction breakpoint
> > > while !IF, then we simply disable that breakpoint and do the RET.
> > 
> > Yes but the breakpoint remains disabled then. Or I'm missing
> > something.
> 
> http://marc.info/?l=linux-kernel&m=143773601130974
> 
> We re-enable before going back to userspace.

Ah OK thanks Peter. I'm sorry if I'm adding more noise than
anything here, it's hard to follow and it becomes a bit complex.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:30                                 ` Peter Zijlstra
  2015-07-24 15:33                                   ` Willy Tarreau
@ 2015-07-24 18:29                                   ` Linus Torvalds
  2015-07-24 18:41                                     ` Linus Torvalds
  2015-07-24 19:55                                     ` Peter Zijlstra
  1 sibling, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2015-07-24 18:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
>> >
>> > The point is, if we trigger a #DB on an instruction breakpoint
>> > while !IF, then we simply disable that breakpoint and do the RET.
>>
>> Yes but the breakpoint remains disabled then. Or I'm missing
>> something.
>
> http://marc.info/?l=linux-kernel&m=143773601130974
>
> We re-enable before going back to userspace.

Actually, Andy had a good argument that we don't even need this.

We just don't ever need to disable data breakpoints. Even if we end up doing

        cli();
        copy_from_user_inatomic();

that actually works fine. If there are data breakpoints, we will have

 (a) things will run slow as hell anyway. Intel CPU's slow down to a
relative crawl.

 (b) let's say we have a data breakpoint on the data we're reading above

 (c) we take a #DB fault after the instruction has completed, so we
make forward progress even if we return with RF clear

 (d) even if the data breakpoint is unaligned and triggers multiple
times, it's going to be a "small number" of multiple times. And see
(a). This never happens in practice, and the much bigger slowdown is
how data breakpoints tend to slow things down in general.

 (e) yes, the string instructions may hit the data breakpoint multilpe
times for the "same" instruction, but the forward progress part is
still true even for the string instructions. In fact, it's actually
likely <i>more</i> true for string instructions, because they are
documented to be less exact, and may trigger the data watchpoint only
after a "group of iterations".

so I think we just leave data breakpoint alone. The only debug
conditions that are *faults* rather than traps are the instruction
breakpoints, and we can detect and disable those by just saying "oh,
we're in kernel mode".

So in the #DB handler, we would basically only clear instruction
breakpoints, and only when they trigger. If we have a data breakpoint
that triggers (even in kernel mode, and with interrupts disabled), let
it trigger and return with "ret" anyway. No biggie.

(Ok, so the "General detect fault" is also a fault rather than a trap,
but that's the "write to debug registers when it's disabled" thing,
very different)

                  Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 18:29                                   ` Linus Torvalds
@ 2015-07-24 18:41                                     ` Linus Torvalds
  2015-07-24 19:05                                       ` Steven Rostedt
  2015-07-24 19:55                                     ` Peter Zijlstra
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-24 18:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So in the #DB handler, we would basically only clear instruction
> breakpoints, and only when they trigger. If we have a data breakpoint
> that triggers (even in kernel mode, and with interrupts disabled), let
> it trigger and return with "ret" anyway. No biggie.

So we'd not only look at "which breakpoint triggered", we'd also look
at the actual debug register and check that "R/Wn == 0", and only
disable it for that case.

So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.

Something like

        unsigned long mask = 0;
        unsigned int dr6 = debug_read(6);
        unsigned int dr7 = debug_read(7)
        int i;

        for (i = 0; i < 4; i++) {
                if ((dr6 >> i) & 1) {
                        if (!((dr7 >> (4*i+16)) & 3))
                                mask |= 3 << (i*2);
                }
        }

        if (mask)
                debug_write(dr7 & ~mask, 7);

(yeah, I could easily have screwed that up)

But the above should only clear bits in dr7 that are actually
associated with the instruction breakpoint that triggered, and since
it's a _kernel_ instruction breakpoint, not a user one, we can clear
it and forget it. No need to re-enable at all.

Hmm?

                       Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 18:41                                     ` Linus Torvalds
@ 2015-07-24 19:05                                       ` Steven Rostedt
  0 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 19:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Willy Tarreau, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 11:41:55 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Jul 24, 2015 at 11:29 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So in the #DB handler, we would basically only clear instruction
> > breakpoints, and only when they trigger. If we have a data breakpoint
> > that triggers (even in kernel mode, and with interrupts disabled), let
> > it trigger and return with "ret" anyway. No biggie.
> 
> So we'd not only look at "which breakpoint triggered", we'd also look
> at the actual debug register and check that "R/Wn == 0", and only
> disable it for that case.
> 
> So you'd read %dr6 and %dr7, and then iterate 0..3 and check whether
> it triggerd (bit #n in %dr6), and that R/Wn (bits 16-17+n*4 of %dr7)
> is zero, and if so, clear LGn bits (bits 0-1+n*2) in %dr7.
> 
> Something like
> 
>         unsigned long mask = 0;
>         unsigned int dr6 = debug_read(6);
>         unsigned int dr7 = debug_read(7)
>         int i;
> 
>         for (i = 0; i < 4; i++) {
>                 if ((dr6 >> i) & 1) {
>                         if (!((dr7 >> (4*i+16)) & 3))
>                                 mask |= 3 << (i*2);
>                 }
>         }
> 
>         if (mask)
>                 debug_write(dr7 & ~mask, 7);

Macros would be nice for readability.

	for (i = 0; i < 4; i++) {
		if ((dr6 >> i) & 1) {
			int shift = DR_CONTROL_SIZE * i + DR_CONTROL_SHIFT;
			if (!((dr7 >> shift) & DR_RW_READ))
				mask |= (DR_LOCAL_ENABLE|DR_GLOBAL_ENABLE) << (i * DR_ENABLE_SIZE);
		}
	}

-- Steve

> 
> (yeah, I could easily have screwed that up)
> 
> But the above should only clear bits in dr7 that are actually
> associated with the instruction breakpoint that triggered, and since
> it's a _kernel_ instruction breakpoint, not a user one, we can clear
> it and forget it. No need to re-enable at all.
> 
> Hmm?
> 
>                        Linus


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 18:29                                   ` Linus Torvalds
  2015-07-24 18:41                                     ` Linus Torvalds
@ 2015-07-24 19:55                                     ` Peter Zijlstra
  2015-07-24 20:22                                       ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 19:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 11:29:29AM -0700, Linus Torvalds wrote:
> On Fri, Jul 24, 2015 at 8:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Jul 24, 2015 at 05:26:37PM +0200, Willy Tarreau wrote:
> >> >
> >> > The point is, if we trigger a #DB on an instruction breakpoint
> >> > while !IF, then we simply disable that breakpoint and do the RET.
> >>
> >> Yes but the breakpoint remains disabled then. Or I'm missing
> >> something.
> >
> > http://marc.info/?l=linux-kernel&m=143773601130974
> >
> > We re-enable before going back to userspace.
> 
> Actually, Andy had a good argument that we don't even need this.
> 
> We just don't ever need to disable data breakpoints. Even if we end up doing
> 
>         cli();
>         copy_from_user_inatomic();
> 
> that actually works fine. If there are data breakpoints, we will have

I worry that we'll end up running the do_debug() handlers from effective
NMI context.

The NMI might have preempted locks which these handlers require etc..

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 19:55                                     ` Peter Zijlstra
@ 2015-07-24 20:22                                       ` Linus Torvalds
  2015-07-24 20:51                                         ` Peter Zijlstra
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-24 20:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> I worry that we'll end up running the do_debug() handlers from effective
> NMI context.
>
> The NMI might have preempted locks which these handlers require etc..

If #DB takes any locks like that, then #DB is broken.

Pretty much by definition, a data breakpoint can happen on pretty much
absolutely any code. This is in no way NMI-specific as far as I can
tell.

Do we really take locks in the #DB handler?

                     Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 20:22                                       ` Linus Torvalds
@ 2015-07-24 20:51                                         ` Peter Zijlstra
  2015-07-24 21:07                                           ` Steven Rostedt
                                                             ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 20:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote:
> On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > I worry that we'll end up running the do_debug() handlers from effective
> > NMI context.
> >
> > The NMI might have preempted locks which these handlers require etc..
> 
> If #DB takes any locks like that, then #DB is broken.
> 
> Pretty much by definition, a data breakpoint can happen on pretty much
> absolutely any code. This is in no way NMI-specific as far as I can
> tell.
> 
> Do we really take locks in the #DB handler?

do_debug()
  send_sigtrap()
    force_sig_info()
      spin_lock_irqsave()

Now, I don't pretend to understand the condition before send_sigtrap(),
so it _might_ be ok, but it sure as heck could do with a comment.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 20:51                                         ` Peter Zijlstra
@ 2015-07-24 21:07                                           ` Steven Rostedt
  2015-07-24 21:08                                           ` Andy Lutomirski
  2015-07-24 23:53                                           ` Linus Torvalds
  2 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 21:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Willy Tarreau, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 22:51:19 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

 
> > Do we really take locks in the #DB handler?
> 
> do_debug()
>   send_sigtrap()
>     force_sig_info()
>       spin_lock_irqsave()
> 
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.

Or that force_sig_info() in send_sigtrap() looks like it can easily be
change to use an irq work queue.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 20:51                                         ` Peter Zijlstra
  2015-07-24 21:07                                           ` Steven Rostedt
@ 2015-07-24 21:08                                           ` Andy Lutomirski
  2015-07-30 15:41                                             ` Paolo Bonzini
  2015-07-24 23:53                                           ` Linus Torvalds
  2 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-24 21:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Jul 24, 2015 at 01:22:11PM -0700, Linus Torvalds wrote:
>> On Fri, Jul 24, 2015 at 12:55 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > I worry that we'll end up running the do_debug() handlers from effective
>> > NMI context.
>> >
>> > The NMI might have preempted locks which these handlers require etc..
>>
>> If #DB takes any locks like that, then #DB is broken.
>>
>> Pretty much by definition, a data breakpoint can happen on pretty much
>> absolutely any code. This is in no way NMI-specific as far as I can
>> tell.
>>
>> Do we really take locks in the #DB handler?
>
> do_debug()
>   send_sigtrap()
>     force_sig_info()
>       spin_lock_irqsave()
>
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.

Let's try to decode it.

user_icebp is set if int $0x01 happens, except it isn't because user
code can't actually do that -- it'll cause #GP instead.

user_icebp is also set if the user has a bloody in-circuit emulator,
given the name.  But who on Earth has one of those on a system new
enough to run Linux and, even if they have one, why on Earth are they
using it to send SIGTRAP.

In any event, user_icebp is only set if user_mode(regs), so it's safe
locking-wise.  But please let's delete it.

Otherwise, we do send_sigtrap if we got a single-step exception from
user mode (because we suppress single-step exceptions from kernel mode
a couple lines above, but we should really BUG on those except for the
single case of SYSENTER with TF set) or if we get a breakpoint
exception that wasn't eaten by perf.

For *#&!'s sake. we should rewrite this pile of crap.

// before kprobes and notify_die
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
if (!user_mode(regs) && regs->ip == sysenter_target) {
    fix it up and return;
}

notify_die, etc.

preempt_conditional_sti(regs);
do_trap(X86_TRAP_DB, SIGTRAP, "debug", regs, error_code, NULL);
preempt_conditional_cli(regs);

except we should do something to disallow fixup_exception here.  Or we
could open-code if(user_mode) send_sigtrap() else die() here.

I really don't think that we should be sending signals to userspace
due to user address watchpoints that hit in kernel mode.  Or, if we do
think we should send signals for those, then, as Steven said, we
should make that explicit and use IRQ work for that.

As it stands, this is probably an exploitable DoS -- just point a
watchpoint down the stack a little bit from yourself and call raise().

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 21:08                                           ` Andy Lutomirski
@ 2015-07-30 15:41                                             ` Paolo Bonzini
  2015-07-30 21:22                                               ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Paolo Bonzini @ 2015-07-30 15:41 UTC (permalink / raw)
  To: Andy Lutomirski, Peter Zijlstra
  Cc: Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst



On 24/07/2015 23:08, Andy Lutomirski wrote:
> user_icebp is set if int $0x01 happens, except it isn't because user
> code can't actually do that -- it'll cause #GP instead.
> 
> user_icebp is also set if the user has a bloody in-circuit emulator,
> given the name.  But who on Earth has one of those on a system new
> enough to run Linux and, even if they have one, why on Earth are they
> using it to send SIGTRAP.

You do not need either "int $0x01" or an ICE to set user_icebp = 1.  You
can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
instead of #BP.

The historical name is ICEBP because in-circuit emulators used it for
software breakpoints, just like your usual debugger used 0xcc aka int3.
 And just like 0xcc it's unprivileged, so you can actually get a SIGTRAP
with asm(".byte 0xf1").

So...

> In any event, user_icebp is only set if user_mode(regs), so it's safe
> locking-wise.  But please let's delete it.

... it's safe, but it has some use (!).

Paolo

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-30 15:41                                             ` Paolo Bonzini
@ 2015-07-30 21:22                                               ` Andy Lutomirski
  2015-07-30 21:58                                                 ` Brian Gerst
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-30 21:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt,
	X86 ML, linux-kernel@vger.kernel.org, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 24/07/2015 23:08, Andy Lutomirski wrote:
>> user_icebp is set if int $0x01 happens, except it isn't because user
>> code can't actually do that -- it'll cause #GP instead.
>>
>> user_icebp is also set if the user has a bloody in-circuit emulator,
>> given the name.  But who on Earth has one of those on a system new
>> enough to run Linux and, even if they have one, why on Earth are they
>> using it to send SIGTRAP.
>
> You do not need either "int $0x01" or an ICE to set user_icebp = 1.  You
> can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
> instead of #BP.

Great.  There's an opcode that invokes an interrupt gate that's not
marked as allowing unprivileged access, and that opcode doesn't appear
in the SDM.  It appears in the APM opcode map with no explanation at
all.

Thanks, CPU vendors.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-30 21:22                                               ` Andy Lutomirski
@ 2015-07-30 21:58                                                 ` Brian Gerst
  2015-07-30 22:59                                                 ` Thomas Gleixner
  2015-07-31  4:22                                                 ` Borislav Petkov
  2 siblings, 0 replies; 85+ messages in thread
From: Brian Gerst @ 2015-07-30 21:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner

On Thu, Jul 30, 2015 at 5:22 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 24/07/2015 23:08, Andy Lutomirski wrote:
>>> user_icebp is set if int $0x01 happens, except it isn't because user
>>> code can't actually do that -- it'll cause #GP instead.
>>>
>>> user_icebp is also set if the user has a bloody in-circuit emulator,
>>> given the name.  But who on Earth has one of those on a system new
>>> enough to run Linux and, even if they have one, why on Earth are they
>>> using it to send SIGTRAP.
>>
>> You do not need either "int $0x01" or an ICE to set user_icebp = 1.  You
>> can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
>> instead of #BP.
>
> Great.  There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM.  It appears in the APM opcode map with no explanation at
> all.
>
> Thanks, CPU vendors.
>
> --Andy

Some Windows programs (running in Wine) use this opcode for
anti-debugging code.  See commit
a1e80fafc9f0742a1776a0490258cb64912411b0.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-30 21:22                                               ` Andy Lutomirski
  2015-07-30 21:58                                                 ` Brian Gerst
@ 2015-07-30 22:59                                                 ` Thomas Gleixner
  2015-07-31  4:22                                                 ` Borislav Petkov
  2 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2015-07-30 22:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Brian Gerst



On Thu, 30 Jul 2015, Andy Lutomirski wrote:

> On Thu, Jul 30, 2015 at 8:41 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >
> >
> > On 24/07/2015 23:08, Andy Lutomirski wrote:
> >> user_icebp is set if int $0x01 happens, except it isn't because user
> >> code can't actually do that -- it'll cause #GP instead.
> >>
> >> user_icebp is also set if the user has a bloody in-circuit emulator,
> >> given the name.  But who on Earth has one of those on a system new
> >> enough to run Linux and, even if they have one, why on Earth are they
> >> using it to send SIGTRAP.
> >
> > You do not need either "int $0x01" or an ICE to set user_icebp = 1.  You
> > can use the 0xf1 opcode, which is kinda like 0xcc but generates #DB
> > instead of #BP.
> 
> Great.  There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM.  It appears in the APM opcode map with no explanation at
> all.

The only SDM reference I found is:

  "The opcodes D6 and F1 are undefined opcodes reserved by the Intel 64
   and IA-32 architectures. These opcodes, even though undefined, do
   not generate an invalid opcode exception."

D6 is actually something useful:

   if (carry flag set)
      AL = FF
   else
      AL = 0

It's been there since i386. It has been conveniant for return code
magic from ASM to C. I haven't thought of it for at least a decade :)

So all we need to worry about is F1, but thats bad enough :(

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-30 21:22                                               ` Andy Lutomirski
  2015-07-30 21:58                                                 ` Brian Gerst
  2015-07-30 22:59                                                 ` Thomas Gleixner
@ 2015-07-31  4:22                                                 ` Borislav Petkov
  2015-07-31  5:11                                                   ` Andy Lutomirski
  2 siblings, 1 reply; 85+ messages in thread
From: Borislav Petkov @ 2015-07-31  4:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst

On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote:
> Great.  There's an opcode that invokes an interrupt gate that's not
> marked as allowing unprivileged access, and that opcode doesn't appear
> in the SDM.  It appears in the APM opcode map with no explanation at
> all.
> 
> Thanks, CPU vendors.

Here's something better:

http://www.rcollins.org/secrets/opcodes/ICEBP.html

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  4:22                                                 ` Borislav Petkov
@ 2015-07-31  5:11                                                   ` Andy Lutomirski
  2015-07-31  7:51                                                     ` Paolo Bonzini
  2015-07-31  8:03                                                     ` Borislav Petkov
  0 siblings, 2 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-31  5:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst

On Thu, Jul 30, 2015 at 9:22 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Jul 30, 2015 at 02:22:06PM -0700, Andy Lutomirski wrote:
>> Great.  There's an opcode that invokes an interrupt gate that's not
>> marked as allowing unprivileged access, and that opcode doesn't appear
>> in the SDM.  It appears in the APM opcode map with no explanation at
>> all.
>>
>> Thanks, CPU vendors.
>
> Here's something better:
>
> http://www.rcollins.org/secrets/opcodes/ICEBP.html

This instruction is awesome.  Binutils can disassemble it (it's called
"icebp") but it can't assemble it.  KVM has special handling for it on
VMX and actually reports it to QEMU on SVM (complete with a defined
ABI).  We have an asm macro so we can assemble it for 32-bit but not
64-bit, despite the fact that it works on 64-bit.

The kernel instruction decoder can't decode it.

Fortunately, it looks like the vm86 case is correct (or as correct as
any of the vm86 junk can be), although I haven't tested it.  I bet
that icebp is like int3 in that it punches through vm86 mode instead
of sending #GP.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  5:11                                                   ` Andy Lutomirski
@ 2015-07-31  7:51                                                     ` Paolo Bonzini
  2015-07-31  8:03                                                     ` Borislav Petkov
  1 sibling, 0 replies; 85+ messages in thread
From: Paolo Bonzini @ 2015-07-31  7:51 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov
  Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt,
	X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner,
	Brian Gerst



On 31/07/2015 07:11, Andy Lutomirski wrote:
> This instruction is awesome.  Binutils can disassemble it (it's called
> "icebp") but it can't assemble it.  KVM has special handling for it on
> VMX and actually reports it to QEMU on SVM (complete with a defined
> ABI).

FWIW it's not reported to QEMU, it's only reported to a nested
hypervisor.  So the ABI is simply the SVM spec.

It's not surprising that VMX support was provided by the Wine guys...

Paolo

> We have an asm macro so we can assemble it for 32-bit but not
> 64-bit, despite the fact that it works on 64-bit.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  5:11                                                   ` Andy Lutomirski
  2015-07-31  7:51                                                     ` Paolo Bonzini
@ 2015-07-31  8:03                                                     ` Borislav Petkov
  2015-07-31  9:27                                                       ` Paolo Bonzini
  2015-09-07  5:39                                                       ` Maciej W. Rozycki
  1 sibling, 2 replies; 85+ messages in thread
From: Borislav Petkov @ 2015-07-31  8:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst

On Thu, Jul 30, 2015 at 10:11:40PM -0700, Andy Lutomirski wrote:
> This instruction is awesome.  Binutils can disassemble it (it's called
> "icebp") but it can't assemble it.  KVM has special handling for it on
> VMX and actually reports it to QEMU on SVM (complete with a defined
> ABI).

Fun.

> We have an asm macro so we can assemble it for 32-bit but not
> 64-bit, despite the fact that it works on 64-bit.
> 
> The kernel instruction decoder can't decode it.

Yeah, the kernel insn decoder needs to be fixed. Even my decoder can
decode it:

$ echo "0xf1" | ./x86d -
0:       f1                      icebp

Big deal. :-)

Let's do some fun and games:

$ cat icebp.c
int main()
{
        asm volatile(".byte 0xf1");

        return 0;
}

$ gcc -Wall -o icebp{,.c}
$ objdump -d icebp

...

00000000004004ac <main>:
  4004ac:       55                      push   %rbp
  4004ad:       48 89 e5                mov    %rsp,%rbp
  4004b0:       f1                      icebp  
  4004b1:       b8 00 00 00 00          mov    $0x0,%eax
  4004b6:       5d                      pop    %rbp
  4004b7:       c3                      retq   
  4004b8:       90                      nop
...

$ ./icebp
Trace/breakpoint trap

^ this in qemu.

On baremetal it gets a SIGTRAP with TRAP_BRKPT. Looks like signal
handling knows about it...

$ strace /tmp/icebp
execve("/tmp/icebp", ["/tmp/icebp"], [/* 27 vars */]) = 0
brk(0)                                  = 0x1680000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e243d000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=127070, ...}) = 0
mmap(NULL, 127070, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f71e241d000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1729984, ...}) = 0
mmap(NULL, 3836448, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f71e1e76000
mprotect(0x7f71e2015000, 2097152, PROT_NONE) = 0
mmap(0x7f71e2215000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19f000) = 0x7f71e2215000
mmap(0x7f71e221b000, 14880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f71e221b000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241b000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f71e241a000
arch_prctl(ARCH_SET_FS, 0x7f71e241b700) = 0
mprotect(0x7f71e2215000, 16384, PROT_READ) = 0
mprotect(0x7f71e243f000, 4096, PROT_READ) = 0
munmap(0x7f71e241d000, 127070)          = 0
--- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=4195505, si_uid=0} ---
+++ killed by SIGTRAP +++
Trace/breakpoint trap

> Fortunately, it looks like the vm86 case is correct (or as correct as
> any of the vm86 junk can be), although I haven't tested it.  I bet
> that icebp is like int3 in that it punches through vm86 mode instead
> of sending #GP.

Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.

But why do you say it is special - it simply raises #DB, i.e. vector 1.
Web page seems to say so when interrupt redirection is disabled. It
sounds like a nice and quick way to generate a breakpoint. You can do
that with INT 01, i.e., the CD opcode, too.

If I'd had to guess, it isn't documented because of the proprietary ICE
aspect. And no one uses ICEs anymore so it is going to be forgotten with
people popping off and on and asking about the undocumented opcode.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  8:03                                                     ` Borislav Petkov
@ 2015-07-31  9:27                                                       ` Paolo Bonzini
  2015-07-31 10:25                                                         ` Borislav Petkov
  2015-09-07  5:39                                                       ` Maciej W. Rozycki
  1 sibling, 1 reply; 85+ messages in thread
From: Paolo Bonzini @ 2015-07-31  9:27 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, Willy Tarreau, Steven Rostedt,
	X86 ML, linux-kernel@vger.kernel.org, Thomas Gleixner,
	Brian Gerst



On 31/07/2015 10:03, Borislav Petkov wrote:
> $ ./icebp
> Trace/breakpoint trap
> 
> ^ this in qemu.

Is the strace different between KVM and baremetal?  QEMU dynamic
translation is broken I think, but KVM should be the same as baremetal.

>> Fortunately, it looks like the vm86 case is correct (or as correct as
>> any of the vm86 junk can be), although I haven't tested it.  I bet
>> that icebp is like int3 in that it punches through vm86 mode instead
>> of sending #GP.
> 
> Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.

No, it sends #GP.

> But why do you say it is special - it simply raises #DB, i.e. vector 1.
> Web page seems to say so when interrupt redirection is disabled. It
> sounds like a nice and quick way to generate a breakpoint. You can do
> that with INT 01, i.e., the CD opcode, too.
> 
> If I'd had to guess, it isn't documented because of the proprietary ICE
> aspect. And no one uses ICEs anymore so it is going to be forgotten with
> people popping off and on and asking about the undocumented opcode.

The reason why it isn't documented is probably hidden within Intel.
Besides ICEBP, which is a bit fringe, there's no reason not to document
SALC which Thomas mentioned.  SALC all has been there since the 8086,
and has been undocumented for thirty-odd years.

The AAM/AAD variants with immediates other than 10 also have been
undocumented for fifteen years or so (an instruction doing a division by
10 where the second byte of the opcode is 10? oh, certainly no one is
going to try changing the second byte...)

Paolo

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  9:27                                                       ` Paolo Bonzini
@ 2015-07-31 10:25                                                         ` Borislav Petkov
  2015-07-31 10:26                                                           ` Paolo Bonzini
  0 siblings, 1 reply; 85+ messages in thread
From: Borislav Petkov @ 2015-07-31 10:25 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst

On Fri, Jul 31, 2015 at 11:27:13AM +0200, Paolo Bonzini wrote:
> Is the strace different between KVM and baremetal?

Yes, the signal part is missing from kvm:

$ strace ./icebp
execve("./icebp", ["./icebp"], [/* 20 vars */]) = 0
brk(0)                                  = 0x601000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7ff6000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=95207, ...}) = 0
mmap(NULL, 95207, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ffff7fde000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300\357\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1595408, ...}) = 0
mmap(NULL, 3709016, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7ffff7a53000
mprotect(0x7ffff7bd3000, 2097152, PROT_NONE) = 0
mmap(0x7ffff7dd3000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x180000) = 0x7ffff7dd3000
mmap(0x7ffff7dd8000, 18520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7ffff7dd8000
close(3)                                = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffff7fdb000
arch_prctl(ARCH_SET_FS, 0x7ffff7fdc700) = 0
mprotect(0x7ffff7dd3000, 16384, PROT_READ) = 0
mprotect(0x7ffff7ffc000, 4096, PROT_READ) = 0
munmap(0x7ffff7fde000, 95207)           = 0
exit_group(0)                           = ?

> No, it sends #GP.

True story:

[  697.707990] traps: icebp[3537] general protection ip:4004b0 sp:7fffffffe610 error:a in icebp[400000+1000]

but why? I guess our IDT entry at 1 is funny... Too lazy to check.

> The reason why it isn't documented is probably hidden within Intel.
> Besides ICEBP, which is a bit fringe, there's no reason not to document
> SALC which Thomas mentioned.  SALC all has been there since the 8086,
> and has been undocumented for thirty-odd years.

That one is invalid (on an IVB):

[ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]

AMD APM documents it as invalid too.

> The AAM/AAD variants with immediates other than 10 also have been
> undocumented for fifteen years or so (an instruction doing a division
> by 10 where the second byte of the opcode is 10? oh, certainly no one
> is going to try changing the second byte...)

There's this in the AMD APM:

"In most modern assemblers, the AAM instruction adjusts to base-10
values. However, by coding the instruction directly in binary, it can
adjust to any base specified by the immediate byte value (ib) suffixed
onto the D4h opcode. For example, code D408h for octal, D40Ah for
decimal, and D40Ch for duodecimal (base 12)."

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31 10:25                                                         ` Borislav Petkov
@ 2015-07-31 10:26                                                           ` Paolo Bonzini
  2015-07-31 10:32                                                             ` Borislav Petkov
  0 siblings, 1 reply; 85+ messages in thread
From: Paolo Bonzini @ 2015-07-31 10:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst



On 31/07/2015 12:25, Borislav Petkov wrote:
>> > The reason why it isn't documented is probably hidden within Intel.
>> > Besides ICEBP, which is a bit fringe, there's no reason not to document
>> > SALC which Thomas mentioned.  SALC all has been there since the 8086,
>> > and has been undocumented for thirty-odd years.
> That one is invalid (on an IVB):
> 
> [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
> 
> AMD APM documents it as invalid too.

It's valid in 32-bit.

Paolo

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31 10:26                                                           ` Paolo Bonzini
@ 2015-07-31 10:32                                                             ` Borislav Petkov
  0 siblings, 0 replies; 85+ messages in thread
From: Borislav Petkov @ 2015-07-31 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, Willy Tarreau,
	Steven Rostedt, X86 ML, linux-kernel@vger.kernel.org,
	Thomas Gleixner, Brian Gerst

On Fri, Jul 31, 2015 at 12:26:34PM +0200, Paolo Bonzini wrote:
> 
> 
> On 31/07/2015 12:25, Borislav Petkov wrote:
> >> > The reason why it isn't documented is probably hidden within Intel.
> >> > Besides ICEBP, which is a bit fringe, there's no reason not to document
> >> > SALC which Thomas mentioned.  SALC all has been there since the 8086,
> >> > and has been undocumented for thirty-odd years.
> > That one is invalid (on an IVB):
> > 
> > [ 1306.231408] traps: icebp[3783] trap invalid opcode ip:4004b0 sp:7fffffffe610 error:0 in icebp[400000+1000]
> > 
> > AMD APM documents it as invalid too.
> 
> It's valid in 32-bit.

Yap, no invalid opcode there. I guess there's another bug in the APM's
opcode table then.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-31  8:03                                                     ` Borislav Petkov
  2015-07-31  9:27                                                       ` Paolo Bonzini
@ 2015-09-07  5:39                                                       ` Maciej W. Rozycki
  2015-09-07  7:42                                                         ` Ingo Molnar
  1 sibling, 1 reply; 85+ messages in thread
From: Maciej W. Rozycki @ 2015-09-07  5:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Paolo Bonzini, Peter Zijlstra, Linus Torvalds,
	Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Fri, 31 Jul 2015, Borislav Petkov wrote:

> Yeah, INT 1. I wonder whether INT 1, i.e. CD imm8 does the same thing.
> 
> But why do you say it is special - it simply raises #DB, i.e. vector 1.
> Web page seems to say so when interrupt redirection is disabled. It
> sounds like a nice and quick way to generate a breakpoint. You can do
> that with INT 01, i.e., the CD opcode, too.
> 
> If I'd had to guess, it isn't documented because of the proprietary ICE
> aspect. And no one uses ICEs anymore so it is going to be forgotten with
> people popping off and on and asking about the undocumented opcode.

 FYI, it's actually still in use with modern hardware, as a software 
breakpoint (and hence it has to be a single byte INT1 instruction rather 
than a multiple-byte regular INT 1 encoding) with JTAG probe hardware used 
for bare-metal debugging.  E.g. Intel Atom supports it and boards have 
been available with a JTAG connector, which Intel calls XDP aka Extended 
Debug Port, e.g. the D945GCLF board (aka Crown Beach IIRC) had one.

 By fiddling with some bits in the CPU, which are only accessible through 
JTAG, probe firmware takes control over #DB making it trap into the debug 
mode rather than into the kernel.  As noted above INT1 is used rather than 
INT3 (which still traps into the kernel with #BP as usually) for software 
breakpoints, but all the other DR0-7 resources are also available to the 
probe and the General Detect fault is used to prevent the kernel from 
fiddling with them.  Similarly single-stepping traps into probe firmware.  
Debug mode transitions are completely transparent to any kernel-mode 
software run.

 I did some work on this a few years ago, including emulating DR0-7 
accesses in software down the JTAG handler upon a General Detect fault to 
keep the kernel both happy and away from real debug registers. ;)  Yes, 
you can debug any software with this stuff, including the Linux kernel: 
set instruction and data breakpoints, single-step it, poke at all hardware 
registers, including descriptor registers not otherwise accessible (you 
can set funny modes for segments, also in the 64-bit mode), etc.  One 
complication though is you operate on physical addresses when poking at 
memory, you can't ask the CPU's MMU to remap them for you (you can walk 
page tables manually of course, just as the MMU would).

 I hope this clears things a bit around this stuff. :)  You might be able 
to find some more by issuing a query for "Extended Debug Port" with your 
favourite Internet search engine.

 It's been a while since this discussion, but I thought I'd chime in as 
you might find it interesting.  I'm actually a bit surprised the knowledge 
about this is so poor among x86 experts.

  Maciej

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07  5:39                                                       ` Maciej W. Rozycki
@ 2015-09-07  7:42                                                         ` Ingo Molnar
  2015-09-07  8:19                                                           ` Maciej W. Rozycki
  0 siblings, 1 reply; 85+ messages in thread
From: Ingo Molnar @ 2015-09-07  7:42 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Borislav Petkov, Andy Lutomirski, Paolo Bonzini, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

>  I did some work on this a few years ago, including emulating DR0-7 accesses in 
> software down the JTAG handler upon a General Detect fault to keep the kernel 
> both happy and away from real debug registers. ;) Yes, you can debug any 
> software with this stuff, including the Linux kernel: set instruction and data 
> breakpoints, single-step it, poke at all hardware registers, including 
> descriptor registers not otherwise accessible (you can set funny modes for 
> segments, also in the 64-bit mode), etc.  One complication though is you operate 
> on physical addresses when poking at memory, you can't ask the CPU's MMU to 
> remap them for you (you can walk page tables manually of course, just as the MMU 
> would).

Essentially the ICE breakpoint instruction enters SMM mode?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07  7:42                                                         ` Ingo Molnar
@ 2015-09-07  8:19                                                           ` Maciej W. Rozycki
  2015-09-07 10:19                                                             ` Paolo Bonzini
  0 siblings, 1 reply; 85+ messages in thread
From: Maciej W. Rozycki @ 2015-09-07  8:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Andy Lutomirski, Paolo Bonzini, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, 7 Sep 2015, Ingo Molnar wrote:

> >  I did some work on this a few years ago, including emulating DR0-7 accesses in 
> > software down the JTAG handler upon a General Detect fault to keep the kernel 
> > both happy and away from real debug registers. ;) Yes, you can debug any 
> > software with this stuff, including the Linux kernel: set instruction and data 
> > breakpoints, single-step it, poke at all hardware registers, including 
> > descriptor registers not otherwise accessible (you can set funny modes for 
> > segments, also in the 64-bit mode), etc.  One complication though is you operate 
> > on physical addresses when poking at memory, you can't ask the CPU's MMU to 
> > remap them for you (you can walk page tables manually of course, just as the MMU 
> > would).
> 
> Essentially the ICE breakpoint instruction enters SMM mode?

 I didn't do stuff at the probe firmware level so I can't say for sure, 
but my gut feeling is the debug mode is indeed very close if not the same 
as SMM.  I think duplicating the logic would be an unnecessary waste of 
silicon.

 And obviously it's any cause of #DB that enters this mode.  The probe can
also request it right at the exit from the reset state, so that you can 
debug software (e.g BIOS startup) right from the reset vector.  You don't 
need working RAM for that.

  Maciej

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07  8:19                                                           ` Maciej W. Rozycki
@ 2015-09-07 10:19                                                             ` Paolo Bonzini
  2015-09-07 17:01                                                               ` Maciej W. Rozycki
  0 siblings, 1 reply; 85+ messages in thread
From: Paolo Bonzini @ 2015-09-07 10:19 UTC (permalink / raw)
  To: Maciej W. Rozycki, Ingo Molnar
  Cc: Borislav Petkov, Andy Lutomirski, Peter Zijlstra, Linus Torvalds,
	Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On 07/09/2015 10:19, Maciej W. Rozycki wrote:
>> > Essentially the ICE breakpoint instruction enters SMM mode?
>  I didn't do stuff at the probe firmware level so I can't say for sure, 
> but my gut feeling is the debug mode is indeed very close if not the same 
> as SMM.  I think duplicating the logic would be an unnecessary waste of 
> silicon.

I researched SMM a bit recently in order to implement it in KVM, and the
best source of folklore seems to be http://www.rcollins.org/ddj (which I
also have on paper :)).

The author there says that SMM design was roughly based on the 386's
probe/ICE mode design, but it's actually separate.  Most notably, on the
386 the state save areas almost mirror each other, but when I say
mirror... I do mean mirror: directions are reversed, and what is on top
for probe mode is on bottom for SMM. :)

In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel
who actually won the lawsuit.  I couldn't find more information about
the lawsuit.

It's probably diverged more and more over time, for example because SMM
is now considered security-sensitive while probe mode isn't.  In
addition, the same DDJ article says that Pentium JTAG probe mode
"doesn't resemble SMM at all, doesn't use a state save map, or even
execute any code of its own", whatever that means.

Paolo

>  And obviously it's any cause of #DB that enters this mode.  The probe can
> also request it right at the exit from the reset state, so that you can 
> debug software (e.g BIOS startup) right from the reset vector.  You don't 
> need working RAM for that.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07 10:19                                                             ` Paolo Bonzini
@ 2015-09-07 17:01                                                               ` Maciej W. Rozycki
  2015-09-07 17:22                                                                 ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Maciej W. Rozycki @ 2015-09-07 17:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Ingo Molnar, Borislav Petkov, Andy Lutomirski, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, 7 Sep 2015, Paolo Bonzini wrote:

> >  I didn't do stuff at the probe firmware level so I can't say for sure, 
> > but my gut feeling is the debug mode is indeed very close if not the same 
> > as SMM.  I think duplicating the logic would be an unnecessary waste of 
> > silicon.
> 
> I researched SMM a bit recently in order to implement it in KVM, and the
> best source of folklore seems to be http://www.rcollins.org/ddj (which I
> also have on paper :)).

 Robert did an excellent job figuring it all, but his stuff is a bit 
dated, things may have changed since, especially as JTAG debugging has 
since become ubiquitous in the embedded world and consequently better 
developed.

> The author there says that SMM design was roughly based on the 386's
> probe/ICE mode design, but it's actually separate.  Most notably, on the
> 386 the state save areas almost mirror each other, but when I say
> mirror... I do mean mirror: directions are reversed, and what is on top
> for probe mode is on bottom for SMM. :)

 That might be a minor implementation detail, needed for whatever reason. 

> In addition, AMD tried reusing ICE mode for SMM, and was sued by Intel
> who actually won the lawsuit.  I couldn't find more information about
> the lawsuit.
> 
> It's probably diverged more and more over time, for example because SMM
> is now considered security-sensitive while probe mode isn't.  In
> addition, the same DDJ article says that Pentium JTAG probe mode
> "doesn't resemble SMM at all, doesn't use a state save map, or even
> execute any code of its own", whatever that means.

 At least I am fairly sure the RSM instruction is used to quit the debug 
mode just like with SMM, so I'd be surprised if they bothered implementing 
separate state save area structures for the two modes.  Some control bits 
in the CPU may well be set differently between the two modes though, 
addressing issues like security sensitivity you mentioned.

 A state save/restore approach is definitely used (unlike with some other 
processors that expose internal registers through JTAG directly) as you 
cannot switch between operation modes (e.g. real vs protected) on the fly 
while in the debug mode.  You actually need to return to the regular mode 
(e.g. ask to single-step a NOP) for a mode change to take effect.  Ditto 
about other registers -- any read-only bits are only masked out in the 
register state once a regular-mode instruction has executed.

 The use of RSM also prompts a question whether you can nest debug mode in 
SMM (to debug SMM code) -- this is actually similar to the NMI vs IRET 
issue considered in this thread -- or nest debug mode in debug mode, e.g. 
by taking a #DB exception from an INT1 instruction while in either mode.  
I don't know.  Some other processors (MIPS) that implement a JTAG debug 
mode allow such nesting and care has to be taken in probe firmware to 
handle it correctly and ensure the context to return to is not clobbered 
if such a situation is to be arranged.  And also -- as you may have 
expected -- the debug mode return instruction has to be avoided in the 
nested handler.

 These are all implementation-specific details, including the INT1 
instruction, which is why I am not at all surprised that they are omitted 
from architecture manuals.  The JTAG debug mode itself is no rocket 
science though, everybody seems to have it these days.  Though for cost 
and power consumption saving reasons the RTL block implementing the debug 
module may obviously be omitted from production silicon known, perhaps by 
definition, to never require one.

  Maciej

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07 17:01                                                               ` Maciej W. Rozycki
@ 2015-09-07 17:22                                                                 ` Andy Lutomirski
  2015-09-07 19:30                                                                   ` Maciej W. Rozycki
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-09-07 17:22 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, Sep 7, 2015 at 10:01 AM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
>  These are all implementation-specific details, including the INT1
> instruction, which is why I am not at all surprised that they are omitted
> from architecture manuals.

That bit is BS, though.  The INT1 instruction, executed in user mode
(CPL3) with no hardware debugger attached, will enter the kernel
through a gate at vector 1, *even if that gate has DPL == 0*.

If there's an instruction that bypasses hardware protection
mechanisms, then Intel should document it rather than relying on OS
writers to know enough folklore to get it right.

Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
interrupt or trap gate only if an exception or interrupt is generated
with an INT n, INT 3, or INTO instruction."  It does not say "the
processor does not check the DPL of the interrupt or trap gate if the
exception or interrupt is generated with the undocumented ICEBP
instruction."

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07 17:22                                                                 ` Andy Lutomirski
@ 2015-09-07 19:30                                                                   ` Maciej W. Rozycki
  2015-09-07 21:56                                                                     ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Maciej W. Rozycki @ 2015-09-07 19:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, 7 Sep 2015, Andy Lutomirski wrote:

> >  These are all implementation-specific details, including the INT1
> > instruction, which is why I am not at all surprised that they are omitted
> > from architecture manuals.
> 
> That bit is BS, though.  The INT1 instruction, executed in user mode
> (CPL3) with no hardware debugger attached, will enter the kernel
> through a gate at vector 1, *even if that gate has DPL == 0*.
> 
> If there's an instruction that bypasses hardware protection
> mechanisms, then Intel should document it rather than relying on OS
> writers to know enough folklore to get it right.
> 
> Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
> interrupt or trap gate only if an exception or interrupt is generated
> with an INT n, INT 3, or INTO instruction."  It does not say "the
> processor does not check the DPL of the interrupt or trap gate if the
> exception or interrupt is generated with the undocumented ICEBP
> instruction."

 It does not have to be mentioned, because it's implied by how the #DB 
exception is propagated: regardless of its origin it never checks the DPL.  
And user-mode software may well use POPF at any time to set the TF bit in 
the flags register to the same effect, so the OS needs to be prepared for 
a #DB exception it hasn't scheduled itself anyway.

  Maciej

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07 19:30                                                                   ` Maciej W. Rozycki
@ 2015-09-07 21:56                                                                     ` Andy Lutomirski
  2015-09-08 16:21                                                                       ` Maciej W. Rozycki
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-09-07 21:56 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, Sep 7, 2015 at 12:30 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Mon, 7 Sep 2015, Andy Lutomirski wrote:
>
>> >  These are all implementation-specific details, including the INT1
>> > instruction, which is why I am not at all surprised that they are omitted
>> > from architecture manuals.
>>
>> That bit is BS, though.  The INT1 instruction, executed in user mode
>> (CPL3) with no hardware debugger attached, will enter the kernel
>> through a gate at vector 1, *even if that gate has DPL == 0*.
>>
>> If there's an instruction that bypasses hardware protection
>> mechanisms, then Intel should document it rather than relying on OS
>> writers to know enough folklore to get it right.
>>
>> Heck, SDM Volume 3 6.12.1.1 says "The processor checks the DPL of the
>> interrupt or trap gate only if an exception or interrupt is generated
>> with an INT n, INT 3, or INTO instruction."  It does not say "the
>> processor does not check the DPL of the interrupt or trap gate if the
>> exception or interrupt is generated with the undocumented ICEBP
>> instruction."
>
>  It does not have to be mentioned, because it's implied by how the #DB
> exception is propagated: regardless of its origin it never checks the DPL.
> And user-mode software may well use POPF at any time to set the TF bit in
> the flags register to the same effect, so the OS needs to be prepared for
> a #DB exception it hasn't scheduled itself anyway.

Not really.

int $1 checks DPL.  Setting TF results in saved TF set and the
corresponding bit in DR6 set as well.  Triggering a #DB using the
debug registers requires active OS help.

So operating systems need to handle a #DB without no indicated cause
without spewing warnings or crashing, and there is no indication
whatsoever in the SDM or APM that this is the case.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-09-07 21:56                                                                     ` Andy Lutomirski
@ 2015-09-08 16:21                                                                       ` Maciej W. Rozycki
  0 siblings, 0 replies; 85+ messages in thread
From: Maciej W. Rozycki @ 2015-09-08 16:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Peter Zijlstra,
	Linus Torvalds, Willy Tarreau, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Thomas Gleixner, Brian Gerst

On Mon, 7 Sep 2015, Andy Lutomirski wrote:

> >  It does not have to be mentioned, because it's implied by how the #DB
> > exception is propagated: regardless of its origin it never checks the DPL.
> > And user-mode software may well use POPF at any time to set the TF bit in
> > the flags register to the same effect, so the OS needs to be prepared for
> > a #DB exception it hasn't scheduled itself anyway.
> 
> Not really.
> 
> int $1 checks DPL.  Setting TF results in saved TF set and the
> corresponding bit in DR6 set as well.  Triggering a #DB using the
> debug registers requires active OS help.

 INT $1 is a software interrupt instruction, it does not trigger a #DB.  
Similarly INT $13 checks DPL while #GP does not.  Or maybe INT $6 vs UD2 
is a better analogy; the latter is as much INT6 as the 0xf1 encoding is 
INT1.

 Yes, you'll get a DR6 status with no new bits set.  So what?  You can 
ignore it and IRET with no adverse effects.  You can print diagnostics if 
you're pedantic.  You can kill the offending user program, but that's no 
harm, because it already did the undefined.  None of these is an issue, 
and certainly not one for security.

 Panicking OTOH would be, but that would IMHO be a silly choice and a bad 
OS design.  You never need to crash due to a user-mode exception, even an 
unknown one.  What if you run on a new CPU which has a new user-mode 
exception unknown at the time the OS binary was compiled?  That's an 
analogous situation for an architecture like x86 where strict backwards 
compatibility is maintained.

 A reasonable #DB handler will do something like:

{
	int dr6 = read_dr6();

	write_dr6(0);
	if (dr6 & DR6_MASK_X)
		handle_dr6_x();
	if (dr6 & DR6_MASK_Y)
		handle_dr6_y();
	/* Etc... */

	return;
}

and will work just fine where invoked with no bits set in DR6.

> So operating systems need to handle a #DB without no indicated cause
> without spewing warnings or crashing, and there is no indication
> whatsoever in the SDM or APM that this is the case.

 Strictly speaking the SDM does not state that at least one status bit 
shall be set in DR6 either.

 FAOD I'm not saying of course that documenting INT1 as a model-specific 
instruction encoding reserved for #DB generation or stating something to 
the effect that the OS is required to handle (e.g. discard) a #DB 
exception seen with no status bits set in DR6 would be bad.  No, it would 
certainly be nice.  But I maintain that I don't see it as strictly 
necessary.

 Pester Intel if you disagree, I'm not the right person to complain about 
it anyway. ;)

  Maciej

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 20:51                                         ` Peter Zijlstra
  2015-07-24 21:07                                           ` Steven Rostedt
  2015-07-24 21:08                                           ` Andy Lutomirski
@ 2015-07-24 23:53                                           ` Linus Torvalds
  2 siblings, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2015-07-24 23:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Willy Tarreau, Steven Rostedt, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 1:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> do_debug()
>   send_sigtrap()
>     force_sig_info()
>       spin_lock_irqsave()
>
> Now, I don't pretend to understand the condition before send_sigtrap(),
> so it _might_ be ok, but it sure as heck could do with a comment.

Ugh. As Andy said, I think that's ok, because it's actually the
single-step case, and won't trigger for kernel mode. So we should be
ok. Although the code I agree is not good.

I'd personally be more worried about the usual crazy "notify_die()"
crap. I absoluely detest those notifier chain things. They are hooks
for random crap that shouldn't be hooked into, but whatever. It's not
a problem in practice, it's just a sign of a certain kind of diseased
mind.

On the whole I think we're ok. I'd love to get rid of things, and yes,
I think we should probably explicitly handle the in-kernel case first
and just return without doing anything, just to make the code more
obviously safe. But it doesn't look like a fundamental problem spot.

                  Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:26                               ` Willy Tarreau
  2015-07-24 15:30                                 ` Peter Zijlstra
@ 2015-07-24 15:34                                 ` Steven Rostedt
  2015-07-24 15:49                                   ` Willy Tarreau
  1 sibling, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 15:34 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 17:26:37 +0200
Willy Tarreau <w@1wt.eu> wrote:

 
> > The point is, if we trigger a #DB on an instruction breakpoint
> > while !IF, then we simply disable that breakpoint and do the RET.
> 
> Yes but the breakpoint remains disabled then. Or I'm missing
> something.

Do we care? If it was an instruction breakpoint with !IF set, then it
had to have happened in the kernel. And kgdb or whatever added it there
needs to deal with that.

There should be no instances in the kernel where we execute userspace
code with interrupts disabled.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:34                                 ` Steven Rostedt
@ 2015-07-24 15:49                                   ` Willy Tarreau
  0 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 15:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Andy Lutomirski, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 11:34:26AM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 17:26:37 +0200
> Willy Tarreau <w@1wt.eu> wrote:
> 
>  
> > > The point is, if we trigger a #DB on an instruction breakpoint
> > > while !IF, then we simply disable that breakpoint and do the RET.
> > 
> > Yes but the breakpoint remains disabled then. Or I'm missing
> > something.
> 
> Do we care? If it was an instruction breakpoint with !IF set, then it
> had to have happened in the kernel. And kgdb or whatever added it there
> needs to deal with that.

I was concerned that an RW BP would remain disabled when returning to
user space but Peter cleared that out by pointing me to the discussion
where it was explained that they are re-enabled when returning to user
space.

So no problem here for me.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24  8:13               ` Peter Zijlstra
  2015-07-24  9:02                 ` Willy Tarreau
  2015-07-24 11:58                 ` Steven Rostedt
@ 2015-07-24 15:48                 ` Andy Lutomirski
  2015-07-24 16:02                   ` Steven Rostedt
                                     ` (3 more replies)
  2 siblings, 4 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-24 15:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, Jul 24, 2015 at 1:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jul 23, 2015 at 02:59:56PM -0700, Linus Torvalds wrote:
>> Hmmm. I thought watchpoints were "before the instruction" too, but
>> that's just because I haven't used them in ages, and I didn't remember
>> the details. I just looked it up.
>>
>> You're right - the memory watchpoints trigger after the instruction
>> has executed, so RF isn't an issue. So yes, the only issue is
>> instruction breakpoints, and those are the only ones we need to clear.
>>
>> And that makes it really easy.
>>
>> So yes, I agree. We only need to clear all kernel breakpoints.
>
> But but but, we can access userspace with !IF, imagine someone doing:
>
>   local_irq_disable();
>   copy_from_user_inatomic();
>
> and as luck would have it, there's a breakpoint on the user memory we
> just touched. And we go and disable a user breakpoint.
>

The Intel SDM says:

17.3.1.2 Data Memory and I/O Breakpoint Exception Conditions

Data memory and I/O breakpoints are reported when the processor
attempts to access a memory or I/O address
specified in a breakpoint-address register (DR0 through DR3) that has
been set up to detect data or I/O accesses
(R/W flag is set to 1, 2, or 3). The processor generates the exception
after it executes the instruction that made the
access, so these breakpoint condition causes a trap-class exception to
be generated.

So by the time we detect that we've hit a watchpoint, the instruction
that tripped it is done and we don't need RF.  Furthermore, after
reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
we hit a watchpoint.  So this might be as simple as:

if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
  for (i = 0; i < 4; i++)
    if (dr6 & (DR_TRAP0<<i)) {
      /* hit a kernel breakpoint with IF clear */
      dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
    }

I'm not saying that your code is wrong, but I think this is simpler
and avoids poking at yet more per-cpu state from NMI context, which is
kind of nice.

If you don't like the RF games above, it would also be straightforward
to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
breakpoint.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:48                 ` Andy Lutomirski
@ 2015-07-24 16:02                   ` Steven Rostedt
  2015-07-24 16:08                     ` Willy Tarreau
  2015-07-24 16:06                   ` Steven Rostedt
                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 16:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, 24 Jul 2015 08:48:57 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF.  Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.  So this might be as simple as:
> 
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |

Um, isn't 0xf * DR_TRAP0 same as a constant "true"?

-- Steve


> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
>   for (i = 0; i < 4; i++)
>     if (dr6 & (DR_TRAP0<<i)) {
>       /* hit a kernel breakpoint with IF clear */
>       dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>     }
> 
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
> 
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.
> 
> --Andy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 16:02                   ` Steven Rostedt
@ 2015-07-24 16:08                     ` Willy Tarreau
  2015-07-24 16:31                       ` Steven Rostedt
  0 siblings, 1 reply; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 16:08 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 12:02:49PM -0400, Steven Rostedt wrote:
> On Fri, 24 Jul 2015 08:48:57 -0700
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > So by the time we detect that we've hit a watchpoint, the instruction
> > that tripped it is done and we don't need RF.  Furthermore, after
> > reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> > we hit a watchpoint.  So this might be as simple as:
> > 
> > if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> 
> Um, isn't 0xf * DR_TRAP0 same as a constant "true"?

For me it's a typo, it should have been :

 if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |

(the purpose was to read all 4 lower bits at once).

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 16:08                     ` Willy Tarreau
@ 2015-07-24 16:31                       ` Steven Rostedt
  0 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 16:31 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, Peter Zijlstra, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, 24 Jul 2015 18:08:06 +0200
Willy Tarreau <w@1wt.eu> wrote:


> > Um, isn't 0xf * DR_TRAP0 same as a constant "true"?
> 
> For me it's a typo, it should have been :
> 
>  if ((dr6 & (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> 
> (the purpose was to read all 4 lower bits at once).

I figured that after I sent it, but the 0xf * DR_TRAP0 is also
confusing to me. Actually, why not use proper naming:

  dr6 & DR_TRAP_BITS

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:48                 ` Andy Lutomirski
  2015-07-24 16:02                   ` Steven Rostedt
@ 2015-07-24 16:06                   ` Steven Rostedt
  2015-07-24 16:25                   ` Willy Tarreau
  2015-07-24 17:10                   ` Willy Tarreau
  3 siblings, 0 replies; 85+ messages in thread
From: Steven Rostedt @ 2015-07-24 16:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, X86 ML,
	linux-kernel@vger.kernel.org, Willy Tarreau, Borislav Petkov,
	Thomas Gleixner, Brian Gerst

On Fri, 24 Jul 2015 08:48:57 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF.  Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.  So this might be as simple as:
> 
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))

Also, wouldn't !(regs->X86_EFLAGS_IF) && !user_mode(regs) be a bug?
When do we allow coming from userspace with interrupts disabled?

-- Steve

>   for (i = 0; i < 4; i++)
>     if (dr6 & (DR_TRAP0<<i)) {
>       /* hit a kernel breakpoint with IF clear */
>       dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>     }
> 
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
> 
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.
> 
> --Andy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:48                 ` Andy Lutomirski
  2015-07-24 16:02                   ` Steven Rostedt
  2015-07-24 16:06                   ` Steven Rostedt
@ 2015-07-24 16:25                   ` Willy Tarreau
  2015-07-24 17:21                     ` Andy Lutomirski
  2015-07-24 17:10                   ` Willy Tarreau
  3 siblings, 1 reply; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 16:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF.  Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.

Apparently after reading 17.3.1.1, it seems like RF can still be set
if a data breakpoint triggers in a repeated string instruction before
the last iteration. However I don't think we care because as long as
we return to the string instruction, since the data location was already
visited it won't trigger again so the loss of the flag should be safe.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 16:25                   ` Willy Tarreau
@ 2015-07-24 17:21                     ` Andy Lutomirski
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-24 17:21 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 9:25 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
>> So by the time we detect that we've hit a watchpoint, the instruction
>> that tripped it is done and we don't need RF.  Furthermore, after
>> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
>> we hit a watchpoint.
>
> Apparently after reading 17.3.1.1, it seems like RF can still be set
> if a data breakpoint triggers in a repeated string instruction before
> the last iteration. However I don't think we care because as long as
> we return to the string instruction, since the data location was already
> visited it won't trigger again so the loss of the flag should be safe.
>

Oh, right.  So my proposal is wrong: it'll clear a watchpoint
incorrectly if we hit it in the middle of a string operation.

So we should either parse dr0..dr3 (whichever one triggered) or do
Peter's think and clear dr7 entirely.  I still prefer just clearing
the breakpoint that triggered.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 15:48                 ` Andy Lutomirski
                                     ` (2 preceding siblings ...)
  2015-07-24 16:25                   ` Willy Tarreau
@ 2015-07-24 17:10                   ` Willy Tarreau
  2015-07-24 17:20                     ` Andy Lutomirski
  2015-07-24 17:21                     ` Willy Tarreau
  3 siblings, 2 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 17:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
> So by the time we detect that we've hit a watchpoint, the instruction
> that tripped it is done and we don't need RF.  Furthermore, after
> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
> we hit a watchpoint.  So this might be as simple as:
> 
> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
>   for (i = 0; i < 4; i++)
>     if (dr6 & (DR_TRAP0<<i)) {
>       /* hit a kernel breakpoint with IF clear */
>       dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>     }
> 
> I'm not saying that your code is wrong, but I think this is simpler
> and avoids poking at yet more per-cpu state from NMI context, which is
> kind of nice.
> 
> If you don't like the RF games above, it would also be straightforward
> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
> breakpoint.

Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
to emulate IRET, and then possibly simplify the flags processing. It says
that it takes the CPL3 code segment but nowhere it says that the target is
validated for effectively being userland, and further it suggests that it
doesn't validate anything :

  "It is the responsibility of the OS to ensure the descriptors in
   the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
   (consistent with the base, limit, and attribute values forced by
   the instructions)."

The OS has to set the RSP by itself before doing SYSRET, which opens a
race between "mov rsp" and "sysret", but if we only take that path once
we figure we come from NMI (using just IF+RSP), we know that IRQs and
NMIs are still disabled and cannot strike at this instant. Maybe MCEs
can, but they would execute within the NMI's stack just as if they were
triggered inside the NMI as well so I don't see a problem here.

I tried to imagine a case where kernel page faults, then NMI comes in,
then debug strikes and we have to return from debug to NMI then to fault
handler and I don't think we break the chain. Of course there are many
subtleties I can't grab because I don't understand all the details.

Do you think that could simplify things or that it's another stupid idea ?

Willy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 17:10                   ` Willy Tarreau
@ 2015-07-24 17:20                     ` Andy Lutomirski
  2015-07-30 15:54                       ` Paolo Bonzini
  2015-07-24 17:21                     ` Willy Tarreau
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-24 17:20 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 10:10 AM, Willy Tarreau <w@1wt.eu> wrote:
> On Fri, Jul 24, 2015 at 08:48:57AM -0700, Andy Lutomirski wrote:
>> So by the time we detect that we've hit a watchpoint, the instruction
>> that tripped it is done and we don't need RF.  Furthermore, after
>> reading 17.3.1.1: I *think* that regs->flags withh have RF *clear* if
>> we hit a watchpoint.  So this might be as simple as:
>>
>> if ((dr6 && (0xf * DR_TRAP0) && (regs->flags & (X86_EFLAGS_RF |
>> X86_EFLAGS_IF)) == X86_EFLAGS_RF && !user_mode(regs))
>>   for (i = 0; i < 4; i++)
>>     if (dr6 & (DR_TRAP0<<i)) {
>>       /* hit a kernel breakpoint with IF clear */
>>       dr7 &= ~(DR_GLOBAL_ENABLE << (i * DR_ENABLE_SHIFT));
>>     }
>>
>> I'm not saying that your code is wrong, but I think this is simpler
>> and avoids poking at yet more per-cpu state from NMI context, which is
>> kind of nice.
>>
>> If you don't like the RF games above, it would also be straightforward
>> to parse dr0..dr3 for each DR_TRAP bit that's set and see if it's a
>> breakpoint.
>
> Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
> to emulate IRET, and then possibly simplify the flags processing. It says
> that it takes the CPL3 code segment but nowhere it says that the target is
> validated for effectively being userland, and further it suggests that it
> doesn't validate anything :
>
>   "It is the responsibility of the OS to ensure the descriptors in
>    the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
>    (consistent with the base, limit, and attribute values forced by
>    the instructions)."

You are an evil bastard.  I seriously doubt that this will work.
SYSRET goes to CPL3 no matter what.  Also, I don't think you want to
start poking at MSRs to return.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 17:20                     ` Andy Lutomirski
@ 2015-07-30 15:54                       ` Paolo Bonzini
  0 siblings, 0 replies; 85+ messages in thread
From: Paolo Bonzini @ 2015-07-30 15:54 UTC (permalink / raw)
  To: Andy Lutomirski, Willy Tarreau
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst



On 24/07/2015 19:20, Andy Lutomirski wrote:
> > Andy, section 5.8 of the SDM makes me think we could possibly abuse SYSRET
> > to emulate IRET, and then possibly simplify the flags processing. It says
> > that it takes the CPL3 code segment but nowhere it says that the target is
> > validated for effectively being userland, and further it suggests that it
> > doesn't validate anything :
> >
> >   "It is the responsibility of the OS to ensure the descriptors in
> >    the GDT/LDT correspond to the selectors loaded by SYSCALL/SYSRET
> >    (consistent with the base, limit, and attribute values forced by
> >    the instructions)."
> You are an evil bastard.  I seriously doubt that this will work.
> SYSRET goes to CPL3 no matter what.  Also, I don't think you want to
> start poking at MSRs to return.

On Intel the bottom two bits of the selector are forced to 11.  The
pseudocode of SYSRET in the SDM has an explicit

	CS.Selector ← (IA32_STAR[63:48]+ either 0 or 16) OR 3;
	...
	SS.Selector ← (IA32_STAR[63:48]+8) OR 3;

On AMD it's even worse, because you get a weird state with
CS.DPL=CS.RPL=SS.DPL=SS.RPL=0 but still the CPL is 3.  This is seriously
messed up because the CPL is always SS.DPL except in this case.  AMD
even had to add a separate field for the CPL to their VM control block,
just to account for this case.  Intel more sanely uses SS.DPL.

Paolo

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-24 17:10                   ` Willy Tarreau
  2015-07-24 17:20                     ` Andy Lutomirski
@ 2015-07-24 17:21                     ` Willy Tarreau
  1 sibling, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-24 17:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, X86 ML,
	linux-kernel@vger.kernel.org, Borislav Petkov, Thomas Gleixner,
	Brian Gerst

On Fri, Jul 24, 2015 at 07:10:18PM +0200, Willy Tarreau wrote:
> The OS has to set the RSP by itself before doing SYSRET, which opens a
> race between "mov rsp" and "sysret", but if we only take that path once
> we figure we come from NMI (using just IF+RSP), we know that IRQs and
> NMIs are still disabled and cannot strike at this instant. Maybe MCEs
> can, but they would execute within the NMI's stack just as if they were
> triggered inside the NMI as well so I don't see a problem here.

OK too bad I just found the response here in the code :-(

     * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
     * restoring TF results in a trap from userspace immediately after
     * SYSRET.  This would cause an infinite loop whenever #DB happens
     * with register state that satisfies the opportunistic SYSRET
     * conditions.

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:38 ` Linus Torvalds
  2015-07-23 20:49   ` Andy Lutomirski
@ 2015-07-23 20:52   ` Willy Tarreau
  2015-07-23 20:53     ` Andy Lutomirski
  2015-07-23 21:13     ` Linus Torvalds
  2015-07-23 21:20   ` Peter Zijlstra
  2 siblings, 2 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-23 20:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
> >
> > We haven't considered:
> >
> > 3. Forbid faults (other than MCE) inside NMI.
> 
> I'd really prefer #2. #3 depends on us getting many things right, and
> never introducing new cases in the future.
> 
> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> but returning to user space with RF clear doesn't really seem to be
> all that problematic.

What's the worst case that can happen with RF cleared when returing
to user space ? My understanding is that it's just that we risk to
break again on an instruction that had a break point set and which
already triggered the breakpoint, right ?

If so the problem probably is whether there's a risk of looping again
without ever getting a chance to execute this instruction normally.
But if the NMIs don't bomb as fast as we can process them, at some
point the instruction should get a chance to be executed, so the
problem doesn't seem dramatic.

That makes me think that I have no idea what happens if we try to
step-trace "int 2", I don't even know if we pass through the NMI
handler.

Willy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:52   ` Willy Tarreau
@ 2015-07-23 20:53     ` Andy Lutomirski
  2015-07-23 21:07       ` Willy Tarreau
  2015-07-23 21:13     ` Linus Torvalds
  1 sibling, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 20:53 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote:
> On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
>> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> >
>> > 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
>> >
>> > We haven't considered:
>> >
>> > 3. Forbid faults (other than MCE) inside NMI.
>>
>> I'd really prefer #2. #3 depends on us getting many things right, and
>> never introducing new cases in the future.
>>
>> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
>> but returning to user space with RF clear doesn't really seem to be
>> all that problematic.
>
> What's the worst case that can happen with RF cleared when returing
> to user space ? My understanding is that it's just that we risk to
> break again on an instruction that had a break point set and which
> already triggered the breakpoint, right ?

I assume Linus meant returning to kernel space with RF clear.  Returns
to userspace have their own fancy logic here, and it's survived for a
couple of releases, including through an explicit test of RF handling
:)

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:53     ` Andy Lutomirski
@ 2015-07-23 21:07       ` Willy Tarreau
  0 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-23 21:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 01:53:34PM -0700, Andy Lutomirski wrote:
> On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote:
> > On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:
> >> On Thu, Jul 23, 2015 at 1:21 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >> >
> >> > 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
> >> >
> >> > We haven't considered:
> >> >
> >> > 3. Forbid faults (other than MCE) inside NMI.
> >>
> >> I'd really prefer #2. #3 depends on us getting many things right, and
> >> never introducing new cases in the future.
> >>
> >> #2, in contrast, seems to be fairly localized. Yes, RF is an issue,
> >> but returning to user space with RF clear doesn't really seem to be
> >> all that problematic.
> >
> > What's the worst case that can happen with RF cleared when returing
> > to user space ? My understanding is that it's just that we risk to
> > break again on an instruction that had a break point set and which
> > already triggered the breakpoint, right ?
> 
> I assume Linus meant returning to kernel space with RF clear.  Returns
> to userspace have their own fancy logic here, and it's survived for a
> couple of releases, including through an explicit test of RF handling
> :)

Ah you must be right, got it. Yes you want to break into the NMI handler
and you either disable all breakpoints/single-step until the NMI's iret
by clearing DR7, or you loop over and over on the same instruction if
you try to restart the stopped instruction with RF clear. That makes
sense.

Thanks,
Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:52   ` Willy Tarreau
  2015-07-23 20:53     ` Andy Lutomirski
@ 2015-07-23 21:13     ` Linus Torvalds
  2015-07-23 21:18       ` Willy Tarreau
  1 sibling, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:13 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote:
>
> What's the worst case that can happen with RF cleared when returing
> to user space ?

Not a good idea. We are fine breaking breakpoints on the kernel ("use
the tracing infrastructure instead"). Breaking it in user space is not
really an option.

And we really don't need to. We'd only use 'ret' when returning to
kernel code. And not even for the usual case, only for the "interrupts
are off" case.  If somebody tries to put a breakpoint on something
that is used in an irq-off situation, they are doing something very
specialized, and we cna tell them: "sorry, we had to break your use
case because it's crazy any other way".

Those kind of people are by definition not "users". They are mucking
with kernel internals. Breaking them is not a regression.

Btw, we should still ask Intel for that "fast iret that doesn't
re-enable NMI". So for possible future CPU's we might let people do
crazy things again.

                  Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:13     ` Linus Torvalds
@ 2015-07-23 21:18       ` Willy Tarreau
  0 siblings, 0 replies; 85+ messages in thread
From: Willy Tarreau @ 2015-07-23 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 02:13:16PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 1:52 PM, Willy Tarreau <w@1wt.eu> wrote:
> >
> > What's the worst case that can happen with RF cleared when returing
> > to user space ?
> 
> Not a good idea. We are fine breaking breakpoints on the kernel ("use
> the tracing infrastructure instead"). Breaking it in user space is not
> really an option.

But that wouldn't disable the breakpoint, just make it strike again,
so the user would not be hurt.

> And we really don't need to. We'd only use 'ret' when returning to
> kernel code. And not even for the usual case, only for the "interrupts
> are off" case.  If somebody tries to put a breakpoint on something
> that is used in an irq-off situation, they are doing something very
> specialized, and we cna tell them: "sorry, we had to break your use
> case because it's crazy any other way".
> 
> Those kind of people are by definition not "users". They are mucking
> with kernel internals. Breaking them is not a regression.
> 
> Btw, we should still ask Intel for that "fast iret that doesn't
> re-enable NMI". So for possible future CPU's we might let people do
> crazy things again.

I'm just thinking that there should be an option for this : task switching.
You can store the EFLAGS in the TSS, so by preparing a dummy task with
everything needed to emulate iret, we might be able to do it without the
iret instruction. Or is this a stupid idea ? At least now I've well
understood that ugliness is not an excuse for not proposing something :-)

Willy


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:38 ` Linus Torvalds
  2015-07-23 20:49   ` Andy Lutomirski
  2015-07-23 20:52   ` Willy Tarreau
@ 2015-07-23 21:20   ` Peter Zijlstra
  2015-07-23 21:35     ` Linus Torvalds
  2 siblings, 1 reply; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-23 21:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 01:38:33PM -0700, Linus Torvalds wrote:

> And the "take them and disable them" is really simple. No "am I in an
> NMI contect" thing (because that leads to the whole question about
> "what is NMI context"). That's not the real rule anyway.
> 
> No, make it very simple and straightforward. Make the test be "uhhuh,
> I got a #DB in kernel mode, and interrupts were disabled - I know I'm
> going to return with "ret", so I'm just going to have to disable this
> breakpoint".
> 
> Nothing clever. Nothing subtle. Nothing that needs "this range of
> instructions is magical". No.  Just a very simple rule: if the context
> we return to is kernel mode and interrupts are disabled, we're using
> 'ret', so we cannot suppress debug faults.
> 
> Did I miss something? There were a lot of emails flying around, but I
> *thought* I saw them all..

So the NMI could trigger userspace debug register faults, and simply
disabling them would make the whole debug register thing entirely
unreliable.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:20   ` Peter Zijlstra
@ 2015-07-23 21:35     ` Linus Torvalds
  2015-07-23 21:45       ` Andy Lutomirski
  0 siblings, 1 reply; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So the NMI could trigger userspace debug register faults, and simply
> disabling them would make the whole debug register thing entirely
> unreliable.

We could easily set something to re-enable them for when we actually
return to user space. I'd be ok with just setting the
_TIF_USER_WORK_MASK.

But even that should not be a requirement for the basic stability and
core integrity of the kernel. Not like the current horrid mess with
NMI nesting and ESP fixing etc.

And realistically, nobody will ever even notice. So the whole "ok, we
can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
that is more like cleaning up things, not a core issue.

               Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:35     ` Linus Torvalds
@ 2015-07-23 21:45       ` Andy Lutomirski
  2015-07-23 21:54         ` Linus Torvalds
  0 siblings, 1 reply; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 21:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:35 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 23, 2015 at 2:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> So the NMI could trigger userspace debug register faults, and simply
>> disabling them would make the whole debug register thing entirely
>> unreliable.
>
> We could easily set something to re-enable them for when we actually
> return to user space. I'd be ok with just setting the
> _TIF_USER_WORK_MASK.
>
> But even that should not be a requirement for the basic stability and
> core integrity of the kernel. Not like the current horrid mess with
> NMI nesting and ESP fixing etc.
>
> And realistically, nobody will ever even notice. So the whole "ok, we
> can use _TIF_USER_WORK_MASK to re-enable dr7" is a tiny tiny detail
> that is more like cleaning up things, not a core issue.
>

Or we just re-enable them on the way out of NMI (i.e. the very last
thing we do in the NMI handler).  I don't want to break regular
userspace gdb when perf is running.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:45       ` Andy Lutomirski
@ 2015-07-23 21:54         ` Linus Torvalds
  2015-07-23 21:59           ` Andy Lutomirski
  2015-07-24 11:06           ` Peter Zijlstra
  0 siblings, 2 replies; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 21:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Or we just re-enable them on the way out of NMI (i.e. the very last
> thing we do in the NMI handler).  I don't want to break regular
> userspace gdb when perf is running.

I'd really prefer it if we don't touch NMI code in those kinds of
ways. The NMI code is fragile as hell. All the problems we have with
it is exactly due to "where is the boundary" issues.

That's why I *don't* want NMI code to do magic crap. Anything that
says "disable this during this magic window" is broken. The problems
we've had are exactly about atomicity of the entry/exit conditions,
and there is no really good way to get them right.

I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
because it's so *obvious* that it's not a boundary condition.

I dislike the "disable and re-enable dr7 in the NMI handler" exactly
because it smells like "we can only handle faults in _this_ region".
It may be true, but it's also what I want us to get away from. I'd
much rather have the "big picture" be that we can take faults anywhere
at all (*), and that none of the core code really cares. Then we "fix
up" user space.

                   Linus

(*) And yes, sysenter and not having a stack at all is very special,
and I think we will *always* have to have that magical special case of
the first few instructions there. But that's a separate hardware
limitation we can't get around.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:54         ` Linus Torvalds
@ 2015-07-23 21:59           ` Andy Lutomirski
  2015-07-23 22:03             ` Linus Torvalds
  2015-07-24 10:28             ` Peter Zijlstra
  2015-07-24 11:06           ` Peter Zijlstra
  1 sibling, 2 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:54 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Or we just re-enable them on the way out of NMI (i.e. the very last
>> thing we do in the NMI handler).  I don't want to break regular
>> userspace gdb when perf is running.
>
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
>
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
>
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
>
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.

OK, new proposal:

In do_debug, if we trip an instruction breakpoint while
!user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
*that breakpoint*.

Why?  It only looks at hardware state (dr6 and dr7), and it can't
break gdb, because gdb can't set a breakpoint that will cause this
problem.

All the other variants of this either need cached state or break gdb
watchpoints on stack variables with perf running.

--Andy

>
>                    Linus
>
> (*) And yes, sysenter and not having a stack at all is very special,
> and I think we will *always* have to have that magical special case of
> the first few instructions there. But that's a separate hardware
> limitation we can't get around.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:59           ` Andy Lutomirski
@ 2015-07-23 22:03             ` Linus Torvalds
  2015-07-24 10:28             ` Peter Zijlstra
  1 sibling, 0 replies; 85+ messages in thread
From: Linus Torvalds @ 2015-07-23 22:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:59 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> OK, new proposal:
>
> In do_debug, if we trip an instruction breakpoint while
> !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
> *that breakpoint*.

Ack.  The more targeted we can make this while still guaranteeing
forward progress, the better. So that sounds really good.

              Linus

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:59           ` Andy Lutomirski
  2015-07-23 22:03             ` Linus Torvalds
@ 2015-07-24 10:28             ` Peter Zijlstra
  1 sibling, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 10:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 02:59:46PM -0700, Andy Lutomirski wrote:
> OK, new proposal:
> 
> In do_debug, if we trip an instruction breakpoint while
> !user_mode(regs) && ((regs->flags & X86_EFLAGS_IF) == 0), then disarm
> *that breakpoint*.

Doesn't !IF already imply that it must be kernel space? AFAIK user space
cannot clear IF.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:54         ` Linus Torvalds
  2015-07-23 21:59           ` Andy Lutomirski
@ 2015-07-24 11:06           ` Peter Zijlstra
  1 sibling, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-24 11:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, X86 ML, linux-kernel@vger.kernel.org,
	Willy Tarreau, Borislav Petkov, Thomas Gleixner, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 02:54:54PM -0700, Linus Torvalds wrote:
> On Thu, Jul 23, 2015 at 2:45 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Or we just re-enable them on the way out of NMI (i.e. the very last
> > thing we do in the NMI handler).  I don't want to break regular
> > userspace gdb when perf is running.
> 
> I'd really prefer it if we don't touch NMI code in those kinds of
> ways. The NMI code is fragile as hell. All the problems we have with
> it is exactly due to "where is the boundary" issues.
> 
> That's why I *don't* want NMI code to do magic crap. Anything that
> says "disable this during this magic window" is broken. The problems
> we've had are exactly about atomicity of the entry/exit conditions,
> and there is no really good way to get them right.
> 
> I'd be much happier with a _TIF_USER_WORK_MASK approach exactly
> because it's so *obvious* that it's not a boundary condition.
> 
> I dislike the "disable and re-enable dr7 in the NMI handler" exactly
> because it smells like "we can only handle faults in _this_ region".
> It may be true, but it's also what I want us to get away from. I'd
> much rather have the "big picture" be that we can take faults anywhere
> at all (*), and that none of the core code really cares. Then we "fix
> up" user space.

A wee bit something like so?

We need the intermediate self-IPI because NMI/MCE etc do not deal with
TIF flags.

I further cleared all of DR7 in an attempt at reducing the amount of
state tracked. And it doesn't distinguish between kernel/user
watchpoints because the kernel can touch both from !IF.

---
 arch/x86/kernel/traps.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8e65d8a9b8db..e8308e9c2b1e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,6 +570,33 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 NOKPROBE_SYMBOL(fixup_bad_iret);
 #endif
 
+struct do_debug_state {
+	unsigned long dr7;
+	struct irq_work irq_work;
+	struct callback_head task_work;
+};
+
+static void __debug_irq_trampoline(struct irq_work *work)
+{
+	struct do_debug_state *dds =
+		container_of(work, struct do_debug_state, irq_work);
+
+	task_work_add(current, &dds->task_work, true);
+}
+
+static void __debug_restore_dr7(struct callback_head *work)
+{
+	struct do_debug_state *dds =
+		container_of(work, struct do_debug_state, task_work);
+
+	set_debugreg(dds->dr7, 7);
+}
+
+static DEFINE_PER_CPU(struct do_debug_state, do_debug_state) = {
+	.irq_work = { .func = __debug_irq_trampoline, },
+	.task_work = { .func = __debug_restore_dr7, },
+};
+
 /*
  * Our handling of the processor debug registers is non-trivial.
  * We do not clear them on entry and exit from the kernel. Therefore
@@ -603,6 +630,16 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 
 	ist_enter(regs);
 
+	if (arch_irqs_disabled_flags(regs->flags)) {
+		struct do_debug_state *dds = this_cpu_ptr(&do_debug_state);
+
+		get_debugreg(dds->dr7, 7);
+		set_debugreg(0, 7);
+		irq_work_queue(&dds->irq_work);
+
+		goto exit;
+	}
+
 	get_debugreg(dr6, 6);
 
 	/* Filter out all the reserved bits which are preset to 1 */

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski
  2015-07-23 20:38 ` Linus Torvalds
@ 2015-07-23 21:17 ` Peter Zijlstra
  2015-07-23 21:20 ` Steven Rostedt
  2015-07-24 16:33 ` Raymond Jennings
  3 siblings, 0 replies; 85+ messages in thread
From: Peter Zijlstra @ 2015-07-23 21:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Brian Gerst

On Thu, Jul 23, 2015 at 01:21:16PM -0700, Andy Lutomirski wrote:
> 3. Forbid faults (other than MCE) inside NMI.
> 
> Option 3 is almost easy.  There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
> fix (e.g. with my patches or Peter's patches).
> 
> What if we went all out and forbade page faults in NMI as well.  There
> are two reasons that I can think of that we might page fault inside an
> NMI:
> 
> b) User memory access faults.
> 
> The reason we access user state in general from an NMI is to allow
> perf to capture enough user stack data to let the tooling backtrace
> back to user space.  What if we did it differently?  Instead of
> capturing this data in NMI context, capture it in
> prepare_exit_to_usermode. 

> Peter, can this be done without breaking the perf ABI?  If we were
> designing all of this stuff from scratch right now, I'd suggest doing
> it this way, but I'm not sure whether it makes sense to try to
> retrofit it in.

Not really; but also almost :/

So the thing is that we currently attach the user backtrace to all
events -- and there can be many before we return to userspace again.

So none of those events would have a userspace stack, I'm sure that's
going to confuse the tooling.

OTOH, userspace stacks are a best effort thing, we bail at the first
sign of trouble (eg. the stack page is not there).

Now realistically this 'never' happens, and it would result in
consistently truncated user traces, where your proposal would result in
a whole bunch of events with no user traces and then an 'extra' event
with a one.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski
  2015-07-23 20:38 ` Linus Torvalds
  2015-07-23 21:17 ` Peter Zijlstra
@ 2015-07-23 21:20 ` Steven Rostedt
  2015-07-23 21:46   ` Andy Lutomirski
  2015-07-24 16:33 ` Raymond Jennings
  3 siblings, 1 reply; 85+ messages in thread
From: Steven Rostedt @ 2015-07-23 21:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds,
	Brian Gerst

On Thu, 23 Jul 2015 13:21:16 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> 3. Forbid faults (other than MCE) inside NMI.
> 
> Option 3 is almost easy.  There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
> fix (e.g. with my patches or Peter's patches).

What about int3? Which is needed to make ftrace work. This was a
requirement to get rid of stomp-machine when updating ftrace functions,
as well as the rational for doing the whole NMI nesting work in the
first place.

> 
> What if we went all out and forbade page faults in NMI as well.  There
> are two reasons that I can think of that we might page fault inside an
> NMI:
> 
> a) vmalloc fault.  I think Ingo already half-implemented a rework to
> eliminate vmalloc faults entirely.
> 
> b) User memory access faults.

c) stack tracing faults

I would have NMIs debug deadlocks with printing stack traces. The stack
tracer can page fault, and before the NMI nesting code, while debugging
machines, these stack dumps would randomly reboot the box. While
writing the NMI nesting code I realized why those reboots happened, and
that was due to the stack trace faulting, and the printk from NMI was
slow enough to have another NMI go off and stomp over the outer NMIs
stack. Which lead to triple faults and such.

-- Steve

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 21:20 ` Steven Rostedt
@ 2015-07-23 21:46   ` Andy Lutomirski
  0 siblings, 0 replies; 85+ messages in thread
From: Andy Lutomirski @ 2015-07-23 21:46 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds,
	Brian Gerst

On Thu, Jul 23, 2015 at 2:20 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 23 Jul 2015 13:21:16 -0700
> Andy Lutomirski <luto@amacapital.net> wrote:
>
>> 3. Forbid faults (other than MCE) inside NMI.
>>
>> Option 3 is almost easy.  There are really only two kinds of faults
>> that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
>> fix (e.g. with my patches or Peter's patches).
>
> What about int3? Which is needed to make ftrace work. This was a
> requirement to get rid of stomp-machine when updating ftrace functions,
> as well as the rational for doing the whole NMI nesting work in the
> first place.

OK, I'm convinced.

So I'll keep working on fixing up int3 to be less magical.  Patches
coming eventually.

--Andy

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: Dealing with the NMI mess
  2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-07-23 21:20 ` Steven Rostedt
@ 2015-07-24 16:33 ` Raymond Jennings
  3 siblings, 0 replies; 85+ messages in thread
From: Raymond Jennings @ 2015-07-24 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel@vger.kernel.org, Willy Tarreau,
	Borislav Petkov, Thomas Gleixner, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, Brian Gerst

On Thu, 2015-07-23 at 13:21 -0700, Andy Lutomirski wrote:
> [moved to a new thread, cc list trimmed]
> 
> Hi all-
> 
> We've considered two approaches to dealing with NMIs:
> 
> 1. Allow nesting.  We know quite well how messy that is.

This might be a stupid question, but

1.  What exactly does the NMI handler handle
2.  Is it possible for the NMI handler to just increment a counter and
return if it nests, and let the outer handler notice and rerun itself.

> 2. Forbid IRET inside NMIs.  Doable but maybe not that pretty.
> 
> We haven't considered:
> 
> 3. Forbid faults (other than MCE) inside NMI.
> 
> Option 3 is almost easy.  There are really only two kinds of faults
> that can legitimately nest inside NMI: #PF and #DB.  #DB is easy to
> fix (e.g. with my patches or Peter's patches).
> 
> What if we went all out and forbade page faults in NMI as well.  There
> are two reasons that I can think of that we might page fault inside an
> NMI:
> 
> a) vmalloc fault.  I think Ingo already half-implemented a rework to
> eliminate vmalloc faults entirely.
> 
> b) User memory access faults.
> 
> The reason we access user state in general from an NMI is to allow
> perf to capture enough user stack data to let the tooling backtrace
> back to user space.  What if we did it differently?  Instead of
> capturing this data in NMI context, capture it in
> prepare_exit_to_usermode.  That would let us capture user state
> *correctly*, which we currently can't really do.  There's a
> never-ending series of minor bugs in which we try to guess the user
> register state from NMI context, and it sort of works.  In
> prepare_exit_to_usermode, we really truly know the user state.
> There's a race where an NMI hits during or after
> prepare_exit_to_usermode, but maybe that's okay -- just admit defeat
> in that case and don't show the user state.  (Realistically, without
> CFI data, we're not going to be guaranteed to get the right state
> anyway.)
> 
> To make this work, we'd have to teach NMI-from-userspace to call the
> callback itself.  It would look like:
> 
> prepare_exit_to_usermode() {
>   ...
>   while (blah blah blah) {
>     if (cached_flags & TIF_PERF_CAPTURE_USER_STATE)
>       perf_capture_user_state();
>     ...
>   }
>   ...
> }
> 
> and then, on NMI exit, we'd call perf_capture_user_state directly,
> since we don't want to enable IRQs or do opportunsitic sysret on exit
> from NMI.  (Why not?  Because NMIs are still masked, and we don't want
> to pay for double-IRET to unmask them, so we really want to leave IRQs
> off and IRET straight back to user mode.)
> 
> There's an unavoidable race in which we enter user mode with
> TIF_PERF_CAPTURE_USER_STATE still set.  In principle, we could
> IPI-to-self from the NMI handler to cover that case (mostly -- we
> capture the wrong state if we're on our way to an IRET fault), or we
> could just check on entry if the flag is still set and, if so, admit
> defeat.
> 
> Peter, can this be done without breaking the perf ABI?  If we were
> designing all of this stuff from scratch right now, I'd suggest doing
> it this way, but I'm not sure whether it makes sense to try to
> retrofit it in.
> 
> 
> If we decide to stick with option 2, then I've now convinced myself
> that banning all kernel breakpoints and watchpoints during NMI
> processing is probably for the best.  Maybe we should go one step
> farther and ban all DR7 breakpoints period.  Sure, it will slow down
> perf if there are user breakpoints or watchpoints set, but, having
> looked at the asm, returning from #DB using RET is, while doable,
> distinctly ugly.
> 
> --Andy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2015-09-08 16:21 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-23 20:21 Dealing with the NMI mess Andy Lutomirski
2015-07-23 20:38 ` Linus Torvalds
2015-07-23 20:49   ` Andy Lutomirski
2015-07-23 21:08     ` Linus Torvalds
2015-07-23 21:31       ` Steven Rostedt
2015-07-23 21:46         ` Willy Tarreau
2015-07-23 21:46           ` Andy Lutomirski
2015-07-23 21:50             ` Willy Tarreau
2015-07-23 21:48         ` Linus Torvalds
2015-07-23 21:50           ` Andy Lutomirski
2015-07-23 21:59             ` Linus Torvalds
2015-07-24  8:13               ` Peter Zijlstra
2015-07-24  9:02                 ` Willy Tarreau
2015-07-24 11:58                 ` Steven Rostedt
2015-07-24 12:43                   ` Peter Zijlstra
2015-07-24 13:03                     ` Steven Rostedt
2015-07-24 13:21                       ` Willy Tarreau
2015-07-24 13:30                         ` Peter Zijlstra
2015-07-24 13:33                           ` Peter Zijlstra
2015-07-24 14:31                         ` Steven Rostedt
2015-07-24 14:59                           ` Willy Tarreau
2015-07-24 15:16                             ` Steven Rostedt
2015-07-24 15:26                               ` Willy Tarreau
2015-07-24 15:30                                 ` Peter Zijlstra
2015-07-24 15:33                                   ` Willy Tarreau
2015-07-24 18:29                                   ` Linus Torvalds
2015-07-24 18:41                                     ` Linus Torvalds
2015-07-24 19:05                                       ` Steven Rostedt
2015-07-24 19:55                                     ` Peter Zijlstra
2015-07-24 20:22                                       ` Linus Torvalds
2015-07-24 20:51                                         ` Peter Zijlstra
2015-07-24 21:07                                           ` Steven Rostedt
2015-07-24 21:08                                           ` Andy Lutomirski
2015-07-30 15:41                                             ` Paolo Bonzini
2015-07-30 21:22                                               ` Andy Lutomirski
2015-07-30 21:58                                                 ` Brian Gerst
2015-07-30 22:59                                                 ` Thomas Gleixner
2015-07-31  4:22                                                 ` Borislav Petkov
2015-07-31  5:11                                                   ` Andy Lutomirski
2015-07-31  7:51                                                     ` Paolo Bonzini
2015-07-31  8:03                                                     ` Borislav Petkov
2015-07-31  9:27                                                       ` Paolo Bonzini
2015-07-31 10:25                                                         ` Borislav Petkov
2015-07-31 10:26                                                           ` Paolo Bonzini
2015-07-31 10:32                                                             ` Borislav Petkov
2015-09-07  5:39                                                       ` Maciej W. Rozycki
2015-09-07  7:42                                                         ` Ingo Molnar
2015-09-07  8:19                                                           ` Maciej W. Rozycki
2015-09-07 10:19                                                             ` Paolo Bonzini
2015-09-07 17:01                                                               ` Maciej W. Rozycki
2015-09-07 17:22                                                                 ` Andy Lutomirski
2015-09-07 19:30                                                                   ` Maciej W. Rozycki
2015-09-07 21:56                                                                     ` Andy Lutomirski
2015-09-08 16:21                                                                       ` Maciej W. Rozycki
2015-07-24 23:53                                           ` Linus Torvalds
2015-07-24 15:34                                 ` Steven Rostedt
2015-07-24 15:49                                   ` Willy Tarreau
2015-07-24 15:48                 ` Andy Lutomirski
2015-07-24 16:02                   ` Steven Rostedt
2015-07-24 16:08                     ` Willy Tarreau
2015-07-24 16:31                       ` Steven Rostedt
2015-07-24 16:06                   ` Steven Rostedt
2015-07-24 16:25                   ` Willy Tarreau
2015-07-24 17:21                     ` Andy Lutomirski
2015-07-24 17:10                   ` Willy Tarreau
2015-07-24 17:20                     ` Andy Lutomirski
2015-07-30 15:54                       ` Paolo Bonzini
2015-07-24 17:21                     ` Willy Tarreau
2015-07-23 20:52   ` Willy Tarreau
2015-07-23 20:53     ` Andy Lutomirski
2015-07-23 21:07       ` Willy Tarreau
2015-07-23 21:13     ` Linus Torvalds
2015-07-23 21:18       ` Willy Tarreau
2015-07-23 21:20   ` Peter Zijlstra
2015-07-23 21:35     ` Linus Torvalds
2015-07-23 21:45       ` Andy Lutomirski
2015-07-23 21:54         ` Linus Torvalds
2015-07-23 21:59           ` Andy Lutomirski
2015-07-23 22:03             ` Linus Torvalds
2015-07-24 10:28             ` Peter Zijlstra
2015-07-24 11:06           ` Peter Zijlstra
2015-07-23 21:17 ` Peter Zijlstra
2015-07-23 21:20 ` Steven Rostedt
2015-07-23 21:46   ` Andy Lutomirski
2015-07-24 16:33 ` Raymond Jennings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).