[RFC] x86_64: A real proposal for iret-less return to kernel

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] x86_64: A real proposal for iret-less return to kernel
@ 2014-05-21  0:53 Andy Lutomirski
  2014-05-21  2:27 ` Steven Rostedt
                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21  0:53 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel@vger.kernel.org
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Borislav Petkov, Andi Kleen

Here's a real proposal for iret-less return.  If this is correct, then
NMIs will never nest, which will probably delete a lot more scariness
than is added by the code I'm describing.

The rest of this email is valid markdown :)  If I end up implementing
this, this text will go straight into Documentation/x86/x86_64.

tl;dr: The only particularly tricky cases are exit from #DB, #BP, and
#MC.  I think they're not so bad, though.

FWIW, if there's a way to read the NMI masking bit, this would be a
lot simpler.  I don't know of any way to do that, though.

`IRET`-less return
==================

There are at least two ways that we can return from a trap entry:
`IRET` and `RET`.  They have a few important differences.

  * `IRET` is very slow on all current (2014) CPUs -- it seems to
    take hundreds of cycles.  `RET` is fast.

  * `IRET` unconditionally unmasks NMIs.  `RET` never unmasks NMIs.

  * `IRET` can change `CS`, `RSP`, `SS`, `RIP`, and `RFLAGS`
    atomically.  `RET` can't; it requires a return address on the
    stack, and it can't apply anything other than a small offset to
    the stack pointer.  It can, in theory, change `CS`, but this
    seems unlikely to be helpful.

Times when we must use `IRET`
=============================

  * If we're returning to a different `CS` (i.e. if firmware is
    doing something funny or if we're returning to userspace), then
    `RET` won't help; we need to use `IRET` unless we're willing to
    play fragile games with `SYSEXIT` or `SYSRET`.

  * If we are changing stacks, the we need to be extremely careful
    about using `RET`: using `RET` requires that we put the target
    `RIP` on the target stack, so the target stack must be valid.
    This means that we cannot use `RET` if, for example, a `SYSCALL`
    just happened.

  * If we're returning from NMI, we `IRET` is mandatory: we need to
    unmask NMIs, and `IRET` is the only way to do that.

Note that, if `RFLAGS.IF` is set, then interrupts were enabled when
we trapped, so `RET` is safe.

Times when we must use `RET`
============================

If there's an NMI on the stack, we must use `RET` until we're ready
to re-enabled NMIs.

Assumptions
===========

  * Neither the NMI, the MCE handler, nor anything that nests inside
    them will ever change `CS` or run with an invalid stack.

  * Interrupts will never be enabled with an NMI on the stack.

  * We explicitly do not assume that we can reliably determine
    whether we were on user `GS` or kernel `GS` when a trap happens.
    In current (3.15) kernels we can tell, but if we ever enable
    `WRGSBASE` then we will lose that ability.

  * The IST interrupts are: #DB #BP #NM #DF #SS, and #MC.

  * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero
    whenever an NMI or MCE is on the stack.  We'll increment it at the
    very beginning of the NMI handler and clear it at the very end.
    We will also increment it in `do_machine_check` before doing
    anything that can cause an interrupt.  The result is that the only
    interrupt that can happen with `nmi_mce_nest_count == 0` in NMI
    context is an MCE at the beginning or end of the NMI handler.


The algorithm
=============

  1. If the target `CS` is not the standard 64-bit kernel CPL0
     selector, then never use `RET`.  This is safe: this will never
     happen with an NMI on the stack.

  2. If we are returning from a non-IST interrupt, then use `RET`.
     Non-IST interrupts use the interrupted code's stack, so the
     stack is always valid.

  3. If we are returning from #NM, then use `IRET`.

  4. If we are returning from #DF or #SS, then use `IRET`.  These
     interrupts cannot occur inside an NMI, or, at the very least,
     if they do happen, then they are not recoverable.

  5. If we are returning from #DB or #BP, then use `RET` if
     `nmi_mce_nest_count != 0` and `IRET` otherwise.

  6. If we are returning from #MC, use `IRET`, unless the return address is
     to the NMI entry or exit code, in which case we use `RET`.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  0:53 [RFC] x86_64: A real proposal for iret-less return to kernel Andy Lutomirski
@ 2014-05-21  2:27 ` Steven Rostedt
  2014-05-21  2:33   ` H. Peter Anvin
  2014-05-21  2:39   ` Andy Lutomirski
  2014-05-21 18:11 ` Andy Lutomirski
  2014-05-21 22:25 ` Andi Kleen
  2 siblings, 2 replies; 68+ messages in thread
From: Steven Rostedt @ 2014-05-21  2:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andi Kleen

On Tue, 2014-05-20 at 17:53 -0700, Andy Lutomirski wrote:
> Here's a real proposal for iret-less return.  If this is correct, then
> NMIs will never nest, which will probably delete a lot more scariness
> than is added by the code I'm describing.

Perhaps we can add this for one window release before we rip out the NMI
nesting code. Perhaps we can add a BUG() if we detect a NMI nest?

> 
> The rest of this email is valid markdown :)  If I end up implementing
> this, this text will go straight into Documentation/x86/x86_64.
> 
> tl;dr: The only particularly tricky cases are exit from #DB, #BP, and
> #MC.  I think they're not so bad, though.
> 
> FWIW, if there's a way to read the NMI masking bit, this would be a
> lot simpler.  I don't know of any way to do that, though.

Is there such a thing on all x86?

> 
> `IRET`-less return
> ==================
> 
> There are at least two ways that we can return from a trap entry:
> `IRET` and `RET`.  They have a few important differences.
> 
>   * `IRET` is very slow on all current (2014) CPUs -- it seems to
>     take hundreds of cycles.  `RET` is fast.

s/fast/faster/ or /fast/much faster/

> 
>   * `IRET` unconditionally unmasks NMIs.  `RET` never unmasks NMIs.
> 
>   * `IRET` can change `CS`, `RSP`, `SS`, `RIP`, and `RFLAGS`
>     atomically.  `RET` can't; it requires a return address on the
>     stack, and it can't apply anything other than a small offset to
>     the stack pointer.  It can, in theory, change `CS`, but this
>     seems unlikely to be helpful.
> 
> Times when we must use `IRET`
> =============================
> 
>   * If we're returning to a different `CS` (i.e. if firmware is
>     doing something funny or if we're returning to userspace), then
>     `RET` won't help; we need to use `IRET` unless we're willing to
>     play fragile games with `SYSEXIT` or `SYSRET`.
> 
>   * If we are changing stacks, the we need to be extremely careful

s/the we/then we/

>     about using `RET`: using `RET` requires that we put the target
>     `RIP` on the target stack, so the target stack must be valid.
>     This means that we cannot use `RET` if, for example, a `SYSCALL`
>     just happened.
> 
>   * If we're returning from NMI, we `IRET` is mandatory: we need to

s/we/then/

>     unmask NMIs, and `IRET` is the only way to do that.
> 
> Note that, if `RFLAGS.IF` is set, then interrupts were enabled when
> we trapped, so `RET` is safe.

Is it? You mean if IF is set *and* we are in the kernel?

> 
> Times when we must use `RET`
> ============================
> 
> If there's an NMI on the stack, we must use `RET` until we're ready
> to re-enabled NMIs.

I'm a little confused by NMI on the stack. Do you mean NMI on the target
stack? If so, please state that.


> 
> Assumptions
> ===========
> 
>   * Neither the NMI, the MCE handler, nor anything that nests inside
>     them will ever change `CS` or run with an invalid stack.
> 
>   * Interrupts will never be enabled with an NMI on the stack

target stack?

> .
> 
>   * We explicitly do not assume that we can reliably determine
>     whether we were on user `GS` or kernel `GS` when a trap happens.
>     In current (3.15) kernels we can tell, but if we ever enable
>     `WRGSBASE` then we will lose that ability.
> 
>   * The IST interrupts are: #DB #BP #NM #DF #SS, and #MC.
> 
>   * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero
>     whenever an NMI or MCE is on the stack.  We'll increment it at the
>     very beginning of the NMI handler and clear it at the very end.
>     We will also increment it in `do_machine_check` before doing
>     anything that can cause an interrupt.  The result is that the only
>     interrupt that can happen with `nmi_mce_nest_count == 0` in NMI
>     context is an MCE at the beginning or end of the NMI handler.

Just note that this will probably be done in the C code, as NMI has
issues with gs being safe.

Also, should we call it "nmi" specifically. Perhaps
"ist_stack_nest_count", stating that the stack is ist to match
do_machine_check as well? Maybe that's not a good name either. Someone
else can come up with something that's a little more generic than NMI?

> 
> 
> The algorithm
> =============
> 
>   1. If the target `CS` is not the standard 64-bit kernel CPL0
>      selector, then never use `RET`.  This is safe: this will never
>      happen with an NMI on the stack.

target stack?

> 
>   2. If we are returning from a non-IST interrupt, then use `RET`.
>      Non-IST interrupts use the interrupted code's stack, so the
>      stack is always valid.
> 
>   3. If we are returning from #NM, then use `IRET`.
> 
>   4. If we are returning from #DF or #SS, then use `IRET`.  These
>      interrupts cannot occur inside an NMI, or, at the very least,
>      if they do happen, then they are not recoverable.
> 
>   5. If we are returning from #DB or #BP, then use `RET` if
>      `nmi_mce_nest_count != 0` and `IRET` otherwise.
> 
>   6. If we are returning from #MC, use `IRET`, unless the return address is
>      to the NMI entry or exit code, in which case we use `RET`.

Seems interesting.

-- Steve



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  2:27 ` Steven Rostedt
@ 2014-05-21  2:33   ` H. Peter Anvin
  2014-05-21  2:39   ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21  2:33 UTC (permalink / raw)
  To: Steven Rostedt, Andy Lutomirski
  Cc: linux-kernel@vger.kernel.org, Linus Torvalds, Ingo Molnar,
	Thomas Gleixner, Borislav Petkov, Andi Kleen

On 05/20/2014 07:27 PM, Steven Rostedt wrote:
>>
>> FWIW, if there's a way to read the NMI masking bit, this would be a
>> lot simpler.  I don't know of any way to do that, though.
> 
> Is there such a thing on all x86?
> 

It is not possible to read this bit without the assistance of SMM to the
best of my knowledge.

I'm going to do a detailed review of this proposal as soon as possible,
hopefully later tonight.

	-hpa


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  2:27 ` Steven Rostedt
  2014-05-21  2:33   ` H. Peter Anvin
@ 2014-05-21  2:39   ` Andy Lutomirski
  2014-05-21  9:46     ` Borislav Petkov
  2014-05-21 12:51     ` Jiri Kosina
  1 sibling, 2 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21  2:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel@vger.kernel.org, H. Peter Anvin, Linus Torvalds,
	Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andi Kleen

On Tue, May 20, 2014 at 7:27 PM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Tue, 2014-05-20 at 17:53 -0700, Andy Lutomirski wrote:
>>
>> If there's an NMI on the stack, we must use `RET` until we're ready
>> to re-enabled NMIs.
>
> I'm a little confused by NMI on the stack. Do you mean NMI on the target
> stack? If so, please state that.

I mean that if we're in an NMI handler or in anything nested inside it.


>>   * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero
>>     whenever an NMI or MCE is on the stack.  We'll increment it at the
>>     very beginning of the NMI handler and clear it at the very end.
>>     We will also increment it in `do_machine_check` before doing
>>     anything that can cause an interrupt.  The result is that the only
>>     interrupt that can happen with `nmi_mce_nest_count == 0` in NMI
>>     context is an MCE at the beginning or end of the NMI handler.
>
> Just note that this will probably be done in the C code, as NMI has
> issues with gs being safe.
>
> Also, should we call it "nmi" specifically. Perhaps
> "ist_stack_nest_count", stating that the stack is ist to match
> do_machine_check as well? Maybe that's not a good name either. Someone
> else can come up with something that's a little more generic than NMI?

So the issue here is that we can have an NMI followed immediately by
an MCE.  The MCE code can call force_sig, which could plausibly result
in a kprobe or something similar happening.  The return from that
needs to use IRET.

Since I don't see a clean way to reliably detect that we're inside an
NMI, I propose instead detecting when we're in *either* NMI or MCE,
hence the name.  As long as we mark do_machine_check and whatever asm
code calls it __kprobes, I think we'll be okay.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  2:39   ` Andy Lutomirski
@ 2014-05-21  9:46     ` Borislav Petkov
  2014-05-21 15:21       ` Andy Lutomirski
  2014-05-21 12:51     ` Jiri Kosina
  1 sibling, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21  9:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Andi Kleen

On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> So the issue here is that we can have an NMI followed immediately by
> an MCE.

That part might need clarification for me: #MC is higher prio interrupt
than NMI so a machine check exception can interrupt the NMI handler at
any point.

But you're talking only about the small window when nmi_mce_nest_count
hasn't been incremented yet, right? I.e., this:

"The result is that the only interrupt that can happen with
`nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
end of the NMI handler."

Correct?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  9:46     ` Borislav Petkov
@ 2014-05-21 15:21       ` Andy Lutomirski
  2014-05-21 16:30         ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 15:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Steven Rostedt, Ingo Molnar

On May 21, 2014 2:46 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> > So the issue here is that we can have an NMI followed immediately by
> > an MCE.
>
> That part might need clarification for me: #MC is higher prio interrupt
> than NMI so a machine check exception can interrupt the NMI handler at
> any point.

Except that NMI can interrupt #MC at any point as well, I think.

>
> But you're talking only about the small window when nmi_mce_nest_count
> hasn't been incremented yet, right? I.e., this:
>
> "The result is that the only interrupt that can happen with
> `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
> end of the NMI handler."
>
> Correct?

Exactly.

>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 15:21       ` Andy Lutomirski
@ 2014-05-21 16:30         ` Borislav Petkov
  2014-05-21 17:52           ` Andy Lutomirski
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Steven Rostedt, Ingo Molnar

On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
> On May 21, 2014 2:46 AM, "Borislav Petkov" <bp@alien8.de> wrote:
> >
> > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
> > > So the issue here is that we can have an NMI followed immediately by
> > > an MCE.
> >
> > That part might need clarification for me: #MC is higher prio interrupt
> > than NMI so a machine check exception can interrupt the NMI handler at
> > any point.
> 
> Except that NMI can interrupt #MC at any point as well, I think.

No, #MC is higher prio than NMI, actually even the highest along with
RESET#. And come to think of it, all exceptions which have a higher prio
than NMI should touch that nmi_mce_nest_count thing.

See Table 8-8 here:

http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

That's the prios before 3, i.e. the NMI one.

HOWEVER, this all is spoken with the assumption that higher prio
interrupts can interrupt the NMI handler too at the first instruction
boundary they've been recognized.

The text is talking about simultaneous interrupts and not about
interrupt handler preemption.

But it must be because Steve wouldn't be dealing with exceptions in the
NMI handler and nested NMIs otherwise...

Hmmm.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 16:30         ` Borislav Petkov
@ 2014-05-21 17:52           ` Andy Lutomirski
  2014-05-21 18:07             ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 17:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Steven Rostedt, Ingo Molnar

On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
>> On May 21, 2014 2:46 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>> >
>> > On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
>> > > So the issue here is that we can have an NMI followed immediately by
>> > > an MCE.
>> >
>> > That part might need clarification for me: #MC is higher prio interrupt
>> > than NMI so a machine check exception can interrupt the NMI handler at
>> > any point.
>>
>> Except that NMI can interrupt #MC at any point as well, I think.
>
> No, #MC is higher prio than NMI, actually even the highest along with
> RESET#. And come to think of it, all exceptions which have a higher prio
> than NMI should touch that nmi_mce_nest_count thing.
>
> See Table 8-8 here:
>
> http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf
>
> That's the prios before 3, i.e. the NMI one.
>
> HOWEVER, this all is spoken with the assumption that higher prio
> interrupts can interrupt the NMI handler too at the first instruction
> boundary they've been recognized.
>
> The text is talking about simultaneous interrupts and not about
> interrupt handler preemption.
>
> But it must be because Steve wouldn't be dealing with exceptions in the
> NMI handler and nested NMIs otherwise...

I think that some of these exceptions are synchronous things (e.g.
int3 or page faults) that happen because the kernel caused them.

Anyway, going through the list:

Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

SMI is already supposedly correct wrt nesting inside NMI.

Debug register stuff should be handled in my outline.  Hopefully
correctly :)  We need to make sure that no breakpoints trip before the
nmi count is incremented, but that should be straightforward as long
as we don't do ridiculous things like poking at userspace addresses.
I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
kernel address (e.g. the nesting count) or enables single-stepping,
we'll mess up.

It may pay to bump the nesting count inside the #DB and #BP handlers
and to check the RIP that we're returning to, but that starts to look
ugly, and we have to be careful about NMI, immediate breakpoint, and
them immediate MCE.  I'd rather just be able to say that there are
some very short windows in which a debug or breakpoint exception will
never happen.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 17:52           ` Andy Lutomirski
@ 2014-05-21 18:07             ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 18:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Thomas Gleixner, Linus Torvalds, Steven Rostedt, Ingo Molnar

On Wed, May 21, 2014 at 10:52:01AM -0700, Andy Lutomirski wrote:
> I think that some of these exceptions are synchronous things (e.g.
> int3 or page faults) that happen because the kernel caused them.
> 
> Anyway, going through the list:
> 
> Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

Yeah.

> SMI is already supposedly correct wrt nesting inside NMI.

It better be. :)

> Debug register stuff should be handled in my outline.  Hopefully
> correctly :)  We need to make sure that no breakpoints trip before the
> nmi count is incremented, but that should be straightforward as long
> as we don't do ridiculous things like poking at userspace addresses.
> I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
> kernel address (e.g. the nesting count) or enables single-stepping,
> we'll mess up.
> 
> 
> It may pay to bump the nesting count inside the #DB and #BP handlers
> and to check the RIP that we're returning to,

Right, at a first glance, all those higher prio exceptions' nesting
count could be nicely dealt with in those paranoidzeroentry* macros.

> but that starts to look ugly, and we have to be careful about NMI,
> immediate breakpoint, and them immediate MCE.

Btw, hpa just confirmed that exceptions are never deferred and thus can
happen while the NMI nahdler runs. Which means, we should defensively
prepare for NMI handlers being interrupted at any point.

> I'd rather just be able to say that there are some very short windows
> in which a debug or breakpoint exception will never happen.

Sounds perfectly fine to me.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  2:39   ` Andy Lutomirski
  2014-05-21  9:46     ` Borislav Petkov
@ 2014-05-21 12:51     ` Jiri Kosina
  2014-05-21 15:21       ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Jiri Kosina @ 2014-05-21 12:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Andi Kleen

On Tue, 20 May 2014, Andy Lutomirski wrote:

> So the issue here is that we can have an NMI followed immediately by
> an MCE.  The MCE code can call force_sig

This is interesting by itself. force_sig() takes siglock spinlock. This 
really looks like a deadlock sitting there waiting to happen.

-- 
Jiri Kosina
SUSE Labs


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 12:51     ` Jiri Kosina
@ 2014-05-21 15:21       ` Andy Lutomirski
  2014-05-21 16:33         ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 15:21 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Borislav Petkov, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On May 21, 2014 5:51 AM, "Jiri Kosina" <jkosina@suse.cz> wrote:
>
> On Tue, 20 May 2014, Andy Lutomirski wrote:
>
> > So the issue here is that we can have an NMI followed immediately by
> > an MCE.  The MCE code can call force_sig
>
> This is interesting by itself. force_sig() takes siglock spinlock. This
> really looks like a deadlock sitting there waiting to happen.

ISTM the do_machine_check code ought to consider any kill-worthy MCE
from kernel space to be non-recoverable, but I want to keep the scope
of these patches under control.

That being said, if an MCE that came from CPL0 never tried to return,
this would be simpler.  I don't know enough about the machine check
architecture to know whether that's a reasonable thing to do.

--Andy

>
> --
> Jiri Kosina
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 15:21       ` Andy Lutomirski
@ 2014-05-21 16:33         ` Borislav Petkov
  2014-05-21 21:25           ` Jiri Kosina
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 16:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On Wed, May 21, 2014 at 08:21:57AM -0700, Andy Lutomirski wrote:
> ISTM the do_machine_check code ought to consider any kill-worthy MCE
> from kernel space to be non-recoverable, but I want to keep the scope
> of these patches under control.

MCA has a bit called RIPV which, if set, signals that RIP is valid and
it is safe to return provided we've taken proper care of handling even
non-correctable errors (memory poisoning, etc).

If RIPV is not set, we panic anyway.

> That being said, if an MCE that came from CPL0 never tried to return,
> this would be simpler.  I don't know enough about the machine check
> architecture to know whether that's a reasonable thing to do.

Yeah, there are cases where MCE can return, see above.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 16:33         ` Borislav Petkov
@ 2014-05-21 21:25           ` Jiri Kosina
  2014-05-21 21:35             ` Andy Lutomirski
  2014-05-21 21:37             ` Linus Torvalds
  0 siblings, 2 replies; 68+ messages in thread
From: Jiri Kosina @ 2014-05-21 21:25 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On Wed, 21 May 2014, Borislav Petkov wrote:

> > ISTM the do_machine_check code ought to consider any kill-worthy MCE
> > from kernel space to be non-recoverable, but I want to keep the scope
> > of these patches under control.
> 
> MCA has a bit called RIPV which, if set, signals that RIP is valid and
> it is safe to return provided we've taken proper care of handling even
> non-correctable errors (memory poisoning, etc).

Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered 
at the time the CPU was already holding sighand->siglock for that 
particular task, it'll deadlock against itself.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:25           ` Jiri Kosina
@ 2014-05-21 21:35             ` Andy Lutomirski
  2014-05-21 21:48               ` Borislav Petkov
  2014-05-21 21:37             ` Linus Torvalds
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 21:35 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Borislav Petkov, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina <jkosina@suse.cz> wrote:
> On Wed, 21 May 2014, Borislav Petkov wrote:
>
>> > ISTM the do_machine_check code ought to consider any kill-worthy MCE
>> > from kernel space to be non-recoverable, but I want to keep the scope
>> > of these patches under control.
>>
>> MCA has a bit called RIPV which, if set, signals that RIP is valid and
>> it is safe to return provided we've taken proper care of handling even
>> non-correctable errors (memory poisoning, etc).
>
> Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
> at the time the CPU was already holding sighand->siglock for that
> particular task, it'll deadlock against itself.
>

If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
like the right solution anyway.

Are there any machine check exceptions for which it makes sense to
continue right where we left off without a signal?  Is CMIC such a
beast?  Can CMIC be delivered when interrupts are off?

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:35             ` Andy Lutomirski
@ 2014-05-21 21:48               ` Borislav Petkov
  2014-05-21 21:52                 ` Andy Lutomirski
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 21:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar, Tony Luck

On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
> If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
> like the right solution anyway.
>
> Are there any machine check exceptions for which it makes sense to
> continue right where we left off without a signal?  Is CMIC such a
> beast?  Can CMIC be delivered when interrupts are off?

I think you mean CMCI and that's not even reported with a MCE exception
- there's a separate APIC interrupt for that.

I think this signal thing is for killing processes which have poisoned
memory but this memory can contained within that process and the
physical page frame can be poisoned so that it doesn't get used ever
again. In any case, this is an example for an uncorrectable error which
needs action from us but doesn't necessarily have to kill the whole
machine.

This is supposed to be more graceful instead of consuming the corrupted
data and sending it out to disk.

But sending signals from #MC context is definitely a bad idea. I think
we had addressed this with irq_work at some point but my memory is very
hazy.

@Tony: this is something we need to take a look at soonish.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:48               ` Borislav Petkov
@ 2014-05-21 21:52                 ` Andy Lutomirski
  2014-05-21 21:55                   ` Borislav Petkov
  2014-05-21 22:01                   ` Luck, Tony
  0 siblings, 2 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 21:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar, Tony Luck

On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
>> If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
>> like the right solution anyway.
>>
>> Are there any machine check exceptions for which it makes sense to
>> continue right where we left off without a signal?  Is CMIC such a
>> beast?  Can CMIC be delivered when interrupts are off?
>
> I think you mean CMCI and that's not even reported with a MCE exception
> - there's a separate APIC interrupt for that.
>
> I think this signal thing is for killing processes which have poisoned
> memory but this memory can contained within that process and the
> physical page frame can be poisoned so that it doesn't get used ever
> again. In any case, this is an example for an uncorrectable error which
> needs action from us but doesn't necessarily have to kill the whole
> machine.
>
> This is supposed to be more graceful instead of consuming the corrupted
> data and sending it out to disk.
>
> But sending signals from #MC context is definitely a bad idea. I think
> we had addressed this with irq_work at some point but my memory is very
> hazy.

Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:52                 ` Andy Lutomirski
@ 2014-05-21 21:55                   ` Borislav Petkov
  2014-05-21 21:59                     ` Jiri Kosina
  2014-05-21 21:59                     ` Andy Lutomirski
  2014-05-21 22:01                   ` Luck, Tony
  1 sibling, 2 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 21:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar, Tony Luck

On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
> Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
> a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

Let me quote Jiri:

(1) task sends signal to itself
(2) it acquires sighand->siglock so that it's able to queue the signal
(3) MCE triggers
(4) it tries to send a signal to the same task
(5) it tries to acquire sighand->siglock and loops forever

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:55                   ` Borislav Petkov
@ 2014-05-21 21:59                     ` Jiri Kosina
  2014-05-21 21:59                     ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Jiri Kosina @ 2014-05-21 21:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar, Tony Luck

On Wed, 21 May 2014, Borislav Petkov wrote:

> > Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
> > a signal a remotely reasonable thing to do if !user_mode_vm(regs)?
> 
> Let me quote Jiri:
> 
> (1) task sends signal to itself
> (2) it acquires sighand->siglock so that it's able to queue the signal
> (3) MCE triggers
> (4) it tries to send a signal to the same task
> (5) it tries to acquire sighand->siglock and loops forever

Ah, alright, but due to what mce_severity() does, this can't happen, 
because if the current CPU is in the kernel (which is obviously implied by 
holding a spinlock), it never proceeds sending the signal, becase 
no_way_out gets set and mce_panic() invoked.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:55                   ` Borislav Petkov
  2014-05-21 21:59                     ` Jiri Kosina
@ 2014-05-21 21:59                     ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 21:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar, Tony Luck

On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
>> Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
>> a signal a remotely reasonable thing to do if !user_mode_vm(regs)?
>
> Let me quote Jiri:
>
> (1) task sends signal to itself
> (2) it acquires sighand->siglock so that it's able to queue the signal
> (3) MCE triggers

...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should
presumably just panic instead of trying to send a signal.

I missed the IN_KERNEL thing because I didn't realize that ->cs was
copied to struct mce.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:52                 ` Andy Lutomirski
  2014-05-21 21:55                   ` Borislav Petkov
@ 2014-05-21 22:01                   ` Luck, Tony
  2014-05-21 22:13                     ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 22:01 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1243 bytes --]

> But sending signals from #MC context is definitely a bad idea. I think
> we had addressed this with irq_work at some point but my memory is very
> hazy.

We added code for recoverable errors to get out of the MC context
before trying to lookup the page and send the signal.  Bottom of
do_machine_check():

        if (cfg->tolerant < 3) {
                if (no_way_out)
                        mce_panic("Fatal machine check on current CPU", &m, msg);
                if (worst == MCE_AR_SEVERITY) {
                        /* schedule action before return to userland */
                        mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
                        set_thread_flag(TIF_MCE_NOTIFY);
                } else if (kill_it) {
                        force_sig(SIGBUS, current);
                }
        }

That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process().

The "force_sig()" there is legacy code - and perhaps should just move off to mce_notify_process()
too (need to save "worst" so it will know what to do).

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:01                   ` Luck, Tony
@ 2014-05-21 22:13                     ` Andy Lutomirski
  2014-05-21 22:17                       ` Borislav Petkov
  2014-05-21 22:18                       ` Luck, Tony
  0 siblings, 2 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:13 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 3:01 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> But sending signals from #MC context is definitely a bad idea. I think
>> we had addressed this with irq_work at some point but my memory is very
>> hazy.
>
> We added code for recoverable errors to get out of the MC context
> before trying to lookup the page and send the signal.  Bottom of
> do_machine_check():
>
>         if (cfg->tolerant < 3) {
>                 if (no_way_out)
>                         mce_panic("Fatal machine check on current CPU", &m, msg);
>                 if (worst == MCE_AR_SEVERITY) {
>                         /* schedule action before return to userland */
>                         mce_save_info(m.addr, m.mcgstatus & MCG_STATUS_RIPV);
>                         set_thread_flag(TIF_MCE_NOTIFY);
>                 } else if (kill_it) {
>                         force_sig(SIGBUS, current);
>                 }
>         }
>
> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process().

Why is this necessary?

If the MCE hit kernel code, then we're going to die anyway.  If the
MCE hit user code, then we should be in a completely sensible context
and we can just send the signal.

--Andy

>
> The "force_sig()" there is legacy code - and perhaps should just move off to mce_notify_process()
> too (need to save "worst" so it will know what to do).
>
> -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:13                     ` Andy Lutomirski
@ 2014-05-21 22:17                       ` Borislav Petkov
  2014-05-21 22:20                         ` Andy Lutomirski
  2014-05-21 22:18                       ` Luck, Tony
  1 sibling, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 22:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
> Why is this necessary?
> 
> If the MCE hit kernel code, then we're going to die anyway.  If the
> MCE hit user code, then we should be in a completely sensible context
> and we can just send the signal.

Are we guaranteed that the first thing the process will execute when
scheduled back in are the signal handlers?

And besides, maybe we don't even want to allow to do the switch_to() but
kill it while it is sleeping.

(I know, we're that nasty :-))

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:17                       ` Borislav Petkov
@ 2014-05-21 22:20                         ` Andy Lutomirski
  2014-05-21 22:36                           ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
>> Why is this necessary?
>>
>> If the MCE hit kernel code, then we're going to die anyway.  If the
>> MCE hit user code, then we should be in a completely sensible context
>> and we can just send the signal.
>
> Are we guaranteed that the first thing the process will execute when
> scheduled back in are the signal handlers?

It's not even scheduled out, right?  This should be just like a signal
from a failed page fault, I think.

>
> And besides, maybe we don't even want to allow to do the switch_to() but
> kill it while it is sleeping.

What switch_to?

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:20                         ` Andy Lutomirski
@ 2014-05-21 22:36                           ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 22:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 03:20:50PM -0700, Andy Lutomirski wrote:
> It's not even scheduled out, right?

Right.

> This should be just like a signal from a failed page fault, I think.

Right, but there this additional work it needs to be done
(mce_notify_process()) before sending the signal. So you want to do this
after the MCE handler is done but before you return to the process.

> What switch_to?

Nevermind, that was bollocks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:13                     ` Andy Lutomirski
  2014-05-21 22:17                       ` Borislav Petkov
@ 2014-05-21 22:18                       ` Luck, Tony
  2014-05-21 22:24                         ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 22:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 468 bytes --]

>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process().
>
> Why is this necessary?

The recovery path has to do more than just send a signal - it needs to walk processes and
"mm"s to see which have mapped the physical address that the h/w told us has gone bad.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:18                       ` Luck, Tony
@ 2014-05-21 22:24                         ` Andy Lutomirski
  2014-05-21 22:32                           ` Luck, Tony
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 3:18 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process().
>>
>> Why is this necessary?
>
> The recovery path has to do more than just send a signal - it needs to walk processes and
> "mm"s to see which have mapped the physical address that the h/w told us has gone bad.

I still feel like I'm missing something.  If we interrupted user space
code, then the context we're in should be identical to the context
we'll get when we're about to return to userspace.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:24                         ` Andy Lutomirski
@ 2014-05-21 22:32                           ` Luck, Tony
  2014-05-21 22:39                             ` Andy Lutomirski
  0 siblings, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 22:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1289 bytes --]

>> The recovery path has to do more than just send a signal - it needs to walk processes and
>> "mm"s to see which have mapped the physical address that the h/w told us has gone bad.
>
> I still feel like I'm missing something.  If we interrupted user space
> code, then the context we're in should be identical to the context
> we'll get when we're about to return to userspace.

True. And this far along in do_machine_check() we have set all the other cpus
free, so the are heading back to whatever context we interrupted them in. So
we might be able to do all that other stuff inline here ... we interrupted user
mode, so we know we don't hold any locks. Other cpus are running, so they can
complete what they are doing to release any locks we might need.

But it will take a while (to scan all those processes). And we haven't yet
cleared MCG_STATUS ... so another machine check before we do that
would be fatal (x86 doesn't allow nesting).  Even if we moved the work
after the clear of MCG_STATUS we'd still be vulnerable to a new machine
check on x86_64 because we are sitting on the one & only machine check
stack.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:32                           ` Luck, Tony
@ 2014-05-21 22:39                             ` Andy Lutomirski
  2014-05-21 22:48                               ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 3:32 PM, Luck, Tony <tony.luck@intel.com> wrote:
>>> The recovery path has to do more than just send a signal - it needs to walk processes and
>>> "mm"s to see which have mapped the physical address that the h/w told us has gone bad.
>>
>> I still feel like I'm missing something.  If we interrupted user space
>> code, then the context we're in should be identical to the context
>> we'll get when we're about to return to userspace.
>
> True. And this far along in do_machine_check() we have set all the other cpus
> free, so the are heading back to whatever context we interrupted them in. So
> we might be able to do all that other stuff inline here ... we interrupted user
> mode, so we know we don't hold any locks. Other cpus are running, so they can
> complete what they are doing to release any locks we might need.
>
> But it will take a while (to scan all those processes). And we haven't yet
> cleared MCG_STATUS ... so another machine check before we do that
> would be fatal (x86 doesn't allow nesting).  Even if we moved the work
> after the clear of MCG_STATUS we'd still be vulnerable to a new machine
> check on x86_64 because we are sitting on the one & only machine check
> stack.

But if we get a new MCE in here, it will be an MCE from kernel context
and it's fatal.  So, yes, we'll clobber the stack, but we'll never
return (unless tolerant is set to something insane), so who cares?

Anyway, I care less about this now that I don't have to worry about it
re: IRET :)

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:39                             ` Andy Lutomirski
@ 2014-05-21 22:48                               ` Borislav Petkov
  2014-05-21 22:52                                 ` Andy Lutomirski
  2014-05-21 23:05                                 ` Luck, Tony
  0 siblings, 2 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 22:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
> But if we get a new MCE in here, it will be an MCE from kernel context
> and it's fatal. So, yes, we'll clobber the stack, but we'll never
> return (unless tolerant is set to something insane), so who cares?

Ok, but we still have to do the work before returning to the process. So
if not mce_notify_process() how else are you suggesting we do this?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:48                               ` Borislav Petkov
@ 2014-05-21 22:52                                 ` Andy Lutomirski
  2014-05-21 23:02                                   ` Borislav Petkov
  2014-05-21 23:05                                 ` Luck, Tony
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
>> But if we get a new MCE in here, it will be an MCE from kernel context
>> and it's fatal. So, yes, we'll clobber the stack, but we'll never
>> return (unless tolerant is set to something insane), so who cares?
>
> Ok, but we still have to do the work before returning to the process. So
> if not mce_notify_process() how else are you suggesting we do this?

I'm suggesting that you re-enable interrupts and do the work in
do_machine_check.  I think it'll just work.  It might pay to set a
flag so that you panic very loudly if do_machine_check recurses.

I suspect that, if the hardware is generating machine checks while
doing memory poisoning, the hardware is broken enough that even
panicking might not work, though :)

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:52                                 ` Andy Lutomirski
@ 2014-05-21 23:02                                   ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 23:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Luck, Tony, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 03:52:16PM -0700, Andy Lutomirski wrote:
> I'm suggesting that you re-enable interrupts and do the work in
> do_machine_check. I think it'll just work. It might pay to set a flag
> so that you panic very loudly if do_machine_check recurses.

And that might happen very likely if we're trying to poison a page which
is shared by a couple of processes' mm's and some process on some cpu
starts touching it.

So keeping all cpus in a holding pattern is much more safe, IMO. (#MC is
broadcasted on Intel, I'm sure you know).

And even if it made sense, why go the trouble? To shorten the time we're
in the MCE handler? Well, if we spend too much time in it, then the box
is dying anyway. On a normal, healthy hw, do_machine_check doesn't run.

:-)

> I suspect that, if the hardware is generating machine checks while
> doing memory poisoning, the hardware is broken enough that even
> panicking might not work, though :)

Yeah, in such cases, they tend to escalate to fatal errors very fast so
we panic right on the spot.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:48                               ` Borislav Petkov
  2014-05-21 22:52                                 ` Andy Lutomirski
@ 2014-05-21 23:05                                 ` Luck, Tony
  2014-05-21 23:07                                   ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 23:05 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: Jiri Kosina, Thomas Gleixner, Linus Torvalds, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1070 bytes --]

On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
> But if we get a new MCE in here, it will be an MCE from kernel context
> and it's fatal. So, yes, we'll clobber the stack, but we'll never
> return (unless tolerant is set to something insane), so who cares?

Remember that machine checks are broadcast.  So some other cpu
can hit a recoverable machine check in user mode ... but that int#18
goes everywhere.  Other cpus are innocent bystanders ... they will
see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important
in any of their machine check banks.

But if we are still finishing off processing the previous machine check,
this will be a nested one - and BOOM, we are dead.

-Tony

[If you peer closely at the latest edition of the SDM - you'll see the
bits are defined for a non-broadcast model ... e.g. LMCE_S bit in
MCG_STATUS .... but currently shipping silicon doesn't use that]
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:05                                 ` Luck, Tony
@ 2014-05-21 23:07                                   ` Andy Lutomirski
  2014-05-21 23:19                                     ` Luck, Tony
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 23:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 4:05 PM, Luck, Tony <tony.luck@intel.com> wrote:
> On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
>> But if we get a new MCE in here, it will be an MCE from kernel context
>> and it's fatal. So, yes, we'll clobber the stack, but we'll never
>> return (unless tolerant is set to something insane), so who cares?
>
> Remember that machine checks are broadcast.  So some other cpu
> can hit a recoverable machine check in user mode ... but that int#18
> goes everywhere.  Other cpus are innocent bystanders ... they will
> see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important
> in any of their machine check banks.
>
> But if we are still finishing off processing the previous machine check,
> this will be a nested one - and BOOM, we are dead.

Oh.  Well, crap.

FWIW, this means that there really is a problem if one of these #MC
errors hits an innocent bystander who just happens to be handling an
NMI, at least if we delete the nested NMI code.  But I think my
simplified proposal gets this right.

>
> -Tony
>
> [If you peer closely at the latest edition of the SDM - you'll see the
> bits are defined for a non-broadcast model ... e.g. LMCE_S bit in
> MCG_STATUS .... but currently shipping silicon doesn't use that]



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:07                                   ` Andy Lutomirski
@ 2014-05-21 23:19                                     ` Luck, Tony
  2014-05-21 23:30                                       ` Linus Torvalds
  0 siblings, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 23:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Jiri Kosina, Thomas Gleixner, Linus Torvalds,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 707 bytes --]

> FWIW, this means that there really is a problem if one of these #MC
> errors hits an innocent bystander who just happens to be handling an
> NMI, at least if we delete the nested NMI code.  But I think my
> simplified proposal gets this right.

Yes. Bystander broadcast machine checks can and will hit processors
that are in NMI context ... and we must not make that fatal. Peek
harder at your proposal so you can state confidently that you get
this right.  "I think ... gets this right" is a bit too wishy-washy for
mission critical :-)

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:19                                     ` Luck, Tony
@ 2014-05-21 23:30                                       ` Linus Torvalds
  2014-05-21 23:40                                         ` Luck, Tony
  2014-05-21 23:51                                         ` Borislav Petkov
  0 siblings, 2 replies; 68+ messages in thread
From: Linus Torvalds @ 2014-05-21 23:30 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Borislav Petkov, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Thu, May 22, 2014 at 8:19 AM, Luck, Tony <tony.luck@intel.com> wrote:
>
> Yes. Bystander broadcast machine checks can and will hit processors
> that are in NMI context ... and we must not make that fatal.

.. and this, btw, is just another example of why MCE hardware
designers are f*cking morons that should be given extensive education
about birth control and how not to procreate.

MCE is frankly misdesigned. It's a piece of shit, and any of the
hardware designers that claim that what they do is for system
stability are out to lunch. This is a prime example of what *NOT* to
do, and how you can actually spread what was potentially a localized
and recoverable error, and make it global and unrecoverable.

Can we please get these designers either fired, or re-educated?
Because this shit has been going on too long. I complained about this
to Tony many years ago, and nothing was ever fixed.

Synchronous MCE's are fine for synchronous errors, but then trying to
turn them "synchronous" for other CPU's (where they *weren't*
synchronous errors) is a major mistake. External errors punching
through irq context is wrong, punching through NMI is just
inexcusable.

If the OS then decides to take down the whole machine, the OS - not
the hardware - can choose to do something that will punch through
other CPU's NMI blocking (notably, init/reset), but the hardware doing
this on its own is just broken if true.

Anyway, I repeat: I refuse to fix hardware bugs. As far as we are
concerned, this is "best effort", and the hardware designers should
take a long deep look at their idiotic schemes. If something punches
through NMI, it's deadly. It's that simple.

                 Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:30                                       ` Linus Torvalds
@ 2014-05-21 23:40                                         ` Luck, Tony
  2014-05-21 23:51                                         ` Borislav Petkov
  1 sibling, 0 replies; 68+ messages in thread
From: Luck, Tony @ 2014-05-21 23:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Borislav Petkov, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 759 bytes --]

> MCE is frankly misdesigned. It's a piece of shit, and any of the
> hardware designers that claim that what they do is for system
> stability are out to lunch. This is a prime example of what *NOT* to
> do, and how you can actually spread what was potentially a localized
> and recoverable error, and make it global and unrecoverable.

Latest SDM (version 050 from late February this year) describes how
this is going to be fixed. Recoverable machine checks are going to be
thread local. But current silicon still has the broadcast behavior ...
silicon development pipeline is very long :-(

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:30                                       ` Linus Torvalds
  2014-05-21 23:40                                         ` Luck, Tony
@ 2014-05-21 23:51                                         ` Borislav Petkov
  2014-05-22  0:03                                           ` Linus Torvalds
  2014-05-22  0:05                                           ` Andy Lutomirski
  1 sibling, 2 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 23:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luck, Tony, Andy Lutomirski, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote:
> If the OS then decides to take down the whole machine, the OS - not
> the hardware - can choose to do something that will punch through
> other CPU's NMI blocking (notably, init/reset), but the hardware doing
> this on its own is just broken if true.

Not that it is any consolation but MCE is not broadcast on AMD.

Regardless, exceptions like MCE cannot be held pending and do pierce the
NMI handler on both.

Now, if the NMI handler experiences a non-broadcast MCE on the same CPU,
while running, we're simply going to panic as we're in kernel space
anyway.

The only problem is if the NMI handler gets interrupted while running
on a bystander CPU. And I think we could deal with this because the
bystander would not see an MCE and will return safely. We just need
to make sure that it returns back to the said NMI handler and not to
userspace. Unless I'm missing something ...

Oh yeah, fun :-\

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:51                                         ` Borislav Petkov
@ 2014-05-22  0:03                                           ` Linus Torvalds
  2014-05-22  8:50                                             ` Borislav Petkov
  2014-05-22  0:05                                           ` Andy Lutomirski
  1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2014-05-22  0:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Andy Lutomirski, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Thu, May 22, 2014 at 8:51 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> Regardless, exceptions like MCE cannot be held pending and do pierce the
> NMI handler on both.

No, that's fine, if it's a thread-synchronous thing (ie a memory load
that causes errors). But for NMI handlers, that is irrelevant: if the
NMI code itself gets memory errors, the machine really is dead. Let's
face it, we're going to panic and reboot, there's no other real
alternative (other than the "just log it, pray, and continue in
unstable mode", which is actually a perfectly valid alternative in
many cases, since people don't necessarily care deeply and have
written their distributed algorithms to not rely on any particular
thread too  much, and will verify the end results anyway).

The problem is literally the non-synchronous things (like another CPU
having problems) where things like broadcast will actually turn a
non-thread-synchronous thing into problems for other CPU's. Then, a
user-mode memory access error (that we *can* recover from, perhaps by
killing the process and isolating the page) can turn into a
unrecoverable error on another CPU because it got interrupted at a
point where it really couldn't afford to be interrupted.

It appears Intel is fixing their braindamage.

                      Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-22  0:03                                           ` Linus Torvalds
@ 2014-05-22  8:50                                             ` Borislav Petkov
  0 siblings, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-22  8:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Luck, Tony, Andy Lutomirski, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Thu, May 22, 2014 at 09:03:34AM +0900, Linus Torvalds wrote:
> No, that's fine, if it's a thread-synchronous thing (ie a memory load
> that causes errors). But for NMI handlers, that is irrelevant: if
> the NMI code itself gets memory errors, the machine really is dead.
> Let's face it, we're going to panic and reboot, there's no other
> real alternative (other than the "just log it, pray, and continue
> in unstable mode", which is actually a perfectly valid alternative
> in many cases, since people don't necessarily care deeply and have
> written their distributed algorithms to not rely on any particular
> thread too much, and will verify the end results anyway).

Oh, definitely.

Infact, we'll panic on uncorrectable errors in any unmovable memory,
i.e. kernel code and data because we simply can't recover from it.
Anything that happens in the NMI handler most probably falls in that
category so...

I was simply pointing out the fact that Andy's algo needs to pay
attention to MCEs and other higher prio exceptions happening.

> The problem is literally the non-synchronous things (like another
> CPU having problems) where things like broadcast will actually turn
> a non-thread-synchronous thing into problems for other CPU's. Then,
> a user-mode memory access error (that we *can* recover from, perhaps
> by killing the process and isolating the page) can turn into a
> unrecoverable error on another CPU because it got interrupted at a
> point where it really couldn't afford to be interrupted.

That definitely sounds like a nasty thing, sure.

Although, there's at least one problem I've been thinking about wrt the
non-broadcast MCE: it is pretty hard to handle an uncorrectable memory
error in a page which is shared by multiple threads running on multiple
cores.

So normally one of the cores will detect it, raise an MCE and deal with
it but there's nothing stopping the other cores from touching that data.

One of the possible things which could happen is, if the other cores
consume that data, they will trigger an MCE too and will have to see
that the first core which detected the error is about to poison that
page so their job in the MCE handler is done and they have to exit.

I'm not saying this is undoable but it is a bit tricky and some
scenarios would need to be played out first to know better.

So, to a certain extent, broadcasting the MCE and keeping the cores in a
holding pattern, not touching any userspace stuff might've been one way
to deal with situations like that. It certainly makes things easier for
that particular scenario.

I'm not saying it was a good idea due to the point you're making - maybe
they should've talked to software people first. I'm basically trying to
explain to me what the reasoning behind that broadcasting might be.

> It appears Intel is fixing their braindamage.

Yep, we'd still need to deal with the existing systems but we don't have
a choice anyway.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:51                                         ` Borislav Petkov
  2014-05-22  0:03                                           ` Linus Torvalds
@ 2014-05-22  0:05                                           ` Andy Lutomirski
  1 sibling, 0 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-22  0:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Luck, Tony, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar

On Wed, May 21, 2014 at 4:51 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote:
>> If the OS then decides to take down the whole machine, the OS - not
>> the hardware - can choose to do something that will punch through
>> other CPU's NMI blocking (notably, init/reset), but the hardware doing
>> this on its own is just broken if true.
>
> Not that it is any consolation but MCE is not broadcast on AMD.
>
> Regardless, exceptions like MCE cannot be held pending and do pierce the
> NMI handler on both.
>
> Now, if the NMI handler experiences a non-broadcast MCE on the same CPU,
> while running, we're simply going to panic as we're in kernel space
> anyway.
>
> The only problem is if the NMI handler gets interrupted while running
> on a bystander CPU. And I think we could deal with this because the
> bystander would not see an MCE and will return safely. We just need
> to make sure that it returns back to the said NMI handler and not to
> userspace. Unless I'm missing something ...

Under my "always RET unless returning from IST to weird CS or to
specific known-invalid-stack regions" proposal this should work fine.
In the current code it'll also work fine *unless* it hits really early
in the NMI, in which case a second NMI can kill us.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:25           ` Jiri Kosina
  2014-05-21 21:35             ` Andy Lutomirski
@ 2014-05-21 21:37             ` Linus Torvalds
  2014-05-21 21:43               ` Borislav Petkov
  1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2014-05-21 21:37 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: Borislav Petkov, Andy Lutomirski, Thomas Gleixner, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On Thu, May 22, 2014 at 6:25 AM, Jiri Kosina <jkosina@suse.cz> wrote:
>
> Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
> at the time the CPU was already holding sighand->siglock for that
> particular task, it'll deadlock against itself.

Don't worry too much about the MCE's. The hardware is f*cking broken,
and nobody sane ever thought that synchronous MCE's were a good idea.

Proof: look at Itanium.

The truly nonmaskable synchronous MCE's are a fatal error. It's that
simple. Anybody who thinks anything else is simply wrong, and has
probably talked to too many hardware engineers that don't actually
understand the bigger picture.

Sane hardware handles anything that *can* be handled in hardware, and
then reports (later) to software about the errors with a regular
non-critical MCE that doesn't punch through NMI or even regular
interrupt disabling.

So the true "MCE punches through even NMI protection" case is
relegated purely to the "hardware is broken and needs to be replaced"
situation, and our only worry as kernel people is to try to be as
graceful as possible about it - but that "as graceful as possible"
does *not* include bending over and worrying about random possible
deadlocks or other crazy situations. It's purely a "best effort" kind
of thing where we try to do whatever logging etc that is easy to do.

Seriously. If an NMI is interrupted by an MCE, you might as well
consider the machine dead. Don't worry about it. We may or may not
recover, but it is *not* our problem.

                Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:37             ` Linus Torvalds
@ 2014-05-21 21:43               ` Borislav Petkov
  2014-05-21 21:45                 ` H. Peter Anvin
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 21:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jiri Kosina, Andy Lutomirski, Thomas Gleixner, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Ingo Molnar

On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
> Seriously. If an NMI is interrupted by an MCE, you might as well
> consider the machine dead. Don't worry about it. We may or may not
> recover, but it is *not* our problem.

I certainly like this way of handling it. We can even issue a nice
banner saying something like "You're f*cked - go change hw."

:-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:43               ` Borislav Petkov
@ 2014-05-21 21:45                 ` H. Peter Anvin
  2014-05-21 21:47                   ` Andy Lutomirski
  2014-05-21 21:50                   ` [RFC] x86_64: A real proposal for iret-less return to kernel Jiri Kosina
  0 siblings, 2 replies; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21 21:45 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds
  Cc: Jiri Kosina, Andy Lutomirski, Thomas Gleixner, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, Ingo Molnar, Luck, Tony

Adding Tony.

On 05/21/2014 02:43 PM, Borislav Petkov wrote:
> On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
>> Seriously. If an NMI is interrupted by an MCE, you might as well
>> consider the machine dead. Don't worry about it. We may or may not
>> recover, but it is *not* our problem.
> 
> I certainly like this way of handling it. We can even issue a nice
> banner saying something like "You're f*cked - go change hw."
> 

Actually, it would be a lot better to panic than deadlock (HA systems
tend to have something in place to catch the panic and/or reboot).  Any
way we can see if the CPU is already holding that lock and panic in that
case?

	-hpa



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:45                 ` H. Peter Anvin
@ 2014-05-21 21:47                   ` Andy Lutomirski
  2014-05-21 21:54                     ` Borislav Petkov
  2014-05-21 21:50                   ` [RFC] x86_64: A real proposal for iret-less return to kernel Jiri Kosina
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 21:47 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> Adding Tony.
>
> On 05/21/2014 02:43 PM, Borislav Petkov wrote:
>> On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
>>> Seriously. If an NMI is interrupted by an MCE, you might as well
>>> consider the machine dead. Don't worry about it. We may or may not
>>> recover, but it is *not* our problem.
>>
>> I certainly like this way of handling it. We can even issue a nice
>> banner saying something like "You're f*cked - go change hw."
>>
>
> Actually, it would be a lot better to panic than deadlock (HA systems
> tend to have something in place to catch the panic and/or reboot).  Any
> way we can see if the CPU is already holding that lock and panic in that
> case?
>

Is there anything actually wrong with just panicking if
!user_mode_vm(regs)?  That would make this a lot more sane.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:47                   ` Andy Lutomirski
@ 2014-05-21 21:54                     ` Borislav Petkov
  2014-05-21 22:00                       ` H. Peter Anvin
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 21:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
> Is there anything actually wrong with just panicking if
> !user_mode_vm(regs)?  That would make this a lot more sane.

It does that already - mce_severity().

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:54                     ` Borislav Petkov
@ 2014-05-21 22:00                       ` H. Peter Anvin
  2014-05-21 22:11                         ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21 22:00 UTC (permalink / raw)
  To: Borislav Petkov, Andy Lutomirski
  Cc: Linus Torvalds, Jiri Kosina, Thomas Gleixner, Steven Rostedt,
	Andi Kleen, linux-kernel@vger.kernel.org, Ingo Molnar, Luck, Tony

On 05/21/2014 02:54 PM, Borislav Petkov wrote:
> On Wed, May 21, 2014 at 02:47:03PM -0700, Andy Lutomirski wrote:
>> Is there anything actually wrong with just panicking if
>> !user_mode_vm(regs)?  That would make this a lot more sane.
> 
> It does that already - mce_severity().
> 

So this is not a problem then.

	-hpa


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:00                       ` H. Peter Anvin
@ 2014-05-21 22:11                         ` Borislav Petkov
  2014-05-21 22:13                           ` H. Peter Anvin
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 22:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
> So this is not a problem then.

Yeah, f'get it - it is all good at that front. :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:11                         ` Borislav Petkov
@ 2014-05-21 22:13                           ` H. Peter Anvin
  2014-05-21 22:21                             ` Borislav Petkov
  2014-05-26 10:18                             ` [PATCH] x86, MCE: Flesh out when to panic comment Borislav Petkov
  0 siblings, 2 replies; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21 22:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On 05/21/2014 03:11 PM, Borislav Petkov wrote:
> On Wed, May 21, 2014 at 03:00:18PM -0700, H. Peter Anvin wrote:
>> So this is not a problem then.
> 
> Yeah, f'get it - it is all good at that front. :-)
> 

Seems like a comment would be in order, though.

	-hpa


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:13                           ` H. Peter Anvin
@ 2014-05-21 22:21                             ` Borislav Petkov
  2014-05-26 10:18                             ` [PATCH] x86, MCE: Flesh out when to panic comment Borislav Petkov
  1 sibling, 0 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-21 22:21 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote:
> Seems like a comment would be in order, though.

Sure, I'll do a nice one once this discussions quiets down. :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-21 22:13                           ` H. Peter Anvin
  2014-05-21 22:21                             ` Borislav Petkov
@ 2014-05-26 10:18                             ` Borislav Petkov
  2014-05-26 10:51                               ` Jiri Kosina
  1 sibling, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-26 10:18 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Linus Torvalds, Jiri Kosina, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote:
> Seems like a comment would be in order, though.

---
From: Borislav Petkov <bp@suse.de>
Subject: [PATCH] x86, MCE: Flesh out when to panic comment

Recent discussion (link below) showed that it is not really clear what
appropriate recovery actions we're taking when in a machine check
exception. Flesh out the comment which was explaining that with more
detail.

Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/CALCETrXudJ8BkNF_M-r4O40XLN%2BPnZ5TOZw0P7N4kqo3qngzyg@mail.gmail.com
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68317c80de7f..9f070339b09f 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1151,10 +1151,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 		no_way_out = worst >= MCE_PANIC_SEVERITY;
 
 	/*
-	 * At insane "tolerant" levels we take no action. Otherwise
-	 * we only die if we have no other choice. For less serious
-	 * issues we try to recover, or limit damage to the current
-	 * process.
+	 * At insane "tolerant" levels we take no action. Otherwise we only die
+	 * if we have no other choice. Which means, we're definitely going to
+	 * panic on unrecoverable, uncontainable errors which would otherwise
+	 * influence machine state and/or cause any type of corruption. The
+	 * decision what do to is done by mce_severity().
+	 *
+	 * For less serious issues we try to recover, or limit damage to the
+	 * current process.
 	 */
 	if (cfg->tolerant < 3) {
 		if (no_way_out)
-- 
1.9.0

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 10:18                             ` [PATCH] x86, MCE: Flesh out when to panic comment Borislav Petkov
@ 2014-05-26 10:51                               ` Jiri Kosina
  2014-05-26 11:06                                 ` Borislav Petkov
  0 siblings, 1 reply; 68+ messages in thread
From: Jiri Kosina @ 2014-05-26 10:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Andy Lutomirski, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Mon, 26 May 2014, Borislav Petkov wrote:

> On Wed, May 21, 2014 at 03:13:54PM -0700, H. Peter Anvin wrote:
> > Seems like a comment would be in order, though.
> 
> ---
> From: Borislav Petkov <bp@suse.de>
> Subject: [PATCH] x86, MCE: Flesh out when to panic comment
> 
> Recent discussion (link below) showed that it is not really clear what
> appropriate recovery actions we're taking when in a machine check
> exception. Flesh out the comment which was explaining that with more
> detail.
> 
> Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Link: http://lkml.kernel.org/r/CALCETrXudJ8BkNF_M-r4O40XLN%2BPnZ5TOZw0P7N4kqo3qngzyg@mail.gmail.com
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 68317c80de7f..9f070339b09f 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1151,10 +1151,14 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		no_way_out = worst >= MCE_PANIC_SEVERITY;
>  
>  	/*
> -	 * At insane "tolerant" levels we take no action. Otherwise
> -	 * we only die if we have no other choice. For less serious
> -	 * issues we try to recover, or limit damage to the current
> -	 * process.
> +	 * At insane "tolerant" levels we take no action. Otherwise we only die
> +	 * if we have no other choice. Which means, we're definitely going to
> +	 * panic on unrecoverable, uncontainable errors which would otherwise
> +	 * influence machine state and/or cause any type of corruption. The
> +	 * decision what do to is done by mce_severity().
> +	 *
> +	 * For less serious issues we try to recover, or limit damage to the
> +	 * current process.
>  	 */

I think the comment is still not explaining the big part of what the 
discussion was about -- i.e. if it was in kernel context, we always panic.

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 10:51                               ` Jiri Kosina
@ 2014-05-26 11:06                                 ` Borislav Petkov
  2014-05-26 16:47                                   ` Andy Lutomirski
  2014-05-27 21:53                                   ` Luck, Tony
  0 siblings, 2 replies; 68+ messages in thread
From: Borislav Petkov @ 2014-05-26 11:06 UTC (permalink / raw)
  To: Jiri Kosina
  Cc: H. Peter Anvin, Andy Lutomirski, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Mon, May 26, 2014 at 12:51:10PM +0200, Jiri Kosina wrote:
> I think the comment is still not explaining the big part of what the
> discussion was about -- i.e. if it was in kernel context, we always
> panic.

I thought the pointer to mce_severity was enough? People should open an
editor and look at the function and at its gory insanity. :-P

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 11:06                                 ` Borislav Petkov
@ 2014-05-26 16:47                                   ` Andy Lutomirski
  2014-05-26 17:51                                     ` Borislav Petkov
  2014-05-27 21:53                                   ` Luck, Tony
  1 sibling, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-26 16:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiri Kosina, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Mon, May 26, 2014 at 4:06 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, May 26, 2014 at 12:51:10PM +0200, Jiri Kosina wrote:
>> I think the comment is still not explaining the big part of what the
>> discussion was about -- i.e. if it was in kernel context, we always
>> panic.
>
> I thought the pointer to mce_severity was enough? People should open an
> editor and look at the function and at its gory insanity. :-P

It may be worth at least pointing out that mce_severity looks at
whether we faulted from kernel context.  I missed that the first time
around because mce_severity doesn't take a pt_regs pointer.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 16:47                                   ` Andy Lutomirski
@ 2014-05-26 17:51                                     ` Borislav Petkov
  2014-05-26 17:59                                       ` Andy Lutomirski
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-26 17:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jiri Kosina, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Mon, May 26, 2014 at 09:47:38AM -0700, Andy Lutomirski wrote:
> It may be worth at least pointing out that mce_severity looks at
> whether we faulted from kernel context. I missed that the first time
> around because mce_severity doesn't take a pt_regs pointer.

Right, but next time we talk about a different aspect which isn't
commented on in the handler, we'd have to add to it again, until we've
rewritten the whole function in pseudo code.

I think simply pointing to the function which decides the fate of the
machine based on the MCE severity is enough - people can then go and
stare at it, albeit with some struggle.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 17:51                                     ` Borislav Petkov
@ 2014-05-26 17:59                                       ` Andy Lutomirski
  0 siblings, 0 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-26 17:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiri Kosina, H. Peter Anvin, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Mon, May 26, 2014 at 10:51 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, May 26, 2014 at 09:47:38AM -0700, Andy Lutomirski wrote:
>> It may be worth at least pointing out that mce_severity looks at
>> whether we faulted from kernel context. I missed that the first time
>> around because mce_severity doesn't take a pt_regs pointer.
>
> Right, but next time we talk about a different aspect which isn't
> commented on in the handler, we'd have to add to it again, until we've
> rewritten the whole function in pseudo code.
>
> I think simply pointing to the function which decides the fate of the
> machine based on the MCE severity is enough - people can then go and
> stare at it, albeit with some struggle.

Fair enough.

>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-26 11:06                                 ` Borislav Petkov
  2014-05-26 16:47                                   ` Andy Lutomirski
@ 2014-05-27 21:53                                   ` Luck, Tony
  2014-05-27 22:24                                     ` Borislav Petkov
  1 sibling, 1 reply; 68+ messages in thread
From: Luck, Tony @ 2014-05-27 21:53 UTC (permalink / raw)
  To: Borislav Petkov, Jiri Kosina
  Cc: H. Peter Anvin, Andy Lutomirski, Linus Torvalds, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 576 bytes --]

>> I think the comment is still not explaining the big part of what the
>> discussion was about -- i.e. if it was in kernel context, we always
>> panic.
>
> I thought the pointer to mce_severity was enough? People should open an
> editor and look at the function and at its gory insanity. :-P

It is far from obvious that mce_severity() will always say that an error
detected inside the kernel will be fatal.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-27 21:53                                   ` Luck, Tony
@ 2014-05-27 22:24                                     ` Borislav Petkov
  2014-05-27 22:33                                       ` Luck, Tony
  0 siblings, 1 reply; 68+ messages in thread
From: Borislav Petkov @ 2014-05-27 22:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Jiri Kosina, H. Peter Anvin, Andy Lutomirski, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, Andi Kleen,
	linux-kernel@vger.kernel.org, Ingo Molnar

On Tue, May 27, 2014 at 09:53:56PM +0000, Luck, Tony wrote:
> It is far from obvious that mce_severity() will always say that an
> error detected inside the kernel will be fatal.

Oh yeah, it needs a good cleansing rewrite, that's for sure.

And this tolerant check looks fishy to me:

                if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
                        if (panic_on_oops || tolerant < 1)
                                return MCE_PANIC_SEVERITY;
                }

since we set it to 1 by default. But I'll look again on a clear head
tomorrow - it is too late here.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 68+ messages in thread

* RE: [PATCH] x86, MCE: Flesh out when to panic comment
  2014-05-27 22:24                                     ` Borislav Petkov
@ 2014-05-27 22:33                                       ` Luck, Tony
  0 siblings, 0 replies; 68+ messages in thread
From: Luck, Tony @ 2014-05-27 22:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Jiri Kosina, H. Peter Anvin, Andy Lutomirski, Linus Torvalds,
	Thomas Gleixner, Steven Rostedt, Andi Kleen,
	linux-kernel@vger.kernel.org, Ingo Molnar

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 711 bytes --]

> And this tolerant check looks fishy to me:
>
>                if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
>                        if (panic_on_oops || tolerant < 1)
>                                return MCE_PANIC_SEVERITY;
>                }
>
> since we set it to 1 by default. But I'll look again on a clear head
> tomorrow - it is too late here.

tolerant level 0 exists - but is somewhat crazy in the opposite direction
from the large values.  Look at the comment in mce.c ... level 0
means always panic if you see a UC error

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 21:45                 ` H. Peter Anvin
  2014-05-21 21:47                   ` Andy Lutomirski
@ 2014-05-21 21:50                   ` Jiri Kosina
  1 sibling, 0 replies; 68+ messages in thread
From: Jiri Kosina @ 2014-05-21 21:50 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Borislav Petkov, Linus Torvalds, Andy Lutomirski, Thomas Gleixner,
	Steven Rostedt, Andi Kleen, linux-kernel@vger.kernel.org,
	Ingo Molnar, Luck, Tony

On Wed, 21 May 2014, H. Peter Anvin wrote:

> > I certainly like this way of handling it. We can even issue a nice
> > banner saying something like "You're f*cked - go change hw."
> 
> Actually, it would be a lot better to panic than deadlock (HA systems
> tend to have something in place to catch the panic and/or reboot).  Any
> way we can see if the CPU is already holding that lock and panic in that
> case?

Well, spin_trylock() and then either spin_unlock() and proceed sending 
the signal, otherwise panic().

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  0:53 [RFC] x86_64: A real proposal for iret-less return to kernel Andy Lutomirski
  2014-05-21  2:27 ` Steven Rostedt
@ 2014-05-21 18:11 ` Andy Lutomirski
  2014-05-21 22:36   ` H. Peter Anvin
  2014-05-21 22:25 ` Andi Kleen
  2 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 18:11 UTC (permalink / raw)
  To: Steven Rostedt, linux-kernel@vger.kernel.org
  Cc: H. Peter Anvin, Linus Torvalds, Ingo Molnar, Thomas Gleixner,
	Borislav Petkov, Andi Kleen

On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Here's a real proposal for iret-less return.  If this is correct, then
> NMIs will never nest, which will probably delete a lot more scariness
> than is added by the code I'm describing.

OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
16-bit stack.  The return from NMI goes through the espfix code.
Something interrupts while on the espfix stack.  Boom!  Neither return
style is particularly good.

More generally, if we got interrupted while on the espfix stack, we
need to return back there using IRET.  Fortunately, re-enabling NMIs
there in harmless, since we've already switched off the NMI stack.

This makes me think that maybe the logic should be turned around: have
some RIP ranges on which the kernel stack might be invalid (which
includes the espfix code and some of the syscall code) and use IRET
only on return from NMI, return to nonstandard CS, and return to these
special ranges.  The NMI code just needs to never so any of this stuff
unless it switches off the NMI stack first.

For this to work reliably, we'll probably have to change CS before
calling into EFI code.  That should be straightforward.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 18:11 ` Andy Lutomirski
@ 2014-05-21 22:36   ` H. Peter Anvin
  2014-05-21 22:41     ` Andy Lutomirski
  0 siblings, 1 reply; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21 22:36 UTC (permalink / raw)
  To: Andy Lutomirski, Steven Rostedt, linux-kernel@vger.kernel.org
  Cc: Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Andi Kleen

On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
> On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Here's a real proposal for iret-less return.  If this is correct, then
>> NMIs will never nest, which will probably delete a lot more scariness
>> than is added by the code I'm describing.
> 
> OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
> 16-bit stack.  The return from NMI goes through the espfix code.
> Something interrupts while on the espfix stack.  Boom!  Neither return
> style is particularly good.
> 
> More generally, if we got interrupted while on the espfix stack, we
> need to return back there using IRET.  Fortunately, re-enabling NMIs
> there in harmless, since we've already switched off the NMI stack.
> 
> This makes me think that maybe the logic should be turned around: have
> some RIP ranges on which the kernel stack might be invalid (which
> includes the espfix code and some of the syscall code) and use IRET
> only on return from NMI, return to nonstandard CS, and return to these
> special ranges.  The NMI code just needs to never so any of this stuff
> unless it switches off the NMI stack first.
> 
> For this to work reliably, we'll probably have to change CS before
> calling into EFI code.  That should be straightforward.
> 

I think you are onto something here.

In particular, the key observation here is that inside the kernel, we
can never *both* have an invalid stack *and* be inside an NMI, #MC or
#DB handler, even if nested.

Now, does this prevent us from using RET in the common case?  I'm not
sure it is a huge loss since kernel-to-kernel is relatively rare.

	-hpa


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:36   ` H. Peter Anvin
@ 2014-05-21 22:41     ` Andy Lutomirski
  2014-05-21 23:03       ` H. Peter Anvin
  0 siblings, 1 reply; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, Linus Torvalds,
	Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andi Kleen

On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
>> On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>> Here's a real proposal for iret-less return.  If this is correct, then
>>> NMIs will never nest, which will probably delete a lot more scariness
>>> than is added by the code I'm describing.
>>
>> OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
>> 16-bit stack.  The return from NMI goes through the espfix code.
>> Something interrupts while on the espfix stack.  Boom!  Neither return
>> style is particularly good.
>>
>> More generally, if we got interrupted while on the espfix stack, we
>> need to return back there using IRET.  Fortunately, re-enabling NMIs
>> there in harmless, since we've already switched off the NMI stack.
>>
>> This makes me think that maybe the logic should be turned around: have
>> some RIP ranges on which the kernel stack might be invalid (which
>> includes the espfix code and some of the syscall code) and use IRET
>> only on return from NMI, return to nonstandard CS, and return to these
>> special ranges.  The NMI code just needs to never so any of this stuff
>> unless it switches off the NMI stack first.
>>
>> For this to work reliably, we'll probably have to change CS before
>> calling into EFI code.  That should be straightforward.
>>
>
> I think you are onto something here.
>
> In particular, the key observation here is that inside the kernel, we
> can never *both* have an invalid stack *and* be inside an NMI, #MC or
> #DB handler, even if nested.

Except for espfix :)

>
> Now, does this prevent us from using RET in the common case?  I'm not
> sure it is a huge loss since kernel-to-kernel is relatively rare.

I don't think so.  The most common case should be plain old interrupts
and I suspect that #PF is a distant second.

In any event, plain old interrupts and #PF are non-IST interrupts and
they should be unconditionally safe for RET

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:41     ` Andy Lutomirski
@ 2014-05-21 23:03       ` H. Peter Anvin
  0 siblings, 0 replies; 68+ messages in thread
From: H. Peter Anvin @ 2014-05-21 23:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, Linus Torvalds,
	Ingo Molnar, Thomas Gleixner, Borislav Petkov, Andi Kleen

On 05/21/2014 03:41 PM, Andy Lutomirski wrote:
>>
>> I think you are onto something here.
>>
>> In particular, the key observation here is that inside the kernel, we
>> can never *both* have an invalid stack *and* be inside an NMI, #MC or
>> #DB handler, even if nested.
> 
> Except for espfix :)

Argh.  Yes, I got that wrong... it isn't really about being inside NMI,
#MC or #DB, but rather being on those respective stacks.  If you are on
the espfix stack you are on your way back to userspace OR (and this gets
really, really ugly) you took an NMI/MC/DB after a SYSCALL executed in
16-bit mode, but even then you are in the kernel entry/exit code and
re-enabling NMI is fine.

>> Now, does this prevent us from using RET in the common case?  I'm not
>> sure it is a huge loss since kernel-to-kernel is relatively rare.
> 
> I don't think so.  The most common case should be plain old interrupts
> and I suspect that #PF is a distant second.
> 
> In any event, plain old interrupts and #PF are non-IST interrupts and
> they should be unconditionally safe for RET

	-hpa



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21  0:53 [RFC] x86_64: A real proposal for iret-less return to kernel Andy Lutomirski
  2014-05-21  2:27 ` Steven Rostedt
  2014-05-21 18:11 ` Andy Lutomirski
@ 2014-05-21 22:25 ` Andi Kleen
  2014-05-21 22:32   ` Andy Lutomirski
  2014-05-21 22:33   ` Linus Torvalds
  2 siblings, 2 replies; 68+ messages in thread
From: Andi Kleen @ 2014-05-21 22:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov,
	Andi Kleen

Seems like a lot of effort and risk to essentially only optimize in kernel
interrupt handlers.

AFAIK the most interesting cases (like user page faults) are not
affected at all. Usually most workloads don't spend all that much time
in the kernel, so it won't help most interrupts.

I suspect the only case that's really interesting here is interrupting
idle. Maybe it would be possible to do some fast path in this case only.

However idle currently has so much overhead that I suspect that there 
are lower hanging fruit elsewhere.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:25 ` Andi Kleen
@ 2014-05-21 22:32   ` Andy Lutomirski
  2014-05-21 22:33   ` Linus Torvalds
  1 sibling, 0 replies; 68+ messages in thread
From: Andy Lutomirski @ 2014-05-21 22:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Steven Rostedt, linux-kernel@vger.kernel.org, H. Peter Anvin,
	Linus Torvalds, Ingo Molnar, Thomas Gleixner, Borislav Petkov

On Wed, May 21, 2014 at 3:25 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Seems like a lot of effort and risk to essentially only optimize in kernel
> interrupt handlers.

The idea is that it might allow us to remove a bunch of scary nested
NMI code as well as speeding things up.

>
> AFAIK the most interesting cases (like user page faults) are not
> affected at all. Usually most workloads don't spend all that much time
> in the kernel, so it won't help most interrupts.
>
> I suspect the only case that's really interesting here is interrupting
> idle. Maybe it would be possible to do some fast path in this case only.
>
> However idle currently has so much overhead that I suspect that there
> are lower hanging fruit elsewhere.

I will gladly buy a meal or beverage for whomever fixes the ttwu stuff
to stop sending IPIs to idle CPUs, which will help a lot.

--Andy

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:25 ` Andi Kleen
  2014-05-21 22:32   ` Andy Lutomirski
@ 2014-05-21 22:33   ` Linus Torvalds
  2014-05-21 23:23     ` Andi Kleen
  1 sibling, 1 reply; 68+ messages in thread
From: Linus Torvalds @ 2014-05-21 22:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Steven Rostedt, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Borislav Petkov

On Thu, May 22, 2014 at 7:25 AM, Andi Kleen <andi@firstfloor.org> wrote:
>
> I suspect the only case that's really interesting here is interrupting
> idle. Maybe it would be possible to do some fast path in this case only.

Hardware-interrupts during kernel are actually fairly common under
network-intensive loads, even outside of idle (but idle is admittedly
likely *the* most common one). Many network loads are fairly
kernel-intensive.

Also, from a kernel perspective, idle isn't really any different from
most other kernel code. Using "ret" to return to the idle handler
would be *more* of a special case than using "ret" to return to just
generic kernel context.

So I disagree vehemently. Do *not* special-case idle. It makes the
code more complex and less generic.

                Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 22:33   ` Linus Torvalds
@ 2014-05-21 23:23     ` Andi Kleen
  2014-05-21 23:34       ` Linus Torvalds
  0 siblings, 1 reply; 68+ messages in thread
From: Andi Kleen @ 2014-05-21 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Andy Lutomirski, Steven Rostedt,
	linux-kernel@vger.kernel.org, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, Borislav Petkov

> Hardware-interrupts during kernel are actually fairly common under
> network-intensive loads, even outside of idle (but idle is admittedly
> likely *the* most common one). Many network loads are fairly
> kernel-intensive.

For network workloads we can arbitarily coalesce interrupts or just use NAPI
to lower the costs.  No need to optimize network interrupts too much.

-Andi


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [RFC] x86_64: A real proposal for iret-less return to kernel
  2014-05-21 23:23     ` Andi Kleen
@ 2014-05-21 23:34       ` Linus Torvalds
  0 siblings, 0 replies; 68+ messages in thread
From: Linus Torvalds @ 2014-05-21 23:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Steven Rostedt, linux-kernel@vger.kernel.org,
	H. Peter Anvin, Ingo Molnar, Thomas Gleixner, Borislav Petkov

On Thu, May 22, 2014 at 8:23 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> Hardware-interrupts during kernel are actually fairly common under
>> network-intensive loads, even outside of idle (but idle is admittedly
>> likely *the* most common one). Many network loads are fairly
>> kernel-intensive.
>
> For network workloads we can arbitarily coalesce interrupts or just use NAPI
> to lower the costs.  No need to optimize network interrupts too much.

BS. Lots of network loads are latency-criticial, to the point that
people sometimes actually turn off coalescing. But even with
coalescing, it doesn't do crap for ping-pong kinds of loads that are
not "interrupt storm from tons and tons of separate packets", but
"lots of individual packets that are data-dependent", so you don't
have new ones coming in while processing old ones.

Ask Andy L. He had numbers. Interrupt overhead was quite big for him.

And you ignored the real issue: special-casing idle is *stupid*. It's
more complicated, and gives fewer cases where it helps. It's simply
fundamentally stupid and wrong.

         Linus

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2014-05-27 22:33 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-21  0:53 [RFC] x86_64: A real proposal for iret-less return to kernel Andy Lutomirski
2014-05-21  2:27 ` Steven Rostedt
2014-05-21  2:33   ` H. Peter Anvin
2014-05-21  2:39   ` Andy Lutomirski
2014-05-21  9:46     ` Borislav Petkov
2014-05-21 15:21       ` Andy Lutomirski
2014-05-21 16:30         ` Borislav Petkov
2014-05-21 17:52           ` Andy Lutomirski
2014-05-21 18:07             ` Borislav Petkov
2014-05-21 12:51     ` Jiri Kosina
2014-05-21 15:21       ` Andy Lutomirski
2014-05-21 16:33         ` Borislav Petkov
2014-05-21 21:25           ` Jiri Kosina
2014-05-21 21:35             ` Andy Lutomirski
2014-05-21 21:48               ` Borislav Petkov
2014-05-21 21:52                 ` Andy Lutomirski
2014-05-21 21:55                   ` Borislav Petkov
2014-05-21 21:59                     ` Jiri Kosina
2014-05-21 21:59                     ` Andy Lutomirski
2014-05-21 22:01                   ` Luck, Tony
2014-05-21 22:13                     ` Andy Lutomirski
2014-05-21 22:17                       ` Borislav Petkov
2014-05-21 22:20                         ` Andy Lutomirski
2014-05-21 22:36                           ` Borislav Petkov
2014-05-21 22:18                       ` Luck, Tony
2014-05-21 22:24                         ` Andy Lutomirski
2014-05-21 22:32                           ` Luck, Tony
2014-05-21 22:39                             ` Andy Lutomirski
2014-05-21 22:48                               ` Borislav Petkov
2014-05-21 22:52                                 ` Andy Lutomirski
2014-05-21 23:02                                   ` Borislav Petkov
2014-05-21 23:05                                 ` Luck, Tony
2014-05-21 23:07                                   ` Andy Lutomirski
2014-05-21 23:19                                     ` Luck, Tony
2014-05-21 23:30                                       ` Linus Torvalds
2014-05-21 23:40                                         ` Luck, Tony
2014-05-21 23:51                                         ` Borislav Petkov
2014-05-22  0:03                                           ` Linus Torvalds
2014-05-22  8:50                                             ` Borislav Petkov
2014-05-22  0:05                                           ` Andy Lutomirski
2014-05-21 21:37             ` Linus Torvalds
2014-05-21 21:43               ` Borislav Petkov
2014-05-21 21:45                 ` H. Peter Anvin
2014-05-21 21:47                   ` Andy Lutomirski
2014-05-21 21:54                     ` Borislav Petkov
2014-05-21 22:00                       ` H. Peter Anvin
2014-05-21 22:11                         ` Borislav Petkov
2014-05-21 22:13                           ` H. Peter Anvin
2014-05-21 22:21                             ` Borislav Petkov
2014-05-26 10:18                             ` [PATCH] x86, MCE: Flesh out when to panic comment Borislav Petkov
2014-05-26 10:51                               ` Jiri Kosina
2014-05-26 11:06                                 ` Borislav Petkov
2014-05-26 16:47                                   ` Andy Lutomirski
2014-05-26 17:51                                     ` Borislav Petkov
2014-05-26 17:59                                       ` Andy Lutomirski
2014-05-27 21:53                                   ` Luck, Tony
2014-05-27 22:24                                     ` Borislav Petkov
2014-05-27 22:33                                       ` Luck, Tony
2014-05-21 21:50                   ` [RFC] x86_64: A real proposal for iret-less return to kernel Jiri Kosina
2014-05-21 18:11 ` Andy Lutomirski
2014-05-21 22:36   ` H. Peter Anvin
2014-05-21 22:41     ` Andy Lutomirski
2014-05-21 23:03       ` H. Peter Anvin
2014-05-21 22:25 ` Andi Kleen
2014-05-21 22:32   ` Andy Lutomirski
2014-05-21 22:33   ` Linus Torvalds
2014-05-21 23:23     ` Andi Kleen
2014-05-21 23:34       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).