From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>,
Steven Rostedt <rostedt@rostedt.homelinux.com>,
Frederic Weisbecker <fweisbec@gmail.com>,
Thomas Gleixner <tglx@linutronix.de>,
Christoph Hellwig <hch@lst.de>, Li Zefan <lizf@cn.fujitsu.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Johannes Berg <johannes.berg@intel.com>,
Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>,
Arnaldo Carvalho de Melo <acme@infradead.org>,
Tom Zanussi <tzanussi@gmail.com>,
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Andi Kleen <andi@firstfloor.org>,
"H. Peter Anvin" <hpa@zytor.com>,
Jeremy Fitzhardinge <jeremy@goop.org>,
"Frank Ch. Eigler" <fche@redhat.com>,
Tejun Heo <htejun@gmail.com>
Subject: Re: [patch 1/2] x86_64 page fault NMI-safe
Date: Wed, 14 Jul 2010 18:21:15 -0400 [thread overview]
Message-ID: <20100714222115.GA30122@Krystal> (raw)
In-Reply-To: <AANLkTinM0eFnreNB5l_6f20oSGBchqAtYPb57tM01kUB@mail.gmail.com>
* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> On Wed, Jul 14, 2010 at 1:39 PM, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
> >
> >> - load percpu NMI stack frame pointer
> >> - if non-zero we know we're nested, and should ignore this NMI:
> >> - we're returning to kernel mode, so return immediately by using
> >> "popf/ret", which also keeps NMI's disabled in the hardware until the
> >> "real" NMI iret happens.
> >
> > Maybe incrementing a per-cpu missed NMIs count could be appropriate here so we
> > know how many NMIs should be replayed at iret ?
>
> No. As mentioned, there is no such counter in real hardware either.
>
> Look at what happens for the not-nested case:
>
> - NMI1 triggers. The CPU takes a fault, and runs the NMI handler with
> NMI's disabled
>
> - NMI2 triggers. Nothing happens, the NMI's are disabled.
>
> - NMI3 triggers. Again, nothing happens, the NMI's are still disabled
>
> - the NMI handler returns.
>
> - What happens now?
>
> How many NMI interrupts do you get? ONE. Exactly like my "emulate it
> in software" approach. The hardware doesn't have any counters for
> pending NMI's either. Why should the software emulation have them?
So I figure that given Maciej's response, we can get at most 2 nested nmis, no
more, no less. So I was probably going too far with the counter, but we need 2.
However, failure to deliver the second NMI in this case would not match the
hardware expectations (see below).
>
> >> - before the popf/iret, use the NMI stack pointer to make the NMI
> >> return stack be invalid and cause a fault
> >
> > I assume you mean "popf/ret" here.
>
> Yes, that was as typo. The whole point of using popf was obviously to
> _avoid_ the iret ;)
>
> > So assuming we use a frame copy, we should
> > change the nmi stack pointer in the nesting 0 nmi stack copy, so the nesting 0
> > NMI iret will trigger the fault
> >
> >> - set the NMI stack pointer to the current stack pointer
> >
> > That would mean bringing back the NMI stack pointer to the (nesting - 1) nmi
> > stack copy.
>
> I think you're confused. Or I am by your question.
>
> The NMI code would literally just do:
>
> - check if the NMI was nested, by looking at whether the percpu
> nmi-stack-pointer is non-NULL
>
> - if it was nested, do nothing, an return with a popf/ret. The only
> stack this sequence might needs is to save/restore the register that
> we use for the percpu value (although maybe we can just co a "cmpl
> $0,%_percpu_seg:nmi_stack_ptr" and not even need that), and it's
> atomic because at this point we know that NMI's are disabled (we've
> not _yet_ taken any nested faults)
>
> - if it's a regular (non-nesting) NMI, we'd basically do
>
> 6* pushq 48(%rsp)
>
> to copy the five words that the NMI pushed (ss/esp/eflags/cs/eip)
> and the one we saved ourselves (if we needed any, maybe we can make do
> with just 5 words).
Ah, right, you only need to do the copy and use the copy for the nesting level 0
NMI handler. The nested NMI can work on the "real" nmi stack because we never
expect it to fault.
>
> - then we just save that new stack pointer to the percpu thing with a simple
>
> movq %rsp,%__percpu_seg:nmi_stack_ptr
>
> and we're all done. The final "iret" will do the right thing (either
> fault or return), and there are no races that I can see exactly
> because we use a single nmi-atomic instruction (the "iret" itself) to
> either re-enable NMI's _or_ test whether we should re-do an NMI.
>
> There is a single-instruction window that is interestign in the return
> path, which is the window between the two final instructions:
>
> movl $0,%__percpu_seg:nmi_stack_ptr
> iret
>
> where I wonder what happens if we have re-enabled NMI (due to a fault
> in the NMI handler), but we haven't actually taken the NMI itself yet,
> so now we _will_ re-use the stack. Hmm. I suspect we need another of
> those horrible "if the NMI happens at this particular %rip" cases that
> we already have for the sysenter code on x86-32 for the NMI/DEBUG trap
> case of fixing up the stack pointer.
Yes, this was this exact instruction window I was worried about. I see another
possible failure mode:
- NMI
- page fault
- iret
- NMI
- set nmi_stack_ptr to 0, popf/lret.
- page fault (yep, another one!)
- iret
- movl $0,%__percpu_seg:nmi_stack_ptr
- iret
So in this case, movl/iret are executed with NMIs enabled. So if an NMI comes in
after the movl instruction, it will not detect that it is nested and will re-use
the percpu "nmi_stack_ptr" stack, squashing the "faulty" stack ptr with a brand
new one which won't trigger a fault. I'm afraid that in this case, the last NMI
handler will iret to the "nesting 0" handler at the iret instruction, which will
in turn return to itself, breaking all hell loose in the meantime (endless iret
loop).
So this also calls for special-casing an NMI nested on top of the following iret
- movl $0,%__percpu_seg:nmi_stack_ptr
- iret <-----
instruction. At the beginning of the NMI handler, we could detect if we are
either nested over an NMI (checking nmi_stack_ptr != NULL) or if we are at this
specifica %rip, and assume we are nested in both cases.
>
> And maybe I missed something else. But it does look reasonably simple.
> Subtle, but not a lot of code. And the code is all very much about the
> NMI itself, not about other random sequences. No?
If we can find a clean way to handle this NMI vs iret problem outside of the
entry_*.S code, within NMI-specific code, I'm indeed all for it. entry_*.s is
already complicated enough as it is. I think checking the %rip at NMI entry
could work out.
Thanks!
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
next prev parent reply other threads:[~2010-07-14 22:21 UTC|newest]
Thread overview: 163+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 1/2] x86_64 page fault NMI-safe Mathieu Desnoyers
2010-07-14 16:28 ` Linus Torvalds
2010-07-14 17:06 ` Mathieu Desnoyers
2010-07-14 18:10 ` Linus Torvalds
2010-07-14 18:46 ` Ingo Molnar
2010-07-14 19:14 ` Linus Torvalds
2010-07-14 19:36 ` Frederic Weisbecker
2010-07-14 19:54 ` Linus Torvalds
2010-07-14 20:17 ` Mathieu Desnoyers
2010-07-14 20:55 ` Linus Torvalds
2010-07-14 21:18 ` Ingo Molnar
2010-07-14 22:14 ` Frederic Weisbecker
2010-07-14 22:31 ` Mathieu Desnoyers
2010-07-14 22:48 ` Frederic Weisbecker
2010-07-14 23:11 ` Mathieu Desnoyers
2010-07-14 23:38 ` Frederic Weisbecker
2010-07-15 16:26 ` Mathieu Desnoyers
2010-08-03 17:18 ` Peter Zijlstra
2010-08-03 18:25 ` Mathieu Desnoyers
2010-08-04 6:46 ` Peter Zijlstra
2010-08-04 7:14 ` Ingo Molnar
2010-08-04 14:45 ` Mathieu Desnoyers
2010-08-04 14:56 ` Peter Zijlstra
2010-08-06 1:49 ` Mathieu Desnoyers
2010-08-06 9:51 ` Peter Zijlstra
2010-08-06 13:46 ` Mathieu Desnoyers
2010-08-06 6:18 ` Masami Hiramatsu
2010-08-06 9:50 ` Peter Zijlstra
2010-08-06 13:37 ` Mathieu Desnoyers
2010-08-07 9:51 ` Masami Hiramatsu
2010-08-09 16:53 ` Frederic Weisbecker
2010-08-03 18:56 ` Linus Torvalds
2010-08-03 19:45 ` Mathieu Desnoyers
2010-08-03 20:02 ` Linus Torvalds
2010-08-03 20:10 ` Ingo Molnar
2010-08-03 20:21 ` Ingo Molnar
2010-08-03 21:16 ` Mathieu Desnoyers
2010-08-03 20:54 ` Mathieu Desnoyers
2010-08-04 6:27 ` Peter Zijlstra
2010-08-04 14:06 ` Mathieu Desnoyers
2010-08-04 14:50 ` Peter Zijlstra
2010-08-06 1:42 ` Mathieu Desnoyers
2010-08-06 10:11 ` Peter Zijlstra
2010-08-06 11:14 ` Peter Zijlstra
2010-08-06 14:15 ` Mathieu Desnoyers
2010-08-06 14:13 ` Mathieu Desnoyers
2010-08-11 14:44 ` Steven Rostedt
2010-08-11 14:34 ` Steven Rostedt
2010-08-15 13:35 ` Mathieu Desnoyers
2010-08-15 16:33 ` Avi Kivity
2010-08-15 16:44 ` Mathieu Desnoyers
2010-08-15 16:51 ` Avi Kivity
2010-08-15 18:31 ` Mathieu Desnoyers
2010-08-16 10:49 ` Avi Kivity
2010-08-16 11:29 ` Avi Kivity
2010-08-04 6:46 ` Dave Chinner
2010-08-04 7:21 ` Ingo Molnar
2010-07-14 23:40 ` Steven Rostedt
2010-07-14 19:41 ` Linus Torvalds
2010-07-14 19:56 ` Andi Kleen
2010-07-14 20:05 ` Mathieu Desnoyers
2010-07-14 20:07 ` Andi Kleen
2010-07-14 20:08 ` H. Peter Anvin
2010-07-14 23:32 ` Tejun Heo
2010-07-14 22:31 ` Frederic Weisbecker
2010-07-14 22:56 ` Linus Torvalds
2010-07-14 23:09 ` Andi Kleen
2010-07-14 23:22 ` Linus Torvalds
2010-07-15 14:11 ` Frederic Weisbecker
2010-07-15 14:35 ` Andi Kleen
2010-07-16 11:21 ` Frederic Weisbecker
2010-07-15 14:46 ` Steven Rostedt
2010-07-16 10:47 ` Frederic Weisbecker
2010-07-16 11:43 ` Steven Rostedt
2010-07-15 14:51 ` Linus Torvalds
2010-07-15 15:38 ` Linus Torvalds
2010-07-16 12:00 ` Frederic Weisbecker
2010-07-16 12:54 ` Steven Rostedt
2010-07-14 20:39 ` Mathieu Desnoyers
2010-07-14 21:23 ` Linus Torvalds
2010-07-14 21:45 ` Maciej W. Rozycki
2010-07-14 21:52 ` Linus Torvalds
2010-07-14 22:31 ` Maciej W. Rozycki
2010-07-14 22:21 ` Mathieu Desnoyers [this message]
2010-07-14 22:37 ` Linus Torvalds
2010-07-14 22:51 ` Jeremy Fitzhardinge
2010-07-14 23:02 ` Linus Torvalds
2010-07-14 23:54 ` Jeremy Fitzhardinge
2010-07-15 1:23 ` Linus Torvalds
2010-07-15 1:45 ` Linus Torvalds
2010-07-15 18:31 ` Mathieu Desnoyers
2010-07-15 18:43 ` Linus Torvalds
2010-07-15 18:48 ` Linus Torvalds
2010-07-15 22:01 ` Mathieu Desnoyers
2010-07-15 22:16 ` Linus Torvalds
2010-07-15 22:24 ` H. Peter Anvin
2010-07-15 22:26 ` Linus Torvalds
2010-07-15 22:46 ` H. Peter Anvin
2010-07-15 22:58 ` Andi Kleen
2010-07-15 23:20 ` H. Peter Anvin
2010-07-15 23:23 ` Linus Torvalds
2010-07-15 23:41 ` H. Peter Anvin
2010-07-15 23:44 ` Linus Torvalds
2010-07-15 23:46 ` H. Peter Anvin
2010-07-15 23:48 ` Andi Kleen
2010-07-15 22:30 ` Mathieu Desnoyers
2010-07-16 19:13 ` Mathieu Desnoyers
2010-07-15 16:44 ` Mathieu Desnoyers
2010-07-15 16:49 ` Linus Torvalds
2010-07-15 17:38 ` Mathieu Desnoyers
2010-07-15 20:44 ` H. Peter Anvin
2010-07-18 11:03 ` Avi Kivity
2010-07-18 17:36 ` Linus Torvalds
2010-07-18 18:04 ` Avi Kivity
2010-07-18 18:22 ` Linus Torvalds
2010-07-19 7:32 ` Avi Kivity
2010-07-18 18:17 ` Linus Torvalds
2010-07-18 18:43 ` Steven Rostedt
2010-07-18 19:26 ` Linus Torvalds
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42 ` Maciej W. Rozycki
2010-07-14 18:12 ` Mathieu Desnoyers
2010-07-14 19:21 ` Maciej W. Rozycki
2010-07-14 19:58 ` Mathieu Desnoyers
2010-07-14 20:36 ` Maciej W. Rozycki
2010-07-16 12:28 ` Avi Kivity
2010-07-16 14:49 ` Mathieu Desnoyers
2010-07-16 15:34 ` Andi Kleen
2010-07-16 15:40 ` Mathieu Desnoyers
2010-07-16 16:47 ` Avi Kivity
2010-07-16 16:58 ` Mathieu Desnoyers
2010-07-16 17:54 ` Avi Kivity
2010-07-16 18:05 ` H. Peter Anvin
2010-07-16 18:15 ` Avi Kivity
2010-07-16 18:17 ` H. Peter Anvin
2010-07-16 18:28 ` Avi Kivity
2010-07-16 18:37 ` Linus Torvalds
2010-07-16 19:26 ` Avi Kivity
2010-07-16 21:39 ` Linus Torvalds
2010-07-16 22:07 ` Andi Kleen
2010-07-16 22:26 ` Linus Torvalds
2010-07-16 22:41 ` Andi Kleen
2010-07-17 1:15 ` Linus Torvalds
2010-07-16 22:40 ` Mathieu Desnoyers
2010-07-18 9:23 ` Avi Kivity
2010-07-16 18:22 ` Mathieu Desnoyers
2010-07-16 18:32 ` Avi Kivity
2010-07-16 19:29 ` H. Peter Anvin
2010-07-16 19:39 ` Avi Kivity
2010-07-16 19:32 ` Andi Kleen
2010-07-16 18:25 ` Linus Torvalds
2010-07-16 19:30 ` Andi Kleen
2010-07-18 9:26 ` Avi Kivity
2010-07-16 19:28 ` Andi Kleen
2010-07-16 19:32 ` Avi Kivity
2010-07-16 19:34 ` Andi Kleen
2010-08-04 9:46 ` Peter Zijlstra
2010-08-04 20:23 ` H. Peter Anvin
2010-07-14 17:06 ` [patch 0/2] x86: NMI-safe trap handlers Andi Kleen
2010-07-14 17:08 ` Mathieu Desnoyers
2010-07-14 18:56 ` Andi Kleen
2010-07-14 23:29 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100714222115.GA30122@Krystal \
--to=mathieu.desnoyers@efficios.com \
--cc=acme@infradead.org \
--cc=akpm@linux-foundation.org \
--cc=andi@firstfloor.org \
--cc=fche@redhat.com \
--cc=fweisbec@gmail.com \
--cc=hch@lst.de \
--cc=hpa@zytor.com \
--cc=htejun@gmail.com \
--cc=jeremy@goop.org \
--cc=johannes.berg@intel.com \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizf@cn.fujitsu.com \
--cc=masami.hiramatsu.pt@hitachi.com \
--cc=mingo@elte.hu \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=rostedt@rostedt.homelinux.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=tzanussi@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox