* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault @ 2010-07-16 22:02 Jeffrey Merkey 2010-07-16 22:22 ` Linus Torvalds 0 siblings, 1 reply; 44+ messages in thread From: Jeffrey Merkey @ 2010-07-16 22:02 UTC (permalink / raw) To: linux-kernel Date Fri, 16 Jul 2010 14:39:50 -0700 >From Linus Torvalds > Linus Torvalds wrote: > But we're not talking about non-NMI code.Yes, we are. We're talking about breakpoints (look at the subjectline), and you are very much talking about things like that _idiotic_vmalloc_sync_all() by module > loading code etc etc.Every _single_ "solution" I have seen - apart from my suggestion - hasbeen about making code "special" because some other code might run inan NMI. Module init sequences having to > do idiotic things just becausethey have data structures that might get accessed by NMI.And the thing is, if we just do NMI's correctly, and allow nesting,ALL THOSE PROBLEMS GO AWAY. And there is no > reason what-so-ever to dostupid things elsewhere.In other words, why the hell are you arguing? Help Mathieu write thelow-level NMI handler right, and remove that idiotic"vmalloc_sync_all()" that is > fundamentally broken and should notexist. Rather than talk about adding more of that kind of crap. > > Linus So Linus, my understanding of Intel's processor design is that the processor will NEVER singal a nested NMI until it sees an iret from the first NMI exception. At least that's how the processors were working when I started this unless this behavior has changed. Just put a gate on the exception that uses its own stack (which I think we do anyway). Jeff ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey @ 2010-07-16 22:22 ` Linus Torvalds 2010-07-16 22:48 ` Jeffrey Merkey 2010-07-16 22:50 ` Jeffrey Merkey 0 siblings, 2 replies; 44+ messages in thread From: Linus Torvalds @ 2010-07-16 22:22 UTC (permalink / raw) To: Jeffrey Merkey; +Cc: linux-kernel On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote: > > So Linus, my understanding of Intel's processor design is that the > processor will NEVER singal a nested NMI until it sees an iret from > the first NMI exception. Wrong. I like x86, but it has warts. The NMI blocking is one of them. The NMI's will be nested until the _next_ "iret", but it has no nesting. So if you take a fault during the NMI (debug, page table fixup, whatever), the iret in the faulthandler will re-enable NMI's even though we're still busy with the original NMI. There is no nesting, or any way to say that "this is a NMI-releasing iret". They could even do it still - make a new "iret that doesn't clear NMI" by adding a segment override prefix to iret or whatever. But it's not going to happen, and it's just one of those ugly special cases that has various historical reasons (recursive faults during NMI sure as hell didn't make sense back in the real-mode 8086 days). So we have to handle it in software. Or not ever trap at all inside the NMI handler. The original patch - and the patch I detest - is to make the normal fault paths use a "popf + ret" to emulate iret, but without the NMI release. Now, I could live with that if it's the only solution, but it _is_ pretty damn ugly. If somebody shows that it's actually faster to do "popf + ret" when retuning to kernel space (a poor mans special-case iret), maybe it would be worth it, but the really critical code sequence is actually not "return to kernel space", but the "return to user space" case that really wants the iret. And I just think it's disgusting to add extra tests to that path. The other alternative would be to just make the rule be "NMI can never take traps". It's possible to do that, but quite frankly, it's a pain. It's a pain for page faults due to the whole vmalloc thing, and it's a pain if you ever want to debug an NMI in any way (or put a breakpoint on anything that is accessed from an NMI, which could potentially be quite a lot of things). If it was just the debug issue, I'd say "neener neener, debuggers are for wimps", but it's clearly not just about debug. It's a whole lot of other thigs. Random percpu datastructures used for tracing, kernel pointer verification code, yadda yadda. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:22 ` Linus Torvalds @ 2010-07-16 22:48 ` Jeffrey Merkey 2010-07-16 22:53 ` Jeffrey Merkey 2010-07-16 22:50 ` Jeffrey Merkey 1 sibling, 1 reply; 44+ messages in thread From: Jeffrey Merkey @ 2010-07-16 22:48 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel On Fri, Jul 16, 2010 at 4:22 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote: >> >> So Linus, my understanding of Intel's processor design is that the >> processor will NEVER singal a nested NMI until it sees an iret from >> the first NMI exception. > > Wrong. > > I like x86, but it has warts. The NMI blocking is one of them. > > The NMI's will be nested until the _next_ "iret", but it has no > nesting. So if you take a fault during the NMI (debug, page table > fixup, whatever), the iret in the faulthandler will re-enable NMI's > even though we're still busy with the original NMI. There is no > nesting, or any way to say that "this is a NMI-releasing iret". They > could even do it still - make a new "iret that doesn't clear NMI" by > adding a segment override prefix to iret or whatever. But it's not > going to happen, and it's just one of those ugly special cases that > has various historical reasons (recursive faults during NMI sure as > hell didn't make sense back in the real-mode 8086 days). > > So we have to handle it in software. Or not ever trap at all inside > the NMI handler. > > The original patch - and the patch I detest - is to make the normal > fault paths use a "popf + ret" to emulate iret, but without the NMI > release. > > Now, I could live with that if it's the only solution, but it _is_ > pretty damn ugly. > > If somebody shows that it's actually faster to do "popf + ret" when > retuning to kernel space (a poor mans special-case iret), maybe it > would be worth it, but the really critical code sequence is actually > not "return to kernel space", but the "return to user space" case that > really wants the iret. And I just think it's disgusting to add extra > tests to that path. > > The other alternative would be to just make the rule be "NMI can never > take traps". It's possible to do that, but quite frankly, it's a pain. > It's a pain for page faults due to the whole vmalloc thing, and it's a > pain if you ever want to debug an NMI in any way (or put a breakpoint > on anything that is accessed from an NMI, which could potentially be > quite a lot of things). > > If it was just the debug issue, I'd say "neener neener, debuggers are > for wimps", but it's clearly not just about debug. It's a whole lot of > other thigs. Random percpu datastructures used for tracing, kernel > pointer verification code, yadda yadda. > > Linus > Well, the way I handled this problem on NetWare SMP and that other kernel was to create a pool of TSS descriptors and reload each during the exception to swap stacks before any handlers were called. Allowed it to nest until I ran out of TSS descriptors (64 levels). Not sure that's the way to go here though but it worked on that case. Jeff ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:48 ` Jeffrey Merkey @ 2010-07-16 22:53 ` Jeffrey Merkey 0 siblings, 0 replies; 44+ messages in thread From: Jeffrey Merkey @ 2010-07-16 22:53 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > > Well, the way I handled this problem on NetWare SMP and that other > kernel was to create a pool of TSS descriptors and reload each during > the exception to swap stacks before any handlers were called. Allowed > it to nest until I ran out of TSS descriptors (64 levels). Not sure > that's the way to go here though but it worked on that case. > > Jeff > Here is where that old dusty code lives these days - it deals with this problem. http://open-source-netware.googlecode.com/files/manos-06-26-2010.tar.gz file to look at is startup.386 ; ; nmi entry code ; nmi_entry macro cli push ebx push ebp mov ebp, esp sub ebp, SIZE TaskStateSegment mov ebx, ebp mov [ebp].tSS, ss mov [ebp].tGS, gs ; save segment registers mov [ebp].tFS, fs mov [ebp].tES, es mov [ebp].tDS, ds pop [ebp].tEBP mov [ebp].tEDI, edi mov [ebp].tESI, esi mov [ebp].tEDX, edx mov [ebp].tECX, ecx pop [ebp].tEBX mov [ebp].tEAX, eax pop [ebp].tEIP ; remove return address pop eax mov [ebp].tCS, ax pop [ebp].tSystemFlags ; get flags into TSS mov [ebp].tESP, esp ; save true stack address mov esp, ebx ; cover stack frame mov eax, CR0 and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to mov CR0, eax ; avoid NPX exceptions xor eax, eax mov dr7, eax ; disable breakpoints mov eax, CR3 ; mov [ebp].tCR3, eax ; mov eax, DebuggerPDE mov CR3, eax ; ; if we do not clear the NESTED_TASK_FLAG, then the IRET ; at the end of this function will cause ; an invalid TSS exception to be generated because the ; task busy bit was cleared earlier ; pushfd and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG) or dword ptr [esp], RESUME_FLAG popfd mov eax, 0FFFFFFFFh ; mark as a non-pooled TSS exception push eax push 0 push 0 push ebp endm ; ; TSS entry code ; task_entry macro LOCAL @TSSNotNested, @NoLTR LOCAL @UsedDefaultSegment LOCAL @UsedPooledSegment LOCAL @EnterTheDebugger cli xor eax, eax str ax mov esi, offset SystemGDTTable mov esi, dword ptr [esi + 2] lea ebx, [esi + eax] mov al, [ebx].TSSBase2 mov ah, [ebx].TSSBase3 shl eax, 16 mov ax, [ebx].TSSBase1 ; ; eax -> TSS Segment (Current) ; ebx -> TSS Descriptor (Current) ; movzx ecx, word ptr [eax].tBackLink or ecx, ecx jz @TSSNotNested mov esi, offset SystemGDTTable mov esi, dword ptr [esi + 2] lea edx, [esi + ecx] mov cl, [edx].TSSBase2 mov ch, [edx].TSSBase3 shl ecx, 16 mov cx, [edx].TSSBase1 mov ebp, ecx ; ; edx -> TSS Descriptor (Previous) ; ebp -> TSS Segment (Previous) ; ; clear busy state and reset TSS ; mov [edx].TSSType, 10001001b @TSSNotNested: mov [ebx].TSSType, 10001001b lgdt ds: SystemGDTTable ; reset GDT TSS Busy bit movzx eax, word ptr [eax].tBackLink or eax, eax jz @NoLTR ltr ax @NoLTR: mov eax, CR0 and eax, 0FFFFFFF7h ; clear task switch bit in CR0 to mov CR0, eax ; avoid NPX exceptions xor eax, eax mov dr7, eax ; disable breakpoints pushfd and dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG) or dword ptr [esp], RESUME_FLAG popfd push ebp call AllocPooledResource pop ebp or eax, eax jz @UsedDefaultSegment lea ebp, [eax].TSSSegment mov esp, [eax].StackTop push eax ; push address of pooled resource jmp @UsedPooledSegment @UsedDefaultSegment: mov eax, 0FFFFFFFFh ; push non-pooled marker onto the stack push eax @UsedPooledSegment: push 0 mov eax, CR2 ; get fault address push eax push ebp ; pass the TSS endm ; ; TSS exit code ; Jeff ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:22 ` Linus Torvalds 2010-07-16 22:48 ` Jeffrey Merkey @ 2010-07-16 22:50 ` Jeffrey Merkey 1 sibling, 0 replies; 44+ messages in thread From: Jeffrey Merkey @ 2010-07-16 22:50 UTC (permalink / raw) To: Linus Torvalds; +Cc: linux-kernel > > If it was just the debug issue, I'd say "neener neener, debuggers are > for wimps", but it's clearly not just about debug. It's a whole lot of > other thigs. Random percpu datastructures used for tracing, kernel > pointer verification code, yadda yadda. > > Linus > I guess I am a wimp then ... :-) Jeff ^ permalink raw reply [flat|nested] 44+ messages in thread
* [patch 0/2] x86: NMI-safe trap handlers @ 2010-07-14 15:49 Mathieu Desnoyers 2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers 0 siblings, 1 reply; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw) To: LKML Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen Hi, There seem to have been some churn regarding Perf problems with per-cpu memory allocation which uses vmalloc. Long story short: faulting NMIs reactivate NMIs faster than supposed, because x86 re-enables NMIs at the first iret encountered, which leads to nested NMIs. x86_32 cannot use vmalloc_sync_all() to sychronize the TLBs from every processes because the vmalloc area is mapped in a different address space for each process on this architecture. A second alternative is to duplicate the per-cpu allocation API to have a variant using kmalloc only. This would lead to code and API duplication and should probably be kept as last resort. A third solution to this problem is to make the page fault handler aware of NMIs and ensure it can be called from this context. This third solution is proposed by this patchset. So I'm respinning this patchset which has been sitting for a while, used for about 1-2 years in the LTTng tree without problems, already tested in a -tip sub-branch in the past. It uses a ret/popf instruction pair instead of iret when it detects that a trap handler is nested over an NMI. A second patch takes care of making the page fault handler nmi-safe by using the cr3 register rather than accessing ->current, which could be in the middle of being changed by a context switch. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers @ 2010-07-14 15:49 ` Mathieu Desnoyers 2010-07-14 16:42 ` Maciej W. Rozycki 2010-07-16 12:28 ` Avi Kivity 0 siblings, 2 replies; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw) To: LKML Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler [-- Attachment #1: x86-nmi-safe-int3-and-page-fault.patch --] [-- Type: text/plain, Size: 21100 bytes --] Implements an alternative iret with popf and return so trap and exception handlers can return to the NMI handler without issuing iret. iret would cause NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to copy the return instruction pointer to the top of the previous stack, issue a popf, loads the previous esp and issue a near return (ret). It allows placing dynamically patched static jumps in asm gotos, which will be used for optimized tracepoints, in NMI code since returning from a breakpoint would be valid. Accessing vmalloc'd memory, which allows executing module code or accessing vmapped or vmalloc'd areas from NMI context, would also be valid. This is very useful to tracers like LTTng. This patch makes all faults, traps and exception safe to be called from NMI context *except* single-stepping, which requires iret to restore the TF (trap flag) and jump to the return address in a single instruction. Sorry, no kprobes support in NMI handlers because of this limitation. This cannot be emulated with popf/lret, because lret would be single-stepped. It does not apply to "immediate values" because they do not use single-stepping. This code detects if the TF flag is set and uses the iret path for single-stepping, even if it reactivates NMIs prematurely. Test to detect if nested under a NMI handler is only done upon the return from trap/exception to kernel, which is not frequent. Other return paths (return from trap/exception to userspace, return from interrupt) keep the exact same behavior (no slowdown). alpha and avr32 use the active count bit 31. This patch moves them to 28. TODO : test alpha and avr32 active count modification TODO : test with lguest, xen, kvm. tested on x86_32 (tests implemented in a separate patch) : - instrumented the return path to export the EIP, CS and EFLAGS values when taken so we know the return path code has been executed. - trace_mark, using immediate values, with 10ms delay with the breakpoint activated. Runs well through the return path. - tested vmalloc faults in NMI handler by placing a non-optimized marker in the NMI handler (so no breakpoint is executed) and connecting a probe which touches every pages of a 20MB vmalloc'd buffer. It executes trough the return path without problem. - Tested with and without preemption tested on x86_64 - instrumented the return path to export the EIP, CS and EFLAGS values when taken so we know the return path code has been executed. - trace_mark, using immediate values, with 10ms delay with the breakpoint activated. Runs well through the return path. To test on x86_64 : - Test without preemption - Test vmalloc faults - Test on Intel 64 bits CPUs. (AMD64 was fine) Changelog since v1 : - x86_64 fixes. Changelog since v2 : - fix paravirt build Changelog since v3 : - Include modifications suggested by Jeremy Changelog since v4 : - including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code), define NMI_MASK in the .S files directly. Changelog since v5 : - Add NMI_MASK to irq_count() and make die() more verbose for NMIs. Changelog since v7 : - Implement paravirtualized nmi_return. Changelog since v8 : - refreshed the patch for asm-offsets. Those were left out of v8. - now depends on "Stringify support commas" patch. Changelog since v9 : - Only test the nmi nested preempt count flag upon return from exceptions, not on return from interrupts. Only the kernel return path has this test. - Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of nmi_return. - update for 2.6.30-rc1 Follow NMI_MASK bits merged in mainline. - update for 2.6.35-rc4-tip Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> CC: akpm@osdl.org CC: mingo@elte.hu CC: "H. Peter Anvin" <hpa@zytor.com> CC: Jeremy Fitzhardinge <jeremy@goop.org> CC: Steven Rostedt <rostedt@goodmis.org> CC: "Frank Ch. Eigler" <fche@redhat.com> --- arch/alpha/include/asm/thread_info.h | 2 - arch/avr32/include/asm/thread_info.h | 2 - arch/x86/include/asm/irqflags.h | 56 ++++++++++++++++++++++++++++++++++ arch/x86/include/asm/paravirt.h | 4 ++ arch/x86/include/asm/paravirt_types.h | 1 arch/x86/kernel/asm-offsets_32.c | 1 arch/x86/kernel/asm-offsets_64.c | 1 arch/x86/kernel/dumpstack.c | 2 + arch/x86/kernel/entry_32.S | 30 ++++++++++++++++++ arch/x86/kernel/entry_64.S | 33 ++++++++++++++++++-- arch/x86/kernel/paravirt.c | 3 + arch/x86/kernel/paravirt_patch_32.c | 6 +++ arch/x86/kernel/paravirt_patch_64.c | 6 +++ arch/x86/kernel/vmi_32.c | 2 + arch/x86/lguest/boot.c | 1 arch/x86/xen/enlighten.c | 1 16 files changed, 145 insertions(+), 6 deletions(-) Index: linux.trees.git/arch/x86/kernel/entry_32.S =================================================================== --- linux.trees.git.orig/arch/x86/kernel/entry_32.S 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/entry_32.S 2010-07-14 08:02:11.000000000 -0400 @@ -80,6 +80,8 @@ #define nr_syscalls ((syscall_table_size)/4) +#define NMI_MASK 0x04000000 + #ifdef CONFIG_PREEMPT #define preempt_stop(clobbers) DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF #else @@ -348,8 +350,32 @@ END(ret_from_fork) # userspace resumption stub bypassing syscall exit tracing ALIGN RING0_PTREGS_FRAME + ret_from_exception: preempt_stop(CLBR_ANY) + GET_THREAD_INFO(%ebp) + movl PT_EFLAGS(%esp), %eax # mix EFLAGS and CS + movb PT_CS(%esp), %al + andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax + cmpl $USER_RPL, %eax + jae resume_userspace # returning to v8086 or userspace + testl $NMI_MASK,TI_preempt_count(%ebp) + jz resume_kernel /* Not nested over NMI ? */ + testw $X86_EFLAGS_TF, PT_EFLAGS(%esp) + jnz resume_kernel /* + * If single-stepping an NMI handler, + * use the normal iret path instead of + * the popf/lret because lret would be + * single-stepped. It should not + * happen : it will reactivate NMIs + * prematurely. + */ + TRACE_IRQS_IRET + RESTORE_REGS + addl $4, %esp # skip orig_eax/error_code + CFI_ADJUST_CFA_OFFSET -4 + INTERRUPT_RETURN_NMI_SAFE + ret_from_intr: GET_THREAD_INFO(%ebp) check_userspace: @@ -949,6 +975,10 @@ ENTRY(native_iret) .previous END(native_iret) +ENTRY(native_nmi_return) + NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ? +END(native_nmi_return) + ENTRY(native_irq_enable_sysexit) sti sysexit Index: linux.trees.git/arch/x86/kernel/entry_64.S =================================================================== --- linux.trees.git.orig/arch/x86/kernel/entry_64.S 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/entry_64.S 2010-07-14 08:02:11.000000000 -0400 @@ -163,6 +163,8 @@ GLOBAL(return_to_handler) #endif +#define NMI_MASK 0x04000000 + #ifndef CONFIG_PREEMPT #define retint_kernel retint_restore_args #endif @@ -875,6 +877,9 @@ ENTRY(native_iret) .section __ex_table,"a" .quad native_iret, bad_iret .previous + +ENTRY(native_nmi_return) + NATIVE_INTERRUPT_RETURN_NMI_SAFE #endif .section .fixup,"ax" @@ -929,6 +934,24 @@ retint_signal: GET_THREAD_INFO(%rcx) jmp retint_with_reschedule + /* Returning to kernel space from exception. */ + /* rcx: threadinfo. interrupts off. */ +ENTRY(retexc_kernel) + testl $NMI_MASK,TI_preempt_count(%rcx) + jz retint_kernel /* Not nested over NMI ? */ + testw $X86_EFLAGS_TF,EFLAGS-ARGOFFSET(%rsp) /* trap flag? */ + jnz retint_kernel /* + * If single-stepping an NMI handler, + * use the normal iret path instead of + * the popf/lret because lret would be + * single-stepped. It should not + * happen : it will reactivate NMIs + * prematurely. + */ + RESTORE_ARGS 0,8,0 + TRACE_IRQS_IRETQ + INTERRUPT_RETURN_NMI_SAFE + #ifdef CONFIG_PREEMPT /* Returning to kernel space. Check if we need preemption */ /* rcx: threadinfo. interrupts off. */ @@ -1375,12 +1398,18 @@ ENTRY(paranoid_exit) paranoid_swapgs: TRACE_IRQS_IRETQ 0 SWAPGS_UNSAFE_STACK +paranoid_restore_no_nmi: RESTORE_ALL 8 jmp irq_return paranoid_restore: + GET_THREAD_INFO(%rcx) TRACE_IRQS_IRETQ 0 + testl $NMI_MASK,TI_preempt_count(%rcx) + jz paranoid_restore_no_nmi /* Nested over NMI ? */ + testw $X86_EFLAGS_TF,EFLAGS-0(%rsp) /* trap flag? */ + jnz paranoid_restore_no_nmi RESTORE_ALL 8 - jmp irq_return + INTERRUPT_RETURN_NMI_SAFE paranoid_userspace: GET_THREAD_INFO(%rcx) movl TI_flags(%rcx),%ebx @@ -1479,7 +1508,7 @@ ENTRY(error_exit) TRACE_IRQS_OFF GET_THREAD_INFO(%rcx) testl %eax,%eax - jne retint_kernel + jne retexc_kernel LOCKDEP_SYS_EXIT_IRQ movl TI_flags(%rcx),%edx movl $_TIF_WORK_MASK,%edi Index: linux.trees.git/arch/x86/include/asm/irqflags.h =================================================================== --- linux.trees.git.orig/arch/x86/include/asm/irqflags.h 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/include/asm/irqflags.h 2010-07-14 08:02:11.000000000 -0400 @@ -56,6 +56,61 @@ static inline void native_halt(void) #endif +#ifdef CONFIG_X86_64 +/* + * Only returns from a trap or exception to a NMI context (intra-privilege + * level near return) to the same SS and CS segments. Should be used + * upon trap or exception return when nested over a NMI context so no iret is + * issued. It takes care of modifying the eflags, rsp and returning to the + * previous function. + * + * The stack, at that point, looks like : + * + * 0(rsp) RIP + * 8(rsp) CS + * 16(rsp) EFLAGS + * 24(rsp) RSP + * 32(rsp) SS + * + * Upon execution : + * Copy EIP to the top of the return stack + * Update top of return stack address + * Pop eflags into the eflags register + * Make the return stack current + * Near return (popping the return address from the return stack) + */ +#define NATIVE_INTERRUPT_RETURN_NMI_SAFE pushq %rax; \ + movq %rsp, %rax; \ + movq 24+8(%rax), %rsp; \ + pushq 0+8(%rax); \ + pushq 16+8(%rax); \ + movq (%rax), %rax; \ + popfq; \ + ret +#else +/* + * Protected mode only, no V8086. Implies that protected mode must + * be entered before NMIs or MCEs are enabled. Only returns from a trap or + * exception to a NMI context (intra-privilege level far return). Should be used + * upon trap or exception return when nested over a NMI context so no iret is + * issued. + * + * The stack, at that point, looks like : + * + * 0(esp) EIP + * 4(esp) CS + * 8(esp) EFLAGS + * + * Upon execution : + * Copy the stack eflags to top of stack + * Pop eflags into the eflags register + * Far return: pop EIP and CS into their register, and additionally pop EFLAGS. + */ +#define NATIVE_INTERRUPT_RETURN_NMI_SAFE pushl 8(%esp); \ + popfl; \ + lret $4 +#endif + #ifdef CONFIG_PARAVIRT #include <asm/paravirt.h> #else @@ -114,6 +169,7 @@ static inline unsigned long __raw_local_ #define ENABLE_INTERRUPTS(x) sti #define DISABLE_INTERRUPTS(x) cli +#define INTERRUPT_RETURN_NMI_SAFE NATIVE_INTERRUPT_RETURN_NMI_SAFE #ifdef CONFIG_X86_64 #define SWAPGS swapgs Index: linux.trees.git/arch/alpha/include/asm/thread_info.h =================================================================== --- linux.trees.git.orig/arch/alpha/include/asm/thread_info.h 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/alpha/include/asm/thread_info.h 2010-07-14 08:02:11.000000000 -0400 @@ -56,7 +56,7 @@ register struct thread_info *__current_t #define THREAD_SIZE_ORDER 1 #define THREAD_SIZE (2*PAGE_SIZE) -#define PREEMPT_ACTIVE 0x40000000 +#define PREEMPT_ACTIVE 0x10000000 /* * Thread information flags: Index: linux.trees.git/arch/avr32/include/asm/thread_info.h =================================================================== --- linux.trees.git.orig/arch/avr32/include/asm/thread_info.h 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/avr32/include/asm/thread_info.h 2010-07-14 08:02:11.000000000 -0400 @@ -66,7 +66,7 @@ static inline struct thread_info *curren #endif /* !__ASSEMBLY__ */ -#define PREEMPT_ACTIVE 0x40000000 +#define PREEMPT_ACTIVE 0x10000000 /* * Thread information flags Index: linux.trees.git/arch/x86/include/asm/paravirt.h =================================================================== --- linux.trees.git.orig/arch/x86/include/asm/paravirt.h 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/include/asm/paravirt.h 2010-07-14 08:02:11.000000000 -0400 @@ -943,6 +943,10 @@ extern void default_banner(void); PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE, \ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret)) +#define INTERRUPT_RETURN_NMI_SAFE \ + PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE, \ + jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return) + #define DISABLE_INTERRUPTS(clobbers) \ PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \ PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE); \ Index: linux.trees.git/arch/x86/kernel/paravirt.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/paravirt.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/paravirt.c 2010-07-14 08:02:11.000000000 -0400 @@ -156,6 +156,7 @@ unsigned paravirt_patch_default(u8 type, ret = paravirt_patch_ident_64(insnbuf, len); else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) || + type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) || type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) || type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) || type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret64)) @@ -204,6 +205,7 @@ static void native_flush_tlb_single(unsi /* These are in entry.S */ extern void native_iret(void); +extern void native_nmi_return(void); extern void native_irq_enable_sysexit(void); extern void native_usergs_sysret32(void); extern void native_usergs_sysret64(void); @@ -373,6 +375,7 @@ struct pv_cpu_ops pv_cpu_ops = { .usergs_sysret64 = native_usergs_sysret64, #endif .iret = native_iret, + .nmi_return = native_nmi_return, .swapgs = native_swapgs, .set_iopl_mask = native_set_iopl_mask, Index: linux.trees.git/arch/x86/kernel/paravirt_patch_32.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_32.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/paravirt_patch_32.c 2010-07-14 08:02:11.000000000 -0400 @@ -1,10 +1,13 @@ -#include <asm/paravirt.h> +#include <linux/stringify.h> +#include <linux/irqflags.h> DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf"); DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax"); DEF_NATIVE(pv_cpu_ops, iret, "iret"); +DEF_NATIVE(pv_cpu_ops, nmi_return, + __stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE)); DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit"); DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax"); DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3"); @@ -41,6 +44,7 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_irq_ops, restore_fl); PATCH_SITE(pv_irq_ops, save_fl); PATCH_SITE(pv_cpu_ops, iret); + PATCH_SITE(pv_cpu_ops, nmi_return); PATCH_SITE(pv_cpu_ops, irq_enable_sysexit); PATCH_SITE(pv_mmu_ops, read_cr2); PATCH_SITE(pv_mmu_ops, read_cr3); Index: linux.trees.git/arch/x86/kernel/paravirt_patch_64.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_64.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/paravirt_patch_64.c 2010-07-14 08:02:11.000000000 -0400 @@ -1,12 +1,15 @@ +#include <linux/irqflags.h> +#include <linux/stringify.h> #include <asm/paravirt.h> #include <asm/asm-offsets.h> -#include <linux/stringify.h> DEF_NATIVE(pv_irq_ops, irq_disable, "cli"); DEF_NATIVE(pv_irq_ops, irq_enable, "sti"); DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq"); DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax"); DEF_NATIVE(pv_cpu_ops, iret, "iretq"); +DEF_NATIVE(pv_cpu_ops, nmi_return, + __stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE)); DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax"); DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax"); DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3"); @@ -51,6 +54,7 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_irq_ops, irq_enable); PATCH_SITE(pv_irq_ops, irq_disable); PATCH_SITE(pv_cpu_ops, iret); + PATCH_SITE(pv_cpu_ops, nmi_return); PATCH_SITE(pv_cpu_ops, irq_enable_sysexit); PATCH_SITE(pv_cpu_ops, usergs_sysret32); PATCH_SITE(pv_cpu_ops, usergs_sysret64); Index: linux.trees.git/arch/x86/kernel/asm-offsets_32.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/asm-offsets_32.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/asm-offsets_32.c 2010-07-14 08:02:11.000000000 -0400 @@ -113,6 +113,7 @@ void foo(void) OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable); OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable); OFFSET(PV_CPU_iret, pv_cpu_ops, iret); + OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return); OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit); OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0); #endif Index: linux.trees.git/arch/x86/kernel/asm-offsets_64.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/asm-offsets_64.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/asm-offsets_64.c 2010-07-14 08:02:11.000000000 -0400 @@ -58,6 +58,7 @@ int main(void) OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable); OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame); OFFSET(PV_CPU_iret, pv_cpu_ops, iret); + OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return); OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32); OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64); OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit); Index: linux.trees.git/arch/x86/xen/enlighten.c =================================================================== --- linux.trees.git.orig/arch/x86/xen/enlighten.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/xen/enlighten.c 2010-07-14 08:02:12.000000000 -0400 @@ -953,6 +953,7 @@ static const struct pv_cpu_ops xen_cpu_o .read_pmc = native_read_pmc, .iret = xen_iret, + .nmi_return = xen_iret, .irq_enable_sysexit = xen_sysexit, #ifdef CONFIG_X86_64 .usergs_sysret32 = xen_sysret32, Index: linux.trees.git/arch/x86/kernel/vmi_32.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/vmi_32.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/vmi_32.c 2010-07-14 08:02:12.000000000 -0400 @@ -154,6 +154,8 @@ static unsigned vmi_patch(u8 type, u16 c insns, ip); case PARAVIRT_PATCH(pv_cpu_ops.iret): return patch_internal(VMI_CALL_IRET, len, insns, ip); + case PARAVIRT_PATCH(pv_cpu_ops.nmi_return): + return patch_internal(VMI_CALL_IRET, len, insns, ip); case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit): return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip); default: Index: linux.trees.git/arch/x86/lguest/boot.c =================================================================== --- linux.trees.git.orig/arch/x86/lguest/boot.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/lguest/boot.c 2010-07-14 08:02:12.000000000 -0400 @@ -1270,6 +1270,7 @@ __init void lguest_init(void) pv_cpu_ops.cpuid = lguest_cpuid; pv_cpu_ops.load_idt = lguest_load_idt; pv_cpu_ops.iret = lguest_iret; + pv_cpu_ops.nmi_return = lguest_iret; pv_cpu_ops.load_sp0 = lguest_load_sp0; pv_cpu_ops.load_tr_desc = lguest_load_tr_desc; pv_cpu_ops.set_ldt = lguest_set_ldt; Index: linux.trees.git/arch/x86/kernel/dumpstack.c =================================================================== --- linux.trees.git.orig/arch/x86/kernel/dumpstack.c 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/kernel/dumpstack.c 2010-07-14 08:02:12.000000000 -0400 @@ -258,6 +258,8 @@ void __kprobes oops_end(unsigned long fl if (!signr) return; + if (in_nmi()) + panic("Fatal exception in non-maskable interrupt"); if (in_interrupt()) panic("Fatal exception in interrupt"); if (panic_on_oops) Index: linux.trees.git/arch/x86/include/asm/paravirt_types.h =================================================================== --- linux.trees.git.orig/arch/x86/include/asm/paravirt_types.h 2010-07-09 00:10:14.000000000 -0400 +++ linux.trees.git/arch/x86/include/asm/paravirt_types.h 2010-07-14 08:02:12.000000000 -0400 @@ -181,6 +181,7 @@ struct pv_cpu_ops { /* Normal iret. Jump to this with the standard iret stack frame set up. */ void (*iret)(void); + void (*nmi_return)(void); void (*swapgs)(void); ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers @ 2010-07-14 16:42 ` Maciej W. Rozycki 2010-07-14 18:12 ` Mathieu Desnoyers 2010-07-16 12:28 ` Avi Kivity 1 sibling, 1 reply; 44+ messages in thread From: Maciej W. Rozycki @ 2010-07-14 16:42 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On Wed, 14 Jul 2010, Mathieu Desnoyers wrote: > This patch makes all faults, traps and exception safe to be called from NMI > context *except* single-stepping, which requires iret to restore the TF (trap > flag) and jump to the return address in a single instruction. Sorry, no kprobes Watch out for the RF flag too, that is not set correctly by POPFD -- that may be important for faulting instructions that also have a hardware breakpoint set at their address. > support in NMI handlers because of this limitation. This cannot be emulated > with popf/lret, because lret would be single-stepped. It does not apply to > "immediate values" because they do not use single-stepping. This code detects if > the TF flag is set and uses the iret path for single-stepping, even if it > reactivates NMIs prematurely. What about the VM flag for VM86 tasks? It cannot be changed by POPFD either. How about only using the special return path when a nested exception is about to return to the NMI handler? You'd avoid all the odd cases then that do not happen in the NMI context. Maciej ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 16:42 ` Maciej W. Rozycki @ 2010-07-14 18:12 ` Mathieu Desnoyers 2010-07-14 19:21 ` Maciej W. Rozycki 0 siblings, 1 reply; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-14 18:12 UTC (permalink / raw) To: Maciej W. Rozycki Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler * Maciej W. Rozycki (macro@linux-mips.org) wrote: > On Wed, 14 Jul 2010, Mathieu Desnoyers wrote: > > > This patch makes all faults, traps and exception safe to be called from NMI > > context *except* single-stepping, which requires iret to restore the TF (trap > > flag) and jump to the return address in a single instruction. Sorry, no kprobes > > Watch out for the RF flag too, that is not set correctly by POPFD -- that > may be important for faulting instructions that also have a hardware > breakpoint set at their address. > > > support in NMI handlers because of this limitation. This cannot be emulated > > with popf/lret, because lret would be single-stepped. It does not apply to > > "immediate values" because they do not use single-stepping. This code detects if > > the TF flag is set and uses the iret path for single-stepping, even if it > > reactivates NMIs prematurely. > > What about the VM flag for VM86 tasks? It cannot be changed by POPFD > either. > > How about only using the special return path when a nested exception is > about to return to the NMI handler? You'd avoid all the odd cases then > that do not happen in the NMI context. This is exactly what this patch does :-) It selects the return path with + testl $NMI_MASK,TI_preempt_count(%ebp) + jz resume_kernel /* Not nested over NMI ? */ In addition, about int3 breakpoints use in the kernel, AFAIK the handler does not explicitly set the RF flag, and the breakpoint instruction (int3) appears not to set it. (from my understanding of Intel's Intel Architecture Software Developer’s Manual Volume 3: System Programming 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C) So it should be safe to set a int3 breakpoint in a NMI handler with this patch. It's just the "single-stepping" feature of kprobes which is problematic. Luckily, only int3 is needed for code patching bypass. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 18:12 ` Mathieu Desnoyers @ 2010-07-14 19:21 ` Maciej W. Rozycki 2010-07-14 19:58 ` Mathieu Desnoyers 0 siblings, 1 reply; 44+ messages in thread From: Maciej W. Rozycki @ 2010-07-14 19:21 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On Wed, 14 Jul 2010, Mathieu Desnoyers wrote: > > How about only using the special return path when a nested exception is > > about to return to the NMI handler? You'd avoid all the odd cases then > > that do not happen in the NMI context. > > This is exactly what this patch does :-) Ah, OK then -- I understood you actually tested the value of TF in the image to be restored. > It selects the return path with > > + testl $NMI_MASK,TI_preempt_count(%ebp) > + jz resume_kernel /* Not nested over NMI ? */ > > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does > not explicitly set the RF flag, and the breakpoint instruction (int3) appears > not to set it. (from my understanding of Intel's > Intel Architecture Software Developer’s Manual Volume 3: System Programming > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C) The CPU only sets RF itself in the image saved in certain cases -- you'd see it set in the page fault handler for example, so that once the handler has finished any instruction breakpoint does not hit (presumably again, because the instruction breakpoint debug exception has the highest priority). You mentioned the need to handle these faults. > So it should be safe to set a int3 breakpoint in a NMI handler with this patch. > > It's just the "single-stepping" feature of kprobes which is problematic. > Luckily, only int3 is needed for code patching bypass. Actually the breakpoint exception handler should actually probably set RF explicitly, but that depends on the exact debugging scenario, so I can't comment on it further. I don't know how INT3 is used in this context, so I'm just noting this may be a danger zone. Maciej ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 19:21 ` Maciej W. Rozycki @ 2010-07-14 19:58 ` Mathieu Desnoyers 2010-07-14 20:36 ` Maciej W. Rozycki 0 siblings, 1 reply; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-14 19:58 UTC (permalink / raw) To: Maciej W. Rozycki Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler * Maciej W. Rozycki (macro@linux-mips.org) wrote: > On Wed, 14 Jul 2010, Mathieu Desnoyers wrote: > > > > How about only using the special return path when a nested exception is > > > about to return to the NMI handler? You'd avoid all the odd cases then > > > that do not happen in the NMI context. > > > > This is exactly what this patch does :-) > > Ah, OK then -- I understood you actually tested the value of TF in the > image to be restored. It tests it too. When it detects that the return path is about to return to a NMI handler, it checks if the TF flag is set. If it is set, then "iret" is really needed, because TF can only single-step an instruction when set by "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that follows popf. Anyway, single-stepping is really discouraged in nmi handlers, because there is no way to go around the iret. > > > It selects the return path with > > > > + testl $NMI_MASK,TI_preempt_count(%ebp) > > + jz resume_kernel /* Not nested over NMI ? */ > > > > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does > > not explicitly set the RF flag, and the breakpoint instruction (int3) appears > > not to set it. (from my understanding of Intel's > > Intel Architecture Software Developer’s Manual Volume 3: System Programming > > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C) > > The CPU only sets RF itself in the image saved in certain cases -- you'd > see it set in the page fault handler for example, so that once the handler > has finished any instruction breakpoint does not hit (presumably again, > because the instruction breakpoint debug exception has the highest > priority). You mentioned the need to handle these faults. Well, the only case where I think it might make sense to allow a breakpoint in NMI handler code would be to temporarily replace a static branch, which should in no way be able to trigger any other fault. > > > So it should be safe to set a int3 breakpoint in a NMI handler with this patch. > > > > It's just the "single-stepping" feature of kprobes which is problematic. > > Luckily, only int3 is needed for code patching bypass. > > Actually the breakpoint exception handler should actually probably set RF > explicitly, but that depends on the exact debugging scenario, so I can't > comment on it further. I don't know how INT3 is used in this context, so > I'm just noting this may be a danger zone. In the case of temporary bypass, the int3 is only there to divert the instruction execution flow to somewhere else, and we come back to the original code at the address following the instruction which has the breakpoint. So basically, we never come back to the original instruction, ever. We might as well just clear the RF flag from the EFLAGS image before popf. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 19:58 ` Mathieu Desnoyers @ 2010-07-14 20:36 ` Maciej W. Rozycki 0 siblings, 0 replies; 44+ messages in thread From: Maciej W. Rozycki @ 2010-07-14 20:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On Wed, 14 Jul 2010, Mathieu Desnoyers wrote: > It tests it too. When it detects that the return path is about to return to a > NMI handler, it checks if the TF flag is set. If it is set, then "iret" is > really needed, because TF can only single-step an instruction when set by > "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that > follows popf. Anyway, single-stepping is really discouraged in nmi handlers, > because there is no way to go around the iret. Hmm, with Pentium Pro and more recent processors there is actually a nasty hack that will let you get away with POPF/RET and TF set. ;) You can try it if you like and can arrange for an appropriate scenario. > In the case of temporary bypass, the int3 is only there to divert the > instruction execution flow to somewhere else, and we come back to the original > code at the address following the instruction which has the breakpoint. So > basically, we never come back to the original instruction, ever. We might as > well just clear the RF flag from the EFLAGS image before popf. Yes, if you return to elsewhere, then that's actually quite desirable IMHO. This RF flag is quite complicated to handle and there are some errata involved too. If I understand it correctly, all fault-class exception handlers are expected to set it manually in the image to be restored if they return to the original faulting instruction (that includes the debug exception handler if it was invoked as a fault, i.e. in response to an instruction breakpoint). Then all trap-class exception handlers are expected to clear the flag (and that includes the debug exception handler if it was invoked as a trap, e.g. in response to a data breakpoint or a single step). I haven't checked if Linux gets these bits right, but it may be worth doing so. For the record -- GDB hardly cares, because it removes any instruction breakpoints before it is asked to resume execution of an instruction that has a breakpoint set at, single-steps the instruction with all the other threads locked out and then reinserts the breakpoints so that they can hit again. Then it proceeds with whatever should be done next to fulfil the execution request. Maciej ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers 2010-07-14 16:42 ` Maciej W. Rozycki @ 2010-07-16 12:28 ` Avi Kivity 2010-07-16 14:49 ` Mathieu Desnoyers 1 sibling, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 12:28 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote: > Implements an alternative iret with popf and return so trap and exception > handlers can return to the NMI handler without issuing iret. iret would cause > NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to > copy the return instruction pointer to the top of the previous stack, issue a > popf, loads the previous esp and issue a near return (ret). > > It allows placing dynamically patched static jumps in asm gotos, which will be > used for optimized tracepoints, in NMI code since returning from a breakpoint > would be valid. Accessing vmalloc'd memory, which allows executing module code > or accessing vmapped or vmalloc'd areas from NMI context, would also be valid. > This is very useful to tracers like LTTng. > > This patch makes all faults, traps and exception safe to be called from NMI > context*except* single-stepping, which requires iret to restore the TF (trap > flag) and jump to the return address in a single instruction. Sorry, no kprobes > support in NMI handlers because of this limitation. This cannot be emulated > with popf/lret, because lret would be single-stepped. It does not apply to > "immediate values" because they do not use single-stepping. This code detects if > the TF flag is set and uses the iret path for single-stepping, even if it > reactivates NMIs prematurely. > You need to save/restore cr2 in addition, otherwise the following hits you - page fault - processor writes cr2, enters fault handler - nmi - page fault - cr2 overwritten I guess you would usually not notice the corruption since you'd just see a spurious fault on the page the NMI handler touched, but if the first fault happened in a kvm guest, then we'd corrupt the guest's cr2. But the whole thing strikes me as overkill. If it's 8k per-cpu, what's wrong with using a per-cpu pointer to a kmalloc() area? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 12:28 ` Avi Kivity @ 2010-07-16 14:49 ` Mathieu Desnoyers 2010-07-16 15:34 ` Andi Kleen 2010-07-16 16:47 ` Avi Kivity 0 siblings, 2 replies; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-16 14:49 UTC (permalink / raw) To: Avi Kivity Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler * Avi Kivity (avi@redhat.com) wrote: > On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote: >> Implements an alternative iret with popf and return so trap and exception >> handlers can return to the NMI handler without issuing iret. iret would cause >> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to >> copy the return instruction pointer to the top of the previous stack, issue a >> popf, loads the previous esp and issue a near return (ret). >> >> It allows placing dynamically patched static jumps in asm gotos, which will be >> used for optimized tracepoints, in NMI code since returning from a breakpoint >> would be valid. Accessing vmalloc'd memory, which allows executing module code >> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid. >> This is very useful to tracers like LTTng. >> >> This patch makes all faults, traps and exception safe to be called from NMI >> context*except* single-stepping, which requires iret to restore the TF (trap >> flag) and jump to the return address in a single instruction. Sorry, no kprobes >> support in NMI handlers because of this limitation. This cannot be emulated >> with popf/lret, because lret would be single-stepped. It does not apply to >> "immediate values" because they do not use single-stepping. This code detects if >> the TF flag is set and uses the iret path for single-stepping, even if it >> reactivates NMIs prematurely. >> > > You need to save/restore cr2 in addition, otherwise the following hits you > > - page fault > - processor writes cr2, enters fault handler > - nmi > - page fault > - cr2 overwritten > > I guess you would usually not notice the corruption since you'd just see > a spurious fault on the page the NMI handler touched, but if the first > fault happened in a kvm guest, then we'd corrupt the guest's cr2. OK, just to make sure: you mean we'd have to save/restore the cr2 register at the beginning/end of the NMI handler execution, right ? The shouldn't we save/restore cr3 too ? > But the whole thing strikes me as overkill. If it's 8k per-cpu, what's > wrong with using a per-cpu pointer to a kmalloc() area? Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is much more than perf) can potentially cause large latencies, which could be squashed by allowing page faults in NMI handlers. This looks like a stronger argument to me. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 14:49 ` Mathieu Desnoyers @ 2010-07-16 15:34 ` Andi Kleen 2010-07-16 15:40 ` Mathieu Desnoyers 2010-07-16 16:47 ` Avi Kivity 1 sibling, 1 reply; 44+ messages in thread From: Andi Kleen @ 2010-07-16 15:34 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is > much more than perf) can potentially cause large latencies, which could be You need to fix all other code too that walks tasks lists to avoid all those. % gid for_each_process | wc -l In fact the mm-struct walk is cheaper than a task-list walk because there are always less than tasks. -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 15:34 ` Andi Kleen @ 2010-07-16 15:40 ` Mathieu Desnoyers 0 siblings, 0 replies; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-16 15:40 UTC (permalink / raw) To: Andi Kleen Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler * Andi Kleen (andi@firstfloor.org) wrote: > > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is > > much more than perf) can potentially cause large latencies, which could be > > You need to fix all other code too that walks tasks lists to avoid all those. > > % gid for_each_process | wc -l This can very well be done incrementally. And I agree, these should eventually targeted too, especially those which hold locks. We've already started hearing about tasklist lock live-locks in the past year, so I think we're pretty much at the point where it should be looked at. Thanks, Mathieu > > In fact the mm-struct walk is cheaper than a task-list walk because there > are always less than tasks. -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 14:49 ` Mathieu Desnoyers 2010-07-16 15:34 ` Andi Kleen @ 2010-07-16 16:47 ` Avi Kivity 2010-07-16 16:58 ` Mathieu Desnoyers 1 sibling, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 16:47 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote: > >> You need to save/restore cr2 in addition, otherwise the following hits you >> >> - page fault >> - processor writes cr2, enters fault handler >> - nmi >> - page fault >> - cr2 overwritten >> >> I guess you would usually not notice the corruption since you'd just see >> a spurious fault on the page the NMI handler touched, but if the first >> fault happened in a kvm guest, then we'd corrupt the guest's cr2. >> > OK, just to make sure: you mean we'd have to save/restore the cr2 register > at the beginning/end of the NMI handler execution, right ? Yes. > The shouldn't we > save/restore cr3 too ? > > No, faults should not change cr3. >> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's >> wrong with using a per-cpu pointer to a kmalloc() area? >> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is > much more than perf) can potentially cause large latencies, which could be > squashed by allowing page faults in NMI handlers. This looks like a stronger > argument to me. Why is that kernel code calling vmalloc_sync_all()? If it is only NMI which cannot take vmalloc faults, why bother? If not, why not? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 16:47 ` Avi Kivity @ 2010-07-16 16:58 ` Mathieu Desnoyers 2010-07-16 17:54 ` Avi Kivity 0 siblings, 1 reply; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-16 16:58 UTC (permalink / raw) To: Avi Kivity Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler * Avi Kivity (avi@redhat.com) wrote: > On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote: >> >>> You need to save/restore cr2 in addition, otherwise the following hits you >>> >>> - page fault >>> - processor writes cr2, enters fault handler >>> - nmi >>> - page fault >>> - cr2 overwritten >>> >>> I guess you would usually not notice the corruption since you'd just see >>> a spurious fault on the page the NMI handler touched, but if the first >>> fault happened in a kvm guest, then we'd corrupt the guest's cr2. >>> >> OK, just to make sure: you mean we'd have to save/restore the cr2 register >> at the beginning/end of the NMI handler execution, right ? > > Yes. OK > >> The shouldn't we >> save/restore cr3 too ? >> >> > > No, faults should not change cr3. Ah, right. > >>> But the whole thing strikes me as overkill. If it's 8k per-cpu, what's >>> wrong with using a per-cpu pointer to a kmalloc() area? >>> >> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is >> much more than perf) can potentially cause large latencies, which could be >> squashed by allowing page faults in NMI handlers. This looks like a stronger >> argument to me. > > Why is that kernel code calling vmalloc_sync_all()? If it is only NMI > which cannot take vmalloc faults, why bother? If not, why not? Modules come as yet another example of stuff that is loaded in vmalloc'd space and can be accesses from NMI context. That would include oprofile, tracers, and probably others I'm forgetting about. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 16:58 ` Mathieu Desnoyers @ 2010-07-16 17:54 ` Avi Kivity 2010-07-16 18:05 ` H. Peter Anvin 0 siblings, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 17:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote: > >> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI >> which cannot take vmalloc faults, why bother? If not, why not? >> > Modules come as yet another example of stuff that is loaded in vmalloc'd space > and can be accesses from NMI context. That would include oprofile, tracers, and > probably others I'm forgetting about. > Module loading can certainly take a vmalloc_sync_all() (though I agree it's unpleasant). Anything else? Note perf is not modular at this time, but could be made so with preempt/sched notifiers to hook the context switch. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 17:54 ` Avi Kivity @ 2010-07-16 18:05 ` H. Peter Anvin 2010-07-16 18:15 ` Avi Kivity ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: H. Peter Anvin @ 2010-07-16 18:05 UTC (permalink / raw) To: Avi Kivity Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 10:54 AM, Avi Kivity wrote: > On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote: >> >>> Why is that kernel code calling vmalloc_sync_all()? If it is only NMI >>> which cannot take vmalloc faults, why bother? If not, why not? >>> >> Modules come as yet another example of stuff that is loaded in vmalloc'd space >> and can be accesses from NMI context. That would include oprofile, tracers, and >> probably others I'm forgetting about. >> > > Module loading can certainly take a vmalloc_sync_all() (though I agree > it's unpleasant). Anything else? > > Note perf is not modular at this time, but could be made so with > preempt/sched notifiers to hook the context switch. > Actually, module loading is already a performance problem; a lot of distros load sometimes hundreds of modules on startup, and it's heavily serialized, so I can see this being desirable to skip. I really hope noone ever gets the idea of touching user space from an NMI handler, though, and expecting it to work... -hpa ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:05 ` H. Peter Anvin @ 2010-07-16 18:15 ` Avi Kivity 2010-07-16 18:17 ` H. Peter Anvin ` (2 more replies) 2010-07-16 19:28 ` Andi Kleen 2010-08-04 9:46 ` Peter Zijlstra 2 siblings, 3 replies; 44+ messages in thread From: Avi Kivity @ 2010-07-16 18:15 UTC (permalink / raw) To: H. Peter Anvin Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 09:05 PM, H. Peter Anvin wrote: > >> Module loading can certainly take a vmalloc_sync_all() (though I agree >> it's unpleasant). Anything else? >> >> Note perf is not modular at this time, but could be made so with >> preempt/sched notifiers to hook the context switch. >> >> > Actually, module loading is already a performance problem; a lot of > distros load sometimes hundreds of modules on startup, and it's heavily > serialized, so I can see this being desirable to skip. > There aren't that many processes at this time (or there shouldn't be, don't know how fork-happy udev is at this stage), so the sync should be pretty fast. In any case, we can sync only modules that contain NMI handlers. > I really hope noone ever gets the idea of touching user space from an > NMI handler, though, and expecting it to work... > I think the concern here is about an NMI handler's code running in vmalloc space, or is it something else? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:15 ` Avi Kivity @ 2010-07-16 18:17 ` H. Peter Anvin 2010-07-16 18:28 ` Avi Kivity 2010-07-16 18:22 ` Mathieu Desnoyers 2010-07-16 18:25 ` Linus Torvalds 2 siblings, 1 reply; 44+ messages in thread From: H. Peter Anvin @ 2010-07-16 18:17 UTC (permalink / raw) To: Avi Kivity Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 11:15 AM, Avi Kivity wrote: > > There aren't that many processes at this time (or there shouldn't be, > don't know how fork-happy udev is at this stage), so the sync should be > pretty fast. In any case, we can sync only modules that contain NMI > handlers. > >> I really hope noone ever gets the idea of touching user space from an >> NMI handler, though, and expecting it to work... >> > > I think the concern here is about an NMI handler's code running in > vmalloc space, or is it something else? > Code or data, yes; including percpu data. -hpa ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:17 ` H. Peter Anvin @ 2010-07-16 18:28 ` Avi Kivity 2010-07-16 18:37 ` Linus Torvalds 0 siblings, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 18:28 UTC (permalink / raw) To: H. Peter Anvin Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 09:17 PM, H. Peter Anvin wrote: > >> I think the concern here is about an NMI handler's code running in >> vmalloc space, or is it something else? >> >> > Code or data, yes; including percpu data. > Use kmalloc and percpu pointers, it's not that onerous. Oh, and you can access vmalloc space by switching cr3 temporarily to init_mm's, no? Obviously not a very performant solution, at least without PCIDs, but can be used if needed. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:28 ` Avi Kivity @ 2010-07-16 18:37 ` Linus Torvalds 2010-07-16 19:26 ` Avi Kivity 0 siblings, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2010-07-16 18:37 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity <avi@redhat.com> wrote: > > Use kmalloc and percpu pointers, it's not that onerous. What people don't seem to understand is that WE SHOULD NOT MAKE NMI FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING WHAT-SO-EVER TO DO WITH NMI. I'm shouting, because this point seems to have been continually missed. It was missed in the original patches, and it's been missed in the discussions. Non-NMI code should simply never have to even _think_ about NMI's. Why should it? It should just do whatever comes "natural" within its own context. This is why I've been pushing for the "let's just fix NMI" approach. Not adding random hacks to other code sequences that have nothing what-so-ever to do with NMI. So don't add NMI code to the page fault code. Not to the debug code, or to the module loading code. Don't say "use special allocations because the NMI code may care about these particular data structures". Because that way lies crap and unmaintainability. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:37 ` Linus Torvalds @ 2010-07-16 19:26 ` Avi Kivity 2010-07-16 21:39 ` Linus Torvalds 0 siblings, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 19:26 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 09:37 PM, Linus Torvalds wrote: > On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity<avi@redhat.com> wrote: > >> Use kmalloc and percpu pointers, it's not that onerous. >> > What people don't seem to understand is that WE SHOULD NOT MAKE NMI > FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING > WHAT-SO-EVER TO DO WITH NMI. > > I'm shouting, because this point seems to have been continually > missed. It was missed in the original patches, and it's been missed in > the discussions. > > Non-NMI code should simply never have to even _think_ about NMI's. Why > should it? It should just do whatever comes "natural" within its own > context. > > But we're not talking about non-NMI code. The 8k referred to in the original patch are buffers used by NMI stack recording. Module code vmalloc_sync_all() is only need by code that is executed during NMI, hence must be NMI aware. > This is why I've been pushing for the "let's just fix NMI" approach. > Not adding random hacks to other code sequences that have nothing > what-so-ever to do with NMI. > "fixing NMI" will result in code that is understandable by maybe three people after long and hard thinking. NMI can happen in too many semi-defined contexts, so there will be plenty of edge cases. I'm not sure we can ever trust such trickery. > So don't add NMI code to the page fault code. Not to the debug code, > or to the module loading code. Don't say "use special allocations > because the NMI code may care about these particular data structures". > Because that way lies crap and unmaintainability. > If NMI code can call random hooks and access random data, yes. But I don't think we're at that point yet. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 19:26 ` Avi Kivity @ 2010-07-16 21:39 ` Linus Torvalds 2010-07-16 22:07 ` Andi Kleen 2010-07-18 9:23 ` Avi Kivity 0 siblings, 2 replies; 44+ messages in thread From: Linus Torvalds @ 2010-07-16 21:39 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity <avi@redhat.com> wrote: > On 07/16/2010 09:37 PM, Linus Torvalds wrote: >> >> Non-NMI code should simply never have to even _think_ about NMI's. Why >> should it? It should just do whatever comes "natural" within its own >> context. > > But we're not talking about non-NMI code. Yes, we are. We're talking about breakpoints (look at the subject line), and you are very much talking about things like that _idiotic_ vmalloc_sync_all() by module loading code etc etc. Every _single_ "solution" I have seen - apart from my suggestion - has been about making code "special" because some other code might run in an NMI. Module init sequences having to do idiotic things just because they have data structures that might get accessed by NMI. And the thing is, if we just do NMI's correctly, and allow nesting, ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do stupid things elsewhere. In other words, why the hell are you arguing? Help Mathieu write the low-level NMI handler right, and remove that idiotic "vmalloc_sync_all()" that is fundamentally broken and should not exist. Rather than talk about adding more of that kind of crap. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 21:39 ` Linus Torvalds @ 2010-07-16 22:07 ` Andi Kleen 2010-07-16 22:26 ` Linus Torvalds 2010-07-16 22:40 ` Mathieu Desnoyers 2010-07-18 9:23 ` Avi Kivity 1 sibling, 2 replies; 44+ messages in thread From: Andi Kleen @ 2010-07-16 22:07 UTC (permalink / raw) To: Linus Torvalds Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler > And the thing is, if we just do NMI's correctly, and allow nesting, > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do > stupid things elsewhere. One issue I have with nesting NMIs is that you need a nesting limit, otherwise you'll overflow the NMI stack. We just got rid of nesting for normal interrupts because of this stack overflow problem which hit in real situations. In some cases you can get quite high NMI frequencies, e.g. with performance counters. Now the current performance counter handlers do not nest by themselves of course, but they might nest with other longer running NMI users. I think none of the current handlers are likely to nest for very long, but there's more and more NMI coded all the time, so it's definitely a concern. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:07 ` Andi Kleen @ 2010-07-16 22:26 ` Linus Torvalds 2010-07-16 22:41 ` Andi Kleen 2010-07-16 22:40 ` Mathieu Desnoyers 1 sibling, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2010-07-16 22:26 UTC (permalink / raw) To: Andi Kleen Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote: > > One issue I have with nesting NMIs is that you need > a nesting limit, otherwise you'll overflow the NMI stack. Have you actually looked at the suggestion I (and now Mathieu) suggested code for? The nesting is very limited. NMI's would nest just once, and when that happens, the nested NMI would never use more than something like a hundred bytes of stack (most of which is what the CPU pushes directly). And there would be no device interrupts that nest, and practically the faults that nest obviously aren't going to be complex faults either (ie the page fault would be the simple case that never calls to 'handle_vm_fault()', but handles it all in arch/x86/mm/fault.c. IOW, there is absolutely _no_ issues with nesting. It's two levels deep, and a much smaller stack footprint than our regular exception nesting for those two levels too. And your argument that there would be more and more NMI usage only makes it more important that we handle NMI's without going crazy. Just handle them cleanly instead of making them something totally special. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:26 ` Linus Torvalds @ 2010-07-16 22:41 ` Andi Kleen 2010-07-17 1:15 ` Linus Torvalds 0 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2010-07-16 22:41 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 03:26:32PM -0700, Linus Torvalds wrote: > On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote: > > > > One issue I have with nesting NMIs is that you need > > a nesting limit, otherwise you'll overflow the NMI stack. > > Have you actually looked at the suggestion I (and now Mathieu) > suggested code for? Maybe I'm misunderstanding everything (and it has been a lot of emails in the thread), but the case I was thinking of would be if the second NMI faults too, and then another one comes in after the IRET etc. -Andi ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:41 ` Andi Kleen @ 2010-07-17 1:15 ` Linus Torvalds 0 siblings, 0 replies; 44+ messages in thread From: Linus Torvalds @ 2010-07-17 1:15 UTC (permalink / raw) To: Andi Kleen Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 3:41 PM, Andi Kleen <andi@firstfloor.org> wrote: > > Maybe I'm misunderstanding everything (and it has been a lot of emails > in the thread), but the case I was thinking of would be if the second NMI > faults too, and then another one comes in after the IRET etc. No, the nested NMI cannot fault, because it never even enters C code. It literally just returns immediately after having noticed it is nested (and corrupted the stack of the original one, so that the original NMI will re-do itself at return).. So the nested NMI will use some few tens of bytes of stack. In fact, it will use the stack "above" the stack that the original NMI handler is using, because it will reset the stack pointer back to the top of the NMI stack. So in a very real sense, it is not even extending the stack, it is just re-using a small part of the same stack that the original NMI used (and that we copied away so that it doesn't matter that it gets re-used) As to another small but important detail: the _nested_ NMI actually returns using "popf+ret", leaving NMI's blocked again. Thus guaranteeing forward progress and lack of NMI storms. To summarize: - the "original" (first-level) NMI can take faults (like the page fault to fill in vmalloc pages lazily, or debug faults). That will actually cause two stack frames (or three, if you debug a page fault that happened while NMI was active). So there is certainly exception nesting going on, but we're talking _much_ less stack than normal stack usage where the nesting can be deep and in complex routines. - any "nested" NMI's will not actually use any more stack at all than a non-nested one, because we've pre-reserved space for them (and we _had_ to reserve space for them due to IST) - even if we get NMI's during the execution of the original NMI, there can be only one such "spurious" NMI per nested exception. So if we take a single page fault, that exception will re-enable NMI (because it returns with "iret"), and as a result we may take a _single_ new nested NMI until we disable NMI's again. In other words, the approach is not all that different from doing "lazy irq disable" like powerpc does for regular interrupts. For NMI's, we do it because it's impossible (on x86) to disable NMI's without actually taking one. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 22:07 ` Andi Kleen 2010-07-16 22:26 ` Linus Torvalds @ 2010-07-16 22:40 ` Mathieu Desnoyers 1 sibling, 0 replies; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-16 22:40 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Avi Kivity, H. Peter Anvin, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler * Andi Kleen (andi@firstfloor.org) wrote: > > And the thing is, if we just do NMI's correctly, and allow nesting, > > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do > > stupid things elsewhere. > > One issue I have with nesting NMIs is that you need > a nesting limit, otherwise you'll overflow the NMI stack. > > We just got rid of nesting for normal interrupts because > of this stack overflow problem which hit in real situations. > > In some cases you can get quite high NMI frequencies, e.g. with > performance counters. Now the current performance counter handlers > do not nest by themselves of course, but they might nest > with other longer running NMI users. > > I think none of the current handlers are likely to nest > for very long, but there's more and more NMI coded all the time, > so it's definitely a concern. We're not proposing to actually "nest" NMIs per se. We copy the stack at the beginning of the NMI handler (and then use the copy) to permit nesting of faults over NMI handlers. Following NMIs that would "try" to nest over the NMI handler would see their regular execution postponed until the end of the currently running NMI handler. It's OK for these "nested" NMI handlers to use the bottom of NMI stack because the NMI handler on which they are trying to nest is only using the stack copy. These "nested" handlers return to the original NMI handler very early just after setting a "pending nmi" flag. There is more to it (e.g. handling NMI handler exit atomically with respect to incoming NMIs); please refer to the last assembly code snipped I sent to Linus a little earlier today for details. Thanks, Mathieu > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only. -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 21:39 ` Linus Torvalds 2010-07-16 22:07 ` Andi Kleen @ 2010-07-18 9:23 ` Avi Kivity 1 sibling, 0 replies; 44+ messages in thread From: Avi Kivity @ 2010-07-18 9:23 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/17/2010 12:39 AM, Linus Torvalds wrote: > On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity<avi@redhat.com> wrote: > >> On 07/16/2010 09:37 PM, Linus Torvalds wrote: >> >>> Non-NMI code should simply never have to even _think_ about NMI's. Why >>> should it? It should just do whatever comes "natural" within its own >>> context. >>> >> But we're not talking about non-NMI code. >> > Yes, we are. We're talking about breakpoints (look at the subject > line), and you are very much talking about things like that _idiotic_ > vmalloc_sync_all() by module loading code etc etc. > Well, I'd put it in the nmi handler registration code, but you're right. A user placing breakpoints can't even tell whether the breakpoint will be hit by NMI code, especially data breakpoints. > Every _single_ "solution" I have seen - apart from my suggestion - has > been about making code "special" because some other code might run in > an NMI. Module init sequences having to do idiotic things just because > they have data structures that might get accessed by NMI. > > And the thing is, if we just do NMI's correctly, and allow nesting, > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do > stupid things elsewhere. > > In other words, why the hell are you arguing? Help Mathieu write the > low-level NMI handler right, and remove that idiotic > "vmalloc_sync_all()" that is fundamentally broken and should not > exist. Rather than talk about adding more of that kind of crap. > Well, at least we'll get a good test case for kvm's nmi blocking emulation (it's tricky since if we fault on an iret sometimes nmis get unblocked even though the instruction did not complete). -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:15 ` Avi Kivity 2010-07-16 18:17 ` H. Peter Anvin @ 2010-07-16 18:22 ` Mathieu Desnoyers 2010-07-16 18:32 ` Avi Kivity 2010-07-16 18:25 ` Linus Torvalds 2 siblings, 1 reply; 44+ messages in thread From: Mathieu Desnoyers @ 2010-07-16 18:22 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler * Avi Kivity (avi@redhat.com) wrote: > On 07/16/2010 09:05 PM, H. Peter Anvin wrote: >> >>> Module loading can certainly take a vmalloc_sync_all() (though I agree >>> it's unpleasant). Anything else? >>> >>> Note perf is not modular at this time, but could be made so with >>> preempt/sched notifiers to hook the context switch. >>> >>> >> Actually, module loading is already a performance problem; a lot of >> distros load sometimes hundreds of modules on startup, and it's heavily >> serialized, so I can see this being desirable to skip. >> > > There aren't that many processes at this time (or there shouldn't be, > don't know how fork-happy udev is at this stage), so the sync should be > pretty fast. In any case, we can sync only modules that contain NMI > handlers. USB hotplug is a use-case happening randomly after the system is well there and running; I'm afraid this does not fit in your module loading expectations. It triggers tons of events, many of these actually load modules. Thanks, Mathieu > >> I really hope noone ever gets the idea of touching user space from an >> NMI handler, though, and expecting it to work... >> > > I think the concern here is about an NMI handler's code running in > vmalloc space, or is it something else? > > -- > I have a truly marvellous patch that fixes the bug which this > signature is too narrow to contain. > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:22 ` Mathieu Desnoyers @ 2010-07-16 18:32 ` Avi Kivity 2010-07-16 19:29 ` H. Peter Anvin 2010-07-16 19:32 ` Andi Kleen 0 siblings, 2 replies; 44+ messages in thread From: Avi Kivity @ 2010-07-16 18:32 UTC (permalink / raw) To: Mathieu Desnoyers Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote: > >> There aren't that many processes at this time (or there shouldn't be, >> don't know how fork-happy udev is at this stage), so the sync should be >> pretty fast. In any case, we can sync only modules that contain NMI >> handlers. >> > USB hotplug is a use-case happening randomly after the system is well there and > running; I'm afraid this does not fit in your module loading expectations. It > triggers tons of events, many of these actually load modules. > How long would vmalloc_sync_all take with a few thousand mm_struct take? We share the pmds, yes? So it's a few thousand memory accesses. The direct impact is probably negligible, compared to actually loading the module from disk. All we need is to make sure the locking doesn't slow down unrelated stuff. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:32 ` Avi Kivity @ 2010-07-16 19:29 ` H. Peter Anvin 2010-07-16 19:39 ` Avi Kivity 2010-07-16 19:32 ` Andi Kleen 1 sibling, 1 reply; 44+ messages in thread From: H. Peter Anvin @ 2010-07-16 19:29 UTC (permalink / raw) To: Avi Kivity Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 11:32 AM, Avi Kivity wrote: > > How long would vmalloc_sync_all take with a few thousand mm_struct take? > > We share the pmds, yes? So it's a few thousand memory accesses. The > direct impact is probably negligible, compared to actually loading the > module from disk. All we need is to make sure the locking doesn't slow > down unrelated stuff. > It's not the memory accesses, it's the need to synchronize all the CPUs. -hpa ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 19:29 ` H. Peter Anvin @ 2010-07-16 19:39 ` Avi Kivity 0 siblings, 0 replies; 44+ messages in thread From: Avi Kivity @ 2010-07-16 19:39 UTC (permalink / raw) To: H. Peter Anvin Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 10:29 PM, H. Peter Anvin wrote: > On 07/16/2010 11:32 AM, Avi Kivity wrote: > >> How long would vmalloc_sync_all take with a few thousand mm_struct take? >> >> We share the pmds, yes? So it's a few thousand memory accesses. The >> direct impact is probably negligible, compared to actually loading the >> module from disk. All we need is to make sure the locking doesn't slow >> down unrelated stuff. >> >> > It's not the memory accesses, it's the need to synchronize all the CPUs. > I'm missing something. Why do we need to sync all cpus? the vmalloc_sync_all() I'm reading doesn't. Even if we do an on_each_cpu() somewhere, it isn't the end of the world. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:32 ` Avi Kivity 2010-07-16 19:29 ` H. Peter Anvin @ 2010-07-16 19:32 ` Andi Kleen 1 sibling, 0 replies; 44+ messages in thread From: Andi Kleen @ 2010-07-16 19:32 UTC (permalink / raw) To: Avi Kivity Cc: Mathieu Desnoyers, H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 09:32:00PM +0300, Avi Kivity wrote: > On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote: > > > >>There aren't that many processes at this time (or there shouldn't be, > >>don't know how fork-happy udev is at this stage), so the sync should be > >>pretty fast. In any case, we can sync only modules that contain NMI > >>handlers. > >USB hotplug is a use-case happening randomly after the system is well there and > >running; I'm afraid this does not fit in your module loading expectations. It > >triggers tons of events, many of these actually load modules. > > How long would vmalloc_sync_all take with a few thousand mm_struct take? > > We share the pmds, yes? So it's a few thousand memory accesses. > The direct impact is probably negligible, compared to actually > loading the module from disk. All we need is to make sure the > locking doesn't slow down unrelated stuff. Also you have to remember that vmalloc_sync_all() only does something when the top level page is actually updated. That is very rare. (in many cases it should happen at most once per boot) Most mapping changes update lower levels, and those are already shared. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:15 ` Avi Kivity 2010-07-16 18:17 ` H. Peter Anvin 2010-07-16 18:22 ` Mathieu Desnoyers @ 2010-07-16 18:25 ` Linus Torvalds 2010-07-16 19:30 ` Andi Kleen 2 siblings, 1 reply; 44+ messages in thread From: Linus Torvalds @ 2010-07-16 18:25 UTC (permalink / raw) To: Avi Kivity Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote: > > I think the concern here is about an NMI handler's code running in vmalloc > space, or is it something else? I think the concern was also potentially doing things like backtraces etc that may need access to the module data structures (I think the ELF headers end up all being in vmalloc space too, for example). The whole debugging thing is also an issue. Now, I obviously am not a big fan of remote debuggers, but everybody tells me I'm wrong. And putting a breakpoint on NMI is certainly not insane if you are doing debugging in the first place. So it's not necessarily always about the page faults. Linus ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:25 ` Linus Torvalds @ 2010-07-16 19:30 ` Andi Kleen 2010-07-18 9:26 ` Avi Kivity 0 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2010-07-16 19:30 UTC (permalink / raw) To: Linus Torvalds Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 11:25:19AM -0700, Linus Torvalds wrote: > On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote: > > > > I think the concern here is about an NMI handler's code running in vmalloc > > space, or is it something else? > > I think the concern was also potentially doing things like backtraces > etc that may need access to the module data structures (I think the > ELF headers end up all being in vmalloc space too, for example). > > The whole debugging thing is also an issue. Now, I obviously am not a > big fan of remote debuggers, but everybody tells me I'm wrong. And > putting a breakpoint on NMI is certainly not insane if you are doing > debugging in the first place. So it's not necessarily always about the > page faults. We already have infrastructure for kprobes to prevent breakpoints on critical code (the __kprobes section). In principle kgdb/kdb could be taught about honoring those too. That wouldn't help for truly external JTAG debuggers, but I would assume those generally can (should) handle any contexts anyways. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 19:30 ` Andi Kleen @ 2010-07-18 9:26 ` Avi Kivity 0 siblings, 0 replies; 44+ messages in thread From: Avi Kivity @ 2010-07-18 9:26 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 10:30 PM, Andi Kleen wrote: > We already have infrastructure for kprobes to prevent breakpoints > on critical code (the __kprobes section). In principle kgdb/kdb > could be taught about honoring those too. > > It doesn't help with NMI code calling other functions, or with data breakpoints. -- error compiling committee.c: too many arguments to function ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:05 ` H. Peter Anvin 2010-07-16 18:15 ` Avi Kivity @ 2010-07-16 19:28 ` Andi Kleen 2010-07-16 19:32 ` Avi Kivity 2010-08-04 9:46 ` Peter Zijlstra 2 siblings, 1 reply; 44+ messages in thread From: Andi Kleen @ 2010-07-16 19:28 UTC (permalink / raw) To: H. Peter Anvin Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler > Actually, module loading is already a performance problem; a lot of > distros load sometimes hundreds of modules on startup, and it's heavily On startup you don't have many processes. If there's a problem it's surely not the fault of vmalloc_sync_all() BTW in my experience one reason module loading was traditionally slow was that it did a stop_machine(). I think(?) that has been fixed at some point. But even with that's it's more an issue on larger systems. > I really hope noone ever gets the idea of touching user space from an > NMI handler, though, and expecting it to work... It can make sense for a backtrace in a profiler. In fact perf is nearly doing it I believe, but moves it to the self IPI handler in most cases. -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 19:28 ` Andi Kleen @ 2010-07-16 19:32 ` Avi Kivity 2010-07-16 19:34 ` Andi Kleen 0 siblings, 1 reply; 44+ messages in thread From: Avi Kivity @ 2010-07-16 19:32 UTC (permalink / raw) To: Andi Kleen Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On 07/16/2010 10:28 PM, Andi Kleen wrote: > >> I really hope noone ever gets the idea of touching user space from an >> NMI handler, though, and expecting it to work... >> > It can make sense for a backtrace in a profiler. > > In fact perf is nearly doing it I believe, but moves > it to the self IPI handler in most cases. > Interesting, is the self IPI guaranteed to execute synchronously after the NMI's IRET? Or can the core IRET faster than the APIC and so we get the backtrace at the wrong place? (and does it matter? the NMI itself is not always accurate) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 19:32 ` Avi Kivity @ 2010-07-16 19:34 ` Andi Kleen 0 siblings, 0 replies; 44+ messages in thread From: Andi Kleen @ 2010-07-16 19:34 UTC (permalink / raw) To: Avi Kivity Cc: Andi Kleen, H. Peter Anvin, Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler On Fri, Jul 16, 2010 at 10:32:13PM +0300, Avi Kivity wrote: > On 07/16/2010 10:28 PM, Andi Kleen wrote: > > > >>I really hope noone ever gets the idea of touching user space from an > >>NMI handler, though, and expecting it to work... > >It can make sense for a backtrace in a profiler. > > > >In fact perf is nearly doing it I believe, but moves > >it to the self IPI handler in most cases. > > Interesting, is the self IPI guaranteed to execute synchronously > after the NMI's IRET? Or can the core IRET faster than the APIC and > so we get the backtrace at the wrong place? > > (and does it matter? the NMI itself is not always accurate) self ipi runs after the next STI (or POPF) -Andi -- ak@linux.intel.com -- Speaking for myself only. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-07-16 18:05 ` H. Peter Anvin 2010-07-16 18:15 ` Avi Kivity 2010-07-16 19:28 ` Andi Kleen @ 2010-08-04 9:46 ` Peter Zijlstra 2010-08-04 20:23 ` H. Peter Anvin 2 siblings, 1 reply; 44+ messages in thread From: Peter Zijlstra @ 2010-08-04 9:46 UTC (permalink / raw) To: H. Peter Anvin Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler, David Howells On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote: > > I really hope noone ever gets the idea of touching user space from an > NMI handler, though, and expecting it to work... Perf actually already does that to unwind user-space stacks... ;-) See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users. What we do is a manual page table walk (using __get_user_pages_fast) and simply bail when the page is not available. That said, I think that the thing that started the whole per-cpu-per-context temp stack-frame storage story also means that that function is now broken and can lead to kmap_atomic corruption. I really should brush up that stack based kmap_atomic thing, last time I got stuck on FRV wanting things. Linus should I refresh that whole series and give a FRV a slow but working implementation and then let David Howells sort out things if he cares about that? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault 2010-08-04 9:46 ` Peter Zijlstra @ 2010-08-04 20:23 ` H. Peter Anvin 0 siblings, 0 replies; 44+ messages in thread From: H. Peter Anvin @ 2010-08-04 20:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt, Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler, David Howells On 08/04/2010 02:46 AM, Peter Zijlstra wrote: > On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote: >> >> I really hope noone ever gets the idea of touching user space from an >> NMI handler, though, and expecting it to work... > > Perf actually already does that to unwind user-space stacks... ;-) > > See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users. > > What we do is a manual page table walk (using __get_user_pages_fast) and > simply bail when the page is not available. > That's not really "touching user space", though. -hpa ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2010-08-04 20:27 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey 2010-07-16 22:22 ` Linus Torvalds 2010-07-16 22:48 ` Jeffrey Merkey 2010-07-16 22:53 ` Jeffrey Merkey 2010-07-16 22:50 ` Jeffrey Merkey -- strict thread matches above, loose matches on Subject: below -- 2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers 2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers 2010-07-14 16:42 ` Maciej W. Rozycki 2010-07-14 18:12 ` Mathieu Desnoyers 2010-07-14 19:21 ` Maciej W. Rozycki 2010-07-14 19:58 ` Mathieu Desnoyers 2010-07-14 20:36 ` Maciej W. Rozycki 2010-07-16 12:28 ` Avi Kivity 2010-07-16 14:49 ` Mathieu Desnoyers 2010-07-16 15:34 ` Andi Kleen 2010-07-16 15:40 ` Mathieu Desnoyers 2010-07-16 16:47 ` Avi Kivity 2010-07-16 16:58 ` Mathieu Desnoyers 2010-07-16 17:54 ` Avi Kivity 2010-07-16 18:05 ` H. Peter Anvin 2010-07-16 18:15 ` Avi Kivity 2010-07-16 18:17 ` H. Peter Anvin 2010-07-16 18:28 ` Avi Kivity 2010-07-16 18:37 ` Linus Torvalds 2010-07-16 19:26 ` Avi Kivity 2010-07-16 21:39 ` Linus Torvalds 2010-07-16 22:07 ` Andi Kleen 2010-07-16 22:26 ` Linus Torvalds 2010-07-16 22:41 ` Andi Kleen 2010-07-17 1:15 ` Linus Torvalds 2010-07-16 22:40 ` Mathieu Desnoyers 2010-07-18 9:23 ` Avi Kivity 2010-07-16 18:22 ` Mathieu Desnoyers 2010-07-16 18:32 ` Avi Kivity 2010-07-16 19:29 ` H. Peter Anvin 2010-07-16 19:39 ` Avi Kivity 2010-07-16 19:32 ` Andi Kleen 2010-07-16 18:25 ` Linus Torvalds 2010-07-16 19:30 ` Andi Kleen 2010-07-18 9:26 ` Avi Kivity 2010-07-16 19:28 ` Andi Kleen 2010-07-16 19:32 ` Avi Kivity 2010-07-16 19:34 ` Andi Kleen 2010-08-04 9:46 ` Peter Zijlstra 2010-08-04 20:23 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).