linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
@ 2010-07-14 15:49 ` Mathieu Desnoyers
  2010-07-14 16:42   ` Maciej W. Rozycki
  2010-07-16 12:28   ` Avi Kivity
  0 siblings, 2 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 15:49 UTC (permalink / raw)
  To: LKML
  Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Mathieu Desnoyers, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Mathieu Desnoyers, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

[-- Attachment #1: x86-nmi-safe-int3-and-page-fault.patch --]
[-- Type: text/plain, Size: 21100 bytes --]

Implements an alternative iret with popf and return so trap and exception
handlers can return to the NMI handler without issuing iret. iret would cause
NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
copy the return instruction pointer to the top of the previous stack, issue a
popf, loads the previous esp and issue a near return (ret).

It allows placing dynamically patched static jumps in asm gotos, which will be
used for optimized tracepoints, in NMI code since returning from a breakpoint
would be valid. Accessing vmalloc'd memory, which allows executing module code
or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
This is very useful to tracers like LTTng.

This patch makes all faults, traps and exception safe to be called from NMI
context *except* single-stepping, which requires iret to restore the TF (trap
flag) and jump to the return address in a single instruction. Sorry, no kprobes
support in NMI handlers because of this limitation. This cannot be emulated
with popf/lret, because lret would be single-stepped. It does not apply to
"immediate values" because they do not use single-stepping. This code detects if
the TF flag is set and uses the iret path for single-stepping, even if it
reactivates NMIs prematurely.

Test to detect if nested under a NMI handler is only done upon the return from
trap/exception to kernel, which is not frequent. Other return paths (return from
trap/exception to userspace, return from interrupt) keep the exact same behavior
(no slowdown).

alpha and avr32 use the active count bit 31. This patch moves them to 28.

TODO : test alpha and avr32 active count modification
TODO : test with lguest, xen, kvm.

tested on x86_32 (tests implemented in a separate patch) :
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.
- tested vmalloc faults in NMI handler by placing a non-optimized marker in the
  NMI handler (so no breakpoint is executed) and connecting a probe which
  touches every pages of a 20MB vmalloc'd buffer. It executes trough the return
  path without problem.
- Tested with and without preemption

tested on x86_64
- instrumented the return path to export the EIP, CS and EFLAGS values when
  taken so we know the return path code has been executed.
- trace_mark, using immediate values, with 10ms delay with the breakpoint
  activated. Runs well through the return path.

To test on x86_64 :
- Test without preemption
- Test vmalloc faults
- Test on Intel 64 bits CPUs. (AMD64 was fine)

Changelog since v1 :
- x86_64 fixes.
Changelog since v2 :
- fix paravirt build
Changelog since v3 :
- Include modifications suggested by Jeremy
Changelog since v4 :
- including hardirq.h in entry_32/64.S is a bad idea (non ifndef'd C code),
  define NMI_MASK in the .S files directly.
Changelog since v5 :
- Add NMI_MASK to irq_count() and make die() more verbose for NMIs.
Changelog since v7 :
- Implement paravirtualized nmi_return.
Changelog since v8 :
- refreshed the patch for asm-offsets. Those were left out of v8.
- now depends on "Stringify support commas" patch.
Changelog since v9 :
- Only test the nmi nested preempt count flag upon return from exceptions, not
  on return from interrupts. Only the kernel return path has this test.
- Add Xen, VMI, lguest support. Use their iret pavavirt ops in lieu of
  nmi_return.

- update for 2.6.30-rc1
Follow NMI_MASK bits merged in mainline.

- update for 2.6.35-rc4-tip

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: akpm@osdl.org
CC: mingo@elte.hu
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: "Frank Ch. Eigler" <fche@redhat.com>
---
 arch/alpha/include/asm/thread_info.h  |    2 -
 arch/avr32/include/asm/thread_info.h  |    2 -
 arch/x86/include/asm/irqflags.h       |   56 ++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/paravirt.h       |    4 ++
 arch/x86/include/asm/paravirt_types.h |    1 
 arch/x86/kernel/asm-offsets_32.c      |    1 
 arch/x86/kernel/asm-offsets_64.c      |    1 
 arch/x86/kernel/dumpstack.c           |    2 +
 arch/x86/kernel/entry_32.S            |   30 ++++++++++++++++++
 arch/x86/kernel/entry_64.S            |   33 ++++++++++++++++++--
 arch/x86/kernel/paravirt.c            |    3 +
 arch/x86/kernel/paravirt_patch_32.c   |    6 +++
 arch/x86/kernel/paravirt_patch_64.c   |    6 +++
 arch/x86/kernel/vmi_32.c              |    2 +
 arch/x86/lguest/boot.c                |    1 
 arch/x86/xen/enlighten.c              |    1 
 16 files changed, 145 insertions(+), 6 deletions(-)

Index: linux.trees.git/arch/x86/kernel/entry_32.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_32.S	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_32.S	2010-07-14 08:02:11.000000000 -0400
@@ -80,6 +80,8 @@
 
 #define nr_syscalls ((syscall_table_size)/4)
 
+#define NMI_MASK 0x04000000
+
 #ifdef CONFIG_PREEMPT
 #define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
@@ -348,8 +350,32 @@ END(ret_from_fork)
 	# userspace resumption stub bypassing syscall exit tracing
 	ALIGN
 	RING0_PTREGS_FRAME
+
 ret_from_exception:
 	preempt_stop(CLBR_ANY)
+	GET_THREAD_INFO(%ebp)
+	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS and CS
+	movb PT_CS(%esp), %al
+	andl $(X86_EFLAGS_VM | SEGMENT_RPL_MASK), %eax
+	cmpl $USER_RPL, %eax
+	jae resume_userspace	# returning to v8086 or userspace
+	testl $NMI_MASK,TI_preempt_count(%ebp)
+	jz resume_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF, PT_EFLAGS(%esp)
+	jnz resume_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	TRACE_IRQS_IRET
+	RESTORE_REGS
+	addl $4, %esp			# skip orig_eax/error_code
+	CFI_ADJUST_CFA_OFFSET -4
+	INTERRUPT_RETURN_NMI_SAFE
+
 ret_from_intr:
 	GET_THREAD_INFO(%ebp)
 check_userspace:
@@ -949,6 +975,10 @@ ENTRY(native_iret)
 .previous
 END(native_iret)
 
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ?
+END(native_nmi_return)
+
 ENTRY(native_irq_enable_sysexit)
 	sti
 	sysexit
Index: linux.trees.git/arch/x86/kernel/entry_64.S
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/entry_64.S	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/entry_64.S	2010-07-14 08:02:11.000000000 -0400
@@ -163,6 +163,8 @@ GLOBAL(return_to_handler)
 #endif
 
 
+#define NMI_MASK 0x04000000
+
 #ifndef CONFIG_PREEMPT
 #define retint_kernel retint_restore_args
 #endif
@@ -875,6 +877,9 @@ ENTRY(native_iret)
 	.section __ex_table,"a"
 	.quad native_iret, bad_iret
 	.previous
+
+ENTRY(native_nmi_return)
+	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 #endif
 
 	.section .fixup,"ax"
@@ -929,6 +934,24 @@ retint_signal:
 	GET_THREAD_INFO(%rcx)
 	jmp retint_with_reschedule
 
+	/* Returning to kernel space from exception. */
+	/* rcx:	 threadinfo. interrupts off. */
+ENTRY(retexc_kernel)
+	testl $NMI_MASK,TI_preempt_count(%rcx)
+	jz retint_kernel		/* Not nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-ARGOFFSET(%rsp)	/* trap flag? */
+	jnz retint_kernel		/*
+					 * If single-stepping an NMI handler,
+					 * use the normal iret path instead of
+					 * the popf/lret because lret would be
+					 * single-stepped. It should not
+					 * happen : it will reactivate NMIs
+					 * prematurely.
+					 */
+	RESTORE_ARGS 0,8,0
+	TRACE_IRQS_IRETQ
+	INTERRUPT_RETURN_NMI_SAFE
+
 #ifdef CONFIG_PREEMPT
 	/* Returning to kernel space. Check if we need preemption */
 	/* rcx:	 threadinfo. interrupts off. */
@@ -1375,12 +1398,18 @@ ENTRY(paranoid_exit)
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
 	SWAPGS_UNSAFE_STACK
+paranoid_restore_no_nmi:
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_restore:
+	GET_THREAD_INFO(%rcx)
 	TRACE_IRQS_IRETQ 0
+	testl $NMI_MASK,TI_preempt_count(%rcx)
+	jz paranoid_restore_no_nmi              /* Nested over NMI ? */
+	testw $X86_EFLAGS_TF,EFLAGS-0(%rsp)     /* trap flag? */
+	jnz paranoid_restore_no_nmi
 	RESTORE_ALL 8
-	jmp irq_return
+	INTERRUPT_RETURN_NMI_SAFE
 paranoid_userspace:
 	GET_THREAD_INFO(%rcx)
 	movl TI_flags(%rcx),%ebx
@@ -1479,7 +1508,7 @@ ENTRY(error_exit)
 	TRACE_IRQS_OFF
 	GET_THREAD_INFO(%rcx)
 	testl %eax,%eax
-	jne retint_kernel
+	jne retexc_kernel
 	LOCKDEP_SYS_EXIT_IRQ
 	movl TI_flags(%rcx),%edx
 	movl $_TIF_WORK_MASK,%edi
Index: linux.trees.git/arch/x86/include/asm/irqflags.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/irqflags.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/irqflags.h	2010-07-14 08:02:11.000000000 -0400
@@ -56,6 +56,61 @@ static inline void native_halt(void)
 
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Only returns from a trap or exception to a NMI context (intra-privilege
+ * level near return) to the same SS and CS segments. Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued. It takes care of modifying the eflags, rsp and returning to the
+ * previous function.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(rsp)  RIP
+ * 8(rsp)  CS
+ * 16(rsp) EFLAGS
+ * 24(rsp) RSP
+ * 32(rsp) SS
+ *
+ * Upon execution :
+ * Copy EIP to the top of the return stack
+ * Update top of return stack address
+ * Pop eflags into the eflags register
+ * Make the return stack current
+ * Near return (popping the return address from the return stack)
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushq %rax;		\
+						movq %rsp, %rax;	\
+						movq 24+8(%rax), %rsp;	\
+						pushq 0+8(%rax);	\
+						pushq 16+8(%rax);	\
+						movq (%rax), %rax;	\
+						popfq;			\
+						ret
+#else
+/*
+ * Protected mode only, no V8086. Implies that protected mode must
+ * be entered before NMIs or MCEs are enabled. Only returns from a trap or
+ * exception to a NMI context (intra-privilege level far return). Should be used
+ * upon trap or exception return when nested over a NMI context so no iret is
+ * issued.
+ *
+ * The stack, at that point, looks like :
+ *
+ * 0(esp) EIP
+ * 4(esp) CS
+ * 8(esp) EFLAGS
+ *
+ * Upon execution :
+ * Copy the stack eflags to top of stack
+ * Pop eflags into the eflags register
+ * Far return: pop EIP and CS into their register, and additionally pop EFLAGS.
+ */
+#define NATIVE_INTERRUPT_RETURN_NMI_SAFE	pushl 8(%esp);	\
+						popfl;		\
+						lret $4
+#endif
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
@@ -114,6 +169,7 @@ static inline unsigned long __raw_local_
 
 #define ENABLE_INTERRUPTS(x)	sti
 #define DISABLE_INTERRUPTS(x)	cli
+#define INTERRUPT_RETURN_NMI_SAFE	NATIVE_INTERRUPT_RETURN_NMI_SAFE
 
 #ifdef CONFIG_X86_64
 #define SWAPGS	swapgs
Index: linux.trees.git/arch/alpha/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/alpha/include/asm/thread_info.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/alpha/include/asm/thread_info.h	2010-07-14 08:02:11.000000000 -0400
@@ -56,7 +56,7 @@ register struct thread_info *__current_t
 #define THREAD_SIZE_ORDER 1
 #define THREAD_SIZE (2*PAGE_SIZE)
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags:
Index: linux.trees.git/arch/avr32/include/asm/thread_info.h
===================================================================
--- linux.trees.git.orig/arch/avr32/include/asm/thread_info.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/avr32/include/asm/thread_info.h	2010-07-14 08:02:11.000000000 -0400
@@ -66,7 +66,7 @@ static inline struct thread_info *curren
 
 #endif /* !__ASSEMBLY__ */
 
-#define PREEMPT_ACTIVE		0x40000000
+#define PREEMPT_ACTIVE		0x10000000
 
 /*
  * Thread information flags
Index: linux.trees.git/arch/x86/include/asm/paravirt.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt.h	2010-07-14 08:02:11.000000000 -0400
@@ -943,6 +943,10 @@ extern void default_banner(void);
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE,	\
 		  jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))
 
+#define INTERRUPT_RETURN_NMI_SAFE					\
+	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE,	\
+		  jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return)
+
 #define DISABLE_INTERRUPTS(clobbers)					\
 	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
 		  PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);		\
Index: linux.trees.git/arch/x86/kernel/paravirt.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt.c	2010-07-14 08:02:11.000000000 -0400
@@ -156,6 +156,7 @@ unsigned paravirt_patch_default(u8 type,
 		ret = paravirt_patch_ident_64(insnbuf, len);
 
 	else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
+		 type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
 		 type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret64))
@@ -204,6 +205,7 @@ static void native_flush_tlb_single(unsi
 
 /* These are in entry.S */
 extern void native_iret(void);
+extern void native_nmi_return(void);
 extern void native_irq_enable_sysexit(void);
 extern void native_usergs_sysret32(void);
 extern void native_usergs_sysret64(void);
@@ -373,6 +375,7 @@ struct pv_cpu_ops pv_cpu_ops = {
 	.usergs_sysret64 = native_usergs_sysret64,
 #endif
 	.iret = native_iret,
+	.nmi_return = native_nmi_return,
 	.swapgs = native_swapgs,
 
 	.set_iopl_mask = native_set_iopl_mask,
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_32.c	2010-07-14 08:02:11.000000000 -0400
@@ -1,10 +1,13 @@
-#include <asm/paravirt.h>
+#include <linux/stringify.h>
+#include <linux/irqflags.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "push %eax; popf");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushf; pop %eax");
 DEF_NATIVE(pv_cpu_ops, iret, "iret");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit");
 DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
@@ -41,6 +44,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, restore_fl);
 		PATCH_SITE(pv_irq_ops, save_fl);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
 		PATCH_SITE(pv_mmu_ops, read_cr2);
 		PATCH_SITE(pv_mmu_ops, read_cr3);
Index: linux.trees.git/arch/x86/kernel/paravirt_patch_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/paravirt_patch_64.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/paravirt_patch_64.c	2010-07-14 08:02:11.000000000 -0400
@@ -1,12 +1,15 @@
+#include <linux/irqflags.h>
+#include <linux/stringify.h>
 #include <asm/paravirt.h>
 #include <asm/asm-offsets.h>
-#include <linux/stringify.h>
 
 DEF_NATIVE(pv_irq_ops, irq_disable, "cli");
 DEF_NATIVE(pv_irq_ops, irq_enable, "sti");
 DEF_NATIVE(pv_irq_ops, restore_fl, "pushq %rdi; popfq");
 DEF_NATIVE(pv_irq_ops, save_fl, "pushfq; popq %rax");
 DEF_NATIVE(pv_cpu_ops, iret, "iretq");
+DEF_NATIVE(pv_cpu_ops, nmi_return,
+	__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
 DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
@@ -51,6 +54,7 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_irq_ops, irq_enable);
 		PATCH_SITE(pv_irq_ops, irq_disable);
 		PATCH_SITE(pv_cpu_ops, iret);
+		PATCH_SITE(pv_cpu_ops, nmi_return);
 		PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
 		PATCH_SITE(pv_cpu_ops, usergs_sysret32);
 		PATCH_SITE(pv_cpu_ops, usergs_sysret64);
Index: linux.trees.git/arch/x86/kernel/asm-offsets_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_32.c	2010-07-14 08:02:11.000000000 -0400
@@ -113,6 +113,7 @@ void foo(void)
 	OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
 	OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0);
 #endif
Index: linux.trees.git/arch/x86/kernel/asm-offsets_64.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/asm-offsets_64.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/asm-offsets_64.c	2010-07-14 08:02:11.000000000 -0400
@@ -58,6 +58,7 @@ int main(void)
 	OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
 	OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame);
 	OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
+	OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
 	OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
 	OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64);
 	OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
Index: linux.trees.git/arch/x86/xen/enlighten.c
===================================================================
--- linux.trees.git.orig/arch/x86/xen/enlighten.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/xen/enlighten.c	2010-07-14 08:02:12.000000000 -0400
@@ -953,6 +953,7 @@ static const struct pv_cpu_ops xen_cpu_o
 	.read_pmc = native_read_pmc,
 
 	.iret = xen_iret,
+	.nmi_return = xen_iret,
 	.irq_enable_sysexit = xen_sysexit,
 #ifdef CONFIG_X86_64
 	.usergs_sysret32 = xen_sysret32,
Index: linux.trees.git/arch/x86/kernel/vmi_32.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/vmi_32.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/vmi_32.c	2010-07-14 08:02:12.000000000 -0400
@@ -154,6 +154,8 @@ static unsigned vmi_patch(u8 type, u16 c
 					      insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.iret):
 			return patch_internal(VMI_CALL_IRET, len, insns, ip);
+		case PARAVIRT_PATCH(pv_cpu_ops.nmi_return):
+			return patch_internal(VMI_CALL_IRET, len, insns, ip);
 		case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit):
 			return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip);
 		default:
Index: linux.trees.git/arch/x86/lguest/boot.c
===================================================================
--- linux.trees.git.orig/arch/x86/lguest/boot.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/lguest/boot.c	2010-07-14 08:02:12.000000000 -0400
@@ -1270,6 +1270,7 @@ __init void lguest_init(void)
 	pv_cpu_ops.cpuid = lguest_cpuid;
 	pv_cpu_ops.load_idt = lguest_load_idt;
 	pv_cpu_ops.iret = lguest_iret;
+	pv_cpu_ops.nmi_return = lguest_iret;
 	pv_cpu_ops.load_sp0 = lguest_load_sp0;
 	pv_cpu_ops.load_tr_desc = lguest_load_tr_desc;
 	pv_cpu_ops.set_ldt = lguest_set_ldt;
Index: linux.trees.git/arch/x86/kernel/dumpstack.c
===================================================================
--- linux.trees.git.orig/arch/x86/kernel/dumpstack.c	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/kernel/dumpstack.c	2010-07-14 08:02:12.000000000 -0400
@@ -258,6 +258,8 @@ void __kprobes oops_end(unsigned long fl
 
 	if (!signr)
 		return;
+	if (in_nmi())
+		panic("Fatal exception in non-maskable interrupt");
 	if (in_interrupt())
 		panic("Fatal exception in interrupt");
 	if (panic_on_oops)
Index: linux.trees.git/arch/x86/include/asm/paravirt_types.h
===================================================================
--- linux.trees.git.orig/arch/x86/include/asm/paravirt_types.h	2010-07-09 00:10:14.000000000 -0400
+++ linux.trees.git/arch/x86/include/asm/paravirt_types.h	2010-07-14 08:02:12.000000000 -0400
@@ -181,6 +181,7 @@ struct pv_cpu_ops {
 	/* Normal iret.  Jump to this with the standard iret stack
 	   frame set up. */
 	void (*iret)(void);
+	void (*nmi_return)(void);
 
 	void (*swapgs)(void);
 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
@ 2010-07-14 16:42   ` Maciej W. Rozycki
  2010-07-14 18:12     ` Mathieu Desnoyers
  2010-07-16 12:28   ` Avi Kivity
  1 sibling, 1 reply; 44+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 16:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, akpm,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> This patch makes all faults, traps and exception safe to be called from NMI
> context *except* single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes

 Watch out for the RF flag too, that is not set correctly by POPFD -- that 
may be important for faulting instructions that also have a hardware 
breakpoint set at their address.

> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.

 What about the VM flag for VM86 tasks?  It cannot be changed by POPFD 
either.

 How about only using the special return path when a nested exception is 
about to return to the NMI handler?  You'd avoid all the odd cases then 
that do not happen in the NMI context.

  Maciej

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 16:42   ` Maciej W. Rozycki
@ 2010-07-14 18:12     ` Mathieu Desnoyers
  2010-07-14 19:21       ` Maciej W. Rozycki
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 18:12 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Maciej W. Rozycki (macro@linux-mips.org) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
> 
> > This patch makes all faults, traps and exception safe to be called from NMI
> > context *except* single-stepping, which requires iret to restore the TF (trap
> > flag) and jump to the return address in a single instruction. Sorry, no kprobes
> 
>  Watch out for the RF flag too, that is not set correctly by POPFD -- that 
> may be important for faulting instructions that also have a hardware 
> breakpoint set at their address.
> 
> > support in NMI handlers because of this limitation. This cannot be emulated
> > with popf/lret, because lret would be single-stepped. It does not apply to
> > "immediate values" because they do not use single-stepping. This code detects if
> > the TF flag is set and uses the iret path for single-stepping, even if it
> > reactivates NMIs prematurely.
> 
>  What about the VM flag for VM86 tasks?  It cannot be changed by POPFD 
> either.
> 
>  How about only using the special return path when a nested exception is 
> about to return to the NMI handler?  You'd avoid all the odd cases then 
> that do not happen in the NMI context.

This is exactly what this patch does :-)

It selects the return path with

+       testl $NMI_MASK,TI_preempt_count(%ebp)
+       jz resume_kernel                /* Not nested over NMI ? */

In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
not explicitly set the RF flag, and the breakpoint instruction (int3) appears
not to set it. (from my understanding of Intel's
Intel Architecture Software Developer’s Manual Volume 3: System Programming
15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
It's just the "single-stepping" feature of kprobes which is problematic.
Luckily, only int3 is needed for code patching bypass.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 18:12     ` Mathieu Desnoyers
@ 2010-07-14 19:21       ` Maciej W. Rozycki
  2010-07-14 19:58         ` Mathieu Desnoyers
  0 siblings, 1 reply; 44+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 19:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> >  How about only using the special return path when a nested exception is 
> > about to return to the NMI handler?  You'd avoid all the odd cases then 
> > that do not happen in the NMI context.
> 
> This is exactly what this patch does :-)

 Ah, OK then -- I understood you actually tested the value of TF in the 
image to be restored.

> It selects the return path with
> 
> +       testl $NMI_MASK,TI_preempt_count(%ebp)
> +       jz resume_kernel                /* Not nested over NMI ? */
> 
> In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> not to set it. (from my understanding of Intel's
> Intel Architecture Software Developer’s Manual Volume 3: System Programming
> 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)

 The CPU only sets RF itself in the image saved in certain cases -- you'd 
see it set in the page fault handler for example, so that once the handler 
has finished any instruction breakpoint does not hit (presumably again, 
because the instruction breakpoint debug exception has the highest 
priority).  You mentioned the need to handle these faults.

> So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> 
> It's just the "single-stepping" feature of kprobes which is problematic.
> Luckily, only int3 is needed for code patching bypass.

 Actually the breakpoint exception handler should actually probably set RF 
explicitly, but that depends on the exact debugging scenario, so I can't 
comment on it further.  I don't know how INT3 is used in this context, so 
I'm just noting this may be a danger zone.

  Maciej

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 19:21       ` Maciej W. Rozycki
@ 2010-07-14 19:58         ` Mathieu Desnoyers
  2010-07-14 20:36           ` Maciej W. Rozycki
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-14 19:58 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Maciej W. Rozycki (macro@linux-mips.org) wrote:
> On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:
> 
> > >  How about only using the special return path when a nested exception is 
> > > about to return to the NMI handler?  You'd avoid all the odd cases then 
> > > that do not happen in the NMI context.
> > 
> > This is exactly what this patch does :-)
> 
>  Ah, OK then -- I understood you actually tested the value of TF in the 
> image to be restored.

It tests it too. When it detects that the return path is about to return to a
NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
really needed, because TF can only single-step an instruction when set by
"iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
because there is no way to go around the iret.

> 
> > It selects the return path with
> > 
> > +       testl $NMI_MASK,TI_preempt_count(%ebp)
> > +       jz resume_kernel                /* Not nested over NMI ? */
> > 
> > In addition, about int3 breakpoints use in the kernel, AFAIK the handler does
> > not explicitly set the RF flag, and the breakpoint instruction (int3) appears
> > not to set it. (from my understanding of Intel's
> > Intel Architecture Software Developer’s Manual Volume 3: System Programming
> > 15.3.1.1. INSTRUCTION-BREAKPOINT EXCEPTION C)
> 
>  The CPU only sets RF itself in the image saved in certain cases -- you'd 
> see it set in the page fault handler for example, so that once the handler 
> has finished any instruction breakpoint does not hit (presumably again, 
> because the instruction breakpoint debug exception has the highest 
> priority).  You mentioned the need to handle these faults.

Well, the only case where I think it might make sense to allow a breakpoint in
NMI handler code would be to temporarily replace a static branch, which should
in no way be able to trigger any other fault.

> 
> > So it should be safe to set a int3 breakpoint in a NMI handler with this patch.
> > 
> > It's just the "single-stepping" feature of kprobes which is problematic.
> > Luckily, only int3 is needed for code patching bypass.
> 
>  Actually the breakpoint exception handler should actually probably set RF 
> explicitly, but that depends on the exact debugging scenario, so I can't 
> comment on it further.  I don't know how INT3 is used in this context, so 
> I'm just noting this may be a danger zone.

In the case of temporary bypass, the int3 is only there to divert the
instruction execution flow to somewhere else, and we come back to the original
code at the address following the instruction which has the breakpoint. So
basically, we never come back to the original instruction, ever. We might as
well just clear the RF flag from the EFLAGS image before popf.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 19:58         ` Mathieu Desnoyers
@ 2010-07-14 20:36           ` Maciej W. Rozycki
  0 siblings, 0 replies; 44+ messages in thread
From: Maciej W. Rozycki @ 2010-07-14 20:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On Wed, 14 Jul 2010, Mathieu Desnoyers wrote:

> It tests it too. When it detects that the return path is about to return to a
> NMI handler, it checks if the TF flag is set. If it is set, then "iret" is
> really needed, because TF can only single-step an instruction when set by
> "iret". The popf/ret scheme would otherwise trap at the "ret" instruction that
> follows popf. Anyway, single-stepping is really discouraged in nmi handlers,
> because there is no way to go around the iret.

 Hmm, with Pentium Pro and more recent processors there is actually a 
nasty hack that will let you get away with POPF/RET and TF set. ;)  You 
can try it if you like and can arrange for an appropriate scenario.

> In the case of temporary bypass, the int3 is only there to divert the
> instruction execution flow to somewhere else, and we come back to the original
> code at the address following the instruction which has the breakpoint. So
> basically, we never come back to the original instruction, ever. We might as
> well just clear the RF flag from the EFLAGS image before popf.

 Yes, if you return to elsewhere, then that's actually quite desirable 
IMHO.

 This RF flag is quite complicated to handle and there are some errata 
involved too.  If I understand it correctly, all fault-class exception 
handlers are expected to set it manually in the image to be restored if 
they return to the original faulting instruction (that includes the debug 
exception handler if it was invoked as a fault, i.e. in response to an 
instruction breakpoint).  Then all trap-class exception handlers are 
expected to clear the flag (and that includes the debug exception handler 
if it was invoked as a trap, e.g. in response to a data breakpoint or a 
single step).  I haven't checked if Linux gets these bits right, but it 
may be worth doing so.

 For the record -- GDB hardly cares, because it removes any instruction 
breakpoints before it is asked to resume execution of an instruction that 
has a breakpoint set at, single-steps the instruction with all the other 
threads locked out and then reinserts the breakpoints so that they can hit 
again.  Then it proceeds with whatever should be done next to fulfil the 
execution request.

  Maciej

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
  2010-07-14 16:42   ` Maciej W. Rozycki
@ 2010-07-16 12:28   ` Avi Kivity
  2010-07-16 14:49     ` Mathieu Desnoyers
  1 sibling, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 12:28 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, Mathieu Desnoyers, akpm,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
> Implements an alternative iret with popf and return so trap and exception
> handlers can return to the NMI handler without issuing iret. iret would cause
> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
> copy the return instruction pointer to the top of the previous stack, issue a
> popf, loads the previous esp and issue a near return (ret).
>
> It allows placing dynamically patched static jumps in asm gotos, which will be
> used for optimized tracepoints, in NMI code since returning from a breakpoint
> would be valid. Accessing vmalloc'd memory, which allows executing module code
> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
> This is very useful to tracers like LTTng.
>
> This patch makes all faults, traps and exception safe to be called from NMI
> context*except*  single-stepping, which requires iret to restore the TF (trap
> flag) and jump to the return address in a single instruction. Sorry, no kprobes
> support in NMI handlers because of this limitation. This cannot be emulated
> with popf/lret, because lret would be single-stepped. It does not apply to
> "immediate values" because they do not use single-stepping. This code detects if
> the TF flag is set and uses the iret path for single-stepping, even if it
> reactivates NMIs prematurely.
>    

You need to save/restore cr2 in addition, otherwise the following hits you

- page fault
- processor writes cr2, enters fault handler
- nmi
- page fault
- cr2 overwritten

I guess you would usually not notice the corruption since you'd just see 
a spurious fault on the page the NMI handler touched, but if the first 
fault happened in a kvm guest, then we'd corrupt the guest's cr2.

But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's 
wrong with using a per-cpu pointer to a kmalloc() area?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 12:28   ` Avi Kivity
@ 2010-07-16 14:49     ` Mathieu Desnoyers
  2010-07-16 15:34       ` Andi Kleen
  2010-07-16 16:47       ` Avi Kivity
  0 siblings, 2 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 14:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/14/2010 06:49 PM, Mathieu Desnoyers wrote:
>> Implements an alternative iret with popf and return so trap and exception
>> handlers can return to the NMI handler without issuing iret. iret would cause
>> NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
>> copy the return instruction pointer to the top of the previous stack, issue a
>> popf, loads the previous esp and issue a near return (ret).
>>
>> It allows placing dynamically patched static jumps in asm gotos, which will be
>> used for optimized tracepoints, in NMI code since returning from a breakpoint
>> would be valid. Accessing vmalloc'd memory, which allows executing module code
>> or accessing vmapped or vmalloc'd areas from NMI context, would also be valid.
>> This is very useful to tracers like LTTng.
>>
>> This patch makes all faults, traps and exception safe to be called from NMI
>> context*except*  single-stepping, which requires iret to restore the TF (trap
>> flag) and jump to the return address in a single instruction. Sorry, no kprobes
>> support in NMI handlers because of this limitation. This cannot be emulated
>> with popf/lret, because lret would be single-stepped. It does not apply to
>> "immediate values" because they do not use single-stepping. This code detects if
>> the TF flag is set and uses the iret path for single-stepping, even if it
>> reactivates NMIs prematurely.
>>    
>
> You need to save/restore cr2 in addition, otherwise the following hits you
>
> - page fault
> - processor writes cr2, enters fault handler
> - nmi
> - page fault
> - cr2 overwritten
>
> I guess you would usually not notice the corruption since you'd just see  
> a spurious fault on the page the NMI handler touched, but if the first  
> fault happened in a kvm guest, then we'd corrupt the guest's cr2.

OK, just to make sure: you mean we'd have to save/restore the cr2 register
at the beginning/end of the NMI handler execution, right ? The shouldn't we
save/restore cr3 too ?

> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's  
> wrong with using a per-cpu pointer to a kmalloc() area?

Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
much more than perf) can potentially cause large latencies, which could be
squashed by allowing page faults in NMI handlers. This looks like a stronger
argument to me.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 14:49     ` Mathieu Desnoyers
@ 2010-07-16 15:34       ` Andi Kleen
  2010-07-16 15:40         ` Mathieu Desnoyers
  2010-07-16 16:47       ` Avi Kivity
  1 sibling, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 15:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, H. Peter Anvin, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be

You need to fix all other code too that walks tasks lists to avoid all those.

% gid for_each_process | wc -l

In fact the mm-struct walk is cheaper than a task-list walk because there
are always less than tasks.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 15:34       ` Andi Kleen
@ 2010-07-16 15:40         ` Mathieu Desnoyers
  0 siblings, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 15:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro, akpm,
	H. Peter Anvin, Jeremy Fitzhardinge, Frank Ch. Eigler

* Andi Kleen (andi@firstfloor.org) wrote:
> > Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> > much more than perf) can potentially cause large latencies, which could be
> 
> You need to fix all other code too that walks tasks lists to avoid all those.
> 
> % gid for_each_process | wc -l

This can very well be done incrementally. And I agree, these should eventually
targeted too, especially those which hold locks. We've already started hearing
about tasklist lock live-locks in the past year, so I think we're pretty much at
the point where it should be looked at.

Thanks,

Mathieu

> 
> In fact the mm-struct walk is cheaper than a task-list walk because there
> are always less than tasks.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 14:49     ` Mathieu Desnoyers
  2010-07-16 15:34       ` Andi Kleen
@ 2010-07-16 16:47       ` Avi Kivity
  2010-07-16 16:58         ` Mathieu Desnoyers
  1 sibling, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 16:47 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>
>> You need to save/restore cr2 in addition, otherwise the following hits you
>>
>> - page fault
>> - processor writes cr2, enters fault handler
>> - nmi
>> - page fault
>> - cr2 overwritten
>>
>> I guess you would usually not notice the corruption since you'd just see
>> a spurious fault on the page the NMI handler touched, but if the first
>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>      
> OK, just to make sure: you mean we'd have to save/restore the cr2 register
> at the beginning/end of the NMI handler execution, right ?

Yes.

> The shouldn't we
> save/restore cr3 too ?
>
>    

No, faults should not change cr3.

>> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's
>> wrong with using a per-cpu pointer to a kmalloc() area?
>>      
> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
> much more than perf) can potentially cause large latencies, which could be
> squashed by allowing page faults in NMI handlers. This looks like a stronger
> argument to me.

Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI 
which cannot take vmalloc faults, why bother?  If not, why not?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 16:47       ` Avi Kivity
@ 2010-07-16 16:58         ` Mathieu Desnoyers
  2010-07-16 17:54           ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 16:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/16/2010 05:49 PM, Mathieu Desnoyers wrote:
>>
>>> You need to save/restore cr2 in addition, otherwise the following hits you
>>>
>>> - page fault
>>> - processor writes cr2, enters fault handler
>>> - nmi
>>> - page fault
>>> - cr2 overwritten
>>>
>>> I guess you would usually not notice the corruption since you'd just see
>>> a spurious fault on the page the NMI handler touched, but if the first
>>> fault happened in a kvm guest, then we'd corrupt the guest's cr2.
>>>      
>> OK, just to make sure: you mean we'd have to save/restore the cr2 register
>> at the beginning/end of the NMI handler execution, right ?
>
> Yes.

OK

>
>> The shouldn't we
>> save/restore cr3 too ?
>>
>>    
>
> No, faults should not change cr3.

Ah, right.

>
>>> But the whole thing strikes me as overkill.  If it's 8k per-cpu, what's
>>> wrong with using a per-cpu pointer to a kmalloc() area?
>>>      
>> Well, it seems like all the kernel code calling "vmalloc_sync_all()" (which is
>> much more than perf) can potentially cause large latencies, which could be
>> squashed by allowing page faults in NMI handlers. This looks like a stronger
>> argument to me.
>
> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI  
> which cannot take vmalloc faults, why bother?  If not, why not?

Modules come as yet another example of stuff that is loaded in vmalloc'd space
and can be accesses from NMI context. That would include oprofile, tracers, and
probably others I'm forgetting about.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 16:58         ` Mathieu Desnoyers
@ 2010-07-16 17:54           ` Avi Kivity
  2010-07-16 18:05             ` H. Peter Anvin
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 17:54 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, Andi Kleen, akpm, H. Peter Anvin,
	Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>
>> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI
>> which cannot take vmalloc faults, why bother?  If not, why not?
>>      
> Modules come as yet another example of stuff that is loaded in vmalloc'd space
> and can be accesses from NMI context. That would include oprofile, tracers, and
> probably others I'm forgetting about.
>    

Module loading can certainly take a vmalloc_sync_all() (though I agree 
it's unpleasant).  Anything else?

Note perf is not modular at this time, but could be made so with 
preempt/sched notifiers to hook the context switch.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 17:54           ` Avi Kivity
@ 2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
                                 ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: H. Peter Anvin @ 2010-07-16 18:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:54 AM, Avi Kivity wrote:
> On 07/16/2010 07:58 PM, Mathieu Desnoyers wrote:
>>
>>> Why is that kernel code calling vmalloc_sync_all()?  If it is only NMI
>>> which cannot take vmalloc faults, why bother?  If not, why not?
>>>      
>> Modules come as yet another example of stuff that is loaded in vmalloc'd space
>> and can be accesses from NMI context. That would include oprofile, tracers, and
>> probably others I'm forgetting about.
>>    
> 
> Module loading can certainly take a vmalloc_sync_all() (though I agree 
> it's unpleasant).  Anything else?
> 
> Note perf is not modular at this time, but could be made so with 
> preempt/sched notifiers to hook the context switch.
> 

Actually, module loading is already a performance problem; a lot of
distros load sometimes hundreds of modules on startup, and it's heavily
serialized, so I can see this being desirable to skip.

I really hope noone ever gets the idea of touching user space from an
NMI handler, though, and expecting it to work...

	-hpa


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
@ 2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
                                   ` (2 more replies)
  2010-07-16 19:28               ` Andi Kleen
  2010-08-04  9:46               ` Peter Zijlstra
  2 siblings, 3 replies; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 18:15 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>
>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>> it's unpleasant).  Anything else?
>>
>> Note perf is not modular at this time, but could be made so with
>> preempt/sched notifiers to hook the context switch.
>>
>>      
> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily
> serialized, so I can see this being desirable to skip.
>    

There aren't that many processes at this time (or there shouldn't be, 
don't know how fork-happy udev is at this stage), so the sync should be 
pretty fast.  In any case, we can sync only modules that contain NMI 
handlers.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...
>    

I think the concern here is about an NMI handler's code running in 
vmalloc space, or is it something else?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
@ 2010-07-16 18:17                 ` H. Peter Anvin
  2010-07-16 18:28                   ` Avi Kivity
  2010-07-16 18:22                 ` Mathieu Desnoyers
  2010-07-16 18:25                 ` Linus Torvalds
  2 siblings, 1 reply; 44+ messages in thread
From: H. Peter Anvin @ 2010-07-16 18:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 11:15 AM, Avi Kivity wrote:
> 
> There aren't that many processes at this time (or there shouldn't be, 
> don't know how fork-happy udev is at this stage), so the sync should be 
> pretty fast.  In any case, we can sync only modules that contain NMI 
> handlers.
> 
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>    
> 
> I think the concern here is about an NMI handler's code running in 
> vmalloc space, or is it something else?
> 

Code or data, yes; including percpu data.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
@ 2010-07-16 18:22                 ` Mathieu Desnoyers
  2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 18:25                 ` Linus Torvalds
  2 siblings, 1 reply; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 18:22 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

* Avi Kivity (avi@redhat.com) wrote:
> On 07/16/2010 09:05 PM, H. Peter Anvin wrote:
>>
>>> Module loading can certainly take a vmalloc_sync_all() (though I agree
>>> it's unpleasant).  Anything else?
>>>
>>> Note perf is not modular at this time, but could be made so with
>>> preempt/sched notifiers to hook the context switch.
>>>
>>>      
>> Actually, module loading is already a performance problem; a lot of
>> distros load sometimes hundreds of modules on startup, and it's heavily
>> serialized, so I can see this being desirable to skip.
>>    
>
> There aren't that many processes at this time (or there shouldn't be,  
> don't know how fork-happy udev is at this stage), so the sync should be  
> pretty fast.  In any case, we can sync only modules that contain NMI  
> handlers.

USB hotplug is a use-case happening randomly after the system is well there and
running; I'm afraid this does not fit in your module loading expectations. It
triggers tons of events, many of these actually load modules.

Thanks,

Mathieu

>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>    
>
> I think the concern here is about an NMI handler's code running in  
> vmalloc space, or is it something else?
>
> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 18:17                 ` H. Peter Anvin
  2010-07-16 18:22                 ` Mathieu Desnoyers
@ 2010-07-16 18:25                 ` Linus Torvalds
  2010-07-16 19:30                   ` Andi Kleen
  2 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-07-16 18:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote:
>
> I think the concern here is about an NMI handler's code running in vmalloc
> space, or is it something else?

I think the concern was also potentially doing things like backtraces
etc that may need access to the module data structures (I think the
ELF headers end up all being in vmalloc space too, for example).

The whole debugging thing is also an issue. Now, I obviously am not a
big fan of remote debuggers, but everybody tells me I'm wrong. And
putting a breakpoint on NMI is certainly not insane if you are doing
debugging in the first place. So it's not necessarily always about the
page faults.

                               Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:17                 ` H. Peter Anvin
@ 2010-07-16 18:28                   ` Avi Kivity
  2010-07-16 18:37                     ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 18:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:17 PM, H. Peter Anvin wrote:
>
>> I think the concern here is about an NMI handler's code running in
>> vmalloc space, or is it something else?
>>
>>      
> Code or data, yes; including percpu data.
>    

Use kmalloc and percpu pointers, it's not that onerous.

Oh, and you can access vmalloc space by switching cr3 temporarily to 
init_mm's, no?  Obviously not a very performant solution, at least 
without PCIDs, but can be used if needed.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:22                 ` Mathieu Desnoyers
@ 2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 19:29                     ` H. Peter Anvin
  2010-07-16 19:32                     ` Andi Kleen
  0 siblings, 2 replies; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 18:32 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, LKML, Linus Torvalds, Andrew Morton, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
>
>> There aren't that many processes at this time (or there shouldn't be,
>> don't know how fork-happy udev is at this stage), so the sync should be
>> pretty fast.  In any case, we can sync only modules that contain NMI
>> handlers.
>>      
> USB hotplug is a use-case happening randomly after the system is well there and
> running; I'm afraid this does not fit in your module loading expectations. It
> triggers tons of events, many of these actually load modules.
>    

How long would vmalloc_sync_all take with a few thousand mm_struct take?

We share the pmds, yes?  So it's a few thousand memory accesses.  The 
direct impact is probably negligible, compared to actually loading the 
module from disk.  All we need is to make sure the locking doesn't slow 
down unrelated stuff.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:28                   ` Avi Kivity
@ 2010-07-16 18:37                     ` Linus Torvalds
  2010-07-16 19:26                       ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-07-16 18:37 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity <avi@redhat.com> wrote:
>
> Use kmalloc and percpu pointers, it's not that onerous.

What people don't seem to understand is that WE SHOULD NOT MAKE NMI
FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
WHAT-SO-EVER TO DO WITH NMI.

I'm shouting, because this point seems to have been continually
missed. It was missed in the original patches, and it's been missed in
the discussions.

Non-NMI code should simply never have to even _think_ about NMI's. Why
should it? It should just do whatever comes "natural" within its own
context.

This is why I've been pushing for the "let's just fix NMI" approach.
Not adding random hacks to other code sequences that have nothing
what-so-ever to do with NMI.

So don't add NMI code to the page fault code. Not to the debug code,
or to the module loading code. Don't say "use special allocations
because the NMI code may care about these particular data structures".
Because that way lies crap and unmaintainability.

                      Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:37                     ` Linus Torvalds
@ 2010-07-16 19:26                       ` Avi Kivity
  2010-07-16 21:39                         ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 19:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 09:37 PM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:28 AM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> Use kmalloc and percpu pointers, it's not that onerous.
>>      
> What people don't seem to understand is that WE SHOULD NOT MAKE NMI
> FORCE US TO DO "STRANGE" CODE IN CODE-PATHS THAT HAVE NOTHING
> WHAT-SO-EVER TO DO WITH NMI.
>
> I'm shouting, because this point seems to have been continually
> missed. It was missed in the original patches, and it's been missed in
> the discussions.
>
> Non-NMI code should simply never have to even _think_ about NMI's. Why
> should it? It should just do whatever comes "natural" within its own
> context.
>
>    

But we're not talking about non-NMI code.  The 8k referred to in the 
original patch are buffers used by NMI stack recording.  Module code 
vmalloc_sync_all() is only need by code that is executed during NMI, 
hence must be NMI aware.

> This is why I've been pushing for the "let's just fix NMI" approach.
> Not adding random hacks to other code sequences that have nothing
> what-so-ever to do with NMI.
>    

"fixing NMI" will result in code that is understandable by maybe three 
people after long and hard thinking.  NMI can happen in too many 
semi-defined contexts, so there will be plenty of edge cases.  I'm not 
sure we can ever trust such trickery.

> So don't add NMI code to the page fault code. Not to the debug code,
> or to the module loading code. Don't say "use special allocations
> because the NMI code may care about these particular data structures".
> Because that way lies crap and unmaintainability.
>    

If NMI code can call random hooks and access random data, yes.  But I 
don't think we're at that point yet.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
@ 2010-07-16 19:28               ` Andi Kleen
  2010-07-16 19:32                 ` Avi Kivity
  2010-08-04  9:46               ` Peter Zijlstra
  2 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 19:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> Actually, module loading is already a performance problem; a lot of
> distros load sometimes hundreds of modules on startup, and it's heavily

On startup you don't have many processes.  If there's a problem
it's surely not the fault of vmalloc_sync_all()

BTW in my experience one reason module loading was traditionally slow was
that it did a stop_machine(). I think(?) that has been fixed
at some point. But even with that's it's more an issue on larger
systems.

> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work...

It can make sense for a backtrace in a profiler.

In fact perf is nearly doing it I believe, but moves
it to the self IPI handler in most cases.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:32                   ` Avi Kivity
@ 2010-07-16 19:29                     ` H. Peter Anvin
  2010-07-16 19:39                       ` Avi Kivity
  2010-07-16 19:32                     ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: H. Peter Anvin @ 2010-07-16 19:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 11:32 AM, Avi Kivity wrote:
> 
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
> 
> We share the pmds, yes?  So it's a few thousand memory accesses.  The 
> direct impact is probably negligible, compared to actually loading the 
> module from disk.  All we need is to make sure the locking doesn't slow 
> down unrelated stuff.
> 

It's not the memory accesses, it's the need to synchronize all the CPUs.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:25                 ` Linus Torvalds
@ 2010-07-16 19:30                   ` Andi Kleen
  2010-07-18  9:26                     ` Avi Kivity
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 19:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 11:25:19AM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 11:15 AM, Avi Kivity <avi@redhat.com> wrote:
> >
> > I think the concern here is about an NMI handler's code running in vmalloc
> > space, or is it something else?
> 
> I think the concern was also potentially doing things like backtraces
> etc that may need access to the module data structures (I think the
> ELF headers end up all being in vmalloc space too, for example).
> 
> The whole debugging thing is also an issue. Now, I obviously am not a
> big fan of remote debuggers, but everybody tells me I'm wrong. And
> putting a breakpoint on NMI is certainly not insane if you are doing
> debugging in the first place. So it's not necessarily always about the
> page faults.

We already have infrastructure for kprobes to prevent breakpoints
on critical code (the __kprobes section). In principle kgdb/kdb
could be taught about honoring those too.

That wouldn't help for truly external JTAG debuggers, but I would assume
those generally can (should) handle any contexts anyways.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:28               ` Andi Kleen
@ 2010-07-16 19:32                 ` Avi Kivity
  2010-07-16 19:34                   ` Andi Kleen
  0 siblings, 1 reply; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 19:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:28 PM, Andi Kleen wrote:
>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work...
>>      
> It can make sense for a backtrace in a profiler.
>
> In fact perf is nearly doing it I believe, but moves
> it to the self IPI handler in most cases.
>    

Interesting, is the self IPI guaranteed to execute synchronously after 
the NMI's IRET?  Or can the core IRET faster than the APIC and so we get 
the backtrace at the wrong place?

(and does it matter? the NMI itself is not always accurate)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:32                   ` Avi Kivity
  2010-07-16 19:29                     ` H. Peter Anvin
@ 2010-07-16 19:32                     ` Andi Kleen
  1 sibling, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 19:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Mathieu Desnoyers, H. Peter Anvin, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 09:32:00PM +0300, Avi Kivity wrote:
> On 07/16/2010 09:22 PM, Mathieu Desnoyers wrote:
> >
> >>There aren't that many processes at this time (or there shouldn't be,
> >>don't know how fork-happy udev is at this stage), so the sync should be
> >>pretty fast.  In any case, we can sync only modules that contain NMI
> >>handlers.
> >USB hotplug is a use-case happening randomly after the system is well there and
> >running; I'm afraid this does not fit in your module loading expectations. It
> >triggers tons of events, many of these actually load modules.
> 
> How long would vmalloc_sync_all take with a few thousand mm_struct take?
> 
> We share the pmds, yes?  So it's a few thousand memory accesses.
> The direct impact is probably negligible, compared to actually
> loading the module from disk.  All we need is to make sure the
> locking doesn't slow down unrelated stuff.

Also you have to remember that vmalloc_sync_all() only does something
when the top level page is actually updated. That is very rare.
(in many cases it should happen at most once per boot)
Most mapping changes update lower levels, and those are already
shared.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:32                 ` Avi Kivity
@ 2010-07-16 19:34                   ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 19:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Andi Kleen, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Linus Torvalds, Andrew Morton, Ingo Molnar, Peter Zijlstra,
	Steven Rostedt, Steven Rostedt, Frederic Weisbecker,
	Thomas Gleixner, Christoph Hellwig, Li Zefan, Lai Jiangshan,
	Johannes Berg, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Tom Zanussi, KOSAKI Motohiro, akpm, Jeremy Fitzhardinge,
	Frank Ch. Eigler

On Fri, Jul 16, 2010 at 10:32:13PM +0300, Avi Kivity wrote:
> On 07/16/2010 10:28 PM, Andi Kleen wrote:
> >
> >>I really hope noone ever gets the idea of touching user space from an
> >>NMI handler, though, and expecting it to work...
> >It can make sense for a backtrace in a profiler.
> >
> >In fact perf is nearly doing it I believe, but moves
> >it to the self IPI handler in most cases.
> 
> Interesting, is the self IPI guaranteed to execute synchronously
> after the NMI's IRET?  Or can the core IRET faster than the APIC and
> so we get the backtrace at the wrong place?
> 
> (and does it matter? the NMI itself is not always accurate)

self ipi runs after the next STI (or POPF)

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:29                     ` H. Peter Anvin
@ 2010-07-16 19:39                       ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2010-07-16 19:39 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Mathieu Desnoyers, LKML, Linus Torvalds, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:29 PM, H. Peter Anvin wrote:
> On 07/16/2010 11:32 AM, Avi Kivity wrote:
>    
>> How long would vmalloc_sync_all take with a few thousand mm_struct take?
>>
>> We share the pmds, yes?  So it's a few thousand memory accesses.  The
>> direct impact is probably negligible, compared to actually loading the
>> module from disk.  All we need is to make sure the locking doesn't slow
>> down unrelated stuff.
>>
>>      
> It's not the memory accesses, it's the need to synchronize all the CPUs.
>    

I'm missing something.  Why do we need to sync all cpus?  the 
vmalloc_sync_all() I'm reading doesn't.

Even if we do an on_each_cpu() somewhere, it isn't the end of the world.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:26                       ` Avi Kivity
@ 2010-07-16 21:39                         ` Linus Torvalds
  2010-07-16 22:07                           ` Andi Kleen
  2010-07-18  9:23                           ` Avi Kivity
  0 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-07-16 21:39 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity <avi@redhat.com> wrote:
> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>
>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>> should it? It should just do whatever comes "natural" within its own
>> context.
>
> But we're not talking about non-NMI code.

Yes, we are. We're talking about breakpoints (look at the subject
line), and you are very much talking about things like that _idiotic_
vmalloc_sync_all() by module loading code etc etc.

Every _single_ "solution" I have seen - apart from my suggestion - has
been about making code "special" because some other code might run in
an NMI. Module init sequences having to do idiotic things just because
they have data structures that might get accessed by NMI.

And the thing is, if we just do NMI's correctly, and allow nesting,
ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
stupid things elsewhere.

In other words, why the hell are you arguing? Help Mathieu write the
low-level NMI handler right, and remove that idiotic
"vmalloc_sync_all()" that is fundamentally broken and should not
exist. Rather than talk about adding more of that kind of crap.

                   Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
@ 2010-07-16 22:02 Jeffrey Merkey
  2010-07-16 22:22 ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:02 UTC (permalink / raw)
  To: linux-kernel

Date Fri, 16 Jul 2010 14:39:50 -0700
>From Linus Torvalds

> Linus Torvalds wrote:
> But we're not talking about non-NMI code.Yes, we are. We're talking about breakpoints (look at the subjectline), and you are very much talking about things like that _idiotic_vmalloc_sync_all() by module
> loading code etc etc.Every _single_ "solution" I have seen - apart from my suggestion - hasbeen about making code "special" because some other code might run inan NMI. Module init sequences having to > do idiotic things just becausethey have data structures that might get accessed by NMI.And the thing is, if we just do NMI's correctly, and allow nesting,ALL THOSE PROBLEMS GO AWAY. And there is no > reason what-so-ever to dostupid things elsewhere.In other words, why the hell are you arguing? Help Mathieu write thelow-level NMI handler right, and remove that idiotic"vmalloc_sync_all()" that is
> fundamentally broken and should notexist. Rather than talk about adding more of that kind of crap.
>
> Linus

So Linus, my understanding of Intel's processor design is that the
processor will NEVER singal a nested NMI until it sees an iret from
the first NMI exception.  At least that's how the processors were
working
when I started this unless this behavior has changed.    Just put a
gate on the exception that uses its own stack (which I think we do
anyway).

Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 21:39                         ` Linus Torvalds
@ 2010-07-16 22:07                           ` Andi Kleen
  2010-07-16 22:26                             ` Linus Torvalds
  2010-07-16 22:40                             ` Mathieu Desnoyers
  2010-07-18  9:23                           ` Avi Kivity
  1 sibling, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 22:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Andi Kleen, Jeremy Fitzhardinge,
	Frank Ch. Eigler

> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.

One issue I have with nesting NMIs is that you need 
a nesting limit, otherwise you'll overflow the NMI stack.

We just got rid of nesting for normal interrupts because
of this stack overflow problem which hit in real situations.

In some cases you can get quite high NMI frequencies, e.g. with
performance counters. Now the current performance counter handlers
do not nest by themselves of course, but they might nest 
with other longer running NMI users.

I think none of the current handlers are likely to nest
for very long, but there's more and more NMI coded all the time,
so it's definitely a concern.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey
@ 2010-07-16 22:22 ` Linus Torvalds
  2010-07-16 22:48   ` Jeffrey Merkey
  2010-07-16 22:50   ` Jeffrey Merkey
  0 siblings, 2 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-07-16 22:22 UTC (permalink / raw)
  To: Jeffrey Merkey; +Cc: linux-kernel

On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote:
>
> So Linus, my understanding of Intel's processor design is that the
> processor will NEVER singal a nested NMI until it sees an iret from
> the first NMI exception.

Wrong.

I like x86, but it has warts. The NMI blocking is one of them.

The NMI's will be nested until the _next_ "iret", but it has no
nesting. So if you take a fault during the NMI (debug, page table
fixup, whatever), the iret in the faulthandler will re-enable NMI's
even though we're still busy with the original NMI. There is no
nesting, or any way to say that "this is a NMI-releasing iret". They
could even do it still - make a new "iret that doesn't clear NMI" by
adding a segment override prefix to iret or whatever. But it's not
going to happen, and it's just one of those ugly special cases that
has various historical reasons (recursive faults during NMI sure as
hell didn't make sense back in the real-mode 8086 days).

So we have to handle it in software. Or not ever trap at all inside
the NMI handler.

The original patch - and the patch I detest - is to make the normal
fault paths use a "popf + ret" to emulate iret, but without the NMI
release.

Now, I could live with that if it's the only solution, but it _is_
pretty damn ugly.

If somebody shows that it's actually faster to do "popf + ret" when
retuning to kernel space (a poor mans special-case iret), maybe it
would be worth it, but the really critical code sequence is actually
not "return to kernel space", but the "return to user space" case that
really wants the iret. And I just think it's disgusting to add extra
tests to that path.

The other alternative would be to just make the rule be "NMI can never
take traps". It's possible to do that, but quite frankly, it's a pain.
It's a pain for page faults due to the whole vmalloc thing, and it's a
pain if you ever want to debug an NMI in any way (or put a breakpoint
on anything that is accessed from an NMI, which could potentially be
quite a lot of things).

If it was just the debug issue, I'd say "neener neener, debuggers are
for wimps", but it's clearly not just about debug. It's a whole lot of
other thigs. Random percpu datastructures used for tracing, kernel
pointer verification code, yadda yadda.

                  Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:07                           ` Andi Kleen
@ 2010-07-16 22:26                             ` Linus Torvalds
  2010-07-16 22:41                               ` Andi Kleen
  2010-07-16 22:40                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 44+ messages in thread
From: Linus Torvalds @ 2010-07-16 22:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> One issue I have with nesting NMIs is that you need
> a nesting limit, otherwise you'll overflow the NMI stack.

Have you actually looked at the suggestion I (and now Mathieu)
suggested code for?

The nesting is very limited. NMI's would nest just once, and when that
happens, the nested NMI would never use more than something like a
hundred bytes of stack (most of which is what the CPU pushes
directly). And there would be no device interrupts that nest, and
practically the faults that nest obviously aren't going to be complex
faults either (ie the page fault would be the simple case that never
calls to 'handle_vm_fault()', but handles it all in
arch/x86/mm/fault.c.

IOW, there is absolutely _no_ issues with nesting. It's two levels
deep, and a much smaller stack footprint than our regular exception
nesting for those two levels too.

And your argument that there would be more and more NMI usage only
makes it more important that we handle NMI's without going crazy. Just
handle them cleanly instead of making them something totally special.

               Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:07                           ` Andi Kleen
  2010-07-16 22:26                             ` Linus Torvalds
@ 2010-07-16 22:40                             ` Mathieu Desnoyers
  1 sibling, 0 replies; 44+ messages in thread
From: Mathieu Desnoyers @ 2010-07-16 22:40 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Avi Kivity, H. Peter Anvin, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Jeremy Fitzhardinge, Frank Ch. Eigler

* Andi Kleen (andi@firstfloor.org) wrote:
> > And the thing is, if we just do NMI's correctly, and allow nesting,
> > ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> > stupid things elsewhere.
> 
> One issue I have with nesting NMIs is that you need 
> a nesting limit, otherwise you'll overflow the NMI stack.
> 
> We just got rid of nesting for normal interrupts because
> of this stack overflow problem which hit in real situations.
> 
> In some cases you can get quite high NMI frequencies, e.g. with
> performance counters. Now the current performance counter handlers
> do not nest by themselves of course, but they might nest 
> with other longer running NMI users.
> 
> I think none of the current handlers are likely to nest
> for very long, but there's more and more NMI coded all the time,
> so it's definitely a concern.

We're not proposing to actually "nest" NMIs per se. We copy the stack at the
beginning of the NMI handler (and then use the copy) to permit nesting of faults
over NMI handlers. Following NMIs that would "try" to nest over the NMI handler
would see their regular execution postponed until the end of the currently
running NMI handler. It's OK for these "nested" NMI handlers to use the bottom
of NMI stack because the NMI handler on which they are trying to nest is only
using the stack copy. These "nested" handlers return to the original NMI handler
very early just after setting a "pending nmi" flag. There is more to it (e.g.
handling NMI handler exit atomically with respect to incoming NMIs); please
refer to the last assembly code snipped I sent to Linus a little earlier today
for details.

Thanks,

Mathieu


> 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:26                             ` Linus Torvalds
@ 2010-07-16 22:41                               ` Andi Kleen
  2010-07-17  1:15                                 ` Linus Torvalds
  0 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2010-07-16 22:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 03:26:32PM -0700, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 3:07 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >
> > One issue I have with nesting NMIs is that you need
> > a nesting limit, otherwise you'll overflow the NMI stack.
> 
> Have you actually looked at the suggestion I (and now Mathieu)
> suggested code for?

Maybe I'm misunderstanding everything (and it has been a lot of emails
in the thread), but the case I was thinking of would be if the second NMI 
faults too, and then another one comes in after the IRET etc.

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:22 ` Linus Torvalds
@ 2010-07-16 22:48   ` Jeffrey Merkey
  2010-07-16 22:53     ` Jeffrey Merkey
  2010-07-16 22:50   ` Jeffrey Merkey
  1 sibling, 1 reply; 44+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

On Fri, Jul 16, 2010 at 4:22 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Jul 16, 2010 at 3:02 PM, Jeffrey Merkey <jeffmerkey@gmail.com> wrote:
>>
>> So Linus, my understanding of Intel's processor design is that the
>> processor will NEVER singal a nested NMI until it sees an iret from
>> the first NMI exception.
>
> Wrong.
>
> I like x86, but it has warts. The NMI blocking is one of them.
>
> The NMI's will be nested until the _next_ "iret", but it has no
> nesting. So if you take a fault during the NMI (debug, page table
> fixup, whatever), the iret in the faulthandler will re-enable NMI's
> even though we're still busy with the original NMI. There is no
> nesting, or any way to say that "this is a NMI-releasing iret". They
> could even do it still - make a new "iret that doesn't clear NMI" by
> adding a segment override prefix to iret or whatever. But it's not
> going to happen, and it's just one of those ugly special cases that
> has various historical reasons (recursive faults during NMI sure as
> hell didn't make sense back in the real-mode 8086 days).
>
> So we have to handle it in software. Or not ever trap at all inside
> the NMI handler.
>
> The original patch - and the patch I detest - is to make the normal
> fault paths use a "popf + ret" to emulate iret, but without the NMI
> release.
>
> Now, I could live with that if it's the only solution, but it _is_
> pretty damn ugly.
>
> If somebody shows that it's actually faster to do "popf + ret" when
> retuning to kernel space (a poor mans special-case iret), maybe it
> would be worth it, but the really critical code sequence is actually
> not "return to kernel space", but the "return to user space" case that
> really wants the iret. And I just think it's disgusting to add extra
> tests to that path.
>
> The other alternative would be to just make the rule be "NMI can never
> take traps". It's possible to do that, but quite frankly, it's a pain.
> It's a pain for page faults due to the whole vmalloc thing, and it's a
> pain if you ever want to debug an NMI in any way (or put a breakpoint
> on anything that is accessed from an NMI, which could potentially be
> quite a lot of things).
>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
>                  Linus
>

Well, the way I handled this problem on NetWare SMP and that other
kernel was to create a pool of TSS descriptors and reload each during
the exception to swap stacks before any handlers were called.  Allowed
it to nest until I ran out of TSS descriptors (64 levels).  Not sure
that's the way to go here though but it worked on that case.

Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:22 ` Linus Torvalds
  2010-07-16 22:48   ` Jeffrey Merkey
@ 2010-07-16 22:50   ` Jeffrey Merkey
  1 sibling, 0 replies; 44+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:50 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>
> If it was just the debug issue, I'd say "neener neener, debuggers are
> for wimps", but it's clearly not just about debug. It's a whole lot of
> other thigs. Random percpu datastructures used for tracing, kernel
> pointer verification code, yadda yadda.
>
>                  Linus
>

I guess I am a wimp then ... :-)

Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:48   ` Jeffrey Merkey
@ 2010-07-16 22:53     ` Jeffrey Merkey
  0 siblings, 0 replies; 44+ messages in thread
From: Jeffrey Merkey @ 2010-07-16 22:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-kernel

>
> Well, the way I handled this problem on NetWare SMP and that other
> kernel was to create a pool of TSS descriptors and reload each during
> the exception to swap stacks before any handlers were called.  Allowed
> it to nest until I ran out of TSS descriptors (64 levels).  Not sure
> that's the way to go here though but it worked on that case.
>
> Jeff
>

Here is where that old dusty code lives these days - it deals with this problem.

http://open-source-netware.googlecode.com/files/manos-06-26-2010.tar.gz

file to look at is startup.386

;
;   nmi entry code
;

nmi_entry     macro
	cli
	push    ebx
	push    ebp
	mov     ebp, esp
	sub     ebp, SIZE TaskStateSegment
	mov     ebx, ebp

	mov     [ebp].tSS, ss
	mov     [ebp].tGS, gs         ; save segment registers
	mov     [ebp].tFS, fs
	mov     [ebp].tES, es
	mov     [ebp].tDS, ds
	pop     [ebp].tEBP
	mov     [ebp].tEDI, edi
	mov     [ebp].tESI, esi
	mov     [ebp].tEDX, edx
	mov     [ebp].tECX, ecx
	pop     [ebp].tEBX
	mov     [ebp].tEAX, eax

	pop     [ebp].tEIP            ; remove return address
	pop     eax
	mov     [ebp].tCS, ax
	pop     [ebp].tSystemFlags    ; get flags into TSS

	mov     [ebp].tESP, esp       ; save true stack address
	mov     esp, ebx            ; cover stack frame

	mov     eax, CR0
	and     eax, 0FFFFFFF7h     ; clear task switch bit in CR0 to
	mov     CR0, eax            ; avoid NPX exceptions

	xor	eax, eax
	mov	dr7, eax            ; disable breakpoints

	mov     eax, CR3            ;
	mov     [ebp].tCR3, eax     ;
	mov     eax, DebuggerPDE
	mov     CR3, eax

	;
	;   if we do not clear the NESTED_TASK_FLAG, then the IRET
	;   at the end of this function will cause
	;   an invalid TSS exception to be generated because the
	;   task busy bit was cleared earlier
	;

	pushfd
	and	dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
	or	dword ptr [esp], RESUME_FLAG
	popfd

	mov     eax, 0FFFFFFFFh    ; mark as a non-pooled TSS exception
	push    eax

	push    0
	push    0
	push    ebp

	endm

;
;   TSS entry code
;


task_entry     macro
	LOCAL   @TSSNotNested, @NoLTR
	LOCAL   @UsedDefaultSegment
	LOCAL   @UsedPooledSegment
	LOCAL   @EnterTheDebugger

	cli
	xor    eax, eax
	str    ax
	mov    esi, offset SystemGDTTable
	mov    esi, dword ptr [esi + 2]
	lea    ebx, [esi + eax]
	mov    al, [ebx].TSSBase2
	mov    ah, [ebx].TSSBase3
	shl    eax, 16
	mov    ax, [ebx].TSSBase1

	;
	;  eax -> TSS Segment (Current)
	;  ebx -> TSS Descriptor (Current)
	;

	movzx  ecx, word ptr [eax].tBackLink
	or     ecx, ecx
	jz     @TSSNotNested

	mov    esi, offset SystemGDTTable
	mov    esi, dword ptr [esi + 2]
	lea    edx, [esi + ecx]
	mov    cl, [edx].TSSBase2
	mov    ch, [edx].TSSBase3
	shl    ecx, 16
	mov    cx, [edx].TSSBase1

	mov    ebp, ecx

	;
	;  edx -> TSS Descriptor (Previous)
	;  ebp -> TSS Segment (Previous)
	;
	;  clear busy state and reset TSS
	;

	mov     [edx].TSSType, 10001001b

@TSSNotNested:
	mov     [ebx].TSSType, 10001001b

	lgdt    ds: SystemGDTTable     ; reset GDT TSS Busy bit

	movzx   eax, word ptr [eax].tBackLink
	or      eax, eax
	jz      @NoLTR

	ltr     ax

@NoLTR:

	mov     eax, CR0
	and     eax, 0FFFFFFF7h     ; clear task switch bit in CR0 to
	mov     CR0, eax            ; avoid NPX exceptions

	xor	eax, eax
	mov	dr7, eax            ; disable breakpoints

	pushfd
	and	dword ptr [esp], NOT (NESTED_TASK_FLAG OR SINGLE_STEP_FLAG)
	or	dword ptr [esp], RESUME_FLAG
	popfd

	push    ebp
	call    AllocPooledResource
	pop     ebp

	or      eax, eax
	jz      @UsedDefaultSegment

	lea     ebp, [eax].TSSSegment
	mov     esp, [eax].StackTop

	push    eax                   ; push address of pooled resource
	jmp     @UsedPooledSegment

@UsedDefaultSegment:
	mov     eax, 0FFFFFFFFh       ; push non-pooled marker onto the stack
	push    eax

@UsedPooledSegment:

	push    0
	mov     eax, CR2    ; get fault address
	push    eax
	push    ebp         ;  pass the TSS

	endm

;
;  TSS exit code
;

Jeff

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 22:41                               ` Andi Kleen
@ 2010-07-17  1:15                                 ` Linus Torvalds
  0 siblings, 0 replies; 44+ messages in thread
From: Linus Torvalds @ 2010-07-17  1:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Avi Kivity, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On Fri, Jul 16, 2010 at 3:41 PM, Andi Kleen <andi@firstfloor.org> wrote:
>
> Maybe I'm misunderstanding everything (and it has been a lot of emails
> in the thread), but the case I was thinking of would be if the second NMI
> faults too, and then another one comes in after the IRET etc.

No, the nested NMI cannot fault, because it never even enters C code.
It literally just returns immediately after having noticed it is
nested (and corrupted the stack of the original one, so that the
original NMI will re-do itself at return)..

So the nested NMI will use some few tens of bytes of stack. In fact,
it will use the stack "above" the stack that the original NMI handler
is using, because it will reset the stack pointer back to the top of
the NMI stack. So in a very real sense, it is not even extending the
stack, it is just re-using a small part of the same stack that the
original NMI used (and that we copied away so that it doesn't matter
that it gets re-used)

As to another small but important detail: the _nested_ NMI actually
returns using "popf+ret", leaving NMI's blocked again. Thus
guaranteeing forward progress and lack of NMI storms.

To summarize:

 - the "original" (first-level) NMI can take faults (like the page
fault to fill in vmalloc pages lazily, or debug faults). That will
actually cause two stack frames (or three, if you debug a page fault
that happened while NMI was active). So there is certainly exception
nesting going on, but we're talking _much_ less stack than normal
stack usage where the nesting can be deep and in complex routines.

 - any "nested" NMI's will not actually use any more stack at all than
a non-nested one, because we've pre-reserved space for them (and we
_had_ to reserve space for them due to IST)

 - even if we get NMI's during the execution of the original NMI,
there can be only one such "spurious" NMI per nested exception. So if
we take a single page fault, that exception will re-enable NMI
(because it returns with "iret"), and as a result we may take a
_single_ new nested NMI until we disable NMI's again.

In other words, the approach is not all that different from doing
"lazy irq disable" like powerpc does for regular interrupts. For
NMI's, we do it because it's impossible (on x86) to disable NMI's
without actually taking one.

                         Linus

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 21:39                         ` Linus Torvalds
  2010-07-16 22:07                           ` Andi Kleen
@ 2010-07-18  9:23                           ` Avi Kivity
  1 sibling, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2010-07-18  9:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Mathieu Desnoyers, LKML, Andrew Morton,
	Ingo Molnar, Peter Zijlstra, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/17/2010 12:39 AM, Linus Torvalds wrote:
> On Fri, Jul 16, 2010 at 12:26 PM, Avi Kivity<avi@redhat.com>  wrote:
>    
>> On 07/16/2010 09:37 PM, Linus Torvalds wrote:
>>      
>>> Non-NMI code should simply never have to even _think_ about NMI's. Why
>>> should it? It should just do whatever comes "natural" within its own
>>> context.
>>>        
>> But we're not talking about non-NMI code.
>>      
> Yes, we are. We're talking about breakpoints (look at the subject
> line), and you are very much talking about things like that _idiotic_
> vmalloc_sync_all() by module loading code etc etc.
>    

Well, I'd put it in the nmi handler registration code, but you're 
right.  A user placing breakpoints can't even tell whether the 
breakpoint will be hit by NMI code, especially data breakpoints.

> Every _single_ "solution" I have seen - apart from my suggestion - has
> been about making code "special" because some other code might run in
> an NMI. Module init sequences having to do idiotic things just because
> they have data structures that might get accessed by NMI.
>
> And the thing is, if we just do NMI's correctly, and allow nesting,
> ALL THOSE PROBLEMS GO AWAY. And there is no reason what-so-ever to do
> stupid things elsewhere.
>
> In other words, why the hell are you arguing? Help Mathieu write the
> low-level NMI handler right, and remove that idiotic
> "vmalloc_sync_all()" that is fundamentally broken and should not
> exist. Rather than talk about adding more of that kind of crap.
>    

Well, at least we'll get a good test case for kvm's nmi blocking 
emulation (it's tricky since if we fault on an iret sometimes nmis get 
unblocked even though the instruction did not complete).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 19:30                   ` Andi Kleen
@ 2010-07-18  9:26                     ` Avi Kivity
  0 siblings, 0 replies; 44+ messages in thread
From: Avi Kivity @ 2010-07-18  9:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, H. Peter Anvin, Mathieu Desnoyers, LKML,
	Andrew Morton, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Steven Rostedt, Frederic Weisbecker, Thomas Gleixner,
	Christoph Hellwig, Li Zefan, Lai Jiangshan, Johannes Berg,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Tom Zanussi,
	KOSAKI Motohiro, Jeremy Fitzhardinge, Frank Ch. Eigler

On 07/16/2010 10:30 PM, Andi Kleen wrote:
> We already have infrastructure for kprobes to prevent breakpoints
> on critical code (the __kprobes section). In principle kgdb/kdb
> could be taught about honoring those too.
>
>    

It doesn't help with NMI code calling other functions, or with data 
breakpoints.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-07-16 18:05             ` H. Peter Anvin
  2010-07-16 18:15               ` Avi Kivity
  2010-07-16 19:28               ` Andi Kleen
@ 2010-08-04  9:46               ` Peter Zijlstra
  2010-08-04 20:23                 ` H. Peter Anvin
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2010-08-04  9:46 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler,
	David Howells

On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
> 
> I really hope noone ever gets the idea of touching user space from an
> NMI handler, though, and expecting it to work... 

Perf actually already does that to unwind user-space stacks... ;-)

See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.

What we do is a manual page table walk (using __get_user_pages_fast) and
simply bail when the page is not available.

That said, I think that the thing that started the whole
per-cpu-per-context temp stack-frame storage story also means that that
function is now broken and can lead to kmap_atomic corruption.

I really should brush up that stack based kmap_atomic thing, last time I
got stuck on FRV wanting things.

Linus should I refresh that whole series and give a FRV a slow but
working implementation and then let David Howells sort out things if he
cares about that?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [patch 2/2] x86 NMI-safe INT3 and Page Fault
  2010-08-04  9:46               ` Peter Zijlstra
@ 2010-08-04 20:23                 ` H. Peter Anvin
  0 siblings, 0 replies; 44+ messages in thread
From: H. Peter Anvin @ 2010-08-04 20:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Avi Kivity, Mathieu Desnoyers, LKML, Linus Torvalds,
	Andrew Morton, Ingo Molnar, Steven Rostedt, Steven Rostedt,
	Frederic Weisbecker, Thomas Gleixner, Christoph Hellwig, Li Zefan,
	Lai Jiangshan, Johannes Berg, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Tom Zanussi, KOSAKI Motohiro,
	Andi Kleen, akpm, Jeremy Fitzhardinge, Frank Ch. Eigler,
	David Howells

On 08/04/2010 02:46 AM, Peter Zijlstra wrote:
> On Fri, 2010-07-16 at 11:05 -0700, H. Peter Anvin wrote:
>>
>> I really hope noone ever gets the idea of touching user space from an
>> NMI handler, though, and expecting it to work... 
> 
> Perf actually already does that to unwind user-space stacks... ;-)
> 
> See arch/x86/kernel/cpu/perf_event.c:copy_from_user_nmi() and its users.
> 
> What we do is a manual page table walk (using __get_user_pages_fast) and
> simply bail when the page is not available.
> 

That's not really "touching user space", though.

	-hpa

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2010-08-04 20:27 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-16 22:02 [patch 2/2] x86 NMI-safe INT3 and Page Fault Jeffrey Merkey
2010-07-16 22:22 ` Linus Torvalds
2010-07-16 22:48   ` Jeffrey Merkey
2010-07-16 22:53     ` Jeffrey Merkey
2010-07-16 22:50   ` Jeffrey Merkey
  -- strict thread matches above, loose matches on Subject: below --
2010-07-14 15:49 [patch 0/2] x86: NMI-safe trap handlers Mathieu Desnoyers
2010-07-14 15:49 ` [patch 2/2] x86 NMI-safe INT3 and Page Fault Mathieu Desnoyers
2010-07-14 16:42   ` Maciej W. Rozycki
2010-07-14 18:12     ` Mathieu Desnoyers
2010-07-14 19:21       ` Maciej W. Rozycki
2010-07-14 19:58         ` Mathieu Desnoyers
2010-07-14 20:36           ` Maciej W. Rozycki
2010-07-16 12:28   ` Avi Kivity
2010-07-16 14:49     ` Mathieu Desnoyers
2010-07-16 15:34       ` Andi Kleen
2010-07-16 15:40         ` Mathieu Desnoyers
2010-07-16 16:47       ` Avi Kivity
2010-07-16 16:58         ` Mathieu Desnoyers
2010-07-16 17:54           ` Avi Kivity
2010-07-16 18:05             ` H. Peter Anvin
2010-07-16 18:15               ` Avi Kivity
2010-07-16 18:17                 ` H. Peter Anvin
2010-07-16 18:28                   ` Avi Kivity
2010-07-16 18:37                     ` Linus Torvalds
2010-07-16 19:26                       ` Avi Kivity
2010-07-16 21:39                         ` Linus Torvalds
2010-07-16 22:07                           ` Andi Kleen
2010-07-16 22:26                             ` Linus Torvalds
2010-07-16 22:41                               ` Andi Kleen
2010-07-17  1:15                                 ` Linus Torvalds
2010-07-16 22:40                             ` Mathieu Desnoyers
2010-07-18  9:23                           ` Avi Kivity
2010-07-16 18:22                 ` Mathieu Desnoyers
2010-07-16 18:32                   ` Avi Kivity
2010-07-16 19:29                     ` H. Peter Anvin
2010-07-16 19:39                       ` Avi Kivity
2010-07-16 19:32                     ` Andi Kleen
2010-07-16 18:25                 ` Linus Torvalds
2010-07-16 19:30                   ` Andi Kleen
2010-07-18  9:26                     ` Avi Kivity
2010-07-16 19:28               ` Andi Kleen
2010-07-16 19:32                 ` Avi Kivity
2010-07-16 19:34                   ` Andi Kleen
2010-08-04  9:46               ` Peter Zijlstra
2010-08-04 20:23                 ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).