Re: ESP corruption bug - what CPUs are affected? (patch att

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: ESP corruption bug - what CPUs are affected? (patch att
@ 2004-10-06 17:18 Petr Vandrovec
  2004-10-06 19:04 ` Stas Sergeev
  2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev
  0 siblings, 2 replies; 38+ messages in thread
From: Petr Vandrovec @ 2004-10-06 17:18 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel, Denis Vlasenko

On  6 Oct 04 at 20:18, Stas Sergeev wrote:
> Yes, if not for that anonymous guy, who kept posting
> to me until he finally convinced me that the Ring-0
> approach is not that difficult at all.
> So I tried... It was much more difficult to code
> up, but at the end it looks a little better
> and localized to entry.S completely. OTOH it
> touches the exception handlers, but not too much -
> it adds only 5 insns on the fast path. And the
> code is very fragile, but after I made all the
> magic numbers a #define consts, it actually looks
> not so bad.
> I don't know which patch is really better, so
> I am attaching both.

CPL0 solution is certainly more localized, but I have hard problems
to convice myself that it is actually safe.

I would appreciate if you could add comments what values are set
by ESPFIX_SWITCH_16 + 8 + 4 and simillar moves, and what they actually
do.  And convicing myself that ESPFIX_SWITCH_32 has just right value so

pushl %eax
pushl %es
lss ESPFIX_SWITCH_32,%esp
popl %es
popl %eax

actually works took almost an hour...
                                                    Petr
                                                    


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected? (patch att
  2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec
@ 2004-10-06 19:04 ` Stas Sergeev
  2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev
  1 sibling, 0 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-10-06 19:04 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel, Denis Vlasenko

[-- Attachment #1: Type: text/plain, Size: 1597 bytes --]

Hi.

Petr Vandrovec wrote:
> CPL0 solution is certainly more localized, but I have hard problems
> to convice myself that it is actually safe.
I spent 2 days convincing myself the same way:)
The most problematic part was to make sure that
the stack is properly unwinded even if NMI comes
before the exception handler managed to switch
out of 16bit. But I think this is now handled.

> I would appreciate if you could add comments what values are set
OK, I did. But in fact that makes the patch
only even more obfuscated:( It doesn't look
possible to explain all the magic pattern and
its constraints in a comments. But I tried:)
The patch is attached.

> by ESPFIX_SWITCH_16 + 8 + 4 and simillar moves, and what they actually
> do.
Fortunately there are no such moves.
In an attempt to make the patch a little
self-explanatory, I #define'd all the meaningfull
values. So if there is a move to
ESPFIX_SWITCH16_OFFS+some_value, it is safe to
assume that the move is intended to load the
ESPFIX_SWITCH16, and "some_value" is just the
correction constant. So besides the 2 magic
pointers, there are really no moves above the
iret frame.

> And convicing myself that ESPFIX_SWITCH_32 has just right value so
> pushl %eax
> pushl %es
> lss ESPFIX_SWITCH_32,%esp
> popl %es
> popl %eax
> actually works took almost an hour...
I realize that and thats really the big problem
of that patch. It is very obfuscated and difficult
to understand.
I guess if you see the first version of that
patch, which was before I found the way to use
the fixed offsets for locating the switches,
you might just get sick:)


[-- Attachment #2: linux-2.6.8-stk0-2a.diff --]
[-- Type: text/x-patch, Size: 8285 bytes --]

diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
--- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-26 15:20:23.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-10-06 22:39:43.084860592 +0400
@@ -78,7 +78,7 @@
 #define preempt_stop		cli
 #else
 #define preempt_stop
-#define resume_kernel		restore_all
+#define resume_kernel		restore_context
 #endif
 
 #define SAVE_ALL \
@@ -122,25 +122,6 @@
 .previous
 
 
-#define RESTORE_ALL	\
-	RESTORE_REGS	\
-	addl $4, %esp;	\
-1:	iret;		\
-.section .fixup,"ax";   \
-2:	sti;		\
-	movl $(__USER_DS), %edx; \
-	movl %edx, %ds; \
-	movl %edx, %es; \
-	pushl $11;	\
-	call do_exit;	\
-.previous;		\
-.section __ex_table,"a";\
-	.align 4;	\
-	.long 1b,2b;	\
-.previous
-
-
-
 ENTRY(lcall7)
 	pushfl			# We get a different stack layout with call
 				# gates, which has to be cleaned up later..
@@ -197,7 +178,7 @@
 	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb CS(%esp), %al
 	testl $(VM_MASK | 3), %eax
-	jz resume_kernel		# returning to kernel or vm86-space
+	jz resume_kernel
 ENTRY(resume_userspace)
  	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
@@ -211,7 +192,7 @@
 #ifdef CONFIG_PREEMPT
 ENTRY(resume_kernel)
 	cmpl $0,TI_preempt_count(%ebp)	# non-zero preempt_count ?
-	jnz restore_all
+	jnz restore_context
 need_resched:
 	movl TI_flags(%ebp), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
@@ -293,8 +274,102 @@
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx	# current->work
 	jne syscall_exit_work
+
 restore_all:
-	RESTORE_ALL
+	/* If returning to Ring-3, not to V86, and with
+	 * the small stack, try to fix the higher word of
+	 * ESP, as the CPU won't restore it from stack.
+	 * This is an "official" bug of all the x86-compatible
+	 * CPUs, which we can try to work around to make
+	 * dosemu happy. */
+	movl EFLAGS(%esp), %eax
+	movb CS(%esp), %al
+	andl $(VM_MASK | 3), %eax
+	/* returning to user-space and not to v86? */
+	cmpl $3, %eax
+	jne restore_context
+	larl OLDSS(%esp), %eax
+	/* returning to "small" stack? */
+	testl $0x00400000, %eax
+	jz 8f		# go fixing ESP instead of just to return :(
+restore_context:
+	RESTORE_REGS
+	addl $4, %esp
+1:	iret
+.section .fixup,"ax"
+iret_exc:
+	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,iret_exc
+.previous
+	/* preparing the ESPfix here */
+/* FRAME_SIZE_ESTIM is the size of an exception stack frame + 8 bytes
+ * for 2 pushes - need to save 2 registers on stack */
+#define FRAME_SIZE_ESTIM 24
+/* ESPFIX_STACK_RESERVE is the value that goes into SP after switch to 16bit */
+#define ESPFIX_STACK_RESERVE (FRAME_SIZE_ESTIM * 3)
+#define SP_ESTIMATE (ESPFIX_STACK_RESERVE - FRAME_SIZE_ESTIM)
+/* iret frame takes 20 bytes.
+ * We hide 2 magic pointers above it, 8 bytes each (seg:off). */
+#define ESPFIX_SWITCH16_OFFS 20
+#define ESPFIX_SWITCH32_OFFS 28
+#define ESPFIX_SWITCH16 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH16_OFFS)
+#define ESPFIX_SWITCH32 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH32_OFFS)
+8:	RESTORE_REGS
+	addl $4, %esp
+	/* reserve the room for 2 magic pointers (16 bytes) and 4 bytes wasted. */
+	.rept 5
+	pushl 16(%esp)
+	.endr
+	/* Preserve some registers. That makes all the +8 offsets below. */
+	pushl %eax
+	pushl %ebp
+	/* Set up the SWITCH16 magic pointer. */
+	movl $__ESPFIX_SS, ESPFIX_SWITCH16_OFFS+8+4(%esp)
+	/* Get the "old" ESP */
+	movl 8+12(%esp), %eax
+	/* replace the lower word of "old" ESP with our magic value */
+	movw $ESPFIX_STACK_RESERVE, %ax
+	movl %eax, ESPFIX_SWITCH16_OFFS+8(%esp)
+	/* SWITCH16 ready */
+	/* Now set up the SWITCH32 magic pointer. */
+	movl $__KERNEL_DS, ESPFIX_SWITCH32_OFFS+8+4(%esp)
+	/* ESP must match the SP_ESTIMATE,
+	 * which is FRAME_SIZE_ESTIM below the iret frame */
+	leal -FRAME_SIZE_ESTIM+8(%esp), %eax
+	movl %eax, ESPFIX_SWITCH32_OFFS+8(%esp)
+	/* SWITCH32 ready */
+	/* %eax will make the base of __ESPFIX_SS.
+	 * It have to be ESPFIX_STACK_RESERVE bytes below the iret frame. */
+	subl $(ESPFIX_STACK_RESERVE-FRAME_SIZE_ESTIM), %eax
+	cli
+	GET_THREAD_INFO(%ebp)
+	movl TI_cpu(%ebp), %ebp
+	shll $GDT_SIZE_SHIFT, %ebp
+	/* find GDT of the proper CPU */
+	addl $cpu_gdt_table, %ebp
+	/* patch the base of the 16bit stack */
+	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp)
+	shrl $16, %eax
+	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp)
+	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp)
+	popl %ebp
+	popl %eax
+espfix_before_lss:
+	lss ESPFIX_SWITCH16_OFFS(%esp), %esp
+espfix_after_lss:
+	iret
+.section __ex_table,"a"
+	.align 4
+	.long espfix_after_lss,iret_exc
+.previous
 
 	# perform work that needs to be done immediately before resumption
 	ALIGN
@@ -493,6 +568,36 @@
 	pushl $do_debug
 	jmp error_code
 
+/* Check if we are on 16bit stack. Can happen either on iret of ESPFIX,
+ * or in an exception handler after that iret... */
+#define CHECK_ESPFIX_STACK(label) \
+	pushl %eax; \
+	movl %ss, %eax; \
+	cmpl $__ESPFIX_SS, %eax; \
+	jne label;
+/* using %es here because NMI can change %ss */
+#define UNWIND_ESPFIX_STACK \
+	CHECK_ESPFIX_STACK(28f) \
+	pushl %es; \
+	movl %eax, %es; \
+	lss %es:ESPFIX_SWITCH32, %esp; \
+	popl %es; \
+28:	popl %eax;
+/* have to compensate the difference between real SP and estimated. */
+#define UNWIND_ESPFIX_STACK_NMI \
+	CHECK_ESPFIX_STACK(28f) \
+	movw $SP_ESTIMATE, %ax; \
+	subw %sp, %ax; \
+	movswl %ax, %eax; \
+	lss %ss:ESPFIX_SWITCH32, %esp; \
+	subl %eax, %esp; \
+	popl %eax; \
+	cmpl $espfix_after_lss, (%esp); \
+	jne nmi_stack_correct; \
+	movl $espfix_before_lss, (%esp); \
+	jmp nmi_stack_correct; \
+28:	popl %eax;
+
 /*
  * NMI is doubly nasty. It can happen _while_ we're handling
  * a debug fault, and the debug fault hasn't yet been able to
@@ -502,6 +607,7 @@
  * fault happened on the sysenter path.
  */
 ENTRY(nmi)
+	UNWIND_ESPFIX_STACK_NMI
 	cmpl $sysenter_entry,(%esp)
 	je nmi_stack_fixup
 	pushl %eax
@@ -523,7 +629,7 @@
 	pushl %edx
 	call do_nmi
 	addl $8, %esp
-	RESTORE_ALL
+	jmp restore_all
 
 nmi_stack_fixup:
 	FIX_STACK(12,nmi_stack_correct, 1)
@@ -569,14 +675,17 @@
 	jmp error_code
 
 ENTRY(segment_not_present)
+	UNWIND_ESPFIX_STACK
 	pushl $do_segment_not_present
 	jmp error_code
 
 ENTRY(stack_segment)
+	UNWIND_ESPFIX_STACK
 	pushl $do_stack_segment
 	jmp error_code
 
 ENTRY(general_protection)
+	UNWIND_ESPFIX_STACK
 	pushl $do_general_protection
 	jmp error_code
 
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S
--- linux-2.6.8-pcsp/arch/i386/kernel/head.S	2004-07-17 23:31:24.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/head.S	2004-10-06 21:58:04.873646424 +0400
@@ -514,7 +514,7 @@
 	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
 	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
-	.quad 0x0000000000000000	/* 0xd0 - unused */
+	.quad 0x000092000000ffff	/* 0xd0 - ESPfix 16-bit SS */
 	.quad 0x0000000000000000	/* 0xd8 - unused */
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h
--- linux-2.6.8-pcsp/include/asm-i386/segment.h	2004-01-09 09:59:19.000000000 +0300
+++ linux-2.6.8-stacks/include/asm-i386/segment.h	2004-10-06 21:58:04.000000000 +0400
@@ -38,7 +38,7 @@
  *  24 - APM BIOS support
  *  25 - APM BIOS support 
  *
- *  26 - unused
+ *  26 - ESPfix small SS
  *  27 - unused
  *  28 - unused
  *  29 - unused
@@ -71,14 +71,19 @@
 #define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
+#define GDT_ENTRY_ESPFIX_SS		(GDT_ENTRY_KERNEL_BASE + 14)
+#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8)
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS	31
 
 /*
  * The GDT has 32 entries
  */
-#define GDT_ENTRIES 32
-
-#define GDT_SIZE (GDT_ENTRIES * 8)
+#define GDT_ENTRIES_SHIFT 5
+#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT)
+#define GDT_ENTRY_SHIFT 3
+#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT)
+#define GDT_SIZE (1 << GDT_SIZE_SHIFT)
 
 /* Simple and small GDT entries for booting only */
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec
  2004-10-06 19:04 ` Stas Sergeev
@ 2004-10-11 18:32 ` Stas Sergeev
  1 sibling, 0 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-10-11 18:32 UTC (permalink / raw)
  To: linux-kernel; +Cc: Petr Vandrovec

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

Hi.

I've been pointed out that my prev ring-0
patch was not completely safe. I was
doing "popl %es". I thought it is save
because RESTORE_REGS also does that
with the fixup in place, but the anonymous
guy thinks that if %es refers to LDT and
the thread on another CPU changes that LDT
entry in a mean time, my "popl %es" can GPF.
So I have to avoid popping any segregs.
I moved my recovery code to error_code,
right after %es is used last time and before
the %esp is used directly (I am lucky such
a place exist there!).
New patch is attached. Does it look safe
this time?



[-- Attachment #2: linux-2.6.8-stk0-3.diff --]
[-- Type: text/x-patch, Size: 8035 bytes --]

diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
--- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-26 15:20:23.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-10-09 20:55:36.772613096 +0400
@@ -78,7 +78,7 @@
 #define preempt_stop		cli
 #else
 #define preempt_stop
-#define resume_kernel		restore_all
+#define resume_kernel		restore_context
 #endif
 
 #define SAVE_ALL \
@@ -122,25 +122,6 @@
 .previous
 
 
-#define RESTORE_ALL	\
-	RESTORE_REGS	\
-	addl $4, %esp;	\
-1:	iret;		\
-.section .fixup,"ax";   \
-2:	sti;		\
-	movl $(__USER_DS), %edx; \
-	movl %edx, %ds; \
-	movl %edx, %es; \
-	pushl $11;	\
-	call do_exit;	\
-.previous;		\
-.section __ex_table,"a";\
-	.align 4;	\
-	.long 1b,2b;	\
-.previous
-
-
-
 ENTRY(lcall7)
 	pushfl			# We get a different stack layout with call
 				# gates, which has to be cleaned up later..
@@ -197,7 +178,7 @@
 	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb CS(%esp), %al
 	testl $(VM_MASK | 3), %eax
-	jz resume_kernel		# returning to kernel or vm86-space
+	jz resume_kernel
 ENTRY(resume_userspace)
  	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
@@ -211,7 +192,7 @@
 #ifdef CONFIG_PREEMPT
 ENTRY(resume_kernel)
 	cmpl $0,TI_preempt_count(%ebp)	# non-zero preempt_count ?
-	jnz restore_all
+	jnz restore_context
 need_resched:
 	movl TI_flags(%ebp), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
@@ -293,8 +274,100 @@
 	movl TI_flags(%ebp), %ecx
 	testw $_TIF_ALLWORK_MASK, %cx	# current->work
 	jne syscall_exit_work
+
 restore_all:
-	RESTORE_ALL
+	/* If returning to Ring-3, not to V86, and with
+	 * the small stack, try to fix the higher word of
+	 * ESP, as the CPU won't restore it from stack.
+	 * This is an "official" bug of all the x86-compatible
+	 * CPUs, which we can try to work around to make
+	 * dosemu happy. */
+	movl EFLAGS(%esp), %eax
+	movb CS(%esp), %al
+	andl $(VM_MASK | 3), %eax
+	/* returning to user-space and not to v86? */
+	cmpl $3, %eax
+	jne restore_context
+	larl OLDSS(%esp), %eax
+	/* returning to "small" stack? */
+	testl $0x00400000, %eax
+	jz 8f		# go fixing ESP instead of just to return :(
+restore_context:
+	RESTORE_REGS
+	addl $4, %esp
+1:	iret
+.section .fixup,"ax"
+iret_exc:
+	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,iret_exc
+.previous
+	/* preparing the ESPfix here */
+/* FRAME_SIZE_ESTIM is the size of an exception stack frame + 40 bytes
+ * for 10 pushes */
+#define FRAME_SIZE_ESTIM (16 + 10 * 4)
+/* ESPFIX_STACK_RESERVE is the value that goes into SP after switch to 16bit */
+#define ESPFIX_STACK_RESERVE (FRAME_SIZE_ESTIM * 3)
+#define SP_ESTIMATE (ESPFIX_STACK_RESERVE - FRAME_SIZE_ESTIM)
+/* iret frame takes 20 bytes.
+ * We hide 2 magic pointers above it, 8 bytes each (seg:off). */
+#define ESPFIX_SWITCH16_OFFS 20
+#define ESPFIX_SWITCH32_OFFS 28
+#define ESPFIX_SWITCH16 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH16_OFFS)
+#define ESPFIX_SWITCH32 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH32_OFFS)
+8:	RESTORE_REGS
+	addl $4, %esp
+	/* reserve the room for 2 magic pointers (16 bytes) and 4 bytes wasted. */
+	.rept 5
+	pushl 16(%esp)
+	.endr
+	/* Preserve some registers. That makes all the +8 offsets below. */
+	pushl %eax
+	pushl %ebp
+	/* Set up the SWITCH16 magic pointer. */
+	movl $__ESPFIX_SS, ESPFIX_SWITCH16_OFFS+8+4(%esp)
+	/* Get the "old" ESP */
+	movl 8+12(%esp), %eax
+	/* replace the lower word of "old" ESP with our magic value */
+	movw $ESPFIX_STACK_RESERVE, %ax
+	movl %eax, ESPFIX_SWITCH16_OFFS+8(%esp)
+	/* Now set up the SWITCH32 magic pointer. */
+	movl $__KERNEL_DS, ESPFIX_SWITCH32_OFFS+8+4(%esp)
+	/* ESP must match the SP_ESTIMATE,
+	 * which is FRAME_SIZE_ESTIM bytes below the iret frame */
+	leal -FRAME_SIZE_ESTIM+8(%esp), %eax
+	movl %eax, ESPFIX_SWITCH32_OFFS+8(%esp)
+	/* %eax will make the base of __ESPFIX_SS.
+	 * It have to be ESPFIX_STACK_RESERVE bytes below the iret frame. */
+	subl $(ESPFIX_STACK_RESERVE-FRAME_SIZE_ESTIM), %eax
+	cli
+	GET_THREAD_INFO(%ebp)
+	movl TI_cpu(%ebp), %ebp
+	shll $GDT_SIZE_SHIFT, %ebp
+	/* find GDT of the proper CPU */
+	addl $cpu_gdt_table, %ebp
+	/* patch the base of the 16bit stack */
+	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp)
+	shrl $16, %eax
+	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp)
+	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp)
+	popl %ebp
+	popl %eax
+espfix_before_lss:
+	lss ESPFIX_SWITCH16_OFFS(%esp), %esp
+espfix_after_lss:
+	iret
+.section __ex_table,"a"
+	.align 4
+	.long espfix_after_lss,iret_exc
+.previous
 
 	# perform work that needs to be done immediately before resumption
 	ALIGN
@@ -396,6 +469,34 @@
 	call do_IRQ
 	jmp ret_from_intr
 
+/* Check if we are on 16bit stack. Can happen either on iret of ESPFIX,
+ * or in an exception handler after that iret... */
+#define CHECK_ESPFIX_STACK(label) \
+	pushl %eax; \
+	movl %ss, %eax; \
+	cmpw $__ESPFIX_SS, %ax; \
+	jne label;
+/* using %es here because NMI can change %ss */
+#define UNWIND_ESPFIX_STACK \
+	CHECK_ESPFIX_STACK(28f) \
+	movl %eax, %es; \
+	lss %es:ESPFIX_SWITCH32, %esp; \
+28:	popl %eax;
+/* have to compensate the difference between real SP and estimated. */
+#define UNWIND_ESPFIX_STACK_NMI \
+	CHECK_ESPFIX_STACK(28f) \
+	movw $SP_ESTIMATE, %ax; \
+	subw %sp, %ax; \
+	movswl %ax, %eax; \
+	lss %ss:ESPFIX_SWITCH32, %esp; \
+	subl %eax, %esp; \
+	popl %eax; \
+	cmpl $espfix_after_lss, (%esp); \
+	jne nmi_stack_correct; \
+	movl $espfix_before_lss, (%esp); \
+	jmp nmi_stack_correct; \
+28:	popl %eax;
+
 #define BUILD_INTERRUPT(name, nr)	\
 ENTRY(name)				\
 	pushl $nr-256;			\
@@ -423,6 +524,7 @@
 	pushl %ebx
 	cld
 	movl %es, %ecx
+	UNWIND_ESPFIX_STACK		# this corrupts %es
 	movl ORIG_EAX(%esp), %esi	# get the error code
 	movl ES(%esp), %edi		# get the function address
 	movl %eax, ORIG_EAX(%esp)
@@ -502,6 +604,7 @@
  * fault happened on the sysenter path.
  */
 ENTRY(nmi)
+	UNWIND_ESPFIX_STACK_NMI
 	cmpl $sysenter_entry,(%esp)
 	je nmi_stack_fixup
 	pushl %eax
@@ -523,7 +626,7 @@
 	pushl %edx
 	call do_nmi
 	addl $8, %esp
-	RESTORE_ALL
+	jmp restore_all
 
 nmi_stack_fixup:
 	FIX_STACK(12,nmi_stack_correct, 1)
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S
--- linux-2.6.8-pcsp/arch/i386/kernel/head.S	2004-07-17 23:31:24.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/head.S	2004-10-06 21:58:04.000000000 +0400
@@ -514,7 +514,7 @@
 	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
 	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
-	.quad 0x0000000000000000	/* 0xd0 - unused */
+	.quad 0x000092000000ffff	/* 0xd0 - ESPfix 16-bit SS */
 	.quad 0x0000000000000000	/* 0xd8 - unused */
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h
--- linux-2.6.8-pcsp/include/asm-i386/segment.h	2004-01-09 09:59:19.000000000 +0300
+++ linux-2.6.8-stacks/include/asm-i386/segment.h	2004-10-06 21:58:04.000000000 +0400
@@ -38,7 +38,7 @@
  *  24 - APM BIOS support
  *  25 - APM BIOS support 
  *
- *  26 - unused
+ *  26 - ESPfix small SS
  *  27 - unused
  *  28 - unused
  *  29 - unused
@@ -71,14 +71,19 @@
 #define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
+#define GDT_ENTRY_ESPFIX_SS		(GDT_ENTRY_KERNEL_BASE + 14)
+#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8)
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS	31
 
 /*
  * The GDT has 32 entries
  */
-#define GDT_ENTRIES 32
-
-#define GDT_SIZE (GDT_ENTRIES * 8)
+#define GDT_ENTRIES_SHIFT 5
+#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT)
+#define GDT_ENTRY_SHIFT 3
+#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT)
+#define GDT_SIZE (1 << GDT_SIZE_SHIFT)
 
 /* Simple and small GDT entries for booting only */
 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
@ 2004-09-16 18:39 Petr Vandrovec
  2004-09-17 18:12 ` Stas Sergeev
  2004-09-18 16:45 ` Stas Sergeev
  0 siblings, 2 replies; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-16 18:39 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On 16 Sep 04 at 21:49, Stas Sergeev wrote:
> 
> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".

Not a bug, but a feature.

> What I want to find out, is what CPUs are
> affected. I wrote a small test-case. It is
> attached. I tried it on Intel Pentium and
> on AMD Athlon - bug is present on both. But
> I'd like to know if it is also present on a
> Transmeta Crusoe, Cyrix and other CPUs.

AFAIK all.  There are products which depend on this behavior.

> I'd also like to know any thoughts on whether
> it is possible to work around the bug, probably
> in a kernel? Well, I am not hoping on such a
> possibility, but who knows...

IMHO you have to switch to 16bit stack, load upper bits of ESP 
with target value, and then execute IRET, while 16bit SS:SP points 
to same place where flat ESP pointed.  

This way IRET, as stack is 16bit, uses only SP instead of ESP, 
and so you can preload upper bits of ESP before executing IRET. 

And I do not think that linux NMI handler survives 16bit stack, so 
natural solution seems to be to create complete 16bit CPL1 
environment, return to it, load ESP as you want, and then do IRET 
to return to CPL2 or CPL3.  Fortunately V8086 mode is not affected, 
so there should be no problem with using CPL1 for this middle step.  
But of course it is not something you want to do on each return 
from interrupt handler...  Well, or maybe you want...

> Anyway, I'd be glad to get any info on that bug.
> Why it was not fixed for so many years, looks
> also like an interesting question, as for me.

It is part of architecture now...
                                                    Petr Vandrovec

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-16 18:39 Petr Vandrovec
@ 2004-09-17 18:12 ` Stas Sergeev
  2004-09-18 16:45 ` Stas Sergeev
  1 sibling, 0 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-09-17 18:12 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

Hello.

Petr Vandrovec wrote:
>> There is a "semi-official" bug in Intel CPUs,
>> which is described here:
>> http://www.intel.com/design/intarch/specupdt/27287402.PDF
>> chapter "Specification Clarifications"
>> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
> Not a bug, but a feature.
Yep, one of those features, that you can
neither disable, nor (speaking about dosemu)
work around.

>> What I want to find out, is what CPUs are
>> affected. I wrote a small test-case. It is
>> attached. I tried it on Intel Pentium and
>> on AMD Athlon - bug is present on both. But
>> I'd like to know if it is also present on a
>> Transmeta Crusoe, Cyrix and other CPUs.
> AFAIK all.
I expected that. My last hopes were on Cyrix, which,
at least as I've heard, doesn't follow the Intel's
specs very strictly. So I thought maybe they have
not duplicated that "feature".

> There are products which depend on this behavior.
Out of curiosity, what are those products? I
think it is pretty much brain-damaged to depend
on a ESP corruption.

>> I'd also like to know any thoughts on whether
>> it is possible to work around the bug, probably
>> in a kernel?
> IMHO you have to switch to 16bit stack, load upper bits of ESP 
> with target value, and then execute IRET, while 16bit SS:SP points 
> to same place where flat ESP pointed.  
Hmm, that sounds feasible.

> And I do not think that linux NMI handler survives 16bit stack, so 
> natural solution seems to be to create complete 16bit CPL1 
> environment, return to it, load ESP as you want, and then do IRET 
> to return to CPL2 or CPL3.  Fortunately V8086 mode is not affected, 
> so there should be no problem with using CPL1 for this middle step.  
> But of course it is not something you want to do on each return 
> from interrupt handler...  Well, or maybe you want...
Umm, no. But well, the spec claims that this
happens only with iret. It doesn't say that it
is a general problem with switching between a
different protection rings. And it seems, for
example, that WinNT/XP do not have that problem.
I.e. the DOS progs that get a Stack Fault under
dosemu because ESP points to the kernel space, can
actually work under WinXP. Maybe they are using
some other technique to return to Ring3, the one
that doesn't trigger the bug at all?

>> Anyway, I'd be glad to get any info on that bug.
>> Why it was not fixed for so many years, looks
>> also like an interesting question, as for me.
> It is part of architecture now...
How could that happen? Was it so difficult to just
fix? Oh well, perhaps it was...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-16 18:39 Petr Vandrovec
  2004-09-17 18:12 ` Stas Sergeev
@ 2004-09-18 16:45 ` Stas Sergeev
  2004-09-18 16:59   ` Petr Vandrovec
  1 sibling, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-18 16:45 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

Hi Petr.

Petr Vandrovec wrote:
> natural solution seems to be to create complete 16bit CPL1 
> environment, return to it, load ESP as you want, and then do IRET 
> to return to CPL2 or CPL3.  Fortunately V8086 mode is not affected, 
> so there should be no problem with using CPL1 for this middle step.  
> But of course it is not something you want to do on each return 
> from interrupt handler...  Well, or maybe you want...
Actually, this may indeed be what I want!
I think this can be implemented with the checks
that Denis Vlasenko suggests. Something like this
can be added to entry.S, right before the "iret":
---
if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one))
  push_a_stack_frame_to_return_to_the_ring1_trampoline();
---
This way the overhead for the normal case would be
something about 4 asm insns (the check), and for the
dosemu case - who cares? (and probably also the wine
people will value that)

Does this look reasonable? If it does, I think I
should just start implementing that.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 16:45 ` Stas Sergeev
@ 2004-09-18 16:59   ` Petr Vandrovec
  2004-09-18 19:14     ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-18 16:59 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Sat, Sep 18, 2004 at 08:45:33PM +0400, Stas Sergeev wrote:
> Hi Petr.
> 
> Petr Vandrovec wrote:
> >natural solution seems to be to create complete 16bit CPL1 
> >environment, return to it, load ESP as you want, and then do IRET 
> >to return to CPL2 or CPL3.  Fortunately V8086 mode is not affected, 
> >so there should be no problem with using CPL1 for this middle step.  
> >But of course it is not something you want to do on each return 
> >from interrupt handler...  Well, or maybe you want...
> Actually, this may indeed be what I want!
> I think this can be implemented with the checks
> that Denis Vlasenko suggests. Something like this
> can be added to entry.S, right before the "iret":
> ---
> if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one))
>  push_a_stack_frame_to_return_to_the_ring1_trampoline();
> ---
> This way the overhead for the normal case would be
> something about 4 asm insns (the check), and for the
> dosemu case - who cares? (and probably also the wine
> people will value that)
> 
> Does this look reasonable? If it does, I think I
> should just start implementing that.

Do not forget that you have to implement also return to CPL1, as
NMI may arrive while you are running on CPL1.  So it may not be
as trivial as it seemed.  Maybe all these programs survive that
their CPL3 stack changes, and then you could just push cs,eip and
esp on CPL3 stack, and return to user code (vsyscall page seems 
natural place for that code) which would do pop %esp; retf?

Only problem is how to find that old SS points to 16bit segment.
You need LAR and/or you have to peek GDT/LDT to find stack size, 
and AFAIK LAR is microcoded on P4.
						Petr Vandrovec


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 16:59   ` Petr Vandrovec
@ 2004-09-18 19:14     ` Stas Sergeev
  2004-09-18 20:35       ` Petr Vandrovec
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-18 19:14 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

Hi,

Petr Vandrovec wrote:
>> Does this look reasonable? If it does, I think I
>> should just start implementing that.
> Do not forget that you have to implement also return to CPL1, as
> NMI may arrive while you are running on CPL1.  So it may not be
> as trivial as it seemed. 
I am not sure what special actions have to be
taken here compared to returning to ring-3 from NMI.
Is there anywhere in the sources an example to take
a look at? (sorry for the newbie questions)

> Maybe all these programs survive that
> their CPL3 stack changes,
Most likely they will, I am just not sure. What
if they disabled interrupts and are switching the
stack by loading the SS and ESP separately? If we
interrupt it there, there may be the problems, which
would be almost impossible to track down later.
It just looks a bit unsafe to me. Or maybe exploit
a sigaltstack for that? Hmm, is implementing the
CPL1 trampoline really that difficult after all?
I think it is somewhat cleaner and maybe safer.

> Only problem is how to find that old SS points to 16bit segment.
> You need LAR and/or you have to peek GDT/LDT to find stack size, 
Yes, I was thinking about using LAR - looks like the
most easy and fast way to just get that single bit
out of LDT.

> and AFAIK LAR is microcoded on P4.
Where does this lead us to? Some other problems I
am not aware about?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 19:14     ` Stas Sergeev
@ 2004-09-18 20:35       ` Petr Vandrovec
  2004-09-22 18:49         ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-18 20:35 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Sat, Sep 18, 2004 at 11:14:44PM +0400, Stas Sergeev wrote:
> Hi,
> 
> Petr Vandrovec wrote:
> >>Does this look reasonable? If it does, I think I
> >>should just start implementing that.
> >Do not forget that you have to implement also return to CPL1, as
> >NMI may arrive while you are running on CPL1.  So it may not be
> >as trivial as it seemed. 
> I am not sure what special actions have to be
> taken here compared to returning to ring-3 from NMI.
> Is there anywhere in the sources an example to take
> a look at? (sorry for the newbie questions)

It means that you cannot blindly create CPL1 trampoline stack
in some static per-cpu area.  But if we can assume that there
is no other CPL1 code in the system, something like code below
could work:

/* + 20 [word 5]  SS
   + 16 [word 4]  ESP
   + 12 [word 3]  EFLAGS
   +  8 [word 2]  CS
   +  4 [word 1]  EIP
   +  0 [word 0]  ESP for popl %esp
*/
u_int32_t cpl1stacks[NUM_CPUS][6];

curCPU = smp_processor_id();
minSP = curCPU * sizeof(cpl1stacks[0]);
maxSP = minSP + sizeof(cpl1stacks[0]);
cpl1stack = cpl1stacks[curCPU];

if (cpl0stack[retCS] & 3 == 1) {
	/* Going back to our trampoline */
	/* There is no other place in kernel running on CPL1
           except our trampoline; so interrupt could occur either
           on popl %esp or on iret.  If it occured on popl %esp,
           just return, code will do proper things.  If interrupt
	   occured on iret, we have to perform popl %esp again,
	   so that upper bits of %esp are correctly restored
	   for CPL3 code */
	ASSERT(cpl0stack[retCS] == FLAT_4G_CPL1_CS);
	ASSERT(cpl0stack[retSS] == SMALL_CPL1_SS);
	if (cpl0stack[retEIP] == fixup_proc) {
		ASSERT(cpl0stack[retESP] == minSP]);
	} else if (cpl0stack[retEIP] == fixup_proc_iret) {
		ASSERT(cpl0stack[retESP] & 0xFFFF == minSP + 4);
		/* Undo popl %esp - copy value from ESP we were
                   using on CPL1 back to stack */
		cpl1stack[0] = cpl0stack[retESP];
		cpl0stack[retEIP] = fixup_proc;
		cpl0stack[retESP] = minSP;
	} else {
		/* unexpected code running on CPL1 */
		/* Probably do simple IRET and hope for the best? */
		ASSERT(0);
	}
	iret;
} else {
	cpl1Stack[5] = cpl0stack[retSS];
	cpl1Stack[4] = cpl0stack[retESP];
	cpl1Stack[3] = cpl0stack[retEFLAGS];
	cpl1Stack[2] = cpl0stack[retCS];
	cpl1Stack[1] = cpl0stack[retEIP];
	cpl1Stack[0] = (cpl0stack[retESP] & 0xFFFF0000) | minSP + 4;
	cpl0stack[retSS] = SMALL_CPL1_SS;
	cpl0stack[retESP] = minSP;
	/* 
	   Do NOT clear IF... IF flag is affected only if IOPL >= CPL, so
	   with IOPL=0 IRET on CPL1 won't reenable interrupts.  This is reason
	   why we cannot use RETF to return from CPL0 to CPL1 (retf
	   is much faster than iret on P4) (and we cannot use retf for
	   CPL1->CPL3 due to TF/RF).

	   Clear TF so we do not start tracing on CPL1 if we trace
	   userspace, and clear RF, so if somebody intentionaly pointed
	   hardware breakpoint into CPL1 handler, it will be triggered
	   (we must use this path for returns from INT1 too, so it
	   is possible that RF is set in EFLAGS on stack).
	*/
	cpl0stack[retEFLAGS] &= ~(EFLAGS_TF | EFLAGS_RF);
	cpl0stack[retCS] = FLAT_4G_CPL1_CS;
	cpl0stack[retEIP] = fixup_proc;
	iret;
}

fixup_proc:
	popl %esp
fixup_proc_iret;
	iret


It assumes that there is one new 32bit CPL1 flat CS descriptor in GDT, and
one 16bit (small) CPL1 SS descriptor (grows up, with limit 24*<num_cpus>
and base of cpl1stacks), plus <num_cpus> 24byte CPL1 stacks (max. 64KB, 
so for kernels with more than 2730 CPUs you need more than one stack 
descriptor).


> >Maybe all these programs survive that
> >their CPL3 stack changes,
> Most likely they will, I am just not sure. What
> if they disabled interrupts and are switching the
> stack by loading the SS and ESP separately? If we
> interrupt it there, there may be the problems, which
> would be almost impossible to track down later.
> It just looks a bit unsafe to me. Or maybe exploit
> a sigaltstack for that? Hmm, is implementing the
> CPL1 trampoline really that difficult after all?
> I think it is somewhat cleaner and maybe safer.

As in pseudocode above I was able to handle even NMIs with just
24 bytes of stack on CPL1, it is definitely preferred solution.

> >and AFAIK LAR is microcoded on P4.
> Where does this lead us to? Some other problems I
> am not aware about?

It is slow.  No other problems, except that doing
(((ss & 4) ? gdt[ss >> 3] : ldt[ss >> 3]) & 0x00????00) == 0x00????00;
may be faster than doing (lar(ss) & 0x00????00) == 0x00????00.
						Petr


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 20:35       ` Petr Vandrovec
@ 2004-09-22 18:49         ` Stas Sergeev
  2004-09-22 19:19           ` Richard B. Johnson
  2004-09-22 20:02           ` Petr Vandrovec
  0 siblings, 2 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-09-22 18:49 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 868 bytes --]

Hi Petr et al.

I coded up something that remotely looks like the
discussed patch.
I have deviated from your proposals at some important
points:
1. I am not allocating the ring1 stack separately,
I am allocating it on a ring0 stack. Much simpler
code, although non-reentrant/preempt-unsafe.
2. I am disabling the interrupts after all. That's
because of the preempt-unsafeness. I pass up the
IOPL=1 when necessary, to avoid problems.
But I guess also with your technique the interrupts
had to be disabled, unless the ring1 stack is
per-thread.
3. I am using LAR. Do you really think it can be
slower than the whole thing of locating LDT?

Let me know if I did something stupid.
The patch is attached.
I tested (pretty much) everything in it, except
probably the "popl %esp" restartability. But that
one looks fairly simple and should work.

Does this patch look good?


[-- Attachment #2: linux-2.6.8-stacks2.diff --]
[-- Type: text/x-patch, Size: 9860 bytes --]

diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
--- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-10 13:28:35.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-09-22 15:39:50.000000000 +0400
@@ -70,9 +70,16 @@
 CF_MASK		= 0x00000001
 TF_MASK		= 0x00000100
 IF_MASK		= 0x00000200
-DF_MASK		= 0x00000400 
+DF_MASK		= 0x00000400
+IOPL1_MASK 	= 0x00001000
+IOPL2_MASK 	= 0x00002000
+IOPL_MASK 	= 0x00003000
 NT_MASK		= 0x00004000
+RF_MASK		= 0x00010000
 VM_MASK		= 0x00020000
+AC_MASK 	= 0x00040000
+VIF_MASK 	= 0x00080000
+VIP_MASK 	= 0x00100000
 
 #ifdef CONFIG_PREEMPT
 #define preempt_stop		cli
@@ -81,6 +88,49 @@
 #define resume_kernel		restore_all
 #endif
 
+/*
+ * First, reserve the 0x400 bytes on stack for the NMI
+ * handler - it uses the same stack.
+ * Second, push the additional iret frame to switch to
+ * the Ring-1 trampoline. The correct value of ESP
+ * higher word is pushed here and popped on Ring-1.
+ * espfix_trampoline is working on a "small" stack.
+ * We patch its base in GDT here to allow the Ring-1 code
+ * to access the right place on stack. This is preempt-unsafe
+ * and therefore must be done with the interrupts disabled.
+ * To allow Ring-1 code to re-enable interrupts, we make
+ * sure it will be executing with at least IOPL1.
+ */
+#define NMI_STACK_RESERVE 0x400
+#define MAKE_ESPFIX \
+	cli; \
+	subl $NMI_STACK_RESERVE, %esp; \
+	.rept 5; \
+	pushl NMI_STACK_RESERVE+0x10(%esp); \
+	.endr; \
+	pushl %eax; \
+	movl %esp, %eax; \
+	pushl %ebp; \
+	GET_THREAD_INFO(%ebp); \
+	movl TI_cpu(%ebp), %ebp; \
+	shll $GDT_SIZE_SHIFT, %ebp; \
+	addl $cpu_gdt_table, %ebp; \
+	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
+	shrl $16, %eax; \
+	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
+	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
+	popl %ebp; \
+	popl %eax; \
+	pushw 14(%esp); \
+	pushw $4; \
+	pushl $__ESPFIX_SS; \
+	pushl $0; \
+	pushl 20(%esp); \
+	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
+	orl $IOPL1_MASK, (%esp); \
+	pushl $__ESPFIX_CS; \
+	pushl $espfix_trampoline;
+
 #define SAVE_ALL \
 	cld; \
 	pushl %es; \
@@ -122,9 +172,24 @@
 .previous
 
 
+/* If returning to Ring-3, not to V86, and with
+ * the small stack, try to fix the higher word of
+ * ESP, as the CPU won't restore it from stack.
+ * This is an "official" bug of all the x86-compatible
+ * CPUs, which we can try to work around to make
+ * dosemu happy. */
 #define RESTORE_ALL	\
-	RESTORE_REGS	\
-	addl $4, %esp;	\
+	movl EFLAGS(%esp), %eax; \
+	movb CS(%esp), %al; \
+	andl $(VM_MASK | 2), %eax; \
+	cmpl $2, %eax; \
+	jne 3f; \
+	larl OLDSS(%esp), %eax; \
+	testl $0x00400000, %eax; \
+3:	RESTORE_REGS	\
+	leal 4(%esp), %esp; \
+	jnz 1f; \
+	MAKE_ESPFIX \
 1:	iret;		\
 .section .fixup,"ax";   \
 2:	sti;		\
@@ -174,6 +239,23 @@
 	jmp do_lcall
 
 
+ENTRY(espfix_trampoline)
+	popl %esp
+espfix_past_esp:
+1:	iret
+.section .fixup,"ax"
+2:	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,2b
+.previous
+
 ENTRY(ret_from_fork)
 	pushl %eax
 	call schedule_tail
@@ -196,7 +278,7 @@
 	GET_THREAD_INFO(%ebp)
 	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb CS(%esp), %al
-	testl $(VM_MASK | 3), %eax
+	testl $(VM_MASK | 2), %eax
 	jz resume_kernel		# returning to kernel or vm86-space
 ENTRY(resume_userspace)
  	cli				# make sure we don't miss an interrupt
@@ -504,7 +586,12 @@
 ENTRY(nmi)
 	cmpl $sysenter_entry,(%esp)
 	je nmi_stack_fixup
-	pushl %eax
+	cmpl $espfix_past_esp,(%esp)
+	jne 1f
+	/* restart the "popl %esp" */
+	decl (%esp)
+	subw $4, 12(%esp)
+1:	pushl %eax
 	movl %esp,%eax
 	/* Do not access memory above the end of our stack page,
 	 * it might not exist.
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S
--- linux-2.6.8-pcsp/arch/i386/kernel/head.S	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/head.S	2004-09-21 14:58:16.000000000 +0400
@@ -514,8 +514,8 @@
 	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
 	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
-	.quad 0x0000000000000000	/* 0xd0 - unused */
-	.quad 0x0000000000000000	/* 0xd8 - unused */
+	.quad 0x00cfba000000ffff	/* 0xd0 - ESPfix CS */
+	.quad 0x0000b2000000ffff	/* 0xd8 - ESPfix SS */
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
 	.quad 0x0000000000000000	/* 0xf0 - unused */
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/process.c linux-2.6.8-stacks/arch/i386/kernel/process.c
--- linux-2.6.8-pcsp/arch/i386/kernel/process.c	2004-06-21 09:19:07.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/process.c	2004-09-21 10:45:55.000000000 +0400
@@ -225,7 +225,7 @@
 	printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id());
 	print_symbol("EIP is at %s\n", regs->eip);
 
-	if (regs->xcs & 3)
+	if (regs->xcs & 2)
 		printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
 	printk(" EFLAGS: %08lx    %s  (%s)\n",regs->eflags, print_tainted(),UTS_RELEASE);
 	printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/signal.c linux-2.6.8-stacks/arch/i386/kernel/signal.c
--- linux-2.6.8-pcsp/arch/i386/kernel/signal.c	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/signal.c	2004-09-21 10:48:38.000000000 +0400
@@ -558,7 +558,7 @@
 	 * kernel mode. Just return without doing anything
 	 * if so.
 	 */
-	if ((regs->xcs & 3) != 3)
+	if (!(regs->xcs & 2))
 		return 1;
 
 	if (current->flags & PF_FREEZE) {
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c
--- linux-2.6.8-pcsp/arch/i386/kernel/traps.c	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/traps.c	2004-09-21 10:47:37.000000000 +0400
@@ -210,7 +210,7 @@
 
 	esp = (unsigned long) (&regs->esp);
 	ss = __KERNEL_DS;
-	if (regs->xcs & 3) {
+	if (regs->xcs & 2) {
 		in_kernel = 0;
 		esp = regs->esp;
 		ss = regs->xss & 0xffff;
@@ -264,7 +264,7 @@
 	char c;
 	unsigned long eip;
 
-	if (regs->xcs & 3)
+	if (regs->xcs & 2)
 		goto no_bug;		/* Not in kernel */
 
 	eip = regs->eip;
@@ -335,7 +335,7 @@
 
 static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err)
 {
-	if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs))
+	if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs))
 		die(str, regs, err);
 }
 
@@ -357,7 +357,7 @@
 		goto trap_signal;
 	}
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto kernel_trap;
 
 	trap_signal: {
@@ -437,7 +437,7 @@
 	if (regs->eflags & VM_MASK)
 		goto gp_in_vm86;
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto gp_in_kernel;
 
 	current->thread.error_code = error_code;
@@ -614,7 +614,7 @@
 		 * allowing programs to debug themselves without the ptrace()
 		 * interface.
 		 */
-		if ((regs->xcs & 3) == 0)
+		if ((regs->xcs & 2) == 0)
 			goto clear_TF_reenable;
 		if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE)
 			goto clear_TF;
@@ -630,7 +630,7 @@
 	/* If this is a kernel mode trap, save the user PC on entry to 
 	 * the kernel, that's what the debugger can make sense of.
 	 */
-	info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : 
+	info.si_addr = ((regs->xcs & 2) == 0) ? (void *)tsk->thread.eip : 
 	                                        (void *)regs->eip;
 	force_sig_info(SIGTRAP, &info, tsk);
 
diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c
--- linux-2.6.8-pcsp/arch/i386/mm/extable.c	2004-06-09 15:44:19.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/extable.c	2004-09-21 17:17:05.000000000 +0400
@@ -26,6 +26,11 @@
 	}
 #endif
 
+	if (unlikely(regs->xcs == __ESPFIX_CS))
+	{
+		regs->xcs = __KERNEL_CS;
+	}
+
 	fixup = search_exception_tables(regs->eip);
 	if (fixup) {
 		regs->eip = fixup->fixup;
diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c
--- linux-2.6.8-pcsp/arch/i386/mm/fault.c	2004-08-10 11:02:37.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/fault.c	2004-09-21 16:03:28.000000000 +0400
@@ -77,7 +77,7 @@
 	u32 seg_ar, seg_limit, base, *desc;
 
 	/* The standard kernel/user address space limit. */
-	*eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
+	*eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg;
 
 	/* Unlikely, but must come before segment checks. */
 	if (unlikely((regs->eflags & VM_MASK) != 0))
diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h
--- linux-2.6.8-pcsp/include/asm-i386/segment.h	2004-01-09 09:59:19.000000000 +0300
+++ linux-2.6.8-stacks/include/asm-i386/segment.h	2004-09-21 14:57:44.000000000 +0400
@@ -38,8 +38,8 @@
  *  24 - APM BIOS support
  *  25 - APM BIOS support 
  *
- *  26 - unused
- *  27 - unused
+ *  26 - ESPfix Ring-1 trampoline CS
+ *  27 - ESPfix Ring-1 small SS
  *  28 - unused
  *  29 - unused
  *  30 - unused
@@ -71,14 +71,21 @@
 #define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
+#define GDT_ENTRY_ESPFIX_CS		(GDT_ENTRY_KERNEL_BASE + 14)
+#define GDT_ENTRY_ESPFIX_SS		(GDT_ENTRY_KERNEL_BASE + 15)
+#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1)
+#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1)
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS	31
 
 /*
  * The GDT has 32 entries
  */
-#define GDT_ENTRIES 32
-
-#define GDT_SIZE (GDT_ENTRIES * 8)
+#define GDT_ENTRIES_SHIFT 5
+#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT)
+#define GDT_ENTRY_SHIFT 3
+#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT)
+#define GDT_SIZE (1 << GDT_SIZE_SHIFT)
 
 /* Simple and small GDT entries for booting only */
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 18:49         ` Stas Sergeev
@ 2004-09-22 19:19           ` Richard B. Johnson
  2004-09-22 20:03             ` Stas Sergeev
  2004-09-22 20:02           ` Petr Vandrovec
  1 sibling, 1 reply; 38+ messages in thread
From: Richard B. Johnson @ 2004-09-22 19:19 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel

What problem is this supposed to fix?  ESP is __not__ corrupted
when returning to protected-mode or a different privilege level.
You don't 'return' to protected mode from a virtual-8086 mode,
even though you (can) use an `iret`instruction. The fundamental
change is though the EFLAGS register stored in the TSS.

The so-called bug is that when in real mode or in virtual-8086
mode, the high word of ESP is not changed. It is not a bug
because the high word doesn't even exist when in VM-86 mode!!
It is possible to use the 32-bit prefix, when in 16-bit mode,
to muck with the high word of the stack, but that's not
a documented procedure, but a side-effect of such an undocumented
instruction.

There is no bug to fix. When the VM-86 mode transitions to
32-bit protected mode, the stack is restored to the condition
it was just prior to the transition to VM-86 mode, therefore you
don't "use up" any stack. The so-called bug is only cosmetic
when somebody is prowling around in undocumented shadows.

Please, somebody from Intel tell these guys to leave the thing
alone. I, for one, don't want a bunch of "fixes" that do nothing
except consume valuable RAM, making it near impossible to
use later versions of Linux in embedded systems.

On Wed, 22 Sep 2004, Stas Sergeev wrote:

> Hi Petr et al.
>
> I coded up something that remotely looks like the
> discussed patch.
> I have deviated from your proposals at some important
> points:
> 1. I am not allocating the ring1 stack separately,
> I am allocating it on a ring0 stack. Much simpler
> code, although non-reentrant/preempt-unsafe.
> 2. I am disabling the interrupts after all. That's
> because of the preempt-unsafeness. I pass up the
> IOPL=1 when necessary, to avoid problems.
> But I guess also with your technique the interrupts
> had to be disabled, unless the ring1 stack is
> per-thread.
> 3. I am using LAR. Do you really think it can be
> slower than the whole thing of locating LDT?
>
> Let me know if I did something stupid.
> The patch is attached.
> I tested (pretty much) everything in it, except
> probably the "popl %esp" restartability. But that
> one looks fairly simple and should work.
>
> Does this patch look good?
>
>

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
            Note 96.31% of all statistics are fiction.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 19:19           ` Richard B. Johnson
@ 2004-09-22 20:03             ` Stas Sergeev
  2004-09-22 20:13               ` Richard B. Johnson
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-22 20:03 UTC (permalink / raw)
  To: root; +Cc: Petr Vandrovec, Linux kernel

Hi,

Richard B. Johnson wrote:
> What problem is this supposed to fix?
Richard, it will really help if you read the
whole thread. I was answering this to Denis
Vlasenko already - he agreed. Do I have to
repeat the explanations?

> ESP is __not__ corrupted
Right, but is not properly restored either,
while it have to be.

> when returning to protected-mode or a different privilege level.
It gets "corrupted" (not properly restored)
exactly when returning to *protected mode*
from another priv level. Please refer to the
Intel docs I pointed to in that thread earlier.

> You don't 'return' to protected mode from a virtual-8086 mode,
Noone was speaking about v86. I don't see why
you pick that up.

> The so-called bug is that when in real mode or in virtual-8086
> mode, the high word of ESP is not changed.
In short: Wrong.
The complete explanations are easily locateable
in that thread. And it have nothing to do with
the real mode either.

> It is not a bug
> because the high word doesn't even exist when in VM-86 mode!!
Noone was speaking about v86, sorry. I am bypassing
that part.

> It is possible to use the 32-bit prefix, when in 16-bit mode,
That's not about the prefixes either, sorry.
We were talking about the 32bit PM code.

> Please, somebody from Intel tell these guys to leave the thing
> alone.
Thanks many, they already left that alone once:)
Maybe enough of leaving the bugs alone?
I have lots of the DOS progs here that do not
work under dosemu without that patch, and work
perfectly with it. It should be enough. If
you need an examples - well, OpenCubicPlayer
for one. It will start, but crash as soon as
the music is attempted to play. The patch fixes
it. Other progs you'll have problems downloading
anyway, but let me know if you need these.

> I, for one, don't want a bunch of "fixes" that do nothing
> except consume valuable RAM, making it near impossible to
> use later versions of Linux in embedded systems.
Well, my patch is purely in asm. How many
valueable bytes does it take from you?
As for performance - 8 asm insns on a generic
path. Not too much either, as for me.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 20:03             ` Stas Sergeev
@ 2004-09-22 20:13               ` Richard B. Johnson
  2004-09-28 15:43                 ` Denis Vlasenko
  0 siblings, 1 reply; 38+ messages in thread
From: Richard B. Johnson @ 2004-09-22 20:13 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 21740 bytes --]

On Thu, 23 Sep 2004, Stas Sergeev wrote:

> Hi,
>
> Richard B. Johnson wrote:
> > What problem is this supposed to fix?
> Richard, it will really help if you read the
> whole thread. I was answering this to Denis
> Vlasenko already - he agreed. Do I have to
> repeat the explanations?
>
> > ESP is __not__ corrupted
> Right, but is not properly restored either,
> while it have to be.
>
> > when returning to protected-mode or a different privilege level.
> It gets "corrupted" (not properly restored)
> exactly when returning to *protected mode*
> from another priv level. Please refer to the
> Intel docs I pointed to in that thread earlier.
>
> > You don't 'return' to protected mode from a virtual-8086 mode,
> Noone was speaking about v86. I don't see why
> you pick that up.
>
> > The so-called bug is that when in real mode or in virtual-8086
> > mode, the high word of ESP is not changed.
> In short: Wrong.
> The complete explanations are easily locateable
> in that thread. And it have nothing to do with
> the real mode either.
>
> > It is not a bug
> > because the high word doesn't even exist when in VM-86 mode!!
> Noone was speaking about v86, sorry. I am bypassing
> that part.
>
> > It is possible to use the 32-bit prefix, when in 16-bit mode,
> That's not about the prefixes either, sorry.
> We were talking about the 32bit PM code.
>
> > Please, somebody from Intel tell these guys to leave the thing
> > alone.
> Thanks many, they already left that alone once:)
> Maybe enough of leaving the bugs alone?
> I have lots of the DOS progs here that do not
> work under dosemu without that patch, and work
> perfectly with it. It should be enough. If
> you need an examples - well, OpenCubicPlayer
> for one. It will start, but crash as soon as
> the music is attempted to play. The patch fixes
> it. Other progs you'll have problems downloading
> anyway, but let me know if you need these.
>
> > I, for one, don't want a bunch of "fixes" that do nothing
> > except consume valuable RAM, making it near impossible to
> > use later versions of Linux in embedded systems.
> Well, my patch is purely in asm. How many
> valueable bytes does it take from you?
> As for performance - 8 asm insns on a generic
> path. Not too much either, as for me.
>

Well DOSEMU uses VM-86 mode. That's how it works. It
creates a virtual 8086 machine, complete with the
required "DOS compatible" virtual BIOS environment.

I use it all the time because I write, amongst other things,
the complete BIOS and startup code for many Intel based
machines.

I run compilers, assemblers, linkers, and editors in that
environment and it works.

Sombody mentions a completely unrelated so-called Intel
bug and next thing you know, there are patches to work-
around the bug???

The bug doesn't exist period. Here is a session in
VM-86 mode., using DOSEMU. It does one hell of a lot
more work than your games.



Script started on Wed Sep 22 16:00:02 2004
# godos
CPU speed set to 2793/1 MHz
Running on CPU=586, FPU=1, rdtsc=1
Linux DOS emulator 0.98.5.0 $Date: 99/01/15 $
Last configured at Wed Mar 24 12:44:16 EST 1999 on linux
This is work in progress.
Please test against a recent version before reporting bugs and problems.
Bugs, Patches & New Code to linux-msdos@vger.rutgers.edu

Starting MS-DOS...

^[[1;1H^[[37m^[[44m
                ^[[2;1H
                      ^[[3;1H
                            ^[[4;1H
                                  ^[[5;1H     1. Start the TCP/IP Network  (NE* B
radley's code)                          ^[[6;1H     2. Start the TCP/IP Network  (3COM Board PCTCP)                            ^[[7;1H     3. Start the TCP/IP Network  (3COM Board ANALOGIC)                         ^[[8;1H     4. Load NDIS driver only     (3COM Board)                                  ^[[9;1H     ^[[1m^[[37m^[[47m5. Do not load any network^[[m\x0f^[[37m^[[44m                                                 ^[[10;1H     6. Do not execute any AUTOEXEC.BAT commands                                ^[[11;1H     7. Start RCCS Host Software                                                ^[[12;1H MS-DOS 6.22 Startup Menu                                                       ^[[13;1H  Enter a choice: 5ÍÍÍÍÍÍ                                                       ^[[14;1H                                                                                ^[[15;1H                                                                                ^[[16;1H                                                                                ^[[17;1H                                                                                ^[[18;1H                                                                                ^[[19;1H                                                                                ^[[20;1H                                                                                ^[[21;1H                                                                                ^[[22;1H                                                                                ^[[23;1H                                                                                ^[[24;1H                                                                                ^[[25;1HF5=Bypass startup files F8=Confirm each line of CONFIG.SYS and AUTOEXEC.BAT [N]^[[13;19H^[[13;27HTime remaining: 04^[[13;19H^[[13;20H                         ^[[25;1H                                                                               ^[[15;1H
MS-DOS Version 6.22

C:\>cd pbios

C:\PBIOS>make clean

Microsoft (R) Program Maintenance Utility  Version 4.07
Copyright (C) Microsoft Corp 1984-1988.  All rights reserved.

^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H

^[[3;24r^[[5;1H^[[1;25r^[[23;1H^[[37m^[[44m                                                                                ^[[24;1H  if exist *.obj del *.obj                                                      ^[[25;1H^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H


^[[1;24r^[[1;1H^[[1;25r^[[22;1H^[[37m^[[44m  if exist *.bin del *.bin                                                      ^[[23;1H  if exist *.rom del *.rom                                                      ^[[24;1H  if exist *.hex del *.hex                                                      ^[[25;1H^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H






^[[1;24r^[[1;1H^[[1;25r^[[18;1H^[[37m^[[44m  if exist *.lis del *.lis                                                      ^[[19;1H  if exist *.com del *.com                                                      ^[[20;1H  if exist *.exe del *.exe                                                      ^[[21;1H  if exist *.oe0 del *.oe0                                                      ^[[22;1H  if exist *.oe1 del *.oe1                                                      ^[[23;1H  cd tools                                                                      ^[[24;1H  if exist *.obj del *.obj                                                      ^[[25;1H^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H






^[[3;24r^[[5;1H^[[1;25r^[[1;1H^[[37m^[[44mC:\PBIOS> make clean^[[18;1H  if exist *.bin del *.bin                                                      ^[[19;1H  if exist *.rom del *.rom                                                      ^[[20;1H  if exist *.hex del *.hex                                                      ^[[21;1H  if exist *.com del *.com                                                      ^[[22;1H  if exist *.exe del *.exe                                                      ^[[23;1H  cd ..                                                                         ^[[24;1H                                                                                ^[[25;1HC:\PBIOS>^[[25;11Hmake^[[25;16Hpbios^[[37m^[[40m^[[2;20r^[[2;1H^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44m  if exist *.obj del *.obj
  if exist *.bin del *.bin                                                      ^[[3;1H  if exist *.rom del *.rom                                                      ^[[4;1H  if exist *.hex del *.hex                                                      ^[[5;1H  if exist *.com del *.com                                                      ^[[6;1H  if exist *.exe del *.exe                                                      ^[[7;1H  cd ..                                                                         ^[[8;1H                                                                                ^[[9;1HC:\PBIOS> make pbios                                                            ^[[10;1H                                                                                ^[[11;1HMicrosoft (R) Program Maintenance Utility  Version 4.07                         ^[[12;1HCopyright (C) Microsoft Corp 1984-1988.  All rights reserved.                   ^[[13;1H                                                                                ^[[14;1H                                                                                ^[[15;1HMAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist                  ^[[16;1H  CD TOOLS                                                                      ^[[17;1H  MAKE TOOLS                                                                    ^[[21;3H                        ^[[22;3H
MAKE : warning U4000: 'odeven.exe' : target does not exist^[[24;3Hcl -W3 odeven.c
                    \r^[[37m^[[40m^[[10;16r^[[10;1H^[M^[M^[M^[M^[[1;25r^[[1;14H^[[37m^[[44mcom^[[7Ccom^[[2;14Hexe^[[7Cexe^[[3;3Hcd ..                   ^[[4;3H
C:\PBIOS> make pbios      ^[[6;3H
Microsoft (R) Program Maintenance Utility  Version 4.07
Copyright (C) Microsoft Corp 1984-1988.  All rights reserved.

                                                                                ^[[11;1HMAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist                  ^[[12;1H  CD TOOLS                                                                      ^[[13;1H  MAKE TOOLS                                                                    ^[[17;3H          ^[[19;2HAKE : warning U4000: 'odeven.exe' : target does not exist
  cl -W3 odeven.c
Microsoft (R) C Optimizing Compiler Version 6.00A
Copyright (c) Microsoft Corp 1984-1990. All rights reserved.

odeven.c
^[[37m^[[40m^[[2;24r^[[2;1H^[[2;24r^[[24;1H








^[[2;24r^[[3;1H^[[1;25r^[[1;3H^[[37m^[[44m                        ^[[16;1H                                                                                ^[[17;1HMicrosoft (R) Segmented-Executable Linker  Version 5.13                         ^[[18;1HCopyright (C) Microsoft Corp 1984-1991.  All rights reserved.                   ^[[19;1H                                                                                ^[[20;1HObject Modules [.OBJ]: odeven.obj /farcall                                      ^[[21;1HRun File [odeven.exe]: "odeven.exe" /noi                                        ^[[22;1HList File [NUL.MAP]: NUL                                                        ^[[23;1HLibraries [.LIB]:                                                               ^[[24;1HDefinitions File [NUL.DEF]: ;                                                   ^[[25;1H^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H






^[[3;24r^[[5;1H^[[1;25r^[[2;1H^[[37m^[[44m                                                              ^[[18;1H                                                                                ^[[19;1HMAKE : warning U4000: 'fixup.exe' : target does not exist                       ^[[20;1H  cl -W3 fixup.c                                                                ^[[21;1HMicrosoft (R) C Optimizing Compiler Version 6.00A                               ^[[22;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved.                    ^[[23;1H                                                                                ^[[24;1Hfixup.c                                                                         ^[[25;1H^[[1;1HDefinitions File [NUL.DEF]: ;^[[3;24Hfixup.exe' : target does not exist ^[[4;10Hfixup.c



fixup.c ^[[13;24Hfixup.obj /farcall ^[[14;11Hfixup.exe]: "fixup.exe" /noi  ^[[19;24Hgetall.exe' : target does not exist^[[20;10Hgetall.c



getall.c
^[[3;24Hgetall.exe' : target does not exist^[[4;10Hgetall.c



getall.c^[[13;24Hgetall.obj /farcall^[[14;11Hgetall.exe]: "getall.exe" /noi^[[19;24Hpci.obj' : target does not exist   ^[[20;7HAL -c -W3 -o pci.obj pci.c



pci.c
^[[37m^[[40m^[[2;22r^[[2;1H^[[2;22r^[[22;1H





^[[2;22r^[[3;1H^[[1;25r^[[1;1H^[[37m^[[44m                             ^[[17;1H                                                                                ^[[18;1Hpci.c                                                                           ^[[19;1H                                                                                ^[[20;1HMAKE : warning U4000: 'pcireg.obj' : target does not exist                      ^[[21;1H  tasm pcireg;                                                                  ^[[22;1HTurbo Assembler  Version 2.0  Copyright (c) 1988, 1990 Borland International    ^[[24;1HAssembling file:   pcireg.ASM
^[[37m^[[40m^[[3;23r^[[3;1H^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44mMAKE : warning U4000: 'pci.obj' : target does not exist
  cl -AL -c -W3 -o pci.obj pci.c
Microsoft (R) C Optimizing Compiler Version 6.00A                               ^[[4;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved.                    ^[[5;1H                                                                                ^[[6;1Hpci.c                                                                           ^[[7;1H                                                                                ^[[8;1HMAKE : warning U4000: 'pcireg.obj' : target does not exist                      ^[[9;1H  tasm pcireg;                                                                  ^[[10;1HTurbo Assembler  Version 2.0  Copyright (c) 1988, 1990 Borland International    ^[[11;1H                                                                                ^[[12;1HAssembling file:   pcireg.ASM                                                   ^[[13;1HError messages:    None                                                         ^[[14;1HWarning messages:  None                                                         ^[[15;1HPasses:            1                                                            ^[[16;1HRemaining memory:  324k                                                         ^[[17;1H                                                                                ^[[18;1H                                                                                ^[[19;1HMAKE : warning U4000: 'pci.exe' : target does not exist                         ^[[20;1H  link pci pcireg;                                                              ^[[24;1H
^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H






^[[1;24r^[[1;1H^[[1;25r^[[18;1H^[[37m^[[44m                                                                                ^[[19;1HMAKE : warning U4000: 'ihex.exe' : target does not exist                        ^[[20;1H  cl -W3 ihex.c                                                                 ^[[21;1HMicrosoft (R) C Optimizing Compiler Version 6.00A                               ^[[22;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved.                    ^[[23;1H                                                                                ^[[24;1Hihex.c                                                                          ^[[25;1H^[[37m^[[40m^[[14;18r^[[14;1H^[M^[M^[[1;25r^[[1;1H^[[37m^[[44m                                                          ^[[2;3H
MAKE : warning U4000: 'pci.exe' : target does not exist                     ^[[4;3Hlink pci pcireg;

Microsoft (R) Segmented-Executable Linker  Version 5.13
Copyright (C) Microsoft Corp 1984-1991.  All rights reserved.
       ^[[12C

MAKE : warning U4000: 'ihex.exe' : target does not exist^[[11;3Hcl -W3 ihex.c^[[12;2Hicrosoft (R) C Optimizing Compiler Version 6.00A
Copyright (c) Microsoft Corp 1984-1990. All rights reserved.
                                                                                ^[[15;1Hihex.c                                                                          ^[[19;1H
Object Modules [.OBJ]: ihex.obj /farcall
Run File [ihex.exe]: "ihex.exe" /noi
List File [NUL.MAP]: NUL
Libraries [.LIB]:
Definitions File [NUL.DEF]: ;
^[[37m^[[40m^[[5;11r^[[5;1H^[M^[M^[M^[M^[[1;25r^[[3;24H^[[37m^[[44mihex.exe' : target does not exist^[[4;3Hcl -W3 ihex.c
Microsoft (R) C Optimizing Compiler Version 6.00A                               ^[[6;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved.                    ^[[7;1H                                                                                ^[[8;1Hihex.c                                                                          ^[[12;1H
Object Modules [.OBJ]: ihex.obj /farcall
Run File [ihex.exe]: "ihex.exe" /noi
List File [NUL.MAP]: NUL
Libraries [.LIB]:
Definitions File [NUL.DEF]: ;
  CD ..

MAKE : warning U4000: 'ABIOS.OBJ' : target does not exist
  TASM /ml /p /w2 /m2 PBIOS, ABIOS.OBJ, PBIOS.LIS;
Turbo Assembler  Version 2.0  Copyright (c) 1988, 1990 Borland International

Assembling file:   PBIOS.ASM  to  ABIOS.OBJ
^[[37m^[[40m^[[2;21r^[[2;1H^[[2;21r^[[21;1H



^[[2;21r^[[3;1H^[[1;25r^[[11;1H^[[37m^[[44mTurbo Assembler  Version 2.0  Copyright (c) 1988, 1990 Borland International    ^[[12;1H                                                                                ^[[13;1HAssembling file:   PBIOS.ASM  to  ABIOS.OBJ                                     ^[[14;1HError messages:    None                                                         ^[[15;1HWarning messages:  None                                                         ^[[16;1HPasses:            2                                                            ^[[17;1HRemaining memory:  258k                                                         ^[[18;1H                                                                                ^[[19;1H                                                                                ^[[20;1HMAKE : warning U4000: 'BBIOS.OBJ' : target does not exist                       ^[[21;1H  TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ;                                 ^[[24;35HB
^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H







^[[1;24r^[[1;1H^[[1;25r^[[17;1H^[[37m^[[44mError messages:    None                                                         ^[[18;1HWarning messages:  None                                                         ^[[19;1HPasses:            2                                                            ^[[20;1HRemaining memory:  262k                                                         ^[[21;1H                                                                                ^[[22;1H                                                                                ^[[23;1HMAKE : warning U4000: 'ABIOS.BIN' : target does not exist                       ^[[24;1H  LINK/T ABIOS.OBJ, ABIOS.BIN;                                                  ^[[25;1H^[[37m^[[40m^[[6;15r^[[6;1H^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44mAssembling file:   PBIOS.ASM  to  ABIOS.OBJ
Error messages:    None
Warning messages:  None
Passes:^[[12C2
Remaining memory:  258k
                                                                                ^[[7;1H                                                                                ^[[8;1HMAKE : warning U4000: 'BBIOS.OBJ' : target does not exist                       ^[[9;1H  TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ;                                 ^[[10;1HTurbo Assembler  Version 2.0  Copyright (c) 1988, 1990 Borland International    ^[[11;1H                                                                                ^[[12;1HAssembling file:   PBIOS.ASM  to  BBIOS.OBJ                                     ^[[16;1HRemaining memory:  262k


MAKE : warning U4000: 'ABIOS.BIN' : target does not exist
  LINK/T ABIOS.OBJ, ABIOS.BIN;

Microsoft (R) Segmented-Executable Linker  Version 5.13
Copyright (C) Microsoft Corp 1984-1991.  All rights reserved.^[[24;3H
^[[37m^[[40m^[[2;5r^[[2;1H^[M^[[1;25r^[[1;1H^[[37m^[[44m
Assembling file:   PBIOS.ASM  to  BBIOS.OBJ                                     ^[[6;1HRemaining memory:  262k


MAKE : warning U4000: 'ABIOS.BIN' : target does not exist
  LINK/T ABIOS.OBJ, ABIOS.BIN;

Microsoft (R) Segmented-Executable Linker  Version 5.13
Copyright (C) Microsoft Corp 1984-1991.  All rights reserved.

       ^[[12C
MAKE : warning U4000: 'PBIOS.COM' : target does not exist^[[17;3HLINK/T BBIOS.OBJ, PBIOS.COM;^[[19;2Hicrosoft (R) Segmented-Executable Linker  Version 5.13
Copyright (C) Microsoft Corp 1984-1991.  All rights reserved.


MAKE : warning U4000: 'PBIOS.ROM' : target does not exist    ^[[24;3HTOOLS\FIXUP ABIOS.BIN PBIOS.ROM
^[[37m^[[40m^[[2;24r^[[2;1H^[[2;24r^[[24;1H






^[[2;24r^[[3;1H^[[1;25r^[[18;1H^[[37m^[[44m                                                                                ^[[19;1HMAKE : warning U4000: 'PBIOS.HEX' : target does not exist                       ^[[20;1H  TOOLS\IHEX PBIOS.ROM PBIOS.HEX 0 20 Y                                         ^[[21;1HIHEX V2.00                              Analogic Software Tools                 ^[[22;1H                                                                                ^[[23;1HReading from file PBIOS.ROM,  writing to file PBIOS.HEX ...                     ^[[24;1HStarting address (hex) 0000, block length (hex) 20                              ^[[25;1H^[[37m^[[40m^[[1;25r^[[1;1H^[[1;25r^[[25;1H
^[[1;25r^[[1;1H^[[1;25r^[[25;1H^[[37m^[[44mC:\PBIOS>                                                                      ^[[25;11Hback\r^[[37m^[[40m^[[m\x0f^[[0m^[[m\r^[[K
^[(B^[(B\r         \r#
#
#
#
# exit
Script done on Wed Sep 22 16:00:39 2004

Even Linux 32-bit `script` works when I am in VM-86 mode.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips).
            Note 96.31% of all statistics are fiction.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 20:13               ` Richard B. Johnson
@ 2004-09-28 15:43                 ` Denis Vlasenko
  0 siblings, 0 replies; 38+ messages in thread
From: Denis Vlasenko @ 2004-09-28 15:43 UTC (permalink / raw)
  To: root, Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel

On Wednesday 22 September 2004 23:13, Richard B. Johnson wrote:
> Well DOSEMU uses VM-86 mode. That's how it works. It
> creates a virtual 8086 machine, complete with the
> required "DOS compatible" virtual BIOS environment.

You forgot DOS extenders which do not use VM86
but 32bit protected mode.
 
> I use it all the time because I write, amongst other things,
> the complete BIOS and startup code for many Intel based
> machines.
> 
> I run compilers, assemblers, linkers, and editors in that
> environment and it works.

The fact that those programs work fine does not automatically mean
that _other_ DOS programs will work.
 
> Sombody mentions a completely unrelated so-called Intel
> bug and next thing you know, there are patches to work-
> around the bug???
>
> The bug doesn't exist period.

Sorry, but it does exist. It is really obscure bug.
Please read on: 

> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
> 
> A short quote:
> "ISSUE: When a 32-bit IRET is used to return to another privilege level,
> and the old level uses a 4G stack (D/B bit in the segment register = 1),
> while the new level uses a 64k stack (D/B bit = 0), then only the 
> lower word of ESP is updated."
> 
> Which means that the higher word of ESP gets
> trashed. This bug beats dosemu rather badly,
> but there seem to be not much info about that
> bug available on the net.

IRET-to-lower-CPL-level stack frame:

+16 old_SS
+12 old_ESP
 +8 old_EFLAGS
 +4 old_CS
 +0 old_EIP <--- SS:ESP

> > If old_SS references 'small' data segment (16-bit one),
> > processor does not restore ESP from old_ESP.
> > It restores lower 16 bits only. Upper 16 bits are filled
> > with garbage (or just not modified at all?).
>
> Not modified at all, yes. That's why it is always
> greater than TASK_SIZE (0xc0000000) - it is still
> in a kernel space.
>
> > This works fine because processor uses SP, not ESP,
> > for subsequent push/pop/call/ret operations.
> > But if code uses full ESP, thinking that upper 16 bits are zero,
> > it will crash badly. Correct me if I'm wrong.
>
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...
--
vda


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 18:49         ` Stas Sergeev
  2004-09-22 19:19           ` Richard B. Johnson
@ 2004-09-22 20:02           ` Petr Vandrovec
  2004-09-23  4:09             ` Stas Sergeev
  2004-09-23 17:08             ` Stas Sergeev
  1 sibling, 2 replies; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-22 20:02 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Wed, Sep 22, 2004 at 10:49:45PM +0400, Stas Sergeev wrote:
> Hi Petr et al.
> 
> I coded up something that remotely looks like the
> discussed patch.
> I have deviated from your proposals at some important
> points:
> 1. I am not allocating the ring1 stack separately,
> I am allocating it on a ring0 stack. Much simpler
> code, although non-reentrant/preempt-unsafe.

I think that it is problem.  And do not forget that NMI can occur at any
place, so I'm not quite sure how your code works.

> 2. I am disabling the interrupts after all. That's
> because of the preempt-unsafeness. I pass up the
> IOPL=1 when necessary, to avoid problems.
> But I guess also with your technique the interrupts
> had to be disabled, unless the ring1 stack is
> per-thread.

What if exception occurs in CPL1 code?  Processor reloads SS/ESP from
TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack -
- and it pushes on top of CPL0 stack address of CPL1 - and real return
address for CPL3 was lost :-(  Boom.

You change base address for SS segment - in that case you need per-CPU
SS segment, it wont' work with global SS segment.

> 3. I am using LAR. Do you really think it can be
> slower than the whole thing of locating LDT?

Try measuring that.  I think that LAR will be faster than lookup in
memory, but...
> 
> Let me know if I did something stupid.
> The patch is attached.
> I tested (pretty much) everything in it, except
> probably the "popl %esp" restartability. But that
> one looks fairly simple and should work.

Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1
triggers fault?  You should be able to do that by modifying SS/ESP from
signal handler.

> +#define NMI_STACK_RESERVE 0x400
> +#define MAKE_ESPFIX \
> +	cli; \
> +	subl $NMI_STACK_RESERVE, %esp; \

This is ugly.  What if NMI handler uses more than 1KB?

> +	.rept 5; \
> +	pushl NMI_STACK_RESERVE+0x10(%esp); \
> +	.endr; \
> +	pushl %eax; \
> +	movl %esp, %eax; \
> +	pushl %ebp; \
> +	GET_THREAD_INFO(%ebp); \
> +	movl TI_cpu(%ebp), %ebp; \
> +	shll $GDT_SIZE_SHIFT, %ebp; \
> +	addl $cpu_gdt_table, %ebp; \
> +	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
> +	shrl $16, %eax; \
> +	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
> +	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
> +	popl %ebp; \
> +	popl %eax; \
> +	pushw 14(%esp); \

What about referencing 14+NMI_STACK_RESERVE() ?  And ESP is not
multiple of 4 here - it may slow down processor a bit.

I think that you should prepare stack by using moves to
-NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes.
Or you can (if you want per-thread and not per-CPU stack) prepare CPL1
stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack,
but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB.

> +	pushw $4; \
> +	pushl $__ESPFIX_SS; \
> +	pushl $0; \
> +	pushl 20(%esp); \
> +	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
> +	orl $IOPL1_MASK, (%esp); \

Does this work if userspace runs with iopl 3,2 or 1?  Does iret on CPL1
set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?).

> +	pushl $__ESPFIX_CS; \
> +	pushl $espfix_trampoline;
> +
>  #define SAVE_ALL \
>  	cld; \
>  	pushl %es; \
> @@ -122,9 +172,24 @@
>  .previous
>  
>  
> +/* If returning to Ring-3, not to V86, and with
> + * the small stack, try to fix the higher word of
> + * ESP, as the CPU won't restore it from stack.
> + * This is an "official" bug of all the x86-compatible
> + * CPUs, which we can try to work around to make
> + * dosemu happy. */
>  #define RESTORE_ALL	\
> -	RESTORE_REGS	\
> -	addl $4, %esp;	\
> +	movl EFLAGS(%esp), %eax; \
> +	movb CS(%esp), %al; \
> +	andl $(VM_MASK | 2), %eax; \
> +	cmpl $2, %eax; \
> +	jne 3f; \
> +	larl OLDSS(%esp), %eax; \
> +	testl $0x00400000, %eax; \
> +3:	RESTORE_REGS	\
> +	leal 4(%esp), %esp; \
> +	jnz 1f; \
> +	MAKE_ESPFIX \
>  1:	iret;		\
>  .section .fixup,"ax";   \

I'm not sure about lea speed.  And fast path should NOT jump,
forward jumps are initialy predicted as not taken... Maybe:

   3: RESTORE_REGS
      jz 8f
      add $4,%esp
   1: iret
   8: add $4,%esp
      MAKE_ESPFIX
      iret

And then you can move MAKE_ESPFIX out of macro.

And it seems to me that it adds too many operations to fast path.
Maybe you want to introduce some per-thread flag, or leave syscall path
to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).

> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/process.c linux-2.6.8-stacks/arch/i386/kernel/process.c
> --- linux-2.6.8-pcsp/arch/i386/kernel/process.c	2004-06-21 09:19:07.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/process.c	2004-09-21 10:45:55.000000000 +0400
> @@ -225,7 +225,7 @@
>  	printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id());
>  	print_symbol("EIP is at %s\n", regs->eip);
>  
> -	if (regs->xcs & 3)
> +	if (regs->xcs & 2)
>  		printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);

I think you should leave this as is, regs->xss/regs->esp should be valid
for CPL1 exceptions.

>  	printk(" EFLAGS: %08lx    %s  (%s)\n",regs->eflags, print_tainted(),UTS_RELEASE);
>  	printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n",
> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/signal.c linux-2.6.8-stacks/arch/i386/kernel/signal.c
> --- linux-2.6.8-pcsp/arch/i386/kernel/signal.c	2004-08-10 11:02:36.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/signal.c	2004-09-21 10:48:38.000000000 +0400
> @@ -558,7 +558,7 @@
>  	 * kernel mode. Just return without doing anything
>  	 * if so.
>  	 */
> -	if ((regs->xcs & 3) != 3)
> +	if (!(regs->xcs & 2))
>  		return 1;

Really?  I think that it should stay.

> @@ -630,7 +630,7 @@
>  	/* If this is a kernel mode trap, save the user PC on entry to 
>  	 * the kernel, that's what the debugger can make sense of.
>  	 */
> -	info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : 
> +	info.si_addr = ((regs->xcs & 2) == 0) ? (void *)tsk->thread.eip : 
>  	                                        (void *)regs->eip;

Sure?  Maybe this should be (regs->xcs & 3 != 3) instead?

Rest looks fine, but I'm not completely sure that SMP and NMIs are
handled in the best possible way.
							Petr Vandrovec

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 20:02           ` Petr Vandrovec
@ 2004-09-23  4:09             ` Stas Sergeev
  2004-09-23 17:08             ` Stas Sergeev
  1 sibling, 0 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-09-23  4:09 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

Hi,

Petr Vandrovec wrote:
>> 1. I am not allocating the ring1 stack separately,
>> I am allocating it on a ring0 stack. Much simpler
>> code, although non-reentrant/preempt-unsafe.
> I think that it is problem.  And do not forget that NMI can occur at any
> place, so I'm not quite sure how your code works.
Why it is not preempt-unsafe, I beleive, is because
if it is preempted after modifying GDT, and then
another process will walk that path modifying GDT,
the first process will not return anywhere. But as
for NMI this should not be the case, as it doesn't
seem to call schedule().

> What if exception occurs in CPL1 code?  Processor reloads SS/ESP from
> TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack -
> - and it pushes on top of CPL0 stack address of CPL1 - and real return
> address for CPL3 was lost :-(  Boom.
But for that I reserve the room on stack, so that
should not hurt.

> You change base address for SS segment - in that case you need per-CPU
> SS segment, it wont' work with global SS segment.
But I beleive it *is* per-CPU, no? I am taking the
TI_cpu value from the thread struct to look up the
proper GDT, should that work? (no SMP-box here to
test)

> Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1
> triggers fault?  You should be able to do that by modifying SS/ESP from
> signal handler.
Yes, I did that - that seem to work (i.e. no Oops
or something bad).

>> +#define NMI_STACK_RESERVE 0x400
>> +#define MAKE_ESPFIX \
>> +	cli; \
>> +	subl $NMI_STACK_RESERVE, %esp; \
> This is ugly.  What if NMI handler uses more than 1KB?
Adjusting NMI_STACK_RESERVE? I need only a few
bytes of stack to work with, so I can reserve
any amount for NMI. What should it be? 0x400 was
taken "randomly", but really, is it worth allocating
another per-CPU object to only save a few bytes of
the ring-0 stack?

> What about referencing 14+NMI_STACK_RESERVE() ?
Will work either, but for what? Just looks a bit
cleaner to me to do it that way, or am I missing
something?

> And ESP is not
> multiple of 4 here - it may slow down processor a bit.
But I have to store the value of ((ESP & 0xffff) | 4) on
stack, so that's why I do that. How can it be
done in a more effective way? pushl, then movw?

> I think that you should prepare stack by using moves to
> -NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes.
I can do that of course, do you think it is faster?

> Or you can (if you want per-thread and not per-CPU stack) prepare CPL1
> stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack,
> but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB.
Ughh, I really need the motivation to fiddle with
the CPL1 stack separately. I don't want to limit the
NMI handler, and 0x400 was just a random value. Maybe
just enlarging it will resolve the problem?

>> +	pushl 20(%esp); \
>> +	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
>> +	orl $IOPL1_MASK, (%esp); \
> Does this work if userspace runs with iopl 3,2 or 1?  Does iret on CPL1
Yes, that should work I think.

> set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?).
Ring-1 code can't alter the IOPL, but then I set
IOPL3 initially. I just use ring3_IOPL|IOPL1, which
should be able to handle all cases (i.e. I do
"pushl 20(%esp)" exactly only to get the Ring-3 IOPL,
I don't care about the other flags there).

> I'm not sure about lea speed.  And fast path should NOT jump,
> forward jumps are initialy predicted as not taken... Maybe:
No problems.

> And then you can move MAKE_ESPFIX out of macro.
Good idea! :)

> And it seems to me that it adds too many operations to fast path.
Ughh, 8 insns only! Is this really too much? :(

> Maybe you want to introduce some per-thread flag,
I don't know. Probably not. I'll think about that.

> or leave syscall path
> to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).
Yes. I'll see how that can be done.

>> -	if (regs->xcs & 3)
>> +	if (regs->xcs & 2)
>>  		printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
> I think you should leave this as is, regs->xss/regs->esp should be valid
> for CPL1 exceptions.
Very probably, I was not sure about all that
s/3/2/g changes I made. So I'll remove those
you mentioned, thanks. It was just that the
kernel/user code should now be considered by
regs->xcs&2 rather than regs->xcs&3, so I just
replaced it everywhere. I'll have a closer look.

Also I was suggested privately to try doing
everything at Ring-0 after all (unfortunately
the guy have not CCd the message to LKML).
I know that we use Ring-1 so that the NMI
handler to work on the small stack, but the
guy writes:

... but i think linux
already has a task gate for NMI, or if not, it should.

I should have a look also in that direction I think.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-22 20:02           ` Petr Vandrovec
  2004-09-23  4:09             ` Stas Sergeev
@ 2004-09-23 17:08             ` Stas Sergeev
  2004-09-23 18:06               ` Petr Vandrovec
  1 sibling, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-23 17:08 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4021 bytes --]

Hi.

I'll reply to that mail again because my previous
reply was done in a big hurry.
The new patch is attached. I applied most of the
optimizations you suggested and reverted the things
you said were unnecessary/wrong - thanks.
Does this one look better?

Petr Vandrovec wrote:
>> 1. I am not allocating the ring1 stack separately,
>> I am allocating it on a ring0 stack. Much simpler
>> code, although non-reentrant/preempt-unsafe.
> I think that it is problem.  And do not forget that NMI can occur at any
> place, so I'm not quite sure how your code works.
I assume this was a misunderstanding. Now I use
"preempt_stop" instead of "cli" to make it more
obvious what am I doing. So I assume that part is
preempt-unsafe, but not NMI-unsafe.

>> +#define NMI_STACK_RESERVE 0x400
>> +#define MAKE_ESPFIX \
>> +	cli; \
>> +	subl $NMI_STACK_RESERVE, %esp; \
> This is ugly.  What if NMI handler uses more than 1KB?
In that new patch I set the const to 0xe00, which
is 3,5K. Is it still the limitation? I can probably
raise it ever higher, but I am not sure if there is
some sensible data below the stack that I may overwrite
ocasionally?

> And ESP is not
> multiple of 4 here - it may slow down processor a bit.
Fixed with "pushl/movw".

> I'm not sure about lea speed.  And fast path should NOT jump,
> forward jumps are initialy predicted as not taken... Maybe:
Done. Looks much better now!

> And then you can move MAKE_ESPFIX out of macro.
He-he, but RESTORE_ALL is a macro itself, so... :)

> Maybe you want to introduce some per-thread flag,
What kind of flag do you have in mind? Flag the
modify_ldt() calls that set some segment to 16bit?
But is it really worth the 8 insns I use? Checking
the per-thread flag will cost you ~5 insns anyway,
and a headache...

> or leave syscall path
> to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?).
I wanted to do that, but then recalled the comment
of Pavel Machek, who thinks exporting the %esp
value to user-space is an "informational leak".
I personally don't see the security threat here,
but it looks safe to assume that Pavel is right.
In which case I think we have to root out the
bug from every path, including the syscall. So I
left that place as is for now.

>> -	if (regs->xcs & 3)
>> +	if (regs->xcs & 2)
>>  		printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp);
> I think you should leave this as is, regs->xss/regs->esp should be valid
> for CPL1 exceptions.
OK, everything you mentioned, is reverted.

> Rest looks fine, but I'm not completely sure that SMP and NMIs are
> handled in the best possible way.
Well, even though my patch can be improved, I don't
feel quite comfortable with its complexity.
Now I really want to re-assure myself that this all
is not possible to do on ring0, in which case this
will simply be the trivial thing.
Anonymous LKML reader (whom I have to thank for the
hints) proposed 2 things:
1. Try lss followed by iret. In that case the interrupt
will not kill us because after lss the interrupts are
not accepted until the next insn is executed, AFAIK.
But he was not sure, and I don't know either, if it is
true even for NMI. So if it is - we could just do
lss/iret, and the problem is solved. So is this true
for NMI? Is this documented anywhere?
2. Set task gate for NMI. AFAICS, the task-gate is
now used only for the doublefault exception, but not
for NMI. If I understand the idea correctly, this
will guarantee that the NMI will be executing on a
separate stack, which may be a good idea in any case,
and will allow us to use the small stack at ring-0,
if only we disable the interrupts. Are there any
chances to use the task-gate for NMI? I may even try to
implement that myself, but I guess there are *reasons*
why it is not yet? (of course this applies only if 1.
fails)

Or maybe somehow to modify the NMI handler itself,
so that it will check if it is on a "small" stack,
and switch to the "big" stack before anything else?
Well, doing the whole trick at ring-0 sounds really
plausible to me...


[-- Attachment #2: linux-2.6.8-stacks3.diff --]
[-- Type: text/x-patch, Size: 8504 bytes --]

diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
--- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-10 13:28:35.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-09-23 16:14:55.209520160 +0400
@@ -70,9 +70,16 @@
 CF_MASK		= 0x00000001
 TF_MASK		= 0x00000100
 IF_MASK		= 0x00000200
-DF_MASK		= 0x00000400 
+DF_MASK		= 0x00000400
+IOPL1_MASK 	= 0x00001000
+IOPL2_MASK 	= 0x00002000
+IOPL_MASK 	= 0x00003000
 NT_MASK		= 0x00004000
+RF_MASK		= 0x00010000
 VM_MASK		= 0x00020000
+AC_MASK 	= 0x00040000
+VIF_MASK 	= 0x00080000
+VIP_MASK 	= 0x00100000
 
 #ifdef CONFIG_PREEMPT
 #define preempt_stop		cli
@@ -122,8 +129,23 @@
 .previous
 
 
+#define NMI_STACK_RESERVE 0xe00
+/* If returning to Ring-3, not to V86, and with
+ * the small stack, try to fix the higher word of
+ * ESP, as the CPU won't restore it from stack.
+ * This is an "official" bug of all the x86-compatible
+ * CPUs, which we can try to work around to make
+ * dosemu happy. */
 #define RESTORE_ALL	\
-	RESTORE_REGS	\
+	movl EFLAGS(%esp), %eax; \
+	movb CS(%esp), %al; \
+	andl $(VM_MASK | 2), %eax; \
+	cmpl $2, %eax;  \
+	jne 3f;         \
+	larl OLDSS(%esp), %eax; \
+	testl $0x00400000, %eax; \
+3:	RESTORE_REGS	\
+	jz 8f;          \
 	addl $4, %esp;	\
 1:	iret;		\
 .section .fixup,"ax";   \
@@ -137,8 +159,43 @@
 .section __ex_table,"a";\
 	.align 4;	\
 	.long 1b,2b;	\
-.previous
-
+.previous; \
+	/* preparing the ESPfix here */ \
+	/* reserve some space on stack for the NMI handler */ \
+8:	subl $(NMI_STACK_RESERVE-4), %esp; \
+	.rept 5; \
+	pushl NMI_STACK_RESERVE+16(%esp); \
+	.endr; \
+	pushl %eax; \
+	movl %esp, %eax; \
+	pushl %ebp; \
+	preempt_stop; \
+	GET_THREAD_INFO(%ebp); \
+	movl TI_cpu(%ebp), %ebp; \
+	shll $GDT_SIZE_SHIFT, %ebp; \
+	/* find GDT of the proper CPU */ \
+	addl $cpu_gdt_table, %ebp; \
+	/* patch the base of the ring-1 16bit stack */ \
+	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
+	shrl $16, %eax; \
+	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
+	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
+	popl %ebp; \
+	popl %eax; \
+	/* push the ESP value to preload on ring-1 */ \
+	pushl 12(%esp); \
+	movw $4, (%esp); \
+	/* push the iret frame for our ring-1 trampoline */ \
+	pushl $__ESPFIX_SS; \
+	pushl $0; \
+	/* push ring-3 flags only to get the IOPL right */ \
+	pushl 20(%esp); \
+	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
+	/* we need at least IOPL1 to re-enable interrupts */ \
+	orl $IOPL1_MASK, (%esp); \
+	pushl $__ESPFIX_CS; \
+	pushl $espfix_trampoline; \
+	iret;
 
 
 ENTRY(lcall7)
@@ -174,6 +231,23 @@
 	jmp do_lcall
 
 
+ENTRY(espfix_trampoline)
+	popl %esp
+espfix_past_esp:
+1:	iret
+.section .fixup,"ax"
+2:	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,2b
+.previous
+
 ENTRY(ret_from_fork)
 	pushl %eax
 	call schedule_tail
@@ -196,7 +270,7 @@
 	GET_THREAD_INFO(%ebp)
 	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb CS(%esp), %al
-	testl $(VM_MASK | 3), %eax
+	testl $(VM_MASK | 2), %eax
 	jz resume_kernel		# returning to kernel or vm86-space
 ENTRY(resume_userspace)
  	cli				# make sure we don't miss an interrupt
@@ -504,7 +578,12 @@
 ENTRY(nmi)
 	cmpl $sysenter_entry,(%esp)
 	je nmi_stack_fixup
-	pushl %eax
+	cmpl $espfix_past_esp,(%esp)
+	jne 1f
+	/* restart the "popl %esp" */
+	decl (%esp)
+	subw $4, 12(%esp)
+1:	pushl %eax
 	movl %esp,%eax
 	/* Do not access memory above the end of our stack page,
 	 * it might not exist.
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S
--- linux-2.6.8-pcsp/arch/i386/kernel/head.S	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/head.S	2004-09-21 14:58:16.000000000 +0400
@@ -514,8 +514,8 @@
 	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
 	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
-	.quad 0x0000000000000000	/* 0xd0 - unused */
-	.quad 0x0000000000000000	/* 0xd8 - unused */
+	.quad 0x00cfba000000ffff	/* 0xd0 - ESPfix CS */
+	.quad 0x0000b2000000ffff	/* 0xd8 - ESPfix SS */
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
 	.quad 0x0000000000000000	/* 0xf0 - unused */
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c
--- linux-2.6.8-pcsp/arch/i386/kernel/traps.c	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/traps.c	2004-09-23 10:39:53.000000000 +0400
@@ -210,7 +210,7 @@
 
 	esp = (unsigned long) (&regs->esp);
 	ss = __KERNEL_DS;
-	if (regs->xcs & 3) {
+	if (regs->xcs & 2) {
 		in_kernel = 0;
 		esp = regs->esp;
 		ss = regs->xss & 0xffff;
@@ -264,7 +264,7 @@
 	char c;
 	unsigned long eip;
 
-	if (regs->xcs & 3)
+	if (regs->xcs & 2)
 		goto no_bug;		/* Not in kernel */
 
 	eip = regs->eip;
@@ -335,7 +335,7 @@
 
 static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err)
 {
-	if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs))
+	if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs))
 		die(str, regs, err);
 }
 
@@ -357,7 +357,7 @@
 		goto trap_signal;
 	}
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto kernel_trap;
 
 	trap_signal: {
@@ -437,7 +437,7 @@
 	if (regs->eflags & VM_MASK)
 		goto gp_in_vm86;
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto gp_in_kernel;
 
 	current->thread.error_code = error_code;
@@ -614,7 +614,7 @@
 		 * allowing programs to debug themselves without the ptrace()
 		 * interface.
 		 */
-		if ((regs->xcs & 3) == 0)
+		if ((regs->xcs & 2) == 0)
 			goto clear_TF_reenable;
 		if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE)
 			goto clear_TF;
@@ -630,7 +630,7 @@
 	/* If this is a kernel mode trap, save the user PC on entry to 
 	 * the kernel, that's what the debugger can make sense of.
 	 */
-	info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : 
+	info.si_addr = ((regs->xcs & 3) != 3) ? (void *)tsk->thread.eip : 
 	                                        (void *)regs->eip;
 	force_sig_info(SIGTRAP, &info, tsk);
 
diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c
--- linux-2.6.8-pcsp/arch/i386/mm/extable.c	2004-06-09 15:44:19.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/extable.c	2004-09-21 17:17:05.000000000 +0400
@@ -26,6 +26,11 @@
 	}
 #endif
 
+	if (unlikely(regs->xcs == __ESPFIX_CS))
+	{
+		regs->xcs = __KERNEL_CS;
+	}
+
 	fixup = search_exception_tables(regs->eip);
 	if (fixup) {
 		regs->eip = fixup->fixup;
diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c
--- linux-2.6.8-pcsp/arch/i386/mm/fault.c	2004-08-10 11:02:37.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/fault.c	2004-09-21 16:03:28.000000000 +0400
@@ -77,7 +77,7 @@
 	u32 seg_ar, seg_limit, base, *desc;
 
 	/* The standard kernel/user address space limit. */
-	*eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
+	*eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg;
 
 	/* Unlikely, but must come before segment checks. */
 	if (unlikely((regs->eflags & VM_MASK) != 0))
diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h
--- linux-2.6.8-pcsp/include/asm-i386/segment.h	2004-01-09 09:59:19.000000000 +0300
+++ linux-2.6.8-stacks/include/asm-i386/segment.h	2004-09-21 14:57:44.000000000 +0400
@@ -38,8 +38,8 @@
  *  24 - APM BIOS support
  *  25 - APM BIOS support 
  *
- *  26 - unused
- *  27 - unused
+ *  26 - ESPfix Ring-1 trampoline CS
+ *  27 - ESPfix Ring-1 small SS
  *  28 - unused
  *  29 - unused
  *  30 - unused
@@ -71,14 +71,21 @@
 #define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
+#define GDT_ENTRY_ESPFIX_CS		(GDT_ENTRY_KERNEL_BASE + 14)
+#define GDT_ENTRY_ESPFIX_SS		(GDT_ENTRY_KERNEL_BASE + 15)
+#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1)
+#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1)
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS	31
 
 /*
  * The GDT has 32 entries
  */
-#define GDT_ENTRIES 32
-
-#define GDT_SIZE (GDT_ENTRIES * 8)
+#define GDT_ENTRIES_SHIFT 5
+#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT)
+#define GDT_ENTRY_SHIFT 3
+#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT)
+#define GDT_SIZE (1 << GDT_SIZE_SHIFT)
 
 /* Simple and small GDT entries for booting only */
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-23 17:08             ` Stas Sergeev
@ 2004-09-23 18:06               ` Petr Vandrovec
  2004-09-24 20:36                 ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-23 18:06 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Thu, Sep 23, 2004 at 09:08:54PM +0400, Stas Sergeev wrote:
> Petr Vandrovec wrote:
> >>1. I am not allocating the ring1 stack separately,
> >>I am allocating it on a ring0 stack. Much simpler
> >>code, although non-reentrant/preempt-unsafe.
> >I think that it is problem.  And do not forget that NMI can occur at any
> >place, so I'm not quite sure how your code works.
> I assume this was a misunderstanding. Now I use
> "preempt_stop" instead of "cli" to make it more
> obvious what am I doing. So I assume that part is
> preempt-unsafe, but not NMI-unsafe.
> 
> >>+#define NMI_STACK_RESERVE 0x400
> >>+#define MAKE_ESPFIX \
> >>+	cli; \
> >>+	subl $NMI_STACK_RESERVE, %esp; \
> >This is ugly.  What if NMI handler uses more than 1KB?
> In that new patch I set the const to 0xe00, which
> is 3,5K. Is it still the limitation? I can probably
> raise it ever higher, but I am not sure if there is
> some sensible data below the stack that I may overwrite
> ocasionally?

For 4KB stacks 2KB looks better (if NMIs do not use
its own stacks).

> >Rest looks fine, but I'm not completely sure that SMP and NMIs are
> >handled in the best possible way.
> Well, even though my patch can be improved, I don't
> feel quite comfortable with its complexity.
> Now I really want to re-assure myself that this all
> is not possible to do on ring0, in which case this
> will simply be the trivial thing.
> Anonymous LKML reader (whom I have to thank for the
> hints) proposed 2 things:
> 1. Try lss followed by iret. In that case the interrupt
> will not kill us because after lss the interrupts are
> not accepted until the next insn is executed, AFAIK.
> But he was not sure, and I don't know either, if it is
> true even for NMI. So if it is - we could just do
> lss/iret, and the problem is solved. So is this true
> for NMI? Is this documented anywhere?

LSS is exempted from rule that interrupts cannot occur after
operation modifying SS, as with LSS you are supposed to
load SS together with ESP in one operation, and so you
do not need interrupt holdoff.  And when exception
occurs on iret, it will happen with 16bit stack, and
you are dead...

> 2. Set task gate for NMI. AFAICS, the task-gate is
> now used only for the doublefault exception, but not
> for NMI. If I understand the idea correctly, this
> will guarantee that the NMI will be executing on a
> separate stack, which may be a good idea in any case,
> and will allow us to use the small stack at ring-0,
> if only we disable the interrupts. Are there any
> chances to use the task-gate for NMI? I may even try to
> implement that myself, but I guess there are *reasons*
> why it is not yet? (of course this applies only if 1.
> fails)

Yes. But you still have to handle exceptions on iret
to userspace :-(

> Or maybe somehow to modify the NMI handler itself,
> so that it will check if it is on a "small" stack,
> and switch to the "big" stack before anything else?
> Well, doing the whole trick at ring-0 sounds really
> plausible to me...

If you can do that, great.  But you have to modify
at least NMI, GP and SS fault handlers to reload
SS/ESP with correct values.

> diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
> --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-10 13:28:35.000000000 +0400
> +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-09-23 16:14:55.209520160 +0400
> @@ -122,8 +129,23 @@
>  .previous
>  
>  
> +#define NMI_STACK_RESERVE 0xe00
> +/* If returning to Ring-3, not to V86, and with
> + * the small stack, try to fix the higher word of
> + * ESP, as the CPU won't restore it from stack.
> + * This is an "official" bug of all the x86-compatible
> + * CPUs, which we can try to work around to make
> + * dosemu happy. */
>  #define RESTORE_ALL	\
> -	RESTORE_REGS	\
> +	movl EFLAGS(%esp), %eax; \
> +	movb CS(%esp), %al; \
> +	andl $(VM_MASK | 2), %eax; \
> +	cmpl $2, %eax;  \
> +	jne 3f;         \
> +	larl OLDSS(%esp), %eax; \
> +	testl $0x00400000, %eax; \
> +3:	RESTORE_REGS	\
> +	jz 8f;          \
>  	addl $4, %esp;	\
>  1:	iret;		\
>  .section .fixup,"ax";   \
> @@ -137,8 +159,43 @@
>  .section __ex_table,"a";\
>  	.align 4;	\
>  	.long 1b,2b;	\
> -.previous
> -
> +.previous; \
> +	/* preparing the ESPfix here */ \
> +	/* reserve some space on stack for the NMI handler */ \
> +8:	subl $(NMI_STACK_RESERVE-4), %esp; \
> +	.rept 5; \

Any reason why you left this part of RESTORE_ALL macro?  It can
be moved out, IMHO.  RESTORE_ALL is used twice in entry.S, so you
could save one copy.  Though I'm not sure why NMI handler simple
does not jump to RESTORE_ALL we already have.

> +	pushl NMI_STACK_RESERVE+16(%esp); \
> +	.endr; \
> +	pushl %eax; \
> +	movl %esp, %eax; \
> +	pushl %ebp; \
> +	preempt_stop; \
> +	GET_THREAD_INFO(%ebp); \
> +	movl TI_cpu(%ebp), %ebp; \
> +	shll $GDT_SIZE_SHIFT, %ebp; \
> +	/* find GDT of the proper CPU */ \
> +	addl $cpu_gdt_table, %ebp; \

I did not know we have per-CPU GDT tables...  This explains it.

> +	/* patch the base of the ring-1 16bit stack */ \
> +	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \
> +	shrl $16, %eax; \
> +	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \
> +	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \
> +	popl %ebp; \
> +	popl %eax; \
> +	/* push the ESP value to preload on ring-1 */ \
> +	pushl 12(%esp); \
> +	movw $4, (%esp); \
> +	/* push the iret frame for our ring-1 trampoline */ \
> +	pushl $__ESPFIX_SS; \
> +	pushl $0; \
> +	/* push ring-3 flags only to get the IOPL right */ \
> +	pushl 20(%esp); \
> +	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
> +	/* we need at least IOPL1 to re-enable interrupts */ \
> +	orl $IOPL1_MASK, (%esp); \
> +	pushl $__ESPFIX_CS; \
> +	pushl $espfix_trampoline; \
> +	iret;

FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec
while written with subl $24,%esp and then doing movs to xx(%esp) takes
0.94sec.  Plus you could then reorganize code a bit (first do 5 pushes
to copy stack, then subtract 24 from esp, and push eax/ebp after that.
This way you can use %eax for computations you currently do in memory
(which are probably most important for 27% slowdown between
pushes and movs - I wrote mov code with using eax for EFLAGS1 & ESP1).

> -	.quad 0x0000000000000000	/* 0xd0 - unused */
> -	.quad 0x0000000000000000	/* 0xd8 - unused */
> +	.quad 0x00cfba000000ffff	/* 0xd0 - ESPfix CS */
> +	.quad 0x0000b2000000ffff	/* 0xd8 - ESPfix SS */

Set SS limit to 23 bytes, so misuse can be quickly catched?  As SP=0 and
we use 16bit stack, it goes from 0000 => FFFE, and you could corrupt
memory 64KB beyond top of CPL1 stack.
					Best regards,
						Petr Vandrovec



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-23 18:06               ` Petr Vandrovec
@ 2004-09-24 20:36                 ` Stas Sergeev
  2004-09-24 21:43                   ` Petr Vandrovec
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-24 20:36 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2907 bytes --]

Hi,

Petr Vandrovec wrote:
> In that new patch I set the const to 0xe00, which
> is 3,5K. Is it still the limitation? I can probably
> For 4KB stacks 2KB looks better
OK, done that. Wondering though, for what?
I don't need 2K myself, I need 24 bytes only.
So what prevents me to raise the gap to 3.5K
or somesuch? Why 2K looks better?

>> 2. Set task gate for NMI. AFAICS, the task-gate is
> Yes. But you still have to handle exceptions on iret
> to userspace :-(
Yes. Thanks, I see the problem now. And I
suppose setting the task-gates also for all
that exceptions is not an option...

>> Or maybe somehow to modify the NMI handler itself,
> If you can do that, great.  But you have to modify
> at least NMI, GP and SS fault handlers to reload
> SS/ESP with correct values.
Yes, that's the problem either. Doesn't look too
difficult (I'll have to look up TSS for the stack
pointer I guess, then do lss to it), but probably
modifying all that handlers for only that small
purpose is an overkill...

>> +.previous; \
>> +	/* preparing the ESPfix here */ \
>> +	/* reserve some space on stack for the NMI handler */ \
>> +8:	subl $(NMI_STACK_RESERVE-4), %esp; \
>> +	.rept 5; \
> Any reason why you left this part of RESTORE_ALL macro?
The reason is that the previous part of the macro
can jump to that part. So how can I divide those?

> be moved out, IMHO.  RESTORE_ALL is used twice in entry.S, so you
> could save one copy.
Do you mean the NMI return path doesn't need
the ESP fix at all? Why?

> Though I'm not sure why NMI handler simple
> does not jump to RESTORE_ALL we already have.
I can only change that and then un-macro the
RESTORE_ALL completely. So I did that.
Additionally I introduced the "short path"
for the case where we know for sure that we
are returning to the kernel. And I am not
setting the exception handler there because
returning to the kernel if fails, should die()
anyway. Is this correct?

>> +	pushl 20(%esp); \
>> +	andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \
>> +	/* we need at least IOPL1 to re-enable interrupts */ \
>> +	orl $IOPL1_MASK, (%esp); \
>> +	pushl $__ESPFIX_CS; \
>> +	pushl $espfix_trampoline; \
> FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec
> while written with subl $24,%esp and then doing movs to xx(%esp) takes
> 0.94sec.
OK, I wasn't sure what pushes do you mean.
I supposed you wanted me to replace those
5 pushes that copy the stack frame.

> Plus you could then reorganize code a bit (first do 5 pushes
> to copy stack, then subtract 24 from esp, and push eax/ebp after that.
> This way you can use %eax for computations you currently do in memory
Done, thanks! Does this help your test-case?

>> +	.quad 0x00cfba000000ffff	/* 0xd0 - ESPfix CS */
>> +	.quad 0x0000b2000000ffff	/* 0xd8 - ESPfix SS */
> Set SS limit to 23 bytes, so misuse can be quickly catched?
Yes!

So the new patch is attached:)


[-- Attachment #2: linux-2.6.8-stacks4.diff --]
[-- Type: text/x-patch, Size: 9930 bytes --]

diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S
--- linux-2.6.8-pcsp/arch/i386/kernel/entry.S	2004-06-10 13:28:35.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/entry.S	2004-09-24 15:35:46.000000000 +0400
@@ -70,15 +70,22 @@
 CF_MASK		= 0x00000001
 TF_MASK		= 0x00000100
 IF_MASK		= 0x00000200
-DF_MASK		= 0x00000400 
+DF_MASK		= 0x00000400
+IOPL1_MASK 	= 0x00001000
+IOPL2_MASK 	= 0x00002000
+IOPL_MASK 	= 0x00003000
 NT_MASK		= 0x00004000
+RF_MASK		= 0x00010000
 VM_MASK		= 0x00020000
+AC_MASK 	= 0x00040000
+VIF_MASK 	= 0x00080000
+VIP_MASK 	= 0x00100000
 
 #ifdef CONFIG_PREEMPT
 #define preempt_stop		cli
 #else
 #define preempt_stop
-#define resume_kernel		restore_all
+#define resume_kernel		resume_kernel_nopreempt
 #endif
 
 #define SAVE_ALL \
@@ -122,25 +129,6 @@
 .previous
 
 
-#define RESTORE_ALL	\
-	RESTORE_REGS	\
-	addl $4, %esp;	\
-1:	iret;		\
-.section .fixup,"ax";   \
-2:	sti;		\
-	movl $(__USER_DS), %edx; \
-	movl %edx, %ds; \
-	movl %edx, %es; \
-	pushl $11;	\
-	call do_exit;	\
-.previous;		\
-.section __ex_table,"a";\
-	.align 4;	\
-	.long 1b,2b;	\
-.previous
-
-
-
 ENTRY(lcall7)
 	pushfl			# We get a different stack layout with call
 				# gates, which has to be cleaned up later..
@@ -174,6 +162,23 @@
 	jmp do_lcall
 
 
+ENTRY(espfix_trampoline)
+	popl %esp
+espfix_past_esp:
+1:	iret
+.section .fixup,"ax"
+2:	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,2b
+.previous
+
 ENTRY(ret_from_fork)
 	pushl %eax
 	call schedule_tail
@@ -196,8 +201,8 @@
 	GET_THREAD_INFO(%ebp)
 	movl EFLAGS(%esp), %eax		# mix EFLAGS and CS
 	movb CS(%esp), %al
-	testl $(VM_MASK | 3), %eax
-	jz resume_kernel		# returning to kernel or vm86-space
+	testl $(VM_MASK | 2), %eax
+	jz resume_kernel
 ENTRY(resume_userspace)
  	cli				# make sure we don't miss an interrupt
 					# setting need_resched or sigpending
@@ -208,10 +213,15 @@
 	jne work_pending
 	jmp restore_all
 
+resume_kernel_nopreempt:
+	RESTORE_REGS
+	addl $4, %esp
+	iret
+
 #ifdef CONFIG_PREEMPT
 ENTRY(resume_kernel)
 	cmpl $0,TI_preempt_count(%ebp)	# non-zero preempt_count ?
-	jnz restore_all
+	jnz resume_kernel_nopreempt
 need_resched:
 	movl TI_flags(%ebp), %ecx	# need_resched set ?
 	testb $_TIF_NEED_RESCHED, %cl
@@ -294,7 +304,77 @@
 	testw $_TIF_ALLWORK_MASK, %cx	# current->work
 	jne syscall_exit_work
 restore_all:
-	RESTORE_ALL
+	/* If returning to Ring-3, not to V86, and with
+	 * the small stack, try to fix the higher word of
+	 * ESP, as the CPU won't restore it from stack.
+	 * This is an "official" bug of all the x86-compatible
+	 * CPUs, which we can try to work around to make
+	 * dosemu happy. */
+	movl EFLAGS(%esp), %eax
+	movb CS(%esp), %al
+	andl $(VM_MASK | 2), %eax
+	/* returning to user-space and not to v86? */
+	cmpl $2, %eax
+	jne 3f
+	larl OLDSS(%esp), %eax
+	/* returning to "small" stack? */
+	testl $0x00400000, %eax
+3:	RESTORE_REGS
+	jz 8f		# go fixing ESP instead of just to return :(
+	addl $4, %esp
+1:	iret
+.section .fixup,"ax"
+2:	sti
+	movl $(__USER_DS), %edx
+	movl %edx, %ds
+	movl %edx, %es
+	pushl $11
+	call do_exit
+.previous
+.section __ex_table,"a"
+	.align 4
+	.long 1b,2b
+.previous
+	/* preparing the ESPfix here */
+#define NMI_STACK_RESERVE 0x800
+	/* reserve some space on stack for the NMI handler */
+8:	subl $(NMI_STACK_RESERVE-4), %esp
+	.rept 5
+	pushl NMI_STACK_RESERVE+16(%esp)
+	.endr
+	subl $24, %esp
+	pushl %eax
+	leal 24(%esp), %eax
+	pushl %ebp
+	preempt_stop
+	GET_THREAD_INFO(%ebp)
+	movl TI_cpu(%ebp), %ebp
+	shll $GDT_SIZE_SHIFT, %ebp
+	/* find GDT of the proper CPU */
+	addl $cpu_gdt_table, %ebp
+	/* patch the base of the ring-1 16bit stack */
+	movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp)
+	shrl $16, %eax
+	movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp)
+	movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp)
+	popl %ebp
+	/* push the ESP value to preload on ring-1 */
+	movl 28+12(%esp), %eax
+	movw $4, %ax
+	movl %eax, 4+20(%esp)
+	/* push the iret frame for our ring-1 trampoline */
+	movl $__ESPFIX_SS, 4+16(%esp)
+	movl $0, 4+12(%esp)
+	/* push ring-3 flags only to get the IOPL right */
+	movl 28+8(%esp), %eax
+	andl $IOPL_MASK, %eax
+	/* we need at least IOPL1 to re-enable interrupts */
+	orl $IOPL1_MASK, %eax
+	movl %eax, 4+8(%esp)
+	popl %eax
+	movl $__ESPFIX_CS, 4(%esp)
+	movl $espfix_trampoline, (%esp)
+	iret
 
 	# perform work that needs to be done immediately before resumption
 	ALIGN
@@ -504,7 +584,12 @@
 ENTRY(nmi)
 	cmpl $sysenter_entry,(%esp)
 	je nmi_stack_fixup
-	pushl %eax
+	cmpl $espfix_past_esp,(%esp)
+	jne 1f
+	/* restart the "popl %esp" */
+	decl (%esp)
+	subw $4, 12(%esp)
+1:	pushl %eax
 	movl %esp,%eax
 	/* Do not access memory above the end of our stack page,
 	 * it might not exist.
@@ -523,7 +608,7 @@
 	pushl %edx
 	call do_nmi
 	addl $8, %esp
-	RESTORE_ALL
+	jmp restore_all
 
 nmi_stack_fixup:
 	FIX_STACK(12,nmi_stack_correct, 1)
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S
--- linux-2.6.8-pcsp/arch/i386/kernel/head.S	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/head.S	2004-09-24 13:57:24.000000000 +0400
@@ -514,8 +514,8 @@
 	.quad 0x00009a0000000000	/* 0xc0 APM CS 16 code (16 bit) */
 	.quad 0x0040920000000000	/* 0xc8 APM DS    data */
 
-	.quad 0x0000000000000000	/* 0xd0 - unused */
-	.quad 0x0000000000000000	/* 0xd8 - unused */
+	.quad 0x00cfba000000ffff	/* 0xd0 - ESPfix 32-bit CS */
+	.quad 0x0000b20000000017	/* 0xd8 - ESPfix 16-bit SS */
 	.quad 0x0000000000000000	/* 0xe0 - unused */
 	.quad 0x0000000000000000	/* 0xe8 - unused */
 	.quad 0x0000000000000000	/* 0xf0 - unused */
diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c
--- linux-2.6.8-pcsp/arch/i386/kernel/traps.c	2004-08-10 11:02:36.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/kernel/traps.c	2004-09-23 17:54:52.000000000 +0400
@@ -210,7 +210,7 @@
 
 	esp = (unsigned long) (&regs->esp);
 	ss = __KERNEL_DS;
-	if (regs->xcs & 3) {
+	if (regs->xcs & 2) {
 		in_kernel = 0;
 		esp = regs->esp;
 		ss = regs->xss & 0xffff;
@@ -264,7 +264,7 @@
 	char c;
 	unsigned long eip;
 
-	if (regs->xcs & 3)
+	if (regs->xcs & 2)
 		goto no_bug;		/* Not in kernel */
 
 	eip = regs->eip;
@@ -335,7 +335,7 @@
 
 static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err)
 {
-	if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs))
+	if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs))
 		die(str, regs, err);
 }
 
@@ -357,7 +357,7 @@
 		goto trap_signal;
 	}
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto kernel_trap;
 
 	trap_signal: {
@@ -437,7 +437,7 @@
 	if (regs->eflags & VM_MASK)
 		goto gp_in_vm86;
 
-	if (!(regs->xcs & 3))
+	if (!(regs->xcs & 2))
 		goto gp_in_kernel;
 
 	current->thread.error_code = error_code;
@@ -614,7 +614,7 @@
 		 * allowing programs to debug themselves without the ptrace()
 		 * interface.
 		 */
-		if ((regs->xcs & 3) == 0)
+		if ((regs->xcs & 2) == 0)
 			goto clear_TF_reenable;
 		if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE)
 			goto clear_TF;
@@ -630,7 +630,7 @@
 	/* If this is a kernel mode trap, save the user PC on entry to 
 	 * the kernel, that's what the debugger can make sense of.
 	 */
-	info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : 
+	info.si_addr = ((regs->xcs & 3) != 3) ? (void *)tsk->thread.eip : 
 	                                        (void *)regs->eip;
 	force_sig_info(SIGTRAP, &info, tsk);
 
diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c
--- linux-2.6.8-pcsp/arch/i386/mm/extable.c	2004-06-09 15:44:19.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/extable.c	2004-09-23 17:54:52.000000000 +0400
@@ -26,6 +26,11 @@
 	}
 #endif
 
+	if (unlikely(regs->xcs == __ESPFIX_CS))
+	{
+		regs->xcs = __KERNEL_CS;
+	}
+
 	fixup = search_exception_tables(regs->eip);
 	if (fixup) {
 		regs->eip = fixup->fixup;
diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c
--- linux-2.6.8-pcsp/arch/i386/mm/fault.c	2004-08-10 11:02:37.000000000 +0400
+++ linux-2.6.8-stacks/arch/i386/mm/fault.c	2004-09-23 17:54:52.000000000 +0400
@@ -77,7 +77,7 @@
 	u32 seg_ar, seg_limit, base, *desc;
 
 	/* The standard kernel/user address space limit. */
-	*eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg;
+	*eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg;
 
 	/* Unlikely, but must come before segment checks. */
 	if (unlikely((regs->eflags & VM_MASK) != 0))
diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h
--- linux-2.6.8-pcsp/include/asm-i386/segment.h	2004-01-09 09:59:19.000000000 +0300
+++ linux-2.6.8-stacks/include/asm-i386/segment.h	2004-09-23 17:54:52.000000000 +0400
@@ -38,8 +38,8 @@
  *  24 - APM BIOS support
  *  25 - APM BIOS support 
  *
- *  26 - unused
- *  27 - unused
+ *  26 - ESPfix Ring-1 trampoline CS
+ *  27 - ESPfix Ring-1 small SS
  *  28 - unused
  *  29 - unused
  *  30 - unused
@@ -71,14 +71,21 @@
 #define GDT_ENTRY_PNPBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 6)
 #define GDT_ENTRY_APMBIOS_BASE		(GDT_ENTRY_KERNEL_BASE + 11)
 
+#define GDT_ENTRY_ESPFIX_CS		(GDT_ENTRY_KERNEL_BASE + 14)
+#define GDT_ENTRY_ESPFIX_SS		(GDT_ENTRY_KERNEL_BASE + 15)
+#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1)
+#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1)
+
 #define GDT_ENTRY_DOUBLEFAULT_TSS	31
 
 /*
  * The GDT has 32 entries
  */
-#define GDT_ENTRIES 32
-
-#define GDT_SIZE (GDT_ENTRIES * 8)
+#define GDT_ENTRIES_SHIFT 5
+#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT)
+#define GDT_ENTRY_SHIFT 3
+#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT)
+#define GDT_SIZE (1 << GDT_SIZE_SHIFT)
 
 /* Simple and small GDT entries for booting only */
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-24 20:36                 ` Stas Sergeev
@ 2004-09-24 21:43                   ` Petr Vandrovec
  2004-09-25  8:04                     ` Gabriel Paubert
  0 siblings, 1 reply; 38+ messages in thread
From: Petr Vandrovec @ 2004-09-24 21:43 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote:
> Hi,
> 
> Petr Vandrovec wrote:
> >In that new patch I set the const to 0xe00, which
> >is 3,5K. Is it still the limitation? I can probably
> >For 4KB stacks 2KB looks better
> OK, done that. Wondering though, for what?
> I don't need 2K myself, I need 24 bytes only.
> So what prevents me to raise the gap to 3.5K
> or somesuch? Why 2K looks better?

You run with ESP decreased by 2KB for some time during
CPL1 stack setup.  As you run in this part at CPL0
with same setup as on CPL1, I think that you should
offer same stack for setup code, and for CPL1 code,
and so each should get 2KB.

> >>+8:	subl $(NMI_STACK_RESERVE-4), %esp; \
> >>+	.rept 5; \
> >Any reason why you left this part of RESTORE_ALL macro?
> The reason is that the previous part of the macro
> can jump to that part. So how can I divide those?

Use real labels instead of numeric local labels.  That way
macro does jne cpl1exit, and you create cpl1exit: label outside
of macro, somewhere in entry.S...

> >be moved out, IMHO.  RESTORE_ALL is used twice in entry.S, so you
> >could save one copy.
> Do you mean the NMI return path doesn't need
> the ESP fix at all? Why?

No.  I was attempting to say that RESTORE_ALL is on two
places to save jumps (I cannot think about any other reason),
and so cpl1exit could be shared:

#define RESTORE_ALL \
	movl EFLAGS(%esp),%eax	\
	....			\
	jne 3f			\
	lar OLDSS(%esp),%eax	\
	testl $0x00400000,%eax	\
	jz cpl1exit		\
3:	RESTORE_REGS		\
	add $4,%esp		\
1:	iret			\
.section __ex_table,"a";\
	.align 4;	\
	long 1b,badss;	\
.previous

badss:	sti;		
	movl $(__USER_DS), %edx; 
	movl %edx, %ds; 
	movl %edx, %es;
	pushl $11;
	call do_exit;

cpl1exit:
	RESTORE_REGS
	subl $...,%esp
	...

But your solution with killing RESTORE_ALL completely is also
fine - you just could reorder jumps a bit, so return to kernel
or VM86 is one (not taken) jump shorter than it is with current
version of patch.

> >Though I'm not sure why NMI handler simple
> >does not jump to RESTORE_ALL we already have.
> I can only change that and then un-macro the
> RESTORE_ALL completely. So I did that.
> Additionally I introduced the "short path"
> for the case where we know for sure that we
> are returning to the kernel. And I am not
> setting the exception handler there because
> returning to the kernel if fails, should die()
> anyway. Is this correct?

If you'll first test result of lar, and then
you'll do two independent RESTORE_REGS (second
one actually does not have to restore %ebp
and %eax and you can restore them directly by
move from stack after you are done with trampoline
setup, instead of doing push + pop), you can reuse
RESOTRE_REGS+add $4,%esp+iret return path, including
its exception handling.

> +ENTRY(espfix_trampoline)
> +	popl %esp
> +espfix_past_esp:
> +1:	iret
> +.section .fixup,"ax"
> +2:	sti
> +	movl $(__USER_DS), %edx
> +	movl %edx, %ds
> +	movl %edx, %es
> +	pushl $11
> +	call do_exit
> +.previous

You can reuse fixup code from other iret instance,
just give it real label and reference it from __ex_table
instead of '2b'.

Patch looks fine, though you could speedup it a bit as outlined above.
Now you have to persuade others that this patch should include patch
into the kernel.
								Petr

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-24 21:43                   ` Petr Vandrovec
@ 2004-09-25  8:04                     ` Gabriel Paubert
  2004-09-25 12:25                       ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Gabriel Paubert @ 2004-09-25  8:04 UTC (permalink / raw)
  To: Petr Vandrovec; +Cc: Stas Sergeev, linux-kernel

On Fri, Sep 24, 2004 at 11:43:30PM +0200, Petr Vandrovec wrote:
> On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote:
> > Hi,
> > 
> > Petr Vandrovec wrote:
> > >In that new patch I set the const to 0xe00, which
> > >is 3,5K. Is it still the limitation? I can probably
> > >For 4KB stacks 2KB looks better
> > OK, done that. Wondering though, for what?
> > I don't need 2K myself, I need 24 bytes only.
> > So what prevents me to raise the gap to 3.5K
> > or somesuch? Why 2K looks better?
> 
> You run with ESP decreased by 2KB for some time during
> CPL1 stack setup.  As you run in this part at CPL0
> with same setup as on CPL1, I think that you should
> offer same stack for setup code, and for CPL1 code,
> and so each should get 2KB.

Maybe I miss something, but it seems that lret (or retl)
is not affected by this bug. What prevents you from reordering
the stack (doing the inverse operation of what the lcall7 entry
point does) and then doing:

	popfl
	lret

Yes, I see issues with debugging (trap flag mostly) but I believe
that they are solvable.

	Regards,
	Gabriel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25  8:04                     ` Gabriel Paubert
@ 2004-09-25 12:25                       ` Stas Sergeev
  2004-09-25 19:18                         ` Gabriel Paubert
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-25 12:25 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: Petr Vandrovec, linux-kernel

Hello.

Gabriel Paubert wrote:
> Maybe I miss something, but it seems that lret (or retl)
> is not affected by this bug.
Petr Vandrovec says (he forgot to CC that
to LKML I think):
---
Looking at VMware's code it seems that RETF suffers from
this bug too...
---

I tested that - he is right, and Intel docs
make no sense as to not mentioning this.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25 12:25                       ` Stas Sergeev
@ 2004-09-25 19:18                         ` Gabriel Paubert
  2004-09-25 20:40                           ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Gabriel Paubert @ 2004-09-25 19:18 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Petr Vandrovec, linux-kernel

On Sat, Sep 25, 2004 at 04:25:43PM +0400, Stas Sergeev wrote:
> Hello.
> 
> Gabriel Paubert wrote:
> >Maybe I miss something, but it seems that lret (or retl)
> >is not affected by this bug.
> Petr Vandrovec says (he forgot to CC that
> to LKML I think):

At least I did not see it.

> ---
> Looking at VMware's code it seems that RETF suffers from
> this bug too...
> ---
> 
> I tested that - he is right, and Intel docs
> make no sense as to not mentioning this.

I suspected that they behaved differently because the
pseudocode in iret's description is quite different 
(for iret, it even does not mention restoring ESP!).
But if you expect Intel's doc, or that of any manufacturer
for the matter, to tell the whole truth, you're naïve.

Is ESP really properly restored for V86 bmode or is it that 
it does not hit the case of a default 32 bit code segment with 
a 16 bit stack?

I'm absolutely amazed by the fact that this bug has been there
since the beginning and only seems to hit users right now.

I don't like adding an intermediate privilege level, but
it looks hard to do through the vdso, and you always have to 
push something on the final stack. A hardware task switch 
might work, but it has problems with races on the segment
descriptors. 

Anyway, I've just read again Intel's doc about mixing 16 and 32 bit 
code and I have found the understament of the day:

"For most efficient and trouble-free operation of the processor, 32-bit
                        ^^^^^^^^^^^^
programs or tasks should have the D flag in the code-segment descriptor
and the B flag in the stack-segment descriptor set, and 16-bit programs
or tasks should have these flags clear. Program control transfers from
16-bit segments to 32-bit segments (and vice versa) are handled most
efficiently through call, interrupt, or trap gates."

	Regards,
	Gabriel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25 19:18                         ` Gabriel Paubert
@ 2004-09-25 20:40                           ` Stas Sergeev
  2004-09-25 23:42                             ` Gabriel Paubert
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-25 20:40 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linux-kernel

Hi.

Gabriel Paubert wrote:
> Is ESP really properly restored for V86 bmode or is it that 
> it does not hit the case of a default 32 bit code segment with 
> a 16 bit stack?
It does not restore it in any case for the 16bit
stack (no matter whether the code is 16 or 32 in PM).
Petr says V86 is not affected, but I have not tested
that case because why to care? The problem is only for
the 32bit code. For 16bit code (PM or V86) it just
doesn't really matter I think (I don't think using
prefixes for ESP is sane).

> I'm absolutely amazed by the fact that this bug has been there
> since the beginning and only seems to hit users right now.
No, right now it just hits me:)
It used to hit the dosemu users always, but people just
blamed dosemu itself and that was it. Noone wanted
to spend weeks traceing the DOS programs under dosemu,
then traceing dosemu itself, then traceing kernel,
then looking through the docs to track the problem down
to something known, then writing to Intel's techsup for
clarifications, and then writing to LKML:) And if not
for the great help I got here, this will end up nowhere
again. So that's how it used to stay "unnoticed" for
years.
As for the other instances of that problem, here are some:

http://www.tenberry.com/dos4g/watcom/rn4gw.html
---
B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
     which mangles the high word of ESP.
---

http://clio.rice.edu/djgpp/r5bug04.txt
---
On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not
clear those bits when swapping to it's own stack.
---

Searching google reveals quite a few implicit
instances of that problem.

> Anyway, I've just read again Intel's doc about mixing 16 and 32 bit 
> code and I have found the understament of the day:
> "For most efficient and trouble-free operation of the processor,
>                         ^^^^^^^^^^^^
What is to expect? They knew there are troubles,
yet decided to make it a part of the specs...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25 20:40                           ` Stas Sergeev
@ 2004-09-25 23:42                             ` Gabriel Paubert
  2004-09-26 18:04                               ` Stas Sergeev
  2004-09-30 15:11                               ` Bill Davidsen
  0 siblings, 2 replies; 38+ messages in thread
From: Gabriel Paubert @ 2004-09-25 23:42 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

	Hi,

> It does not restore it in any case for the 16bit
> stack (no matter whether the code is 16 or 32 in PM).
> Petr says V86 is not affected, but I have not tested
> that case because why to care? The problem is only for
> the 32bit code. For 16bit code (PM or V86) it just
> doesn't really matter I think (I don't think using
> prefixes for ESP is sane).

Well, thrashing a register at any time under the user
just because an interrupt happened is even less sane ;-)

> 
> >I'm absolutely amazed by the fact that this bug has been there
> >since the beginning and only seems to hit users right now.
> No, right now it just hits me:)
> It used to hit the dosemu users always, but people just
> blamed dosemu itself and that was it. Noone wanted
> to spend weeks traceing the DOS programs under dosemu,
> then traceing dosemu itself, then traceing kernel,
> then looking through the docs to track the problem down
> to something known, then writing to Intel's techsup for
> clarifications, and then writing to LKML:) And if not
> for the great help I got here, this will end up nowhere
> again. So that's how it used to stay "unnoticed" for
> years.

I see. And finally we know that the problem with the
processor and that Intel changed the spec to conform
to what the hardware does but is rather coy about it.

> As for the other instances of that problem, here are some:
> 
> http://www.tenberry.com/dos4g/watcom/rn4gw.html
> ---
> B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
>     which mangles the high word of ESP.

But if the bug is also affects Windows DOS box, it means that
V86 is affected too, no? 

I'd like to know what OS/2 did. The DOS boxes and 16 bit mode
DPMI applications ran very well and it was very stable, despite
the fact that the kernel spent its time switching between 16
and 32 bit modes (for example, almost all device drivers were 
16 bit code, but the drivers for DOS emulation were 32 bit).
But I removed it from all my machines several years ago.


> http://clio.rice.edu/djgpp/r5bug04.txt
> ---
> On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not
> clear those bits when swapping to it's own stack.
> ---
> 
> Searching google reveals quite a few implicit
> instances of that problem.
> 
> >Anyway, I've just read again Intel's doc about mixing 16 and 32 bit 
> >code and I have found the understament of the day:
> >"For most efficient and trouble-free operation of the processor,
> >                        ^^^^^^^^^^^^
> What is to expect? They knew there are troubles,
> yet decided to make it a part of the specs...

Of which specs? It is never clearly stated in the whole
4 volume doc of the architecture. Only in an obscure
specification change...

	Gabriel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25 23:42                             ` Gabriel Paubert
@ 2004-09-26 18:04                               ` Stas Sergeev
  2004-09-27  9:07                                 ` Gabriel Paubert
  2004-09-30 15:11                               ` Bill Davidsen
  1 sibling, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-26 18:04 UTC (permalink / raw)
  To: Gabriel Paubert; +Cc: linux-kernel

Hi.

Gabriel Paubert wrote:
>> the 32bit code. For 16bit code (PM or V86) it just
>> doesn't really matter I think (I don't think using
>> prefixes for ESP is sane).
> Well, thrashing a register at any time under the user
> just because an interrupt happened is even less sane ;-)
OK, I agree, so I tested that: wrote some value
to the higher word of ESP in the struct that
dosemu passes to vm86() syscall, let the prog
to run for a while, and the value was still there.
It is not 100% reliable test because I can't
guarantee the vm86() was interrupted during the
period I was waiting (because vm86() is executed
for the short durations, then exits on fault or
signal), but it looks OK and why not to beleive
Petr that v86 is not affected.

>> http://www.tenberry.com/dos4g/watcom/rn4gw.html
>> ---
>> B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
>>     which mangles the high word of ESP.
> But if the bug is also affects Windows DOS box, it means that
> V86 is affected too, no? 
No. This URL is about dos4gw, so that's about a
prot. mode either.

> I'd like to know what OS/2 did.
AFAIK also WinNT is not affected. Not sure though.

> The DOS boxes and 16 bit mode
> DPMI applications ran very well and it was very stable, despite
Hey. It is not about 16bit DPMI mode, it is
about the 32bit DPMI mode. That's the whole
problem. Be it only about the 16bit DPMI mode,
the problem may not ever harm anything at all.
But for 32bit mode that's quite a problem.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-26 18:04                               ` Stas Sergeev
@ 2004-09-27  9:07                                 ` Gabriel Paubert
  0 siblings, 0 replies; 38+ messages in thread
From: Gabriel Paubert @ 2004-09-27  9:07 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote:

	Hi,

On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote:
> OK, I agree, so I tested that: wrote some value
> to the higher word of ESP in the struct that
> dosemu passes to vm86() syscall, let the prog
> to run for a while, and the value was still there.
> It is not 100% reliable test because I can't
> guarantee the vm86() was interrupted during the
> period I was waiting (because vm86() is executed
> for the short durations, then exits on fault or
> signal), but it looks OK and why not to beleive
> Petr that v86 is not affected.

Should be enough, you'll likely get at least a timer 
interrupt in the middle of v86 mode sooner or later.

> 
> >>http://www.tenberry.com/dos4g/watcom/rn4gw.html
> >>---
> >>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
> >>    which mangles the high word of ESP.
> >But if the bug is also affects Windows DOS box, it means that
> >V86 is affected too, no? 
> No. This URL is about dos4gw, so that's about a
> prot. mode either.

I see, I missed the 32 in mouse32.

> 
> >I'd like to know what OS/2 did.
> AFAIK also WinNT is not affected. Not sure though.
> 
> >The DOS boxes and 16 bit mode
> >DPMI applications ran very well and it was very stable, despite
> Hey. It is not about 16bit DPMI mode, it is
> about the 32bit DPMI mode. That's the whole
> problem. Be it only about the 16bit DPMI mode,
> the problem may not ever harm anything at all.
> But for 32bit mode that's quite a problem.

Well, I did not express myself correctly. OS/2 was probably the 
most frightening mix of 16 and 32 bit code ever produced, so 
they certainly had to work around the problem. There
were literally tons of "thunks" to interface both modes
and the LDT was set up in a way which made the translation
of addresses relatively easy:

  off32 = (seg16>>3) + off16

Therefore the base address of LDT segment n was n*64k,
limiting the addressable memory to 512MB in all versions
I used (later versions lifted this limit), but making
conversions of 16 to 32 bit addresses and back quite
efficient. However, I can't remember exactly how the 
thunks worked.

	Gabriel


	Gabriel


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-25 23:42                             ` Gabriel Paubert
  2004-09-26 18:04                               ` Stas Sergeev
@ 2004-09-30 15:11                               ` Bill Davidsen
  1 sibling, 0 replies; 38+ messages in thread
From: Bill Davidsen @ 2004-09-30 15:11 UTC (permalink / raw)
  To: linux-kernel

Gabriel Paubert wrote:
> 	Hi,
> 
> 
>>It does not restore it in any case for the 16bit
>>stack (no matter whether the code is 16 or 32 in PM).
>>Petr says V86 is not affected, but I have not tested
>>that case because why to care? The problem is only for
>>the 32bit code. For 16bit code (PM or V86) it just
>>doesn't really matter I think (I don't think using
>>prefixes for ESP is sane).
> 
> 
> Well, thrashing a register at any time under the user
> just because an interrupt happened is even less sane ;-)
> 
> 
>>>I'm absolutely amazed by the fact that this bug has been there
>>>since the beginning and only seems to hit users right now.
>>
>>No, right now it just hits me:)
>>It used to hit the dosemu users always, but people just
>>blamed dosemu itself and that was it. Noone wanted
>>to spend weeks traceing the DOS programs under dosemu,
>>then traceing dosemu itself, then traceing kernel,
>>then looking through the docs to track the problem down
>>to something known, then writing to Intel's techsup for
>>clarifications, and then writing to LKML:) And if not
>>for the great help I got here, this will end up nowhere
>>again. So that's how it used to stay "unnoticed" for
>>years.
> 
> 
> I see. And finally we know that the problem with the
> processor and that Intel changed the spec to conform
> to what the hardware does but is rather coy about it.
> 
> 
>>As for the other instances of that problem, here are some:
>>
>>http://www.tenberry.com/dos4g/watcom/rn4gw.html
>>---
>>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug
>>    which mangles the high word of ESP.
> 
> 
> But if the bug is also affects Windows DOS box, it means that
> V86 is affected too, no? 
> 
> I'd like to know what OS/2 did. The DOS boxes and 16 bit mode
> DPMI applications ran very well and it was very stable, despite
> the fact that the kernel spent its time switching between 16
> and 32 bit modes (for example, almost all device drivers were 
> 16 bit code, but the drivers for DOS emulation were 32 bit).
> But I removed it from all my machines several years ago.

May I suggest that IBM is a friend of Linux, and that *if* there is a 
simple way to get around the problem they will probably be very willing 
to share it with us.

-- 
    -bill davidsen (davidsen@tmr.com)
"The secret to procrastination is to put things off until the
  last possible moment - but no longer"  -me

^ permalink raw reply	[flat|nested] 38+ messages in thread

* ESP corruption bug - what CPUs are affected?
@ 2004-09-16 17:49 Stas Sergeev
  2004-09-16 19:03 ` Denis Vlasenko
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-16 17:49 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1638 bytes --]

Hello.

My question is not directly linux-related,
but I think people here can give me some clue
on an issue.

There is a "semi-official" bug in Intel CPUs,
which is described here:
http://www.intel.com/design/intarch/specupdt/27287402.PDF
chapter "Specification Clarifications"
section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".

A short quote:
"ISSUE: When a 32-bit IRET is used to return to another privilege level,
and the old level uses a 4G stack (D/B bit in the segment register = 1),
while the new level uses a 64k stack (D/B bit = 0), then only the 
lower word of ESP is updated."

Which means that the higher word of ESP gets
trashed. This bug beats dosemu rather badly,
but there seem to be not much info about that
bug available on the net.
What I want to find out, is what CPUs are
affected. I wrote a small test-case. It is
attached. I tried it on Intel Pentium and
on AMD Athlon - bug is present on both. But
I'd like to know if it is also present on a
Transmeta Crusoe, Cyrix and other CPUs.
Can someone please run the test-case on some
of that CPUs and see how it goes?
Note: specifying "32" as an arg to the
test-case, will make it to use the 32bit
segment for the stack, and it will not detect
the bug. It is there to prove that it detects
the Right Thing. Run it without that argument.

I'd also like to know any thoughts on whether
it is possible to work around the bug, probably
in a kernel? Well, I am not hoping on such a
possibility, but who knows...
Anyway, I'd be glad to get any info on that bug.
Why it was not fixed for so many years, looks
also like an interesting question, as for me.


[-- Attachment #2: stk.c --]
[-- Type: text/x-csrc, Size: 2636 bytes --]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
#include <asm/ucontext.h>

unsigned char stack[65536];

void my_segv(int signum, siginfo_t *info, void *c)
{
  struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext;
  printf("In sighandler: esp=%lx\n", ctx->esp);
  /* Skip HLT */
  ctx->eip++;
}

_syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount);

static int set_ldt_entry(int entry, unsigned long base, unsigned int limit,
	      int seg_32bit_flag, int contents, int read_only_flag,
	      int limit_in_pages_flag, int seg_not_present, int useable)
{
  struct modify_ldt_ldt_s ldt_info;
  ldt_info.entry_number = entry;
  ldt_info.base_addr = base;
  ldt_info.limit = limit;
  ldt_info.seg_32bit = seg_32bit_flag;
  ldt_info.contents = contents;
  ldt_info.read_exec_only = read_only_flag;
  ldt_info.limit_in_pages = limit_in_pages_flag;
  ldt_info.seg_not_present = seg_not_present;
  ldt_info.useable = useable;

  return modify_ldt(1, &ldt_info, sizeof(ldt_info));
}

int main(int argc, char *argv[])
{
  unsigned short _ss, new_ss;
  unsigned long _esp, new_esp;
  int is32 = 0;
  stack_t sig_stack;
  struct sigaction sa;

  if (argc > 1 && !strncmp(argv[1], "32", 3))
    is32 = 1;

  sig_stack.ss_sp = stack;
  sig_stack.ss_flags = 0;
  sig_stack.ss_size = sizeof(stack);
  if (sigaltstack(&sig_stack, NULL)) {
    perror("sigaltstack()");
    return 1;
  }

  sa.sa_sigaction = my_segv;
  sigemptyset(&sa.sa_mask);
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);

  /* Get SS, ESP */
  asm volatile(
    "movw %%ss, %0\n"
    "movl %%esp, %1\n"
    :"=m"(_ss), "=m"(_esp)
  );
  printf("sp=%#lx, ss=%#hx\n", _esp, _ss);

  /* Force to LDT */
  new_ss = _ss | 4;
  /* Create the LDT entry */
  set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0);

  /* Do the trick... Switch stack and then switch context. */
  asm volatile(
    "movw %3, %%ss\n"		/* Load our LDT selector to SS */
    "hlt\n"			/* Force the context switch */
    "movw %1, %%ss\n"		/* Restore SS ASAP */
    "movl %%esp, %0\n"		/* Get new ESP */
    "movl %2, %%esp\n"		/* Restore ESP */
    :"=m"(new_esp)
    : "m"(_ss), "m"(_esp), "m"(new_ss)
  );
  printf("Now sp=%#lx, ss=%#hx\n", new_esp, new_ss);

  /* See what we've got... */
  if (new_esp > 0xc0000000) {
    printf("BUG!\n");
  } else if (new_esp != _esp) {
    printf("Esp changed, strange...\n");
  } else {
    printf("No bug here! What CPU is this?\n");
    if (!is32)
      system("cat /proc/cpuinfo");
  }
  return 0;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-16 17:49 Stas Sergeev
@ 2004-09-16 19:03 ` Denis Vlasenko
  2004-09-17 18:13   ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Denis Vlasenko @ 2004-09-16 19:03 UTC (permalink / raw)
  To: Stas Sergeev, linux-kernel

On Thursday 16 September 2004 20:49, Stas Sergeev wrote:
> Hello.
>
> My question is not directly linux-related,
> but I think people here can give me some clue
> on an issue.
>
> There is a "semi-official" bug in Intel CPUs,
> which is described here:
> http://www.intel.com/design/intarch/specupdt/27287402.PDF
> chapter "Specification Clarifications"
> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
>
> A short quote:
> "ISSUE: When a 32-bit IRET is used to return to another privilege level,
> and the old level uses a 4G stack (D/B bit in the segment register = 1),
> while the new level uses a 64k stack (D/B bit = 0), then only the
> lower word of ESP is updated."
>
> Which means that the higher word of ESP gets
> trashed. This bug beats dosemu rather badly,
> but there seem to be not much info about that
> bug available on the net.

Well. Strictly speaking it's not a bug.
Code executes as it should. IRET returns to the correct
location. Upper 16 bits of ESP do not affect execution
in this case.

You want CPU to preserve upper bits,
but this will waste a handful of transistors
practically for no gain.

I think dosemu should just work around this.
Is there some subtle problem with that in dosemu?

> What I want to find out, is what CPUs are
> affected. I wrote a small test-case. It is
> attached. I tried it on Intel Pentium and
> on AMD Athlon - bug is present on both. But
> I'd like to know if it is also present on a
> Transmeta Crusoe, Cyrix and other CPUs.
> Can someone please run the test-case on some
> of that CPUs and see how it goes?
> Note: specifying "32" as an arg to the
> test-case, will make it to use the 32bit
> segment for the stack, and it will not detect
> the bug. It is there to prove that it detects
> the Right Thing. Run it without that argument.
>
> I'd also like to know any thoughts on whether
> it is possible to work around the bug, probably
> in a kernel? Well, I am not hoping on such a
> possibility, but who knows...
> Anyway, I'd be glad to get any info on that bug.
> Why it was not fixed for so many years, looks
> also like an interesting question, as for me.

Because it does not matter for 99.9999% of users...

# gcc -O2 stk.c
# ./a.out
sp=0xbffffb30, ss=0x7b
In sighandler: esp=bffffb10
Now sp=0xccf1fb10, ss=0x7f
Esp changed, strange...
# cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : Unknown CPU Typ
stepping        : 1
cpu MHz         : 1743.632
cache size      : 64 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow
--
vda


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-16 19:03 ` Denis Vlasenko
@ 2004-09-17 18:13   ` Stas Sergeev
  2004-09-17 22:04     ` Denis Vlasenko
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-17 18:13 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

Hello.

Denis Vlasenko wrote:
>> Which means that the higher word of ESP gets
>> trashed. This bug beats dosemu rather badly,
>> but there seem to be not much info about that
>> bug available on the net.
> Well. Strictly speaking it's not a bug.
> Code executes as it should. IRET returns to the correct
> location. Upper 16 bits of ESP do not affect execution
> in this case.
Unfortunately this is not true. Mainly because
the code is still 32bit, so when is operates with
the stack pointer directly, it gets ESP, not SP.
Here are a few examples from the real DOS progs:
---
frstor  [esp]  <--crash
---
push    ebp
mov     ebp,esp
mov     edx,[ebp+0x8]  <--crash
---
and there are much more. "mov ebp,esp" is just a
standard thing, and then the ebp is being used to
access the locals and function arguments. Why do
they use the 16bit stack in that case, is another
question. Unfortunately some of them do.

> You want CPU to preserve upper bits,
> but this will waste a handful of transistors
> practically for no gain.
I don't think so. After all, if only the B bit is
set, this is being done. So I guess all the transistors
are in place (well, of course this is just a wild
guess).

> I think dosemu should just work around this.
I see no way to work around this in user-space.
Esp. since dosemu doesn't control the execution
completely, it acts as a monitor, only getting a
control when exceptions occur. For dosemu code
there are no problems, but the DOS program that
is running, crashes.

> Is there some subtle problem with that in dosemu?
Nothing subtle I think. The problem is real and
quite hurts. Fortunately it happened that using
the dos32a extender instead of dos4gw "fixes" it
for the dos4gw-based apps. But users wants
everything out of the box, and never read docs or
even the error messages in a log, so this doesn't
really help:) And there are some not dos4gw-based
progs that are affected either.

>> Why it was not fixed for so many years, looks
>> also like an interesting question, as for me.
> Because it does not matter for 99.9999% of users...
Aparently you are underestimating the amount of
the dosemu users out there:) AFAIK it is also a
problem for Win95 DOS box sometimes.

> # gcc -O2 stk.c
> # ./a.out
> sp=0xbffffb30, ss=0x7b
> In sighandler: esp=bffffb10
> Now sp=0xccf1fb10, ss=0x7f
> Esp changed, strange...
That's strange indeed! Apparently your ESP value,
0xccf1fb10, is greater than 0xc0000000, in which
case my program should just write "BUG!", but for
you it doesn't. Haven't you altered the code
somehow? Why I compare it with 0xc0000000, is
because ESP has the higher word of a kernel's stack
address, so it should be above the TASK_SIZE when
corrupted.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-17 18:13   ` Stas Sergeev
@ 2004-09-17 22:04     ` Denis Vlasenko
  2004-09-18 10:58       ` Stas Sergeev
  0 siblings, 1 reply; 38+ messages in thread
From: Denis Vlasenko @ 2004-09-17 22:04 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3458 bytes --]

On Friday 17 September 2004 21:13, Stas Sergeev wrote:
> Hello.
>
> Denis Vlasenko wrote:
> >> Which means that the higher word of ESP gets
> >> trashed. This bug beats dosemu rather badly,
> >> but there seem to be not much info about that
> >> bug available on the net.
> >
> > Well. Strictly speaking it's not a bug.
> > Code executes as it should. IRET returns to the correct
> > location. Upper 16 bits of ESP do not affect execution
> > in this case.
>
> Unfortunately this is not true. Mainly because
> the code is still 32bit, so when is operates with
> the stack pointer directly, it gets ESP, not SP.
> Here are a few examples from the real DOS progs:
> ---
> frstor  [esp]  <--crash
> ---
> push    ebp
> mov     ebp,esp
> mov     edx,[ebp+0x8]  <--crash
> ---

Hmmm. Let me dust off my i386 knwoledge.

This is what happens: CPU is doing IRET in 32bit mode.
Stack layout:

+20 (empty) <--- SS:ESP
+16 old_CS
+12 old_EIP
 +8 old_EFLAGS
 +4 old_SS
 +0 old_ESP
 
If old_SS references 'small' data segment (16-bit one),
processor does not restore ESP from old_ESP.
It restores lower 16 bits only. Upper 16 bits are filled
with garbage (or just not modified at all?).
This works fine because processor uses SP, not ESP,
for subsequent push/pop/call/ret operations.
But if code uses full ESP, thinking that upper 16 bits are zero,
it will crash badly. Correct me if I'm wrong.

> and there are much more. "mov ebp,esp" is just a
> standard thing, and then the ebp is being used to
> access the locals and function arguments. Why do
> they use the 16bit stack in that case, is another
> question. Unfortunately some of them do.

Hmm. I think you need to *emulate* this IRET.
Is this IRET happens in kernel or in dosemu code?

> > # gcc -O2 stk.c
> > # ./a.out
> > sp=0xbffffb30, ss=0x7b
> > In sighandler: esp=bffffb10
> > Now sp=0xccf1fb10, ss=0x7f
> > Esp changed, strange...
>
> That's strange indeed! Apparently your ESP value,
> 0xccf1fb10, is greater than 0xc0000000, in which
> case my program should just write "BUG!", but for
> you it doesn't. Haven't you altered the code
> somehow?

No, I didn't alter anything. Strange indeed.

> Why I compare it with 0xc0000000, is
> because ESP has the higher word of a kernel's stack
> address, so it should be above the TASK_SIZE when
> corrupted.

Even more strange! Now I did modify the code (attached),
and this is what I've got:

# gcc -o stk2 -O2 stk2.c
# ./stk2
sp=0xbffffb30, ss=0x7b
In sighandler: esp=bffffb10
Now sp=cd79fb10   <=========
Now sp=bffffb50   <========= !?
Now sp=bffffb50, ss=007f
Now sp=bffffb50
Now sp=bffffb50
Now sp=bffffb50
Esp changed, strange...

I am thoroughly confused now. Two back-to-back
'printf("Now sp=%08lx\n", new_esp);'
gave different result.  How this can happen??

Corresponding gcc -S code. new_esp is at -180(%ebp):

#APP
        movw -182(%ebp), %ss
hlt
movw -170(%ebp), %ss
movl %esp, -180(%ebp)
movl -176(%ebp), %esp

#NO_APP
        addl    $40, %esp
        pushl   -180(%ebp)
        pushl   $.LC4
        call    printf
        popl    %eax
        popl    %edx
        pushl   -180(%ebp)
        pushl   $.LC4
        call    printf
        addl    $12, %esp
        movzwl  -182(%ebp), %eax
        pushl   %eax
        pushl   -180(%ebp)
        pushl   $.LC5
        call    printf
        popl    %ecx
        popl    %ebx
        pushl   -180(%ebp)
        pushl   $.LC4
        call    printf
        popl    %eax
        popl    %edx

--
vda

[-- Attachment #2: stk2.c --]
[-- Type: text/x-csrc, Size: 2903 bytes --]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
#include <asm/ucontext.h>

unsigned char stack[65536];

void my_segv(int signum, siginfo_t *info, void *c)
{
  struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext;
  printf("In sighandler: esp=%lx\n", ctx->esp);
  /* Skip HLT */
  ctx->eip++;
}

_syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount);

static int set_ldt_entry(int entry, unsigned long base, unsigned int limit,
	      int seg_32bit_flag, int contents, int read_only_flag,
	      int limit_in_pages_flag, int seg_not_present, int useable)
{
  struct modify_ldt_ldt_s ldt_info;
  ldt_info.entry_number = entry;
  ldt_info.base_addr = base;
  ldt_info.limit = limit;
  ldt_info.seg_32bit = seg_32bit_flag;
  ldt_info.contents = contents;
  ldt_info.read_exec_only = read_only_flag;
  ldt_info.limit_in_pages = limit_in_pages_flag;
  ldt_info.seg_not_present = seg_not_present;
  ldt_info.useable = useable;

  return modify_ldt(1, &ldt_info, sizeof(ldt_info));
}

int main(int argc, char *argv[])
{
  unsigned short _ss, new_ss;
  unsigned long _esp, new_esp;
  int is32 = 0;
  stack_t sig_stack;
  struct sigaction sa;

  if (argc > 1 && !strncmp(argv[1], "32", 3))
    is32 = 1;

  sig_stack.ss_sp = stack;
  sig_stack.ss_flags = 0;
  sig_stack.ss_size = sizeof(stack);
  if (sigaltstack(&sig_stack, NULL)) {
    perror("sigaltstack()");
    return 1;
  }

  sa.sa_sigaction = my_segv;
  sigemptyset(&sa.sa_mask);
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);

  /* Get SS, ESP */
  asm volatile(
    "movw %%ss, %0\n"
    "movl %%esp, %1\n"
    :"=m"(_ss), "=m"(_esp)
  );
  printf("sp=%#lx, ss=%#hx\n", _esp, _ss);

  /* Force to LDT */
  new_ss = _ss | 4;
  /* Create the LDT entry */
  set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0);

  /* Do the trick... Switch stack and then switch context. */
  asm volatile(
    "movw %3, %%ss\n"		/* Load our LDT selector to SS */
    "hlt\n"			/* Force the context switch */
    "movw %1, %%ss\n"		/* Restore SS ASAP */
    "movl %%esp, %0\n"		/* Get new ESP */
    "movl %2, %%esp\n"		/* Restore ESP */
    :"=m"(new_esp)
    : "m"(_ss), "m"(_esp), "m"(new_ss)
  );
  printf("Now sp=%08lx\n", new_esp);
  printf("Now sp=%08lx\n", new_esp);
  printf("Now sp=%08lx, ss=%04hx\n", new_esp, new_ss);
  printf("Now sp=%08lx\n", new_esp);
  printf("Now sp=%08lx\n", new_esp);

  /* See what we've got... */
  if (new_esp > 0xc0000000) {
    printf("Now sp=%08lx\n", new_esp);
    printf("BUG!\n");
  } else if (new_esp != _esp) {
    printf("Now sp=%08lx\n", new_esp);
    printf("Esp changed, strange...\n");
  } else {
    printf("Now sp=%08lx\n", new_esp);
    printf("No bug here! What CPU is this?\n");
    if (!is32)
      system("cat /proc/cpuinfo");
  }
  return 0;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-17 22:04     ` Denis Vlasenko
@ 2004-09-18 10:58       ` Stas Sergeev
  2004-09-18 13:08         ` Denis Vlasenko
  2004-09-21 11:19         ` Pavel Machek
  0 siblings, 2 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-09-18 10:58 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2952 bytes --]

Hello.

Denis Vlasenko wrote:
>> This is what happens: CPU is doing IRET in 32bit mode.
>> Stack layout:
> +20 (empty) <--- SS:ESP
> +16 old_CS
> +12 old_EIP
>  +8 old_EFLAGS
>  +4 old_SS
>  +0 old_ESP
JFYI, looks like you swapped CS<-->EIP and SS<-->ESP
on your stack frame layout.
 
> If old_SS references 'small' data segment (16-bit one),
> processor does not restore ESP from old_ESP.
> It restores lower 16 bits only. Upper 16 bits are filled
> with garbage (or just not modified at all?).
Not modified at all, yes. That's why it is always
greater than TASK_SIZE (0xc0000000) - it is still
in a kernel space.

> This works fine because processor uses SP, not ESP,
> for subsequent push/pop/call/ret operations.
> But if code uses full ESP, thinking that upper 16 bits are zero,
> it will crash badly. Correct me if I'm wrong.
That's correct. But you should note that the
program not just "thinks" that the upper 16 bits
are zero. It writes zero there itself, and a few
instructions later - oops, it is no longer zero...
My test-case does "hlt" to force the context switch,
but in fact it is not necessary. I tried an
empty loop instead of HLT. If the loop is long
enough and some IRQ being handled by the kernel
in a mean time, the ESP is corrupted. So the
user-space never knows when exactly it gets corrupted.
And, as I mentioned already, since the code is 32bit,
it just does the things like "mov ebp,esp" and
uses ebp to access the function params and locals,
which crashes, or uses ESP directly to access
something, etc.

> Hmm. I think you need to *emulate* this IRET.
> Is this IRET happens in kernel or in dosemu code?
The problem happens only when iret is used to
switch to another priv level. So it is the kernel's
iret of course, something to the effect of entry.S:128
I think. The control is transferred from kernel code
to the DOS code directly. dosemu code is completely
bypassed. dosemu have no chances to intercept that.
The DOS program is running on its own (as long as
it doesn't trigger an exception, or SIGALRM comes),
and the CPU is corrupting its ESP in *any* place,
since it happens when an external IRQ arrives.

>> That's strange indeed! Apparently your ESP value,
>> 0xccf1fb10, is greater than 0xc0000000, in which
>> case my program should just write "BUG!", but for
>> you it doesn't. Haven't you altered the code
>> somehow?
> No, I didn't alter anything. Strange indeed.
So that proves that writing the bug-free code, esp.
when the asm is involved, is not always as easy as
I assume...

> I am thoroughly confused now. Two back-to-back
> 'printf("Now sp=%08lx\n", new_esp);'
> gave different result.  How this can happen??
My code did the assumption that the ESP should
not change within the execution of main(). This
is true only if you don't enable the optimization
(as I did), or use -fno-defer-pop option of gcc.
Fixed code is attached (just for the overall
completeness:) Should survive the optimization now.


[-- Attachment #2: stk1.c --]
[-- Type: text/x-csrc, Size: 2646 bytes --]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
#include <asm/ucontext.h>

unsigned char stack[65536];

void my_segv(int signum, siginfo_t *info, void *c)
{
  struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext;
  printf("In sighandler: esp=%lx\n", ctx->esp);
  /* Skip HLT */
  ctx->eip++;
}

_syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount)

static int set_ldt_entry(int entry, unsigned long base, unsigned int limit,
	      int seg_32bit_flag, int contents, int read_only_flag,
	      int limit_in_pages_flag, int seg_not_present, int useable)
{
  struct modify_ldt_ldt_s ldt_info;
  ldt_info.entry_number = entry;
  ldt_info.base_addr = base;
  ldt_info.limit = limit;
  ldt_info.seg_32bit = seg_32bit_flag;
  ldt_info.contents = contents;
  ldt_info.read_exec_only = read_only_flag;
  ldt_info.limit_in_pages = limit_in_pages_flag;
  ldt_info.seg_not_present = seg_not_present;
  ldt_info.useable = useable;

  return modify_ldt(1, &ldt_info, sizeof(ldt_info));
}

int main(int argc, char *argv[])
{
  unsigned short _ss, new_ss;
  unsigned long _esp, new_esp;
  int is32 = 0;
  stack_t sig_stack;
  struct sigaction sa;

  if (argc > 1 && !strncmp(argv[1], "32", 3))
    is32 = 1;

  sig_stack.ss_sp = stack;
  sig_stack.ss_flags = 0;
  sig_stack.ss_size = sizeof(stack);
  if (sigaltstack(&sig_stack, NULL)) {
    perror("sigaltstack()");
    return 1;
  }

  sa.sa_sigaction = my_segv;
  sigemptyset(&sa.sa_mask);
  sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
  sigaction(SIGSEGV, &sa, NULL);

  /* Get SS */
  asm volatile(
    "movw %%ss, %0\n"
    :"=m"(_ss)
  );

  /* Force to LDT */
  new_ss = _ss | 4;
  /* Create the LDT entry */
  set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0);
  printf("old_ss=%#hx new_ss=%#hx\n", _ss, new_ss);

  /* Do the trick... Switch stack and then switch context. */
  asm volatile(
    "movl %%esp, %1\n"		/* Save ESP */
    "movw %2, %%ss\n"		/* Load our LDT selector to SS */
    "hlt\n"			/* Force the context switch */
    "movw %3, %%ss\n"		/* Restore SS ASAP */
    "movl %%esp, %0\n"		/* Get new ESP */
    "movl %1, %%esp\n"		/* Restore ESP */
    :"=m"(new_esp), "=a"(_esp)
    :"m"(new_ss), "m"(_ss)
  );
  printf("old_esp=%#lx new_esp=%#lx\n", _esp, new_esp);

  /* See what we've got... */
  if (new_esp > 0xc0000000) {
    printf("BUG!\n");
  } else if (new_esp != _esp) {
    printf("Esp changed, strange...\n");
  } else {
    printf("No bug here! What CPU is this?\n");
    if (!is32)
      system("cat /proc/cpuinfo");
  }
  return 0;
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 10:58       ` Stas Sergeev
@ 2004-09-18 13:08         ` Denis Vlasenko
  2004-09-18 17:05           ` Stas Sergeev
  2004-09-21 11:19         ` Pavel Machek
  1 sibling, 1 reply; 38+ messages in thread
From: Denis Vlasenko @ 2004-09-18 13:08 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: linux-kernel

On Saturday 18 September 2004 13:58, Stas Sergeev wrote:
> Hello.
>
> Denis Vlasenko wrote:
> >> This is what happens: CPU is doing IRET in 32bit mode.
> >> Stack layout:
> >
> > +20 (empty) <--- SS:ESP
> > +16 old_CS
> > +12 old_EIP
> >  +8 old_EFLAGS
> >  +4 old_SS
> >  +0 old_ESP
>
> JFYI, looks like you swapped CS<-->EIP and SS<-->ESP
> on your stack frame layout.

Hm. More than that. My figure is totally incorrect.
Corrected one:

+16 old_SS
+12 old_ESP
 +8 old_EFLAGS
 +4 old_CS
 +0 old_EIP <--- SS:ESP

Addresses grow from the bottom in this fig.
Note that offsets are always stored in lower address
relative to where selectors are stored.

> > If old_SS references 'small' data segment (16-bit one),
> > processor does not restore ESP from old_ESP.
> > It restores lower 16 bits only. Upper 16 bits are filled
> > with garbage (or just not modified at all?).
>
> Not modified at all, yes. That's why it is always
> greater than TASK_SIZE (0xc0000000) - it is still
> in a kernel space.
>
> > This works fine because processor uses SP, not ESP,
> > for subsequent push/pop/call/ret operations.
> > But if code uses full ESP, thinking that upper 16 bits are zero,
> > it will crash badly. Correct me if I'm wrong.
>
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...
> My test-case does "hlt" to force the context switch,
> but in fact it is not necessary. I tried an
> empty loop instead of HLT. If the loop is long
> enough and some IRQ being handled by the kernel
> in a mean time, the ESP is corrupted. So the
> user-space never knows when exactly it gets corrupted.
> And, as I mentioned already, since the code is 32bit,
> it just does the things like "mov ebp,esp" and
> uses ebp to access the function params and locals,
> which crashes, or uses ESP directly to access
> something, etc.
>
> > Hmm. I think you need to *emulate* this IRET.
> > Is this IRET happens in kernel or in dosemu code?
>
> The problem happens only when iret is used to
> switch to another priv level. So it is the kernel's
> iret of course, something to the effect of entry.S:128
> I think. The control is transferred from kernel code
> to the DOS code directly. dosemu code is completely
> bypassed. dosemu have no chances to intercept that.
> The DOS program is running on its own (as long as
> it doesn't trigger an exception, or SIGALRM comes),
> and the CPU is corrupting its ESP in *any* place,
> since it happens when an external IRQ arrives.

Aha. The only way to sanely handle it is to
hack on entry.S I'm afraid. Something like rewriting
CS:EIP so that it returns to a small ring-3 trampoline
which clears upper 16 bits of ESP and jumps to original CS:EIP.

You may use ring-3 user stack as a place to save original CS:EIP,
and then just do RETF. This way, trampoline boils down to

	movzxl	%sp,%esp
	retf

which does not touch EFLAGS and you don't need to bother
with thinking about preserving it. IRET will restore it,
and trampoline wouldn't touch it anymore.

Now, how to detect when to use this? Hmm.... the simplest thing
is to check that

(old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)

This will cost us only one comparison in the normal
path, because typically ESP of Linus executables
is greater than 0xffff.

P.S.

# gcc -o stk1 -O2 stk1.c
# ./stk1
old_ss=0x7b new_ss=0x7f
In sighandler: esp=bffffb30
old_esp=0xbffffb30 new_esp=0xc618fb30
BUG!

--
vda


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 13:08         ` Denis Vlasenko
@ 2004-09-18 17:05           ` Stas Sergeev
       [not found]             ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>
  0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-18 17:05 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

Hi Denis.

Denis Vlasenko wrote:
> Aha. The only way to sanely handle it is to
> hack on entry.S I'm afraid. Something like rewriting
> CS:EIP so that it returns to a small ring-3 trampoline
> which clears upper 16 bits of ESP and jumps to original CS:EIP.
Well, I don't really want to clear the higher word
of ESP (even if I want to do that in most cases),
but I'd really like to just restore it properly.
Of course (I think) ring-3 trampoline will not
work for many reasons. The most obvious one is
that it itself can be interrupted in any place.
Another problem with it is that the return frame
will have to be pushed to the stack of a DOS prog.
This is not the good thing to do. dosemu avoids
ever touching the stack of a DOS prog by setting
up the sigaltstack.
What *will* work, however, is a ring-1 trampoline,
as Petr Vandrovec suggested. It can be executed
with interrupts disabled (I think) so will not
be interrupted, and it can (isn't it?) use the
kernel stack, if I understand that correctly.

> Now, how to detect when to use this? Hmm.... the simplest thing
> is to check that
> (old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)
Yes, that's an excellent idea (except for the
ESP<=0xffff check - I don't think this one is
necessary).

> This will cost us only one comparison in the normal
> path,
Yes, so I think I should just try to implement
that. Not as a ring-3 trampoline, but as a ring-1
one.

> because typically ESP of Linus executables
> is greater than 0xffff.
I'd rather say, because the stack segment is 32bit
for them always.
And this all should work, as it seems to me.
Thanks for the hint about the checking.


^ permalink raw reply	[flat|nested] 38+ messages in thread

[parent not found: <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>]

* Re: ESP corruption bug - what CPUs are affected?
       [not found]             ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>
@ 2004-09-22 19:05               ` Stas Sergeev
  0 siblings, 0 replies; 38+ messages in thread
From: Stas Sergeev @ 2004-09-22 19:05 UTC (permalink / raw)
  To: Denis Vlasenko; +Cc: linux-kernel

Hi,

Denis Vlasenko wrote:
> Maybe. This would be a complicated thing.
I bet it was! :)

> Well. Not okay. Maybe this?
> 1. We build IRET frame on ring1 stack for step 4 (see below),
>    modify IRET frame on ring0 stack so that intrs are disabled
>    and CS:EIP and SS:ESP point to values of ring1 code/stack.
Much simpler: IRET frame to return to user is
already there. Just push another one to return
to ring1 first.

> 2. IRET returns to ring1 code with dedicated *16-bit* ring1 stack
>    Upper word of ESP is wrong now, but we can safely fiddle with it.
> 3. trampoline code fixes upper word of ESP (how?)
popl %esp (as per Petr Vandrovec's suggestion)
The value on stack is carefully prepared on ring0.

> 4. trampoline IRETs to user code.
> May work.
Works!

> ring1 stacks must be per-CPU.
I allocate it on a ring0 stack. Noone seem to
suggest that. Is this flawed for some reasons?
 
>> ESP<=0xffff check - I don't think this one is
>> necessary).
> Any program which runs with 16bit stack and yet with
> ESP > 0xffff is doing something *terminally* weird.
> I think it is acceptable to leave this case unfixed.
For what? We can have that fixed so why not?

> You cannot check 16bitness of SS descriptor in two
> insns.
I do actually:
larl OLDSS(%esp), %eax
testl $0x00400000, %eax
which is exactly two insns.

Since I've forgot to CC the patch to you, I uploaded
it here:
http://www.dosemu.org/stas/linux-2.6.8-stacks2.diff
so that you (or anyone interested) can make a review.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-18 10:58       ` Stas Sergeev
  2004-09-18 13:08         ` Denis Vlasenko
@ 2004-09-21 11:19         ` Pavel Machek
  2004-09-21 11:43           ` Denis Vlasenko
  1 sibling, 1 reply; 38+ messages in thread
From: Pavel Machek @ 2004-09-21 11:19 UTC (permalink / raw)
  To: Stas Sergeev; +Cc: Denis Vlasenko, linux-kernel

Hi!

> >for subsequent push/pop/call/ret operations.
> >But if code uses full ESP, thinking that upper 16 bits are zero,
> >it will crash badly. Correct me if I'm wrong.
> That's correct. But you should note that the
> program not just "thinks" that the upper 16 bits
> are zero. It writes zero there itself, and a few
> instructions later - oops, it is no longer zero...

Hmm, perhaps this can also be viewed as a "information leak"? Program
running under dosemu is not expected to know high bits of kernel
%esp...
								Pavel
-- 
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ESP corruption bug - what CPUs are affected?
  2004-09-21 11:19         ` Pavel Machek
@ 2004-09-21 11:43           ` Denis Vlasenko
  0 siblings, 0 replies; 38+ messages in thread
From: Denis Vlasenko @ 2004-09-21 11:43 UTC (permalink / raw)
  To: Pavel Machek, Stas Sergeev; +Cc: linux-kernel

On Tuesday 21 September 2004 14:19, Pavel Machek wrote:
> Hi!
> 
> > >for subsequent push/pop/call/ret operations.
> > >But if code uses full ESP, thinking that upper 16 bits are zero,
> > >it will crash badly. Correct me if I'm wrong.
> >
> > That's correct. But you should note that the
> > program not just "thinks" that the upper 16 bits
> > are zero. It writes zero there itself, and a few
> > instructions later - oops, it is no longer zero...
> 
> Hmm, perhaps this can also be viewed as a "information leak"? Program
> running under dosemu is not expected to know high bits of kernel
> %esp...

Yes.

I must admit that Stas was right and I was wrong.
This is a 386 CPU bug.

Since then it seems to be codified in i86 architecture
and duplicated in all successors.
Incredible... :(
--
vda


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2004-10-11 23:16 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec
2004-10-06 19:04 ` Stas Sergeev
2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev
  -- strict thread matches above, loose matches on Subject: below --
2004-09-16 18:39 Petr Vandrovec
2004-09-17 18:12 ` Stas Sergeev
2004-09-18 16:45 ` Stas Sergeev
2004-09-18 16:59   ` Petr Vandrovec
2004-09-18 19:14     ` Stas Sergeev
2004-09-18 20:35       ` Petr Vandrovec
2004-09-22 18:49         ` Stas Sergeev
2004-09-22 19:19           ` Richard B. Johnson
2004-09-22 20:03             ` Stas Sergeev
2004-09-22 20:13               ` Richard B. Johnson
2004-09-28 15:43                 ` Denis Vlasenko
2004-09-22 20:02           ` Petr Vandrovec
2004-09-23  4:09             ` Stas Sergeev
2004-09-23 17:08             ` Stas Sergeev
2004-09-23 18:06               ` Petr Vandrovec
2004-09-24 20:36                 ` Stas Sergeev
2004-09-24 21:43                   ` Petr Vandrovec
2004-09-25  8:04                     ` Gabriel Paubert
2004-09-25 12:25                       ` Stas Sergeev
2004-09-25 19:18                         ` Gabriel Paubert
2004-09-25 20:40                           ` Stas Sergeev
2004-09-25 23:42                             ` Gabriel Paubert
2004-09-26 18:04                               ` Stas Sergeev
2004-09-27  9:07                                 ` Gabriel Paubert
2004-09-30 15:11                               ` Bill Davidsen
2004-09-16 17:49 Stas Sergeev
2004-09-16 19:03 ` Denis Vlasenko
2004-09-17 18:13   ` Stas Sergeev
2004-09-17 22:04     ` Denis Vlasenko
2004-09-18 10:58       ` Stas Sergeev
2004-09-18 13:08         ` Denis Vlasenko
2004-09-18 17:05           ` Stas Sergeev
     [not found]             ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-22 19:05               ` Stas Sergeev
2004-09-21 11:19         ` Pavel Machek
2004-09-21 11:43           ` Denis Vlasenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).