* Re: ESP corruption bug - what CPUs are affected? (patch att
@ 2004-10-06 17:18 Petr Vandrovec
2004-10-06 19:04 ` Stas Sergeev
2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev
0 siblings, 2 replies; 38+ messages in thread
From: Petr Vandrovec @ 2004-10-06 17:18 UTC (permalink / raw)
To: Stas Sergeev; +Cc: linux-kernel, Denis Vlasenko
On 6 Oct 04 at 20:18, Stas Sergeev wrote:
> Yes, if not for that anonymous guy, who kept posting
> to me until he finally convinced me that the Ring-0
> approach is not that difficult at all.
> So I tried... It was much more difficult to code
> up, but at the end it looks a little better
> and localized to entry.S completely. OTOH it
> touches the exception handlers, but not too much -
> it adds only 5 insns on the fast path. And the
> code is very fragile, but after I made all the
> magic numbers a #define consts, it actually looks
> not so bad.
> I don't know which patch is really better, so
> I am attaching both.
CPL0 solution is certainly more localized, but I have hard problems
to convice myself that it is actually safe.
I would appreciate if you could add comments what values are set
by ESPFIX_SWITCH_16 + 8 + 4 and simillar moves, and what they actually
do. And convicing myself that ESPFIX_SWITCH_32 has just right value so
pushl %eax
pushl %es
lss ESPFIX_SWITCH_32,%esp
popl %es
popl %eax
actually works took almost an hour...
Petr
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: ESP corruption bug - what CPUs are affected? (patch att 2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec @ 2004-10-06 19:04 ` Stas Sergeev 2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev 1 sibling, 0 replies; 38+ messages in thread From: Stas Sergeev @ 2004-10-06 19:04 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel, Denis Vlasenko [-- Attachment #1: Type: text/plain, Size: 1597 bytes --] Hi. Petr Vandrovec wrote: > CPL0 solution is certainly more localized, but I have hard problems > to convice myself that it is actually safe. I spent 2 days convincing myself the same way:) The most problematic part was to make sure that the stack is properly unwinded even if NMI comes before the exception handler managed to switch out of 16bit. But I think this is now handled. > I would appreciate if you could add comments what values are set OK, I did. But in fact that makes the patch only even more obfuscated:( It doesn't look possible to explain all the magic pattern and its constraints in a comments. But I tried:) The patch is attached. > by ESPFIX_SWITCH_16 + 8 + 4 and simillar moves, and what they actually > do. Fortunately there are no such moves. In an attempt to make the patch a little self-explanatory, I #define'd all the meaningfull values. So if there is a move to ESPFIX_SWITCH16_OFFS+some_value, it is safe to assume that the move is intended to load the ESPFIX_SWITCH16, and "some_value" is just the correction constant. So besides the 2 magic pointers, there are really no moves above the iret frame. > And convicing myself that ESPFIX_SWITCH_32 has just right value so > pushl %eax > pushl %es > lss ESPFIX_SWITCH_32,%esp > popl %es > popl %eax > actually works took almost an hour... I realize that and thats really the big problem of that patch. It is very obfuscated and difficult to understand. I guess if you see the first version of that patch, which was before I found the way to use the fixed offsets for locating the switches, you might just get sick:) [-- Attachment #2: linux-2.6.8-stk0-2a.diff --] [-- Type: text/x-patch, Size: 8285 bytes --] diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-26 15:20:23.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-10-06 22:39:43.084860592 +0400 @@ -78,7 +78,7 @@ #define preempt_stop cli #else #define preempt_stop -#define resume_kernel restore_all +#define resume_kernel restore_context #endif #define SAVE_ALL \ @@ -122,25 +122,6 @@ .previous -#define RESTORE_ALL \ - RESTORE_REGS \ - addl $4, %esp; \ -1: iret; \ -.section .fixup,"ax"; \ -2: sti; \ - movl $(__USER_DS), %edx; \ - movl %edx, %ds; \ - movl %edx, %es; \ - pushl $11; \ - call do_exit; \ -.previous; \ -.section __ex_table,"a";\ - .align 4; \ - .long 1b,2b; \ -.previous - - - ENTRY(lcall7) pushfl # We get a different stack layout with call # gates, which has to be cleaned up later.. @@ -197,7 +178,7 @@ movl EFLAGS(%esp), %eax # mix EFLAGS and CS movb CS(%esp), %al testl $(VM_MASK | 3), %eax - jz resume_kernel # returning to kernel or vm86-space + jz resume_kernel ENTRY(resume_userspace) cli # make sure we don't miss an interrupt # setting need_resched or sigpending @@ -211,7 +192,7 @@ #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? - jnz restore_all + jnz restore_context need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl @@ -293,8 +274,102 @@ movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work + restore_all: - RESTORE_ALL + /* If returning to Ring-3, not to V86, and with + * the small stack, try to fix the higher word of + * ESP, as the CPU won't restore it from stack. + * This is an "official" bug of all the x86-compatible + * CPUs, which we can try to work around to make + * dosemu happy. */ + movl EFLAGS(%esp), %eax + movb CS(%esp), %al + andl $(VM_MASK | 3), %eax + /* returning to user-space and not to v86? */ + cmpl $3, %eax + jne restore_context + larl OLDSS(%esp), %eax + /* returning to "small" stack? */ + testl $0x00400000, %eax + jz 8f # go fixing ESP instead of just to return :( +restore_context: + RESTORE_REGS + addl $4, %esp +1: iret +.section .fixup,"ax" +iret_exc: + sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,iret_exc +.previous + /* preparing the ESPfix here */ +/* FRAME_SIZE_ESTIM is the size of an exception stack frame + 8 bytes + * for 2 pushes - need to save 2 registers on stack */ +#define FRAME_SIZE_ESTIM 24 +/* ESPFIX_STACK_RESERVE is the value that goes into SP after switch to 16bit */ +#define ESPFIX_STACK_RESERVE (FRAME_SIZE_ESTIM * 3) +#define SP_ESTIMATE (ESPFIX_STACK_RESERVE - FRAME_SIZE_ESTIM) +/* iret frame takes 20 bytes. + * We hide 2 magic pointers above it, 8 bytes each (seg:off). */ +#define ESPFIX_SWITCH16_OFFS 20 +#define ESPFIX_SWITCH32_OFFS 28 +#define ESPFIX_SWITCH16 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH16_OFFS) +#define ESPFIX_SWITCH32 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH32_OFFS) +8: RESTORE_REGS + addl $4, %esp + /* reserve the room for 2 magic pointers (16 bytes) and 4 bytes wasted. */ + .rept 5 + pushl 16(%esp) + .endr + /* Preserve some registers. That makes all the +8 offsets below. */ + pushl %eax + pushl %ebp + /* Set up the SWITCH16 magic pointer. */ + movl $__ESPFIX_SS, ESPFIX_SWITCH16_OFFS+8+4(%esp) + /* Get the "old" ESP */ + movl 8+12(%esp), %eax + /* replace the lower word of "old" ESP with our magic value */ + movw $ESPFIX_STACK_RESERVE, %ax + movl %eax, ESPFIX_SWITCH16_OFFS+8(%esp) + /* SWITCH16 ready */ + /* Now set up the SWITCH32 magic pointer. */ + movl $__KERNEL_DS, ESPFIX_SWITCH32_OFFS+8+4(%esp) + /* ESP must match the SP_ESTIMATE, + * which is FRAME_SIZE_ESTIM below the iret frame */ + leal -FRAME_SIZE_ESTIM+8(%esp), %eax + movl %eax, ESPFIX_SWITCH32_OFFS+8(%esp) + /* SWITCH32 ready */ + /* %eax will make the base of __ESPFIX_SS. + * It have to be ESPFIX_STACK_RESERVE bytes below the iret frame. */ + subl $(ESPFIX_STACK_RESERVE-FRAME_SIZE_ESTIM), %eax + cli + GET_THREAD_INFO(%ebp) + movl TI_cpu(%ebp), %ebp + shll $GDT_SIZE_SHIFT, %ebp + /* find GDT of the proper CPU */ + addl $cpu_gdt_table, %ebp + /* patch the base of the 16bit stack */ + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp) + shrl $16, %eax + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp) + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp) + popl %ebp + popl %eax +espfix_before_lss: + lss ESPFIX_SWITCH16_OFFS(%esp), %esp +espfix_after_lss: + iret +.section __ex_table,"a" + .align 4 + .long espfix_after_lss,iret_exc +.previous # perform work that needs to be done immediately before resumption ALIGN @@ -493,6 +568,36 @@ pushl $do_debug jmp error_code +/* Check if we are on 16bit stack. Can happen either on iret of ESPFIX, + * or in an exception handler after that iret... */ +#define CHECK_ESPFIX_STACK(label) \ + pushl %eax; \ + movl %ss, %eax; \ + cmpl $__ESPFIX_SS, %eax; \ + jne label; +/* using %es here because NMI can change %ss */ +#define UNWIND_ESPFIX_STACK \ + CHECK_ESPFIX_STACK(28f) \ + pushl %es; \ + movl %eax, %es; \ + lss %es:ESPFIX_SWITCH32, %esp; \ + popl %es; \ +28: popl %eax; +/* have to compensate the difference between real SP and estimated. */ +#define UNWIND_ESPFIX_STACK_NMI \ + CHECK_ESPFIX_STACK(28f) \ + movw $SP_ESTIMATE, %ax; \ + subw %sp, %ax; \ + movswl %ax, %eax; \ + lss %ss:ESPFIX_SWITCH32, %esp; \ + subl %eax, %esp; \ + popl %eax; \ + cmpl $espfix_after_lss, (%esp); \ + jne nmi_stack_correct; \ + movl $espfix_before_lss, (%esp); \ + jmp nmi_stack_correct; \ +28: popl %eax; + /* * NMI is doubly nasty. It can happen _while_ we're handling * a debug fault, and the debug fault hasn't yet been able to @@ -502,6 +607,7 @@ * fault happened on the sysenter path. */ ENTRY(nmi) + UNWIND_ESPFIX_STACK_NMI cmpl $sysenter_entry,(%esp) je nmi_stack_fixup pushl %eax @@ -523,7 +629,7 @@ pushl %edx call do_nmi addl $8, %esp - RESTORE_ALL + jmp restore_all nmi_stack_fixup: FIX_STACK(12,nmi_stack_correct, 1) @@ -569,14 +675,17 @@ jmp error_code ENTRY(segment_not_present) + UNWIND_ESPFIX_STACK pushl $do_segment_not_present jmp error_code ENTRY(stack_segment) + UNWIND_ESPFIX_STACK pushl $do_stack_segment jmp error_code ENTRY(general_protection) + UNWIND_ESPFIX_STACK pushl $do_general_protection jmp error_code diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S --- linux-2.6.8-pcsp/arch/i386/kernel/head.S 2004-07-17 23:31:24.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/head.S 2004-10-06 21:58:04.873646424 +0400 @@ -514,7 +514,7 @@ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0xc8 APM DS data */ - .quad 0x0000000000000000 /* 0xd0 - unused */ + .quad 0x000092000000ffff /* 0xd0 - ESPfix 16-bit SS */ .quad 0x0000000000000000 /* 0xd8 - unused */ .quad 0x0000000000000000 /* 0xe0 - unused */ .quad 0x0000000000000000 /* 0xe8 - unused */ diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h --- linux-2.6.8-pcsp/include/asm-i386/segment.h 2004-01-09 09:59:19.000000000 +0300 +++ linux-2.6.8-stacks/include/asm-i386/segment.h 2004-10-06 21:58:04.000000000 +0400 @@ -38,7 +38,7 @@ * 24 - APM BIOS support * 25 - APM BIOS support * - * 26 - unused + * 26 - ESPfix small SS * 27 - unused * 28 - unused * 29 - unused @@ -71,14 +71,19 @@ #define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6) #define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11) +#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 14) +#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8) + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* * The GDT has 32 entries */ -#define GDT_ENTRIES 32 - -#define GDT_SIZE (GDT_ENTRIES * 8) +#define GDT_ENTRIES_SHIFT 5 +#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT) +#define GDT_ENTRY_SHIFT 3 +#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT) +#define GDT_SIZE (1 << GDT_SIZE_SHIFT) /* Simple and small GDT entries for booting only */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec 2004-10-06 19:04 ` Stas Sergeev @ 2004-10-11 18:32 ` Stas Sergeev 1 sibling, 0 replies; 38+ messages in thread From: Stas Sergeev @ 2004-10-11 18:32 UTC (permalink / raw) To: linux-kernel; +Cc: Petr Vandrovec [-- Attachment #1: Type: text/plain, Size: 576 bytes --] Hi. I've been pointed out that my prev ring-0 patch was not completely safe. I was doing "popl %es". I thought it is save because RESTORE_REGS also does that with the fixup in place, but the anonymous guy thinks that if %es refers to LDT and the thread on another CPU changes that LDT entry in a mean time, my "popl %es" can GPF. So I have to avoid popping any segregs. I moved my recovery code to error_code, right after %es is used last time and before the %esp is used directly (I am lucky such a place exist there!). New patch is attached. Does it look safe this time? [-- Attachment #2: linux-2.6.8-stk0-3.diff --] [-- Type: text/x-patch, Size: 8035 bytes --] diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-26 15:20:23.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-10-09 20:55:36.772613096 +0400 @@ -78,7 +78,7 @@ #define preempt_stop cli #else #define preempt_stop -#define resume_kernel restore_all +#define resume_kernel restore_context #endif #define SAVE_ALL \ @@ -122,25 +122,6 @@ .previous -#define RESTORE_ALL \ - RESTORE_REGS \ - addl $4, %esp; \ -1: iret; \ -.section .fixup,"ax"; \ -2: sti; \ - movl $(__USER_DS), %edx; \ - movl %edx, %ds; \ - movl %edx, %es; \ - pushl $11; \ - call do_exit; \ -.previous; \ -.section __ex_table,"a";\ - .align 4; \ - .long 1b,2b; \ -.previous - - - ENTRY(lcall7) pushfl # We get a different stack layout with call # gates, which has to be cleaned up later.. @@ -197,7 +178,7 @@ movl EFLAGS(%esp), %eax # mix EFLAGS and CS movb CS(%esp), %al testl $(VM_MASK | 3), %eax - jz resume_kernel # returning to kernel or vm86-space + jz resume_kernel ENTRY(resume_userspace) cli # make sure we don't miss an interrupt # setting need_resched or sigpending @@ -211,7 +192,7 @@ #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? - jnz restore_all + jnz restore_context need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl @@ -293,8 +274,100 @@ movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work + restore_all: - RESTORE_ALL + /* If returning to Ring-3, not to V86, and with + * the small stack, try to fix the higher word of + * ESP, as the CPU won't restore it from stack. + * This is an "official" bug of all the x86-compatible + * CPUs, which we can try to work around to make + * dosemu happy. */ + movl EFLAGS(%esp), %eax + movb CS(%esp), %al + andl $(VM_MASK | 3), %eax + /* returning to user-space and not to v86? */ + cmpl $3, %eax + jne restore_context + larl OLDSS(%esp), %eax + /* returning to "small" stack? */ + testl $0x00400000, %eax + jz 8f # go fixing ESP instead of just to return :( +restore_context: + RESTORE_REGS + addl $4, %esp +1: iret +.section .fixup,"ax" +iret_exc: + sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,iret_exc +.previous + /* preparing the ESPfix here */ +/* FRAME_SIZE_ESTIM is the size of an exception stack frame + 40 bytes + * for 10 pushes */ +#define FRAME_SIZE_ESTIM (16 + 10 * 4) +/* ESPFIX_STACK_RESERVE is the value that goes into SP after switch to 16bit */ +#define ESPFIX_STACK_RESERVE (FRAME_SIZE_ESTIM * 3) +#define SP_ESTIMATE (ESPFIX_STACK_RESERVE - FRAME_SIZE_ESTIM) +/* iret frame takes 20 bytes. + * We hide 2 magic pointers above it, 8 bytes each (seg:off). */ +#define ESPFIX_SWITCH16_OFFS 20 +#define ESPFIX_SWITCH32_OFFS 28 +#define ESPFIX_SWITCH16 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH16_OFFS) +#define ESPFIX_SWITCH32 (ESPFIX_STACK_RESERVE + ESPFIX_SWITCH32_OFFS) +8: RESTORE_REGS + addl $4, %esp + /* reserve the room for 2 magic pointers (16 bytes) and 4 bytes wasted. */ + .rept 5 + pushl 16(%esp) + .endr + /* Preserve some registers. That makes all the +8 offsets below. */ + pushl %eax + pushl %ebp + /* Set up the SWITCH16 magic pointer. */ + movl $__ESPFIX_SS, ESPFIX_SWITCH16_OFFS+8+4(%esp) + /* Get the "old" ESP */ + movl 8+12(%esp), %eax + /* replace the lower word of "old" ESP with our magic value */ + movw $ESPFIX_STACK_RESERVE, %ax + movl %eax, ESPFIX_SWITCH16_OFFS+8(%esp) + /* Now set up the SWITCH32 magic pointer. */ + movl $__KERNEL_DS, ESPFIX_SWITCH32_OFFS+8+4(%esp) + /* ESP must match the SP_ESTIMATE, + * which is FRAME_SIZE_ESTIM bytes below the iret frame */ + leal -FRAME_SIZE_ESTIM+8(%esp), %eax + movl %eax, ESPFIX_SWITCH32_OFFS+8(%esp) + /* %eax will make the base of __ESPFIX_SS. + * It have to be ESPFIX_STACK_RESERVE bytes below the iret frame. */ + subl $(ESPFIX_STACK_RESERVE-FRAME_SIZE_ESTIM), %eax + cli + GET_THREAD_INFO(%ebp) + movl TI_cpu(%ebp), %ebp + shll $GDT_SIZE_SHIFT, %ebp + /* find GDT of the proper CPU */ + addl $cpu_gdt_table, %ebp + /* patch the base of the 16bit stack */ + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp) + shrl $16, %eax + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp) + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp) + popl %ebp + popl %eax +espfix_before_lss: + lss ESPFIX_SWITCH16_OFFS(%esp), %esp +espfix_after_lss: + iret +.section __ex_table,"a" + .align 4 + .long espfix_after_lss,iret_exc +.previous # perform work that needs to be done immediately before resumption ALIGN @@ -396,6 +469,34 @@ call do_IRQ jmp ret_from_intr +/* Check if we are on 16bit stack. Can happen either on iret of ESPFIX, + * or in an exception handler after that iret... */ +#define CHECK_ESPFIX_STACK(label) \ + pushl %eax; \ + movl %ss, %eax; \ + cmpw $__ESPFIX_SS, %ax; \ + jne label; +/* using %es here because NMI can change %ss */ +#define UNWIND_ESPFIX_STACK \ + CHECK_ESPFIX_STACK(28f) \ + movl %eax, %es; \ + lss %es:ESPFIX_SWITCH32, %esp; \ +28: popl %eax; +/* have to compensate the difference between real SP and estimated. */ +#define UNWIND_ESPFIX_STACK_NMI \ + CHECK_ESPFIX_STACK(28f) \ + movw $SP_ESTIMATE, %ax; \ + subw %sp, %ax; \ + movswl %ax, %eax; \ + lss %ss:ESPFIX_SWITCH32, %esp; \ + subl %eax, %esp; \ + popl %eax; \ + cmpl $espfix_after_lss, (%esp); \ + jne nmi_stack_correct; \ + movl $espfix_before_lss, (%esp); \ + jmp nmi_stack_correct; \ +28: popl %eax; + #define BUILD_INTERRUPT(name, nr) \ ENTRY(name) \ pushl $nr-256; \ @@ -423,6 +524,7 @@ pushl %ebx cld movl %es, %ecx + UNWIND_ESPFIX_STACK # this corrupts %es movl ORIG_EAX(%esp), %esi # get the error code movl ES(%esp), %edi # get the function address movl %eax, ORIG_EAX(%esp) @@ -502,6 +604,7 @@ * fault happened on the sysenter path. */ ENTRY(nmi) + UNWIND_ESPFIX_STACK_NMI cmpl $sysenter_entry,(%esp) je nmi_stack_fixup pushl %eax @@ -523,7 +626,7 @@ pushl %edx call do_nmi addl $8, %esp - RESTORE_ALL + jmp restore_all nmi_stack_fixup: FIX_STACK(12,nmi_stack_correct, 1) diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S --- linux-2.6.8-pcsp/arch/i386/kernel/head.S 2004-07-17 23:31:24.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/head.S 2004-10-06 21:58:04.000000000 +0400 @@ -514,7 +514,7 @@ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0xc8 APM DS data */ - .quad 0x0000000000000000 /* 0xd0 - unused */ + .quad 0x000092000000ffff /* 0xd0 - ESPfix 16-bit SS */ .quad 0x0000000000000000 /* 0xd8 - unused */ .quad 0x0000000000000000 /* 0xe0 - unused */ .quad 0x0000000000000000 /* 0xe8 - unused */ diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h --- linux-2.6.8-pcsp/include/asm-i386/segment.h 2004-01-09 09:59:19.000000000 +0300 +++ linux-2.6.8-stacks/include/asm-i386/segment.h 2004-10-06 21:58:04.000000000 +0400 @@ -38,7 +38,7 @@ * 24 - APM BIOS support * 25 - APM BIOS support * - * 26 - unused + * 26 - ESPfix small SS * 27 - unused * 28 - unused * 29 - unused @@ -71,14 +71,19 @@ #define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6) #define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11) +#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 14) +#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8) + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* * The GDT has 32 entries */ -#define GDT_ENTRIES 32 - -#define GDT_SIZE (GDT_ENTRIES * 8) +#define GDT_ENTRIES_SHIFT 5 +#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT) +#define GDT_ENTRY_SHIFT 3 +#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT) +#define GDT_SIZE (1 << GDT_SIZE_SHIFT) /* Simple and small GDT entries for booting only */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? @ 2004-09-16 18:39 Petr Vandrovec 2004-09-17 18:12 ` Stas Sergeev 2004-09-18 16:45 ` Stas Sergeev 0 siblings, 2 replies; 38+ messages in thread From: Petr Vandrovec @ 2004-09-16 18:39 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On 16 Sep 04 at 21:49, Stas Sergeev wrote: > > There is a "semi-official" bug in Intel CPUs, > which is described here: > http://www.intel.com/design/intarch/specupdt/27287402.PDF > chapter "Specification Clarifications" > section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers". Not a bug, but a feature. > What I want to find out, is what CPUs are > affected. I wrote a small test-case. It is > attached. I tried it on Intel Pentium and > on AMD Athlon - bug is present on both. But > I'd like to know if it is also present on a > Transmeta Crusoe, Cyrix and other CPUs. AFAIK all. There are products which depend on this behavior. > I'd also like to know any thoughts on whether > it is possible to work around the bug, probably > in a kernel? Well, I am not hoping on such a > possibility, but who knows... IMHO you have to switch to 16bit stack, load upper bits of ESP with target value, and then execute IRET, while 16bit SS:SP points to same place where flat ESP pointed. This way IRET, as stack is 16bit, uses only SP instead of ESP, and so you can preload upper bits of ESP before executing IRET. And I do not think that linux NMI handler survives 16bit stack, so natural solution seems to be to create complete 16bit CPL1 environment, return to it, load ESP as you want, and then do IRET to return to CPL2 or CPL3. Fortunately V8086 mode is not affected, so there should be no problem with using CPL1 for this middle step. But of course it is not something you want to do on each return from interrupt handler... Well, or maybe you want... > Anyway, I'd be glad to get any info on that bug. > Why it was not fixed for so many years, looks > also like an interesting question, as for me. It is part of architecture now... Petr Vandrovec ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-16 18:39 Petr Vandrovec @ 2004-09-17 18:12 ` Stas Sergeev 2004-09-18 16:45 ` Stas Sergeev 1 sibling, 0 replies; 38+ messages in thread From: Stas Sergeev @ 2004-09-17 18:12 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel Hello. Petr Vandrovec wrote: >> There is a "semi-official" bug in Intel CPUs, >> which is described here: >> http://www.intel.com/design/intarch/specupdt/27287402.PDF >> chapter "Specification Clarifications" >> section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers". > Not a bug, but a feature. Yep, one of those features, that you can neither disable, nor (speaking about dosemu) work around. >> What I want to find out, is what CPUs are >> affected. I wrote a small test-case. It is >> attached. I tried it on Intel Pentium and >> on AMD Athlon - bug is present on both. But >> I'd like to know if it is also present on a >> Transmeta Crusoe, Cyrix and other CPUs. > AFAIK all. I expected that. My last hopes were on Cyrix, which, at least as I've heard, doesn't follow the Intel's specs very strictly. So I thought maybe they have not duplicated that "feature". > There are products which depend on this behavior. Out of curiosity, what are those products? I think it is pretty much brain-damaged to depend on a ESP corruption. >> I'd also like to know any thoughts on whether >> it is possible to work around the bug, probably >> in a kernel? > IMHO you have to switch to 16bit stack, load upper bits of ESP > with target value, and then execute IRET, while 16bit SS:SP points > to same place where flat ESP pointed. Hmm, that sounds feasible. > And I do not think that linux NMI handler survives 16bit stack, so > natural solution seems to be to create complete 16bit CPL1 > environment, return to it, load ESP as you want, and then do IRET > to return to CPL2 or CPL3. Fortunately V8086 mode is not affected, > so there should be no problem with using CPL1 for this middle step. > But of course it is not something you want to do on each return > from interrupt handler... Well, or maybe you want... Umm, no. But well, the spec claims that this happens only with iret. It doesn't say that it is a general problem with switching between a different protection rings. And it seems, for example, that WinNT/XP do not have that problem. I.e. the DOS progs that get a Stack Fault under dosemu because ESP points to the kernel space, can actually work under WinXP. Maybe they are using some other technique to return to Ring3, the one that doesn't trigger the bug at all? >> Anyway, I'd be glad to get any info on that bug. >> Why it was not fixed for so many years, looks >> also like an interesting question, as for me. > It is part of architecture now... How could that happen? Was it so difficult to just fix? Oh well, perhaps it was... ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-16 18:39 Petr Vandrovec 2004-09-17 18:12 ` Stas Sergeev @ 2004-09-18 16:45 ` Stas Sergeev 2004-09-18 16:59 ` Petr Vandrovec 1 sibling, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-18 16:45 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel Hi Petr. Petr Vandrovec wrote: > natural solution seems to be to create complete 16bit CPL1 > environment, return to it, load ESP as you want, and then do IRET > to return to CPL2 or CPL3. Fortunately V8086 mode is not affected, > so there should be no problem with using CPL1 for this middle step. > But of course it is not something you want to do on each return > from interrupt handler... Well, or maybe you want... Actually, this may indeed be what I want! I think this can be implemented with the checks that Denis Vlasenko suggests. Something like this can be added to entry.S, right before the "iret": --- if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)) push_a_stack_frame_to_return_to_the_ring1_trampoline(); --- This way the overhead for the normal case would be something about 4 asm insns (the check), and for the dosemu case - who cares? (and probably also the wine people will value that) Does this look reasonable? If it does, I think I should just start implementing that. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 16:45 ` Stas Sergeev @ 2004-09-18 16:59 ` Petr Vandrovec 2004-09-18 19:14 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Petr Vandrovec @ 2004-09-18 16:59 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Sat, Sep 18, 2004 at 08:45:33PM +0400, Stas Sergeev wrote: > Hi Petr. > > Petr Vandrovec wrote: > >natural solution seems to be to create complete 16bit CPL1 > >environment, return to it, load ESP as you want, and then do IRET > >to return to CPL2 or CPL3. Fortunately V8086 mode is not affected, > >so there should be no problem with using CPL1 for this middle step. > >But of course it is not something you want to do on each return > >from interrupt handler... Well, or maybe you want... > Actually, this may indeed be what I want! > I think this can be implemented with the checks > that Denis Vlasenko suggests. Something like this > can be added to entry.S, right before the "iret": > --- > if (!(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one)) > push_a_stack_frame_to_return_to_the_ring1_trampoline(); > --- > This way the overhead for the normal case would be > something about 4 asm insns (the check), and for the > dosemu case - who cares? (and probably also the wine > people will value that) > > Does this look reasonable? If it does, I think I > should just start implementing that. Do not forget that you have to implement also return to CPL1, as NMI may arrive while you are running on CPL1. So it may not be as trivial as it seemed. Maybe all these programs survive that their CPL3 stack changes, and then you could just push cs,eip and esp on CPL3 stack, and return to user code (vsyscall page seems natural place for that code) which would do pop %esp; retf? Only problem is how to find that old SS points to 16bit segment. You need LAR and/or you have to peek GDT/LDT to find stack size, and AFAIK LAR is microcoded on P4. Petr Vandrovec ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 16:59 ` Petr Vandrovec @ 2004-09-18 19:14 ` Stas Sergeev 2004-09-18 20:35 ` Petr Vandrovec 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-18 19:14 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel Hi, Petr Vandrovec wrote: >> Does this look reasonable? If it does, I think I >> should just start implementing that. > Do not forget that you have to implement also return to CPL1, as > NMI may arrive while you are running on CPL1. So it may not be > as trivial as it seemed. I am not sure what special actions have to be taken here compared to returning to ring-3 from NMI. Is there anywhere in the sources an example to take a look at? (sorry for the newbie questions) > Maybe all these programs survive that > their CPL3 stack changes, Most likely they will, I am just not sure. What if they disabled interrupts and are switching the stack by loading the SS and ESP separately? If we interrupt it there, there may be the problems, which would be almost impossible to track down later. It just looks a bit unsafe to me. Or maybe exploit a sigaltstack for that? Hmm, is implementing the CPL1 trampoline really that difficult after all? I think it is somewhat cleaner and maybe safer. > Only problem is how to find that old SS points to 16bit segment. > You need LAR and/or you have to peek GDT/LDT to find stack size, Yes, I was thinking about using LAR - looks like the most easy and fast way to just get that single bit out of LDT. > and AFAIK LAR is microcoded on P4. Where does this lead us to? Some other problems I am not aware about? ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 19:14 ` Stas Sergeev @ 2004-09-18 20:35 ` Petr Vandrovec 2004-09-22 18:49 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Petr Vandrovec @ 2004-09-18 20:35 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Sat, Sep 18, 2004 at 11:14:44PM +0400, Stas Sergeev wrote: > Hi, > > Petr Vandrovec wrote: > >>Does this look reasonable? If it does, I think I > >>should just start implementing that. > >Do not forget that you have to implement also return to CPL1, as > >NMI may arrive while you are running on CPL1. So it may not be > >as trivial as it seemed. > I am not sure what special actions have to be > taken here compared to returning to ring-3 from NMI. > Is there anywhere in the sources an example to take > a look at? (sorry for the newbie questions) It means that you cannot blindly create CPL1 trampoline stack in some static per-cpu area. But if we can assume that there is no other CPL1 code in the system, something like code below could work: /* + 20 [word 5] SS + 16 [word 4] ESP + 12 [word 3] EFLAGS + 8 [word 2] CS + 4 [word 1] EIP + 0 [word 0] ESP for popl %esp */ u_int32_t cpl1stacks[NUM_CPUS][6]; curCPU = smp_processor_id(); minSP = curCPU * sizeof(cpl1stacks[0]); maxSP = minSP + sizeof(cpl1stacks[0]); cpl1stack = cpl1stacks[curCPU]; if (cpl0stack[retCS] & 3 == 1) { /* Going back to our trampoline */ /* There is no other place in kernel running on CPL1 except our trampoline; so interrupt could occur either on popl %esp or on iret. If it occured on popl %esp, just return, code will do proper things. If interrupt occured on iret, we have to perform popl %esp again, so that upper bits of %esp are correctly restored for CPL3 code */ ASSERT(cpl0stack[retCS] == FLAT_4G_CPL1_CS); ASSERT(cpl0stack[retSS] == SMALL_CPL1_SS); if (cpl0stack[retEIP] == fixup_proc) { ASSERT(cpl0stack[retESP] == minSP]); } else if (cpl0stack[retEIP] == fixup_proc_iret) { ASSERT(cpl0stack[retESP] & 0xFFFF == minSP + 4); /* Undo popl %esp - copy value from ESP we were using on CPL1 back to stack */ cpl1stack[0] = cpl0stack[retESP]; cpl0stack[retEIP] = fixup_proc; cpl0stack[retESP] = minSP; } else { /* unexpected code running on CPL1 */ /* Probably do simple IRET and hope for the best? */ ASSERT(0); } iret; } else { cpl1Stack[5] = cpl0stack[retSS]; cpl1Stack[4] = cpl0stack[retESP]; cpl1Stack[3] = cpl0stack[retEFLAGS]; cpl1Stack[2] = cpl0stack[retCS]; cpl1Stack[1] = cpl0stack[retEIP]; cpl1Stack[0] = (cpl0stack[retESP] & 0xFFFF0000) | minSP + 4; cpl0stack[retSS] = SMALL_CPL1_SS; cpl0stack[retESP] = minSP; /* Do NOT clear IF... IF flag is affected only if IOPL >= CPL, so with IOPL=0 IRET on CPL1 won't reenable interrupts. This is reason why we cannot use RETF to return from CPL0 to CPL1 (retf is much faster than iret on P4) (and we cannot use retf for CPL1->CPL3 due to TF/RF). Clear TF so we do not start tracing on CPL1 if we trace userspace, and clear RF, so if somebody intentionaly pointed hardware breakpoint into CPL1 handler, it will be triggered (we must use this path for returns from INT1 too, so it is possible that RF is set in EFLAGS on stack). */ cpl0stack[retEFLAGS] &= ~(EFLAGS_TF | EFLAGS_RF); cpl0stack[retCS] = FLAT_4G_CPL1_CS; cpl0stack[retEIP] = fixup_proc; iret; } fixup_proc: popl %esp fixup_proc_iret; iret It assumes that there is one new 32bit CPL1 flat CS descriptor in GDT, and one 16bit (small) CPL1 SS descriptor (grows up, with limit 24*<num_cpus> and base of cpl1stacks), plus <num_cpus> 24byte CPL1 stacks (max. 64KB, so for kernels with more than 2730 CPUs you need more than one stack descriptor). > >Maybe all these programs survive that > >their CPL3 stack changes, > Most likely they will, I am just not sure. What > if they disabled interrupts and are switching the > stack by loading the SS and ESP separately? If we > interrupt it there, there may be the problems, which > would be almost impossible to track down later. > It just looks a bit unsafe to me. Or maybe exploit > a sigaltstack for that? Hmm, is implementing the > CPL1 trampoline really that difficult after all? > I think it is somewhat cleaner and maybe safer. As in pseudocode above I was able to handle even NMIs with just 24 bytes of stack on CPL1, it is definitely preferred solution. > >and AFAIK LAR is microcoded on P4. > Where does this lead us to? Some other problems I > am not aware about? It is slow. No other problems, except that doing (((ss & 4) ? gdt[ss >> 3] : ldt[ss >> 3]) & 0x00????00) == 0x00????00; may be faster than doing (lar(ss) & 0x00????00) == 0x00????00. Petr ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 20:35 ` Petr Vandrovec @ 2004-09-22 18:49 ` Stas Sergeev 2004-09-22 19:19 ` Richard B. Johnson 2004-09-22 20:02 ` Petr Vandrovec 0 siblings, 2 replies; 38+ messages in thread From: Stas Sergeev @ 2004-09-22 18:49 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 868 bytes --] Hi Petr et al. I coded up something that remotely looks like the discussed patch. I have deviated from your proposals at some important points: 1. I am not allocating the ring1 stack separately, I am allocating it on a ring0 stack. Much simpler code, although non-reentrant/preempt-unsafe. 2. I am disabling the interrupts after all. That's because of the preempt-unsafeness. I pass up the IOPL=1 when necessary, to avoid problems. But I guess also with your technique the interrupts had to be disabled, unless the ring1 stack is per-thread. 3. I am using LAR. Do you really think it can be slower than the whole thing of locating LDT? Let me know if I did something stupid. The patch is attached. I tested (pretty much) everything in it, except probably the "popl %esp" restartability. But that one looks fairly simple and should work. Does this patch look good? [-- Attachment #2: linux-2.6.8-stacks2.diff --] [-- Type: text/x-patch, Size: 9860 bytes --] diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-10 13:28:35.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-09-22 15:39:50.000000000 +0400 @@ -70,9 +70,16 @@ CF_MASK = 0x00000001 TF_MASK = 0x00000100 IF_MASK = 0x00000200 -DF_MASK = 0x00000400 +DF_MASK = 0x00000400 +IOPL1_MASK = 0x00001000 +IOPL2_MASK = 0x00002000 +IOPL_MASK = 0x00003000 NT_MASK = 0x00004000 +RF_MASK = 0x00010000 VM_MASK = 0x00020000 +AC_MASK = 0x00040000 +VIF_MASK = 0x00080000 +VIP_MASK = 0x00100000 #ifdef CONFIG_PREEMPT #define preempt_stop cli @@ -81,6 +88,49 @@ #define resume_kernel restore_all #endif +/* + * First, reserve the 0x400 bytes on stack for the NMI + * handler - it uses the same stack. + * Second, push the additional iret frame to switch to + * the Ring-1 trampoline. The correct value of ESP + * higher word is pushed here and popped on Ring-1. + * espfix_trampoline is working on a "small" stack. + * We patch its base in GDT here to allow the Ring-1 code + * to access the right place on stack. This is preempt-unsafe + * and therefore must be done with the interrupts disabled. + * To allow Ring-1 code to re-enable interrupts, we make + * sure it will be executing with at least IOPL1. + */ +#define NMI_STACK_RESERVE 0x400 +#define MAKE_ESPFIX \ + cli; \ + subl $NMI_STACK_RESERVE, %esp; \ + .rept 5; \ + pushl NMI_STACK_RESERVE+0x10(%esp); \ + .endr; \ + pushl %eax; \ + movl %esp, %eax; \ + pushl %ebp; \ + GET_THREAD_INFO(%ebp); \ + movl TI_cpu(%ebp), %ebp; \ + shll $GDT_SIZE_SHIFT, %ebp; \ + addl $cpu_gdt_table, %ebp; \ + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \ + shrl $16, %eax; \ + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \ + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \ + popl %ebp; \ + popl %eax; \ + pushw 14(%esp); \ + pushw $4; \ + pushl $__ESPFIX_SS; \ + pushl $0; \ + pushl 20(%esp); \ + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ + orl $IOPL1_MASK, (%esp); \ + pushl $__ESPFIX_CS; \ + pushl $espfix_trampoline; + #define SAVE_ALL \ cld; \ pushl %es; \ @@ -122,9 +172,24 @@ .previous +/* If returning to Ring-3, not to V86, and with + * the small stack, try to fix the higher word of + * ESP, as the CPU won't restore it from stack. + * This is an "official" bug of all the x86-compatible + * CPUs, which we can try to work around to make + * dosemu happy. */ #define RESTORE_ALL \ - RESTORE_REGS \ - addl $4, %esp; \ + movl EFLAGS(%esp), %eax; \ + movb CS(%esp), %al; \ + andl $(VM_MASK | 2), %eax; \ + cmpl $2, %eax; \ + jne 3f; \ + larl OLDSS(%esp), %eax; \ + testl $0x00400000, %eax; \ +3: RESTORE_REGS \ + leal 4(%esp), %esp; \ + jnz 1f; \ + MAKE_ESPFIX \ 1: iret; \ .section .fixup,"ax"; \ 2: sti; \ @@ -174,6 +239,23 @@ jmp do_lcall +ENTRY(espfix_trampoline) + popl %esp +espfix_past_esp: +1: iret +.section .fixup,"ax" +2: sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,2b +.previous + ENTRY(ret_from_fork) pushl %eax call schedule_tail @@ -196,7 +278,7 @@ GET_THREAD_INFO(%ebp) movl EFLAGS(%esp), %eax # mix EFLAGS and CS movb CS(%esp), %al - testl $(VM_MASK | 3), %eax + testl $(VM_MASK | 2), %eax jz resume_kernel # returning to kernel or vm86-space ENTRY(resume_userspace) cli # make sure we don't miss an interrupt @@ -504,7 +586,12 @@ ENTRY(nmi) cmpl $sysenter_entry,(%esp) je nmi_stack_fixup - pushl %eax + cmpl $espfix_past_esp,(%esp) + jne 1f + /* restart the "popl %esp" */ + decl (%esp) + subw $4, 12(%esp) +1: pushl %eax movl %esp,%eax /* Do not access memory above the end of our stack page, * it might not exist. diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S --- linux-2.6.8-pcsp/arch/i386/kernel/head.S 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/head.S 2004-09-21 14:58:16.000000000 +0400 @@ -514,8 +514,8 @@ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0xc8 APM DS data */ - .quad 0x0000000000000000 /* 0xd0 - unused */ - .quad 0x0000000000000000 /* 0xd8 - unused */ + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */ + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */ .quad 0x0000000000000000 /* 0xe0 - unused */ .quad 0x0000000000000000 /* 0xe8 - unused */ .quad 0x0000000000000000 /* 0xf0 - unused */ diff -urN linux-2.6.8-pcsp/arch/i386/kernel/process.c linux-2.6.8-stacks/arch/i386/kernel/process.c --- linux-2.6.8-pcsp/arch/i386/kernel/process.c 2004-06-21 09:19:07.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/process.c 2004-09-21 10:45:55.000000000 +0400 @@ -225,7 +225,7 @@ printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id()); print_symbol("EIP is at %s\n", regs->eip); - if (regs->xcs & 3) + if (regs->xcs & 2) printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp); printk(" EFLAGS: %08lx %s (%s)\n",regs->eflags, print_tainted(),UTS_RELEASE); printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n", diff -urN linux-2.6.8-pcsp/arch/i386/kernel/signal.c linux-2.6.8-stacks/arch/i386/kernel/signal.c --- linux-2.6.8-pcsp/arch/i386/kernel/signal.c 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/signal.c 2004-09-21 10:48:38.000000000 +0400 @@ -558,7 +558,7 @@ * kernel mode. Just return without doing anything * if so. */ - if ((regs->xcs & 3) != 3) + if (!(regs->xcs & 2)) return 1; if (current->flags & PF_FREEZE) { diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c --- linux-2.6.8-pcsp/arch/i386/kernel/traps.c 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/traps.c 2004-09-21 10:47:37.000000000 +0400 @@ -210,7 +210,7 @@ esp = (unsigned long) (®s->esp); ss = __KERNEL_DS; - if (regs->xcs & 3) { + if (regs->xcs & 2) { in_kernel = 0; esp = regs->esp; ss = regs->xss & 0xffff; @@ -264,7 +264,7 @@ char c; unsigned long eip; - if (regs->xcs & 3) + if (regs->xcs & 2) goto no_bug; /* Not in kernel */ eip = regs->eip; @@ -335,7 +335,7 @@ static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err) { - if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs)) + if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs)) die(str, regs, err); } @@ -357,7 +357,7 @@ goto trap_signal; } - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto kernel_trap; trap_signal: { @@ -437,7 +437,7 @@ if (regs->eflags & VM_MASK) goto gp_in_vm86; - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto gp_in_kernel; current->thread.error_code = error_code; @@ -614,7 +614,7 @@ * allowing programs to debug themselves without the ptrace() * interface. */ - if ((regs->xcs & 3) == 0) + if ((regs->xcs & 2) == 0) goto clear_TF_reenable; if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE) goto clear_TF; @@ -630,7 +630,7 @@ /* If this is a kernel mode trap, save the user PC on entry to * the kernel, that's what the debugger can make sense of. */ - info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : + info.si_addr = ((regs->xcs & 2) == 0) ? (void *)tsk->thread.eip : (void *)regs->eip; force_sig_info(SIGTRAP, &info, tsk); diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c --- linux-2.6.8-pcsp/arch/i386/mm/extable.c 2004-06-09 15:44:19.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/extable.c 2004-09-21 17:17:05.000000000 +0400 @@ -26,6 +26,11 @@ } #endif + if (unlikely(regs->xcs == __ESPFIX_CS)) + { + regs->xcs = __KERNEL_CS; + } + fixup = search_exception_tables(regs->eip); if (fixup) { regs->eip = fixup->fixup; diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c --- linux-2.6.8-pcsp/arch/i386/mm/fault.c 2004-08-10 11:02:37.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/fault.c 2004-09-21 16:03:28.000000000 +0400 @@ -77,7 +77,7 @@ u32 seg_ar, seg_limit, base, *desc; /* The standard kernel/user address space limit. */ - *eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg; + *eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg; /* Unlikely, but must come before segment checks. */ if (unlikely((regs->eflags & VM_MASK) != 0)) diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h --- linux-2.6.8-pcsp/include/asm-i386/segment.h 2004-01-09 09:59:19.000000000 +0300 +++ linux-2.6.8-stacks/include/asm-i386/segment.h 2004-09-21 14:57:44.000000000 +0400 @@ -38,8 +38,8 @@ * 24 - APM BIOS support * 25 - APM BIOS support * - * 26 - unused - * 27 - unused + * 26 - ESPfix Ring-1 trampoline CS + * 27 - ESPfix Ring-1 small SS * 28 - unused * 29 - unused * 30 - unused @@ -71,14 +71,21 @@ #define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6) #define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11) +#define GDT_ENTRY_ESPFIX_CS (GDT_ENTRY_KERNEL_BASE + 14) +#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 15) +#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1) +#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1) + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* * The GDT has 32 entries */ -#define GDT_ENTRIES 32 - -#define GDT_SIZE (GDT_ENTRIES * 8) +#define GDT_ENTRIES_SHIFT 5 +#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT) +#define GDT_ENTRY_SHIFT 3 +#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT) +#define GDT_SIZE (1 << GDT_SIZE_SHIFT) /* Simple and small GDT entries for booting only */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 18:49 ` Stas Sergeev @ 2004-09-22 19:19 ` Richard B. Johnson 2004-09-22 20:03 ` Stas Sergeev 2004-09-22 20:02 ` Petr Vandrovec 1 sibling, 1 reply; 38+ messages in thread From: Richard B. Johnson @ 2004-09-22 19:19 UTC (permalink / raw) To: Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel What problem is this supposed to fix? ESP is __not__ corrupted when returning to protected-mode or a different privilege level. You don't 'return' to protected mode from a virtual-8086 mode, even though you (can) use an `iret`instruction. The fundamental change is though the EFLAGS register stored in the TSS. The so-called bug is that when in real mode or in virtual-8086 mode, the high word of ESP is not changed. It is not a bug because the high word doesn't even exist when in VM-86 mode!! It is possible to use the 32-bit prefix, when in 16-bit mode, to muck with the high word of the stack, but that's not a documented procedure, but a side-effect of such an undocumented instruction. There is no bug to fix. When the VM-86 mode transitions to 32-bit protected mode, the stack is restored to the condition it was just prior to the transition to VM-86 mode, therefore you don't "use up" any stack. The so-called bug is only cosmetic when somebody is prowling around in undocumented shadows. Please, somebody from Intel tell these guys to leave the thing alone. I, for one, don't want a bunch of "fixes" that do nothing except consume valuable RAM, making it near impossible to use later versions of Linux in embedded systems. On Wed, 22 Sep 2004, Stas Sergeev wrote: > Hi Petr et al. > > I coded up something that remotely looks like the > discussed patch. > I have deviated from your proposals at some important > points: > 1. I am not allocating the ring1 stack separately, > I am allocating it on a ring0 stack. Much simpler > code, although non-reentrant/preempt-unsafe. > 2. I am disabling the interrupts after all. That's > because of the preempt-unsafeness. I pass up the > IOPL=1 when necessary, to avoid problems. > But I guess also with your technique the interrupts > had to be disabled, unless the ring1 stack is > per-thread. > 3. I am using LAR. Do you really think it can be > slower than the whole thing of locating LDT? > > Let me know if I did something stupid. > The patch is attached. > I tested (pretty much) everything in it, except > probably the "popl %esp" restartability. But that > one looks fairly simple and should work. > > Does this patch look good? > > Cheers, Dick Johnson Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips). Note 96.31% of all statistics are fiction. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 19:19 ` Richard B. Johnson @ 2004-09-22 20:03 ` Stas Sergeev 2004-09-22 20:13 ` Richard B. Johnson 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-22 20:03 UTC (permalink / raw) To: root; +Cc: Petr Vandrovec, Linux kernel Hi, Richard B. Johnson wrote: > What problem is this supposed to fix? Richard, it will really help if you read the whole thread. I was answering this to Denis Vlasenko already - he agreed. Do I have to repeat the explanations? > ESP is __not__ corrupted Right, but is not properly restored either, while it have to be. > when returning to protected-mode or a different privilege level. It gets "corrupted" (not properly restored) exactly when returning to *protected mode* from another priv level. Please refer to the Intel docs I pointed to in that thread earlier. > You don't 'return' to protected mode from a virtual-8086 mode, Noone was speaking about v86. I don't see why you pick that up. > The so-called bug is that when in real mode or in virtual-8086 > mode, the high word of ESP is not changed. In short: Wrong. The complete explanations are easily locateable in that thread. And it have nothing to do with the real mode either. > It is not a bug > because the high word doesn't even exist when in VM-86 mode!! Noone was speaking about v86, sorry. I am bypassing that part. > It is possible to use the 32-bit prefix, when in 16-bit mode, That's not about the prefixes either, sorry. We were talking about the 32bit PM code. > Please, somebody from Intel tell these guys to leave the thing > alone. Thanks many, they already left that alone once:) Maybe enough of leaving the bugs alone? I have lots of the DOS progs here that do not work under dosemu without that patch, and work perfectly with it. It should be enough. If you need an examples - well, OpenCubicPlayer for one. It will start, but crash as soon as the music is attempted to play. The patch fixes it. Other progs you'll have problems downloading anyway, but let me know if you need these. > I, for one, don't want a bunch of "fixes" that do nothing > except consume valuable RAM, making it near impossible to > use later versions of Linux in embedded systems. Well, my patch is purely in asm. How many valueable bytes does it take from you? As for performance - 8 asm insns on a generic path. Not too much either, as for me. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 20:03 ` Stas Sergeev @ 2004-09-22 20:13 ` Richard B. Johnson 2004-09-28 15:43 ` Denis Vlasenko 0 siblings, 1 reply; 38+ messages in thread From: Richard B. Johnson @ 2004-09-22 20:13 UTC (permalink / raw) To: Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 21740 bytes --] On Thu, 23 Sep 2004, Stas Sergeev wrote: > Hi, > > Richard B. Johnson wrote: > > What problem is this supposed to fix? > Richard, it will really help if you read the > whole thread. I was answering this to Denis > Vlasenko already - he agreed. Do I have to > repeat the explanations? > > > ESP is __not__ corrupted > Right, but is not properly restored either, > while it have to be. > > > when returning to protected-mode or a different privilege level. > It gets "corrupted" (not properly restored) > exactly when returning to *protected mode* > from another priv level. Please refer to the > Intel docs I pointed to in that thread earlier. > > > You don't 'return' to protected mode from a virtual-8086 mode, > Noone was speaking about v86. I don't see why > you pick that up. > > > The so-called bug is that when in real mode or in virtual-8086 > > mode, the high word of ESP is not changed. > In short: Wrong. > The complete explanations are easily locateable > in that thread. And it have nothing to do with > the real mode either. > > > It is not a bug > > because the high word doesn't even exist when in VM-86 mode!! > Noone was speaking about v86, sorry. I am bypassing > that part. > > > It is possible to use the 32-bit prefix, when in 16-bit mode, > That's not about the prefixes either, sorry. > We were talking about the 32bit PM code. > > > Please, somebody from Intel tell these guys to leave the thing > > alone. > Thanks many, they already left that alone once:) > Maybe enough of leaving the bugs alone? > I have lots of the DOS progs here that do not > work under dosemu without that patch, and work > perfectly with it. It should be enough. If > you need an examples - well, OpenCubicPlayer > for one. It will start, but crash as soon as > the music is attempted to play. The patch fixes > it. Other progs you'll have problems downloading > anyway, but let me know if you need these. > > > I, for one, don't want a bunch of "fixes" that do nothing > > except consume valuable RAM, making it near impossible to > > use later versions of Linux in embedded systems. > Well, my patch is purely in asm. How many > valueable bytes does it take from you? > As for performance - 8 asm insns on a generic > path. Not too much either, as for me. > Well DOSEMU uses VM-86 mode. That's how it works. It creates a virtual 8086 machine, complete with the required "DOS compatible" virtual BIOS environment. I use it all the time because I write, amongst other things, the complete BIOS and startup code for many Intel based machines. I run compilers, assemblers, linkers, and editors in that environment and it works. Sombody mentions a completely unrelated so-called Intel bug and next thing you know, there are patches to work- around the bug??? The bug doesn't exist period. Here is a session in VM-86 mode., using DOSEMU. It does one hell of a lot more work than your games. Script started on Wed Sep 22 16:00:02 2004 # godos CPU speed set to 2793/1 MHz Running on CPU=586, FPU=1, rdtsc=1 Linux DOS emulator 0.98.5.0 $Date: 99/01/15 $ Last configured at Wed Mar 24 12:44:16 EST 1999 on linux This is work in progress. Please test against a recent version before reporting bugs and problems. Bugs, Patches & New Code to linux-msdos@vger.rutgers.edu Starting MS-DOS... ^[[1;1H^[[37m^[[44m ^[[2;1H ^[[3;1H ^[[4;1H ^[[5;1H 1. Start the TCP/IP Network (NE* B radley's code) ^[[6;1H 2. Start the TCP/IP Network (3COM Board PCTCP) ^[[7;1H 3. Start the TCP/IP Network (3COM Board ANALOGIC) ^[[8;1H 4. Load NDIS driver only (3COM Board) ^[[9;1H ^[[1m^[[37m^[[47m5. Do not load any network^[[m\x0f^[[37m^[[44m ^[[10;1H 6. Do not execute any AUTOEXEC.BAT commands ^[[11;1H 7. Start RCCS Host Software ^[[12;1H MS-DOS 6.22 Startup Menu ^[[13;1H Enter a choice: 5ÍÍÍÍÍÍ ^[[14;1H ^[[15;1H ^[[16;1H ^[[17;1H ^[[18;1H ^[[19;1H ^[[20;1H ^[[21;1H ^[[22;1H ^[[23;1H ^[[24;1H ^[[25;1HF5=Bypass startup files F8=Confirm each line of CONFIG.SYS and AUTOEXEC.BAT [N]^[[13;19H^[[13;27HTime remaining: 04^[[13;19H^[[13;20H ^[[25;1H ^[[15;1H MS-DOS Version 6.22 C:\>cd pbios C:\PBIOS>make clean Microsoft (R) Program Maintenance Utility Version 4.07 Copyright (C) Microsoft Corp 1984-1988. All rights reserved. ^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H ^[[3;24r^[[5;1H^[[1;25r^[[23;1H^[[37m^[[44m ^[[24;1H if exist *.obj del *.obj ^[[25;1H^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H ^[[1;24r^[[1;1H^[[1;25r^[[22;1H^[[37m^[[44m if exist *.bin del *.bin ^[[23;1H if exist *.rom del *.rom ^[[24;1H if exist *.hex del *.hex ^[[25;1H^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H ^[[1;24r^[[1;1H^[[1;25r^[[18;1H^[[37m^[[44m if exist *.lis del *.lis ^[[19;1H if exist *.com del *.com ^[[20;1H if exist *.exe del *.exe ^[[21;1H if exist *.oe0 del *.oe0 ^[[22;1H if exist *.oe1 del *.oe1 ^[[23;1H cd tools ^[[24;1H if exist *.obj del *.obj ^[[25;1H^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H ^[[3;24r^[[5;1H^[[1;25r^[[1;1H^[[37m^[[44mC:\PBIOS> make clean^[[18;1H if exist *.bin del *.bin ^[[19;1H if exist *.rom del *.rom ^[[20;1H if exist *.hex del *.hex ^[[21;1H if exist *.com del *.com ^[[22;1H if exist *.exe del *.exe ^[[23;1H cd .. ^[[24;1H ^[[25;1HC:\PBIOS>^[[25;11Hmake^[[25;16Hpbios^[[37m^[[40m^[[2;20r^[[2;1H^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44m if exist *.obj del *.obj if exist *.bin del *.bin ^[[3;1H if exist *.rom del *.rom ^[[4;1H if exist *.hex del *.hex ^[[5;1H if exist *.com del *.com ^[[6;1H if exist *.exe del *.exe ^[[7;1H cd .. ^[[8;1H ^[[9;1HC:\PBIOS> make pbios ^[[10;1H ^[[11;1HMicrosoft (R) Program Maintenance Utility Version 4.07 ^[[12;1HCopyright (C) Microsoft Corp 1984-1988. All rights reserved. ^[[13;1H ^[[14;1H ^[[15;1HMAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist ^[[16;1H CD TOOLS ^[[17;1H MAKE TOOLS ^[[21;3H ^[[22;3H MAKE : warning U4000: 'odeven.exe' : target does not exist^[[24;3Hcl -W3 odeven.c \r^[[37m^[[40m^[[10;16r^[[10;1H^[M^[M^[M^[M^[[1;25r^[[1;14H^[[37m^[[44mcom^[[7Ccom^[[2;14Hexe^[[7Cexe^[[3;3Hcd .. ^[[4;3H C:\PBIOS> make pbios ^[[6;3H Microsoft (R) Program Maintenance Utility Version 4.07 Copyright (C) Microsoft Corp 1984-1988. All rights reserved. ^[[11;1HMAKE : warning U4000: 'TOOLS\IHEX.EXE' : target does not exist ^[[12;1H CD TOOLS ^[[13;1H MAKE TOOLS ^[[17;3H ^[[19;2HAKE : warning U4000: 'odeven.exe' : target does not exist cl -W3 odeven.c Microsoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved. odeven.c ^[[37m^[[40m^[[2;24r^[[2;1H^[[2;24r^[[24;1H ^[[2;24r^[[3;1H^[[1;25r^[[1;3H^[[37m^[[44m ^[[16;1H ^[[17;1HMicrosoft (R) Segmented-Executable Linker Version 5.13 ^[[18;1HCopyright (C) Microsoft Corp 1984-1991. All rights reserved. ^[[19;1H ^[[20;1HObject Modules [.OBJ]: odeven.obj /farcall ^[[21;1HRun File [odeven.exe]: "odeven.exe" /noi ^[[22;1HList File [NUL.MAP]: NUL ^[[23;1HLibraries [.LIB]: ^[[24;1HDefinitions File [NUL.DEF]: ; ^[[25;1H^[[37m^[[40m^[[3;24r^[[3;1H^[[3;24r^[[24;1H ^[[3;24r^[[5;1H^[[1;25r^[[2;1H^[[37m^[[44m ^[[18;1H ^[[19;1HMAKE : warning U4000: 'fixup.exe' : target does not exist ^[[20;1H cl -W3 fixup.c ^[[21;1HMicrosoft (R) C Optimizing Compiler Version 6.00A ^[[22;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved. ^[[23;1H ^[[24;1Hfixup.c ^[[25;1H^[[1;1HDefinitions File [NUL.DEF]: ;^[[3;24Hfixup.exe' : target does not exist ^[[4;10Hfixup.c fixup.c ^[[13;24Hfixup.obj /farcall ^[[14;11Hfixup.exe]: "fixup.exe" /noi ^[[19;24Hgetall.exe' : target does not exist^[[20;10Hgetall.c getall.c ^[[3;24Hgetall.exe' : target does not exist^[[4;10Hgetall.c getall.c^[[13;24Hgetall.obj /farcall^[[14;11Hgetall.exe]: "getall.exe" /noi^[[19;24Hpci.obj' : target does not exist ^[[20;7HAL -c -W3 -o pci.obj pci.c pci.c ^[[37m^[[40m^[[2;22r^[[2;1H^[[2;22r^[[22;1H ^[[2;22r^[[3;1H^[[1;25r^[[1;1H^[[37m^[[44m ^[[17;1H ^[[18;1Hpci.c ^[[19;1H ^[[20;1HMAKE : warning U4000: 'pcireg.obj' : target does not exist ^[[21;1H tasm pcireg; ^[[22;1HTurbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International ^[[24;1HAssembling file: pcireg.ASM ^[[37m^[[40m^[[3;23r^[[3;1H^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44mMAKE : warning U4000: 'pci.obj' : target does not exist cl -AL -c -W3 -o pci.obj pci.c Microsoft (R) C Optimizing Compiler Version 6.00A ^[[4;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved. ^[[5;1H ^[[6;1Hpci.c ^[[7;1H ^[[8;1HMAKE : warning U4000: 'pcireg.obj' : target does not exist ^[[9;1H tasm pcireg; ^[[10;1HTurbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International ^[[11;1H ^[[12;1HAssembling file: pcireg.ASM ^[[13;1HError messages: None ^[[14;1HWarning messages: None ^[[15;1HPasses: 1 ^[[16;1HRemaining memory: 324k ^[[17;1H ^[[18;1H ^[[19;1HMAKE : warning U4000: 'pci.exe' : target does not exist ^[[20;1H link pci pcireg; ^[[24;1H ^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H ^[[1;24r^[[1;1H^[[1;25r^[[18;1H^[[37m^[[44m ^[[19;1HMAKE : warning U4000: 'ihex.exe' : target does not exist ^[[20;1H cl -W3 ihex.c ^[[21;1HMicrosoft (R) C Optimizing Compiler Version 6.00A ^[[22;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved. ^[[23;1H ^[[24;1Hihex.c ^[[25;1H^[[37m^[[40m^[[14;18r^[[14;1H^[M^[M^[[1;25r^[[1;1H^[[37m^[[44m ^[[2;3H MAKE : warning U4000: 'pci.exe' : target does not exist ^[[4;3Hlink pci pcireg; Microsoft (R) Segmented-Executable Linker Version 5.13 Copyright (C) Microsoft Corp 1984-1991. All rights reserved. ^[[12C MAKE : warning U4000: 'ihex.exe' : target does not exist^[[11;3Hcl -W3 ihex.c^[[12;2Hicrosoft (R) C Optimizing Compiler Version 6.00A Copyright (c) Microsoft Corp 1984-1990. All rights reserved. ^[[15;1Hihex.c ^[[19;1H Object Modules [.OBJ]: ihex.obj /farcall Run File [ihex.exe]: "ihex.exe" /noi List File [NUL.MAP]: NUL Libraries [.LIB]: Definitions File [NUL.DEF]: ; ^[[37m^[[40m^[[5;11r^[[5;1H^[M^[M^[M^[M^[[1;25r^[[3;24H^[[37m^[[44mihex.exe' : target does not exist^[[4;3Hcl -W3 ihex.c Microsoft (R) C Optimizing Compiler Version 6.00A ^[[6;1HCopyright (c) Microsoft Corp 1984-1990. All rights reserved. ^[[7;1H ^[[8;1Hihex.c ^[[12;1H Object Modules [.OBJ]: ihex.obj /farcall Run File [ihex.exe]: "ihex.exe" /noi List File [NUL.MAP]: NUL Libraries [.LIB]: Definitions File [NUL.DEF]: ; CD .. MAKE : warning U4000: 'ABIOS.OBJ' : target does not exist TASM /ml /p /w2 /m2 PBIOS, ABIOS.OBJ, PBIOS.LIS; Turbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International Assembling file: PBIOS.ASM to ABIOS.OBJ ^[[37m^[[40m^[[2;21r^[[2;1H^[[2;21r^[[21;1H ^[[2;21r^[[3;1H^[[1;25r^[[11;1H^[[37m^[[44mTurbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International ^[[12;1H ^[[13;1HAssembling file: PBIOS.ASM to ABIOS.OBJ ^[[14;1HError messages: None ^[[15;1HWarning messages: None ^[[16;1HPasses: 2 ^[[17;1HRemaining memory: 258k ^[[18;1H ^[[19;1H ^[[20;1HMAKE : warning U4000: 'BBIOS.OBJ' : target does not exist ^[[21;1H TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ; ^[[24;35HB ^[[37m^[[40m^[[1;24r^[[1;1H^[[1;24r^[[24;1H ^[[1;24r^[[1;1H^[[1;25r^[[17;1H^[[37m^[[44mError messages: None ^[[18;1HWarning messages: None ^[[19;1HPasses: 2 ^[[20;1HRemaining memory: 262k ^[[21;1H ^[[22;1H ^[[23;1HMAKE : warning U4000: 'ABIOS.BIN' : target does not exist ^[[24;1H LINK/T ABIOS.OBJ, ABIOS.BIN; ^[[25;1H^[[37m^[[40m^[[6;15r^[[6;1H^[M^[M^[M^[M^[M^[M^[M^[[1;25r^[[1;1H^[[37m^[[44mAssembling file: PBIOS.ASM to ABIOS.OBJ Error messages: None Warning messages: None Passes:^[[12C2 Remaining memory: 258k ^[[7;1H ^[[8;1HMAKE : warning U4000: 'BBIOS.OBJ' : target does not exist ^[[9;1H TASM /ml /p /w2 /m2 /DCOM=1 PBIOS, BBIOS.OBJ; ^[[10;1HTurbo Assembler Version 2.0 Copyright (c) 1988, 1990 Borland International ^[[11;1H ^[[12;1HAssembling file: PBIOS.ASM to BBIOS.OBJ ^[[16;1HRemaining memory: 262k MAKE : warning U4000: 'ABIOS.BIN' : target does not exist LINK/T ABIOS.OBJ, ABIOS.BIN; Microsoft (R) Segmented-Executable Linker Version 5.13 Copyright (C) Microsoft Corp 1984-1991. All rights reserved.^[[24;3H ^[[37m^[[40m^[[2;5r^[[2;1H^[M^[[1;25r^[[1;1H^[[37m^[[44m Assembling file: PBIOS.ASM to BBIOS.OBJ ^[[6;1HRemaining memory: 262k MAKE : warning U4000: 'ABIOS.BIN' : target does not exist LINK/T ABIOS.OBJ, ABIOS.BIN; Microsoft (R) Segmented-Executable Linker Version 5.13 Copyright (C) Microsoft Corp 1984-1991. All rights reserved. ^[[12C MAKE : warning U4000: 'PBIOS.COM' : target does not exist^[[17;3HLINK/T BBIOS.OBJ, PBIOS.COM;^[[19;2Hicrosoft (R) Segmented-Executable Linker Version 5.13 Copyright (C) Microsoft Corp 1984-1991. All rights reserved. MAKE : warning U4000: 'PBIOS.ROM' : target does not exist ^[[24;3HTOOLS\FIXUP ABIOS.BIN PBIOS.ROM ^[[37m^[[40m^[[2;24r^[[2;1H^[[2;24r^[[24;1H ^[[2;24r^[[3;1H^[[1;25r^[[18;1H^[[37m^[[44m ^[[19;1HMAKE : warning U4000: 'PBIOS.HEX' : target does not exist ^[[20;1H TOOLS\IHEX PBIOS.ROM PBIOS.HEX 0 20 Y ^[[21;1HIHEX V2.00 Analogic Software Tools ^[[22;1H ^[[23;1HReading from file PBIOS.ROM, writing to file PBIOS.HEX ... ^[[24;1HStarting address (hex) 0000, block length (hex) 20 ^[[25;1H^[[37m^[[40m^[[1;25r^[[1;1H^[[1;25r^[[25;1H ^[[1;25r^[[1;1H^[[1;25r^[[25;1H^[[37m^[[44mC:\PBIOS> ^[[25;11Hback\r^[[37m^[[40m^[[m\x0f^[[0m^[[m\r^[[K ^[(B^[(B\r \r# # # # # exit Script done on Wed Sep 22 16:00:39 2004 Even Linux 32-bit `script` works when I am in VM-86 mode. Cheers, Dick Johnson Penguin : Linux version 2.4.26 on an i686 machine (5570.56 BogoMips). Note 96.31% of all statistics are fiction. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 20:13 ` Richard B. Johnson @ 2004-09-28 15:43 ` Denis Vlasenko 0 siblings, 0 replies; 38+ messages in thread From: Denis Vlasenko @ 2004-09-28 15:43 UTC (permalink / raw) To: root, Stas Sergeev; +Cc: Petr Vandrovec, Linux kernel On Wednesday 22 September 2004 23:13, Richard B. Johnson wrote: > Well DOSEMU uses VM-86 mode. That's how it works. It > creates a virtual 8086 machine, complete with the > required "DOS compatible" virtual BIOS environment. You forgot DOS extenders which do not use VM86 but 32bit protected mode. > I use it all the time because I write, amongst other things, > the complete BIOS and startup code for many Intel based > machines. > > I run compilers, assemblers, linkers, and editors in that > environment and it works. The fact that those programs work fine does not automatically mean that _other_ DOS programs will work. > Sombody mentions a completely unrelated so-called Intel > bug and next thing you know, there are patches to work- > around the bug??? > > The bug doesn't exist period. Sorry, but it does exist. It is really obscure bug. Please read on: > There is a "semi-official" bug in Intel CPUs, > which is described here: > http://www.intel.com/design/intarch/specupdt/27287402.PDF > chapter "Specification Clarifications" > section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers". > > A short quote: > "ISSUE: When a 32-bit IRET is used to return to another privilege level, > and the old level uses a 4G stack (D/B bit in the segment register = 1), > while the new level uses a 64k stack (D/B bit = 0), then only the > lower word of ESP is updated." > > Which means that the higher word of ESP gets > trashed. This bug beats dosemu rather badly, > but there seem to be not much info about that > bug available on the net. IRET-to-lower-CPL-level stack frame: +16 old_SS +12 old_ESP +8 old_EFLAGS +4 old_CS +0 old_EIP <--- SS:ESP > > If old_SS references 'small' data segment (16-bit one), > > processor does not restore ESP from old_ESP. > > It restores lower 16 bits only. Upper 16 bits are filled > > with garbage (or just not modified at all?). > > Not modified at all, yes. That's why it is always > greater than TASK_SIZE (0xc0000000) - it is still > in a kernel space. > > > This works fine because processor uses SP, not ESP, > > for subsequent push/pop/call/ret operations. > > But if code uses full ESP, thinking that upper 16 bits are zero, > > it will crash badly. Correct me if I'm wrong. > > That's correct. But you should note that the > program not just "thinks" that the upper 16 bits > are zero. It writes zero there itself, and a few > instructions later - oops, it is no longer zero... -- vda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 18:49 ` Stas Sergeev 2004-09-22 19:19 ` Richard B. Johnson @ 2004-09-22 20:02 ` Petr Vandrovec 2004-09-23 4:09 ` Stas Sergeev 2004-09-23 17:08 ` Stas Sergeev 1 sibling, 2 replies; 38+ messages in thread From: Petr Vandrovec @ 2004-09-22 20:02 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Wed, Sep 22, 2004 at 10:49:45PM +0400, Stas Sergeev wrote: > Hi Petr et al. > > I coded up something that remotely looks like the > discussed patch. > I have deviated from your proposals at some important > points: > 1. I am not allocating the ring1 stack separately, > I am allocating it on a ring0 stack. Much simpler > code, although non-reentrant/preempt-unsafe. I think that it is problem. And do not forget that NMI can occur at any place, so I'm not quite sure how your code works. > 2. I am disabling the interrupts after all. That's > because of the preempt-unsafeness. I pass up the > IOPL=1 when necessary, to avoid problems. > But I guess also with your technique the interrupts > had to be disabled, unless the ring1 stack is > per-thread. What if exception occurs in CPL1 code? Processor reloads SS/ESP from TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack - - and it pushes on top of CPL0 stack address of CPL1 - and real return address for CPL3 was lost :-( Boom. You change base address for SS segment - in that case you need per-CPU SS segment, it wont' work with global SS segment. > 3. I am using LAR. Do you really think it can be > slower than the whole thing of locating LDT? Try measuring that. I think that LAR will be faster than lookup in memory, but... > > Let me know if I did something stupid. > The patch is attached. > I tested (pretty much) everything in it, except > probably the "popl %esp" restartability. But that > one looks fairly simple and should work. Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1 triggers fault? You should be able to do that by modifying SS/ESP from signal handler. > +#define NMI_STACK_RESERVE 0x400 > +#define MAKE_ESPFIX \ > + cli; \ > + subl $NMI_STACK_RESERVE, %esp; \ This is ugly. What if NMI handler uses more than 1KB? > + .rept 5; \ > + pushl NMI_STACK_RESERVE+0x10(%esp); \ > + .endr; \ > + pushl %eax; \ > + movl %esp, %eax; \ > + pushl %ebp; \ > + GET_THREAD_INFO(%ebp); \ > + movl TI_cpu(%ebp), %ebp; \ > + shll $GDT_SIZE_SHIFT, %ebp; \ > + addl $cpu_gdt_table, %ebp; \ > + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \ > + shrl $16, %eax; \ > + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \ > + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \ > + popl %ebp; \ > + popl %eax; \ > + pushw 14(%esp); \ What about referencing 14+NMI_STACK_RESERVE() ? And ESP is not multiple of 4 here - it may slow down processor a bit. I think that you should prepare stack by using moves to -NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes. Or you can (if you want per-thread and not per-CPU stack) prepare CPL1 stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack, but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB. > + pushw $4; \ > + pushl $__ESPFIX_SS; \ > + pushl $0; \ > + pushl 20(%esp); \ > + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ > + orl $IOPL1_MASK, (%esp); \ Does this work if userspace runs with iopl 3,2 or 1? Does iret on CPL1 set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?). > + pushl $__ESPFIX_CS; \ > + pushl $espfix_trampoline; > + > #define SAVE_ALL \ > cld; \ > pushl %es; \ > @@ -122,9 +172,24 @@ > .previous > > > +/* If returning to Ring-3, not to V86, and with > + * the small stack, try to fix the higher word of > + * ESP, as the CPU won't restore it from stack. > + * This is an "official" bug of all the x86-compatible > + * CPUs, which we can try to work around to make > + * dosemu happy. */ > #define RESTORE_ALL \ > - RESTORE_REGS \ > - addl $4, %esp; \ > + movl EFLAGS(%esp), %eax; \ > + movb CS(%esp), %al; \ > + andl $(VM_MASK | 2), %eax; \ > + cmpl $2, %eax; \ > + jne 3f; \ > + larl OLDSS(%esp), %eax; \ > + testl $0x00400000, %eax; \ > +3: RESTORE_REGS \ > + leal 4(%esp), %esp; \ > + jnz 1f; \ > + MAKE_ESPFIX \ > 1: iret; \ > .section .fixup,"ax"; \ I'm not sure about lea speed. And fast path should NOT jump, forward jumps are initialy predicted as not taken... Maybe: 3: RESTORE_REGS jz 8f add $4,%esp 1: iret 8: add $4,%esp MAKE_ESPFIX iret And then you can move MAKE_ESPFIX out of macro. And it seems to me that it adds too many operations to fast path. Maybe you want to introduce some per-thread flag, or leave syscall path to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?). > diff -urN linux-2.6.8-pcsp/arch/i386/kernel/process.c linux-2.6.8-stacks/arch/i386/kernel/process.c > --- linux-2.6.8-pcsp/arch/i386/kernel/process.c 2004-06-21 09:19:07.000000000 +0400 > +++ linux-2.6.8-stacks/arch/i386/kernel/process.c 2004-09-21 10:45:55.000000000 +0400 > @@ -225,7 +225,7 @@ > printk("EIP: %04x:[<%08lx>] CPU: %d\n",0xffff & regs->xcs,regs->eip, smp_processor_id()); > print_symbol("EIP is at %s\n", regs->eip); > > - if (regs->xcs & 3) > + if (regs->xcs & 2) > printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp); I think you should leave this as is, regs->xss/regs->esp should be valid for CPL1 exceptions. > printk(" EFLAGS: %08lx %s (%s)\n",regs->eflags, print_tainted(),UTS_RELEASE); > printk("EAX: %08lx EBX: %08lx ECX: %08lx EDX: %08lx\n", > diff -urN linux-2.6.8-pcsp/arch/i386/kernel/signal.c linux-2.6.8-stacks/arch/i386/kernel/signal.c > --- linux-2.6.8-pcsp/arch/i386/kernel/signal.c 2004-08-10 11:02:36.000000000 +0400 > +++ linux-2.6.8-stacks/arch/i386/kernel/signal.c 2004-09-21 10:48:38.000000000 +0400 > @@ -558,7 +558,7 @@ > * kernel mode. Just return without doing anything > * if so. > */ > - if ((regs->xcs & 3) != 3) > + if (!(regs->xcs & 2)) > return 1; Really? I think that it should stay. > @@ -630,7 +630,7 @@ > /* If this is a kernel mode trap, save the user PC on entry to > * the kernel, that's what the debugger can make sense of. > */ > - info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : > + info.si_addr = ((regs->xcs & 2) == 0) ? (void *)tsk->thread.eip : > (void *)regs->eip; Sure? Maybe this should be (regs->xcs & 3 != 3) instead? Rest looks fine, but I'm not completely sure that SMP and NMIs are handled in the best possible way. Petr Vandrovec ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 20:02 ` Petr Vandrovec @ 2004-09-23 4:09 ` Stas Sergeev 2004-09-23 17:08 ` Stas Sergeev 1 sibling, 0 replies; 38+ messages in thread From: Stas Sergeev @ 2004-09-23 4:09 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel Hi, Petr Vandrovec wrote: >> 1. I am not allocating the ring1 stack separately, >> I am allocating it on a ring0 stack. Much simpler >> code, although non-reentrant/preempt-unsafe. > I think that it is problem. And do not forget that NMI can occur at any > place, so I'm not quite sure how your code works. Why it is not preempt-unsafe, I beleive, is because if it is preempted after modifying GDT, and then another process will walk that path modifying GDT, the first process will not return anywhere. But as for NMI this should not be the case, as it doesn't seem to call schedule(). > What if exception occurs in CPL1 code? Processor reloads SS/ESP from > TSS, pointing at top of CPL0 stack - where it overwrites CPL1 stack - > - and it pushes on top of CPL0 stack address of CPL1 - and real return > address for CPL3 was lost :-( Boom. But for that I reserve the room on stack, so that should not hurt. > You change base address for SS segment - in that case you need per-CPU > SS segment, it wont' work with global SS segment. But I beleive it *is* per-CPU, no? I am taking the TI_cpu value from the thread struct to look up the proper GDT, should that work? (no SMP-box here to test) > Did you tried setting SS/ESP on stack to invalid values, so IRET on CPL1 > triggers fault? You should be able to do that by modifying SS/ESP from > signal handler. Yes, I did that - that seem to work (i.e. no Oops or something bad). >> +#define NMI_STACK_RESERVE 0x400 >> +#define MAKE_ESPFIX \ >> + cli; \ >> + subl $NMI_STACK_RESERVE, %esp; \ > This is ugly. What if NMI handler uses more than 1KB? Adjusting NMI_STACK_RESERVE? I need only a few bytes of stack to work with, so I can reserve any amount for NMI. What should it be? 0x400 was taken "randomly", but really, is it worth allocating another per-CPU object to only save a few bytes of the ring-0 stack? > What about referencing 14+NMI_STACK_RESERVE() ? Will work either, but for what? Just looks a bit cleaner to me to do it that way, or am I missing something? > And ESP is not > multiple of 4 here - it may slow down processor a bit. But I have to store the value of ((ESP & 0xffff) | 4) on stack, so that's why I do that. How can it be done in a more effective way? pushl, then movw? > I think that you should prepare stack by using moves to > -NMI_STACK_RESERVE-xx(%esp) instead of adjusting stack and using pushes. I can do that of course, do you think it is faster? > Or you can (if you want per-thread and not per-CPU stack) prepare CPL1 > stack above CPL0 stack - this way you steal 24 bytes from CPL0 stack, > but CPL0 can use full 4KB (8KB), without NMI being limited by 1KB. Ughh, I really need the motivation to fiddle with the CPL1 stack separately. I don't want to limit the NMI handler, and 0x400 was just a random value. Maybe just enlarging it will resolve the problem? >> + pushl 20(%esp); \ >> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ >> + orl $IOPL1_MASK, (%esp); \ > Does this work if userspace runs with iopl 3,2 or 1? Does iret on CPL1 Yes, that should work I think. > set iopl level properly (i.e. is iopl set if IOPL <= CPL, or only if CPL=0?). Ring-1 code can't alter the IOPL, but then I set IOPL3 initially. I just use ring3_IOPL|IOPL1, which should be able to handle all cases (i.e. I do "pushl 20(%esp)" exactly only to get the Ring-3 IOPL, I don't care about the other flags there). > I'm not sure about lea speed. And fast path should NOT jump, > forward jumps are initialy predicted as not taken... Maybe: No problems. > And then you can move MAKE_ESPFIX out of macro. Good idea! :) > And it seems to me that it adds too many operations to fast path. Ughh, 8 insns only! Is this really too much? :( > Maybe you want to introduce some per-thread flag, I don't know. Probably not. I'll think about that. > or leave syscall path > to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?). Yes. I'll see how that can be done. >> - if (regs->xcs & 3) >> + if (regs->xcs & 2) >> printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp); > I think you should leave this as is, regs->xss/regs->esp should be valid > for CPL1 exceptions. Very probably, I was not sure about all that s/3/2/g changes I made. So I'll remove those you mentioned, thanks. It was just that the kernel/user code should now be considered by regs->xcs&2 rather than regs->xcs&3, so I just replaced it everywhere. I'll have a closer look. Also I was suggested privately to try doing everything at Ring-0 after all (unfortunately the guy have not CCd the message to LKML). I know that we use Ring-1 so that the NMI handler to work on the small stack, but the guy writes: ... but i think linux already has a task gate for NMI, or if not, it should. I should have a look also in that direction I think. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-22 20:02 ` Petr Vandrovec 2004-09-23 4:09 ` Stas Sergeev @ 2004-09-23 17:08 ` Stas Sergeev 2004-09-23 18:06 ` Petr Vandrovec 1 sibling, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-23 17:08 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 4021 bytes --] Hi. I'll reply to that mail again because my previous reply was done in a big hurry. The new patch is attached. I applied most of the optimizations you suggested and reverted the things you said were unnecessary/wrong - thanks. Does this one look better? Petr Vandrovec wrote: >> 1. I am not allocating the ring1 stack separately, >> I am allocating it on a ring0 stack. Much simpler >> code, although non-reentrant/preempt-unsafe. > I think that it is problem. And do not forget that NMI can occur at any > place, so I'm not quite sure how your code works. I assume this was a misunderstanding. Now I use "preempt_stop" instead of "cli" to make it more obvious what am I doing. So I assume that part is preempt-unsafe, but not NMI-unsafe. >> +#define NMI_STACK_RESERVE 0x400 >> +#define MAKE_ESPFIX \ >> + cli; \ >> + subl $NMI_STACK_RESERVE, %esp; \ > This is ugly. What if NMI handler uses more than 1KB? In that new patch I set the const to 0xe00, which is 3,5K. Is it still the limitation? I can probably raise it ever higher, but I am not sure if there is some sensible data below the stack that I may overwrite ocasionally? > And ESP is not > multiple of 4 here - it may slow down processor a bit. Fixed with "pushl/movw". > I'm not sure about lea speed. And fast path should NOT jump, > forward jumps are initialy predicted as not taken... Maybe: Done. Looks much better now! > And then you can move MAKE_ESPFIX out of macro. He-he, but RESTORE_ALL is a macro itself, so... :) > Maybe you want to introduce some per-thread flag, What kind of flag do you have in mind? Flag the modify_ldt() calls that set some segment to 16bit? But is it really worth the 8 insns I use? Checking the per-thread flag will cost you ~5 insns anyway, and a headache... > or leave syscall path > to corrupt ESP (dosemu apps should not call Linux INT syscall, yes?). I wanted to do that, but then recalled the comment of Pavel Machek, who thinks exporting the %esp value to user-space is an "informational leak". I personally don't see the security threat here, but it looks safe to assume that Pavel is right. In which case I think we have to root out the bug from every path, including the syscall. So I left that place as is for now. >> - if (regs->xcs & 3) >> + if (regs->xcs & 2) >> printk(" ESP: %04x:%08lx",0xffff & regs->xss,regs->esp); > I think you should leave this as is, regs->xss/regs->esp should be valid > for CPL1 exceptions. OK, everything you mentioned, is reverted. > Rest looks fine, but I'm not completely sure that SMP and NMIs are > handled in the best possible way. Well, even though my patch can be improved, I don't feel quite comfortable with its complexity. Now I really want to re-assure myself that this all is not possible to do on ring0, in which case this will simply be the trivial thing. Anonymous LKML reader (whom I have to thank for the hints) proposed 2 things: 1. Try lss followed by iret. In that case the interrupt will not kill us because after lss the interrupts are not accepted until the next insn is executed, AFAIK. But he was not sure, and I don't know either, if it is true even for NMI. So if it is - we could just do lss/iret, and the problem is solved. So is this true for NMI? Is this documented anywhere? 2. Set task gate for NMI. AFAICS, the task-gate is now used only for the doublefault exception, but not for NMI. If I understand the idea correctly, this will guarantee that the NMI will be executing on a separate stack, which may be a good idea in any case, and will allow us to use the small stack at ring-0, if only we disable the interrupts. Are there any chances to use the task-gate for NMI? I may even try to implement that myself, but I guess there are *reasons* why it is not yet? (of course this applies only if 1. fails) Or maybe somehow to modify the NMI handler itself, so that it will check if it is on a "small" stack, and switch to the "big" stack before anything else? Well, doing the whole trick at ring-0 sounds really plausible to me... [-- Attachment #2: linux-2.6.8-stacks3.diff --] [-- Type: text/x-patch, Size: 8504 bytes --] diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-10 13:28:35.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-09-23 16:14:55.209520160 +0400 @@ -70,9 +70,16 @@ CF_MASK = 0x00000001 TF_MASK = 0x00000100 IF_MASK = 0x00000200 -DF_MASK = 0x00000400 +DF_MASK = 0x00000400 +IOPL1_MASK = 0x00001000 +IOPL2_MASK = 0x00002000 +IOPL_MASK = 0x00003000 NT_MASK = 0x00004000 +RF_MASK = 0x00010000 VM_MASK = 0x00020000 +AC_MASK = 0x00040000 +VIF_MASK = 0x00080000 +VIP_MASK = 0x00100000 #ifdef CONFIG_PREEMPT #define preempt_stop cli @@ -122,8 +129,23 @@ .previous +#define NMI_STACK_RESERVE 0xe00 +/* If returning to Ring-3, not to V86, and with + * the small stack, try to fix the higher word of + * ESP, as the CPU won't restore it from stack. + * This is an "official" bug of all the x86-compatible + * CPUs, which we can try to work around to make + * dosemu happy. */ #define RESTORE_ALL \ - RESTORE_REGS \ + movl EFLAGS(%esp), %eax; \ + movb CS(%esp), %al; \ + andl $(VM_MASK | 2), %eax; \ + cmpl $2, %eax; \ + jne 3f; \ + larl OLDSS(%esp), %eax; \ + testl $0x00400000, %eax; \ +3: RESTORE_REGS \ + jz 8f; \ addl $4, %esp; \ 1: iret; \ .section .fixup,"ax"; \ @@ -137,8 +159,43 @@ .section __ex_table,"a";\ .align 4; \ .long 1b,2b; \ -.previous - +.previous; \ + /* preparing the ESPfix here */ \ + /* reserve some space on stack for the NMI handler */ \ +8: subl $(NMI_STACK_RESERVE-4), %esp; \ + .rept 5; \ + pushl NMI_STACK_RESERVE+16(%esp); \ + .endr; \ + pushl %eax; \ + movl %esp, %eax; \ + pushl %ebp; \ + preempt_stop; \ + GET_THREAD_INFO(%ebp); \ + movl TI_cpu(%ebp), %ebp; \ + shll $GDT_SIZE_SHIFT, %ebp; \ + /* find GDT of the proper CPU */ \ + addl $cpu_gdt_table, %ebp; \ + /* patch the base of the ring-1 16bit stack */ \ + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \ + shrl $16, %eax; \ + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \ + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \ + popl %ebp; \ + popl %eax; \ + /* push the ESP value to preload on ring-1 */ \ + pushl 12(%esp); \ + movw $4, (%esp); \ + /* push the iret frame for our ring-1 trampoline */ \ + pushl $__ESPFIX_SS; \ + pushl $0; \ + /* push ring-3 flags only to get the IOPL right */ \ + pushl 20(%esp); \ + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ + /* we need at least IOPL1 to re-enable interrupts */ \ + orl $IOPL1_MASK, (%esp); \ + pushl $__ESPFIX_CS; \ + pushl $espfix_trampoline; \ + iret; ENTRY(lcall7) @@ -174,6 +231,23 @@ jmp do_lcall +ENTRY(espfix_trampoline) + popl %esp +espfix_past_esp: +1: iret +.section .fixup,"ax" +2: sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,2b +.previous + ENTRY(ret_from_fork) pushl %eax call schedule_tail @@ -196,7 +270,7 @@ GET_THREAD_INFO(%ebp) movl EFLAGS(%esp), %eax # mix EFLAGS and CS movb CS(%esp), %al - testl $(VM_MASK | 3), %eax + testl $(VM_MASK | 2), %eax jz resume_kernel # returning to kernel or vm86-space ENTRY(resume_userspace) cli # make sure we don't miss an interrupt @@ -504,7 +578,12 @@ ENTRY(nmi) cmpl $sysenter_entry,(%esp) je nmi_stack_fixup - pushl %eax + cmpl $espfix_past_esp,(%esp) + jne 1f + /* restart the "popl %esp" */ + decl (%esp) + subw $4, 12(%esp) +1: pushl %eax movl %esp,%eax /* Do not access memory above the end of our stack page, * it might not exist. diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S --- linux-2.6.8-pcsp/arch/i386/kernel/head.S 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/head.S 2004-09-21 14:58:16.000000000 +0400 @@ -514,8 +514,8 @@ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0xc8 APM DS data */ - .quad 0x0000000000000000 /* 0xd0 - unused */ - .quad 0x0000000000000000 /* 0xd8 - unused */ + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */ + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */ .quad 0x0000000000000000 /* 0xe0 - unused */ .quad 0x0000000000000000 /* 0xe8 - unused */ .quad 0x0000000000000000 /* 0xf0 - unused */ diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c --- linux-2.6.8-pcsp/arch/i386/kernel/traps.c 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/traps.c 2004-09-23 10:39:53.000000000 +0400 @@ -210,7 +210,7 @@ esp = (unsigned long) (®s->esp); ss = __KERNEL_DS; - if (regs->xcs & 3) { + if (regs->xcs & 2) { in_kernel = 0; esp = regs->esp; ss = regs->xss & 0xffff; @@ -264,7 +264,7 @@ char c; unsigned long eip; - if (regs->xcs & 3) + if (regs->xcs & 2) goto no_bug; /* Not in kernel */ eip = regs->eip; @@ -335,7 +335,7 @@ static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err) { - if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs)) + if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs)) die(str, regs, err); } @@ -357,7 +357,7 @@ goto trap_signal; } - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto kernel_trap; trap_signal: { @@ -437,7 +437,7 @@ if (regs->eflags & VM_MASK) goto gp_in_vm86; - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto gp_in_kernel; current->thread.error_code = error_code; @@ -614,7 +614,7 @@ * allowing programs to debug themselves without the ptrace() * interface. */ - if ((regs->xcs & 3) == 0) + if ((regs->xcs & 2) == 0) goto clear_TF_reenable; if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE) goto clear_TF; @@ -630,7 +630,7 @@ /* If this is a kernel mode trap, save the user PC on entry to * the kernel, that's what the debugger can make sense of. */ - info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : + info.si_addr = ((regs->xcs & 3) != 3) ? (void *)tsk->thread.eip : (void *)regs->eip; force_sig_info(SIGTRAP, &info, tsk); diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c --- linux-2.6.8-pcsp/arch/i386/mm/extable.c 2004-06-09 15:44:19.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/extable.c 2004-09-21 17:17:05.000000000 +0400 @@ -26,6 +26,11 @@ } #endif + if (unlikely(regs->xcs == __ESPFIX_CS)) + { + regs->xcs = __KERNEL_CS; + } + fixup = search_exception_tables(regs->eip); if (fixup) { regs->eip = fixup->fixup; diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c --- linux-2.6.8-pcsp/arch/i386/mm/fault.c 2004-08-10 11:02:37.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/fault.c 2004-09-21 16:03:28.000000000 +0400 @@ -77,7 +77,7 @@ u32 seg_ar, seg_limit, base, *desc; /* The standard kernel/user address space limit. */ - *eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg; + *eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg; /* Unlikely, but must come before segment checks. */ if (unlikely((regs->eflags & VM_MASK) != 0)) diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h --- linux-2.6.8-pcsp/include/asm-i386/segment.h 2004-01-09 09:59:19.000000000 +0300 +++ linux-2.6.8-stacks/include/asm-i386/segment.h 2004-09-21 14:57:44.000000000 +0400 @@ -38,8 +38,8 @@ * 24 - APM BIOS support * 25 - APM BIOS support * - * 26 - unused - * 27 - unused + * 26 - ESPfix Ring-1 trampoline CS + * 27 - ESPfix Ring-1 small SS * 28 - unused * 29 - unused * 30 - unused @@ -71,14 +71,21 @@ #define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6) #define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11) +#define GDT_ENTRY_ESPFIX_CS (GDT_ENTRY_KERNEL_BASE + 14) +#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 15) +#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1) +#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1) + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* * The GDT has 32 entries */ -#define GDT_ENTRIES 32 - -#define GDT_SIZE (GDT_ENTRIES * 8) +#define GDT_ENTRIES_SHIFT 5 +#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT) +#define GDT_ENTRY_SHIFT 3 +#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT) +#define GDT_SIZE (1 << GDT_SIZE_SHIFT) /* Simple and small GDT entries for booting only */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-23 17:08 ` Stas Sergeev @ 2004-09-23 18:06 ` Petr Vandrovec 2004-09-24 20:36 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Petr Vandrovec @ 2004-09-23 18:06 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Thu, Sep 23, 2004 at 09:08:54PM +0400, Stas Sergeev wrote: > Petr Vandrovec wrote: > >>1. I am not allocating the ring1 stack separately, > >>I am allocating it on a ring0 stack. Much simpler > >>code, although non-reentrant/preempt-unsafe. > >I think that it is problem. And do not forget that NMI can occur at any > >place, so I'm not quite sure how your code works. > I assume this was a misunderstanding. Now I use > "preempt_stop" instead of "cli" to make it more > obvious what am I doing. So I assume that part is > preempt-unsafe, but not NMI-unsafe. > > >>+#define NMI_STACK_RESERVE 0x400 > >>+#define MAKE_ESPFIX \ > >>+ cli; \ > >>+ subl $NMI_STACK_RESERVE, %esp; \ > >This is ugly. What if NMI handler uses more than 1KB? > In that new patch I set the const to 0xe00, which > is 3,5K. Is it still the limitation? I can probably > raise it ever higher, but I am not sure if there is > some sensible data below the stack that I may overwrite > ocasionally? For 4KB stacks 2KB looks better (if NMIs do not use its own stacks). > >Rest looks fine, but I'm not completely sure that SMP and NMIs are > >handled in the best possible way. > Well, even though my patch can be improved, I don't > feel quite comfortable with its complexity. > Now I really want to re-assure myself that this all > is not possible to do on ring0, in which case this > will simply be the trivial thing. > Anonymous LKML reader (whom I have to thank for the > hints) proposed 2 things: > 1. Try lss followed by iret. In that case the interrupt > will not kill us because after lss the interrupts are > not accepted until the next insn is executed, AFAIK. > But he was not sure, and I don't know either, if it is > true even for NMI. So if it is - we could just do > lss/iret, and the problem is solved. So is this true > for NMI? Is this documented anywhere? LSS is exempted from rule that interrupts cannot occur after operation modifying SS, as with LSS you are supposed to load SS together with ESP in one operation, and so you do not need interrupt holdoff. And when exception occurs on iret, it will happen with 16bit stack, and you are dead... > 2. Set task gate for NMI. AFAICS, the task-gate is > now used only for the doublefault exception, but not > for NMI. If I understand the idea correctly, this > will guarantee that the NMI will be executing on a > separate stack, which may be a good idea in any case, > and will allow us to use the small stack at ring-0, > if only we disable the interrupts. Are there any > chances to use the task-gate for NMI? I may even try to > implement that myself, but I guess there are *reasons* > why it is not yet? (of course this applies only if 1. > fails) Yes. But you still have to handle exceptions on iret to userspace :-( > Or maybe somehow to modify the NMI handler itself, > so that it will check if it is on a "small" stack, > and switch to the "big" stack before anything else? > Well, doing the whole trick at ring-0 sounds really > plausible to me... If you can do that, great. But you have to modify at least NMI, GP and SS fault handlers to reload SS/ESP with correct values. > diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S > --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-10 13:28:35.000000000 +0400 > +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-09-23 16:14:55.209520160 +0400 > @@ -122,8 +129,23 @@ > .previous > > > +#define NMI_STACK_RESERVE 0xe00 > +/* If returning to Ring-3, not to V86, and with > + * the small stack, try to fix the higher word of > + * ESP, as the CPU won't restore it from stack. > + * This is an "official" bug of all the x86-compatible > + * CPUs, which we can try to work around to make > + * dosemu happy. */ > #define RESTORE_ALL \ > - RESTORE_REGS \ > + movl EFLAGS(%esp), %eax; \ > + movb CS(%esp), %al; \ > + andl $(VM_MASK | 2), %eax; \ > + cmpl $2, %eax; \ > + jne 3f; \ > + larl OLDSS(%esp), %eax; \ > + testl $0x00400000, %eax; \ > +3: RESTORE_REGS \ > + jz 8f; \ > addl $4, %esp; \ > 1: iret; \ > .section .fixup,"ax"; \ > @@ -137,8 +159,43 @@ > .section __ex_table,"a";\ > .align 4; \ > .long 1b,2b; \ > -.previous > - > +.previous; \ > + /* preparing the ESPfix here */ \ > + /* reserve some space on stack for the NMI handler */ \ > +8: subl $(NMI_STACK_RESERVE-4), %esp; \ > + .rept 5; \ Any reason why you left this part of RESTORE_ALL macro? It can be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you could save one copy. Though I'm not sure why NMI handler simple does not jump to RESTORE_ALL we already have. > + pushl NMI_STACK_RESERVE+16(%esp); \ > + .endr; \ > + pushl %eax; \ > + movl %esp, %eax; \ > + pushl %ebp; \ > + preempt_stop; \ > + GET_THREAD_INFO(%ebp); \ > + movl TI_cpu(%ebp), %ebp; \ > + shll $GDT_SIZE_SHIFT, %ebp; \ > + /* find GDT of the proper CPU */ \ > + addl $cpu_gdt_table, %ebp; \ I did not know we have per-CPU GDT tables... This explains it. > + /* patch the base of the ring-1 16bit stack */ \ > + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp); \ > + shrl $16, %eax; \ > + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp); \ > + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp); \ > + popl %ebp; \ > + popl %eax; \ > + /* push the ESP value to preload on ring-1 */ \ > + pushl 12(%esp); \ > + movw $4, (%esp); \ > + /* push the iret frame for our ring-1 trampoline */ \ > + pushl $__ESPFIX_SS; \ > + pushl $0; \ > + /* push ring-3 flags only to get the IOPL right */ \ > + pushl 20(%esp); \ > + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ > + /* we need at least IOPL1 to re-enable interrupts */ \ > + orl $IOPL1_MASK, (%esp); \ > + pushl $__ESPFIX_CS; \ > + pushl $espfix_trampoline; \ > + iret; FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec while written with subl $24,%esp and then doing movs to xx(%esp) takes 0.94sec. Plus you could then reorganize code a bit (first do 5 pushes to copy stack, then subtract 24 from esp, and push eax/ebp after that. This way you can use %eax for computations you currently do in memory (which are probably most important for 27% slowdown between pushes and movs - I wrote mov code with using eax for EFLAGS1 & ESP1). > - .quad 0x0000000000000000 /* 0xd0 - unused */ > - .quad 0x0000000000000000 /* 0xd8 - unused */ > + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */ > + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */ Set SS limit to 23 bytes, so misuse can be quickly catched? As SP=0 and we use 16bit stack, it goes from 0000 => FFFE, and you could corrupt memory 64KB beyond top of CPL1 stack. Best regards, Petr Vandrovec ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-23 18:06 ` Petr Vandrovec @ 2004-09-24 20:36 ` Stas Sergeev 2004-09-24 21:43 ` Petr Vandrovec 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-24 20:36 UTC (permalink / raw) To: Petr Vandrovec; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 2907 bytes --] Hi, Petr Vandrovec wrote: > In that new patch I set the const to 0xe00, which > is 3,5K. Is it still the limitation? I can probably > For 4KB stacks 2KB looks better OK, done that. Wondering though, for what? I don't need 2K myself, I need 24 bytes only. So what prevents me to raise the gap to 3.5K or somesuch? Why 2K looks better? >> 2. Set task gate for NMI. AFAICS, the task-gate is > Yes. But you still have to handle exceptions on iret > to userspace :-( Yes. Thanks, I see the problem now. And I suppose setting the task-gates also for all that exceptions is not an option... >> Or maybe somehow to modify the NMI handler itself, > If you can do that, great. But you have to modify > at least NMI, GP and SS fault handlers to reload > SS/ESP with correct values. Yes, that's the problem either. Doesn't look too difficult (I'll have to look up TSS for the stack pointer I guess, then do lss to it), but probably modifying all that handlers for only that small purpose is an overkill... >> +.previous; \ >> + /* preparing the ESPfix here */ \ >> + /* reserve some space on stack for the NMI handler */ \ >> +8: subl $(NMI_STACK_RESERVE-4), %esp; \ >> + .rept 5; \ > Any reason why you left this part of RESTORE_ALL macro? The reason is that the previous part of the macro can jump to that part. So how can I divide those? > be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you > could save one copy. Do you mean the NMI return path doesn't need the ESP fix at all? Why? > Though I'm not sure why NMI handler simple > does not jump to RESTORE_ALL we already have. I can only change that and then un-macro the RESTORE_ALL completely. So I did that. Additionally I introduced the "short path" for the case where we know for sure that we are returning to the kernel. And I am not setting the exception handler there because returning to the kernel if fails, should die() anyway. Is this correct? >> + pushl 20(%esp); \ >> + andl $~(IF_MASK | TF_MASK | RF_MASK | NT_MASK | AC_MASK), (%esp); \ >> + /* we need at least IOPL1 to re-enable interrupts */ \ >> + orl $IOPL1_MASK, (%esp); \ >> + pushl $__ESPFIX_CS; \ >> + pushl $espfix_trampoline; \ > FYI, on my system (P4/1.6GHz) 100M loops of these pushes takes 1.20sec > while written with subl $24,%esp and then doing movs to xx(%esp) takes > 0.94sec. OK, I wasn't sure what pushes do you mean. I supposed you wanted me to replace those 5 pushes that copy the stack frame. > Plus you could then reorganize code a bit (first do 5 pushes > to copy stack, then subtract 24 from esp, and push eax/ebp after that. > This way you can use %eax for computations you currently do in memory Done, thanks! Does this help your test-case? >> + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix CS */ >> + .quad 0x0000b2000000ffff /* 0xd8 - ESPfix SS */ > Set SS limit to 23 bytes, so misuse can be quickly catched? Yes! So the new patch is attached:) [-- Attachment #2: linux-2.6.8-stacks4.diff --] [-- Type: text/x-patch, Size: 9930 bytes --] diff -urN linux-2.6.8-pcsp/arch/i386/kernel/entry.S linux-2.6.8-stacks/arch/i386/kernel/entry.S --- linux-2.6.8-pcsp/arch/i386/kernel/entry.S 2004-06-10 13:28:35.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/entry.S 2004-09-24 15:35:46.000000000 +0400 @@ -70,15 +70,22 @@ CF_MASK = 0x00000001 TF_MASK = 0x00000100 IF_MASK = 0x00000200 -DF_MASK = 0x00000400 +DF_MASK = 0x00000400 +IOPL1_MASK = 0x00001000 +IOPL2_MASK = 0x00002000 +IOPL_MASK = 0x00003000 NT_MASK = 0x00004000 +RF_MASK = 0x00010000 VM_MASK = 0x00020000 +AC_MASK = 0x00040000 +VIF_MASK = 0x00080000 +VIP_MASK = 0x00100000 #ifdef CONFIG_PREEMPT #define preempt_stop cli #else #define preempt_stop -#define resume_kernel restore_all +#define resume_kernel resume_kernel_nopreempt #endif #define SAVE_ALL \ @@ -122,25 +129,6 @@ .previous -#define RESTORE_ALL \ - RESTORE_REGS \ - addl $4, %esp; \ -1: iret; \ -.section .fixup,"ax"; \ -2: sti; \ - movl $(__USER_DS), %edx; \ - movl %edx, %ds; \ - movl %edx, %es; \ - pushl $11; \ - call do_exit; \ -.previous; \ -.section __ex_table,"a";\ - .align 4; \ - .long 1b,2b; \ -.previous - - - ENTRY(lcall7) pushfl # We get a different stack layout with call # gates, which has to be cleaned up later.. @@ -174,6 +162,23 @@ jmp do_lcall +ENTRY(espfix_trampoline) + popl %esp +espfix_past_esp: +1: iret +.section .fixup,"ax" +2: sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,2b +.previous + ENTRY(ret_from_fork) pushl %eax call schedule_tail @@ -196,8 +201,8 @@ GET_THREAD_INFO(%ebp) movl EFLAGS(%esp), %eax # mix EFLAGS and CS movb CS(%esp), %al - testl $(VM_MASK | 3), %eax - jz resume_kernel # returning to kernel or vm86-space + testl $(VM_MASK | 2), %eax + jz resume_kernel ENTRY(resume_userspace) cli # make sure we don't miss an interrupt # setting need_resched or sigpending @@ -208,10 +213,15 @@ jne work_pending jmp restore_all +resume_kernel_nopreempt: + RESTORE_REGS + addl $4, %esp + iret + #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? - jnz restore_all + jnz resume_kernel_nopreempt need_resched: movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl @@ -294,7 +304,77 @@ testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work restore_all: - RESTORE_ALL + /* If returning to Ring-3, not to V86, and with + * the small stack, try to fix the higher word of + * ESP, as the CPU won't restore it from stack. + * This is an "official" bug of all the x86-compatible + * CPUs, which we can try to work around to make + * dosemu happy. */ + movl EFLAGS(%esp), %eax + movb CS(%esp), %al + andl $(VM_MASK | 2), %eax + /* returning to user-space and not to v86? */ + cmpl $2, %eax + jne 3f + larl OLDSS(%esp), %eax + /* returning to "small" stack? */ + testl $0x00400000, %eax +3: RESTORE_REGS + jz 8f # go fixing ESP instead of just to return :( + addl $4, %esp +1: iret +.section .fixup,"ax" +2: sti + movl $(__USER_DS), %edx + movl %edx, %ds + movl %edx, %es + pushl $11 + call do_exit +.previous +.section __ex_table,"a" + .align 4 + .long 1b,2b +.previous + /* preparing the ESPfix here */ +#define NMI_STACK_RESERVE 0x800 + /* reserve some space on stack for the NMI handler */ +8: subl $(NMI_STACK_RESERVE-4), %esp + .rept 5 + pushl NMI_STACK_RESERVE+16(%esp) + .endr + subl $24, %esp + pushl %eax + leal 24(%esp), %eax + pushl %ebp + preempt_stop + GET_THREAD_INFO(%ebp) + movl TI_cpu(%ebp), %ebp + shll $GDT_SIZE_SHIFT, %ebp + /* find GDT of the proper CPU */ + addl $cpu_gdt_table, %ebp + /* patch the base of the ring-1 16bit stack */ + movw %ax, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 2)(%ebp) + shrl $16, %eax + movb %al, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 4)(%ebp) + movb %ah, ((GDT_ENTRY_ESPFIX_SS << GDT_ENTRY_SHIFT) + 7)(%ebp) + popl %ebp + /* push the ESP value to preload on ring-1 */ + movl 28+12(%esp), %eax + movw $4, %ax + movl %eax, 4+20(%esp) + /* push the iret frame for our ring-1 trampoline */ + movl $__ESPFIX_SS, 4+16(%esp) + movl $0, 4+12(%esp) + /* push ring-3 flags only to get the IOPL right */ + movl 28+8(%esp), %eax + andl $IOPL_MASK, %eax + /* we need at least IOPL1 to re-enable interrupts */ + orl $IOPL1_MASK, %eax + movl %eax, 4+8(%esp) + popl %eax + movl $__ESPFIX_CS, 4(%esp) + movl $espfix_trampoline, (%esp) + iret # perform work that needs to be done immediately before resumption ALIGN @@ -504,7 +584,12 @@ ENTRY(nmi) cmpl $sysenter_entry,(%esp) je nmi_stack_fixup - pushl %eax + cmpl $espfix_past_esp,(%esp) + jne 1f + /* restart the "popl %esp" */ + decl (%esp) + subw $4, 12(%esp) +1: pushl %eax movl %esp,%eax /* Do not access memory above the end of our stack page, * it might not exist. @@ -523,7 +608,7 @@ pushl %edx call do_nmi addl $8, %esp - RESTORE_ALL + jmp restore_all nmi_stack_fixup: FIX_STACK(12,nmi_stack_correct, 1) diff -urN linux-2.6.8-pcsp/arch/i386/kernel/head.S linux-2.6.8-stacks/arch/i386/kernel/head.S --- linux-2.6.8-pcsp/arch/i386/kernel/head.S 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/head.S 2004-09-24 13:57:24.000000000 +0400 @@ -514,8 +514,8 @@ .quad 0x00009a0000000000 /* 0xc0 APM CS 16 code (16 bit) */ .quad 0x0040920000000000 /* 0xc8 APM DS data */ - .quad 0x0000000000000000 /* 0xd0 - unused */ - .quad 0x0000000000000000 /* 0xd8 - unused */ + .quad 0x00cfba000000ffff /* 0xd0 - ESPfix 32-bit CS */ + .quad 0x0000b20000000017 /* 0xd8 - ESPfix 16-bit SS */ .quad 0x0000000000000000 /* 0xe0 - unused */ .quad 0x0000000000000000 /* 0xe8 - unused */ .quad 0x0000000000000000 /* 0xf0 - unused */ diff -urN linux-2.6.8-pcsp/arch/i386/kernel/traps.c linux-2.6.8-stacks/arch/i386/kernel/traps.c --- linux-2.6.8-pcsp/arch/i386/kernel/traps.c 2004-08-10 11:02:36.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/kernel/traps.c 2004-09-23 17:54:52.000000000 +0400 @@ -210,7 +210,7 @@ esp = (unsigned long) (®s->esp); ss = __KERNEL_DS; - if (regs->xcs & 3) { + if (regs->xcs & 2) { in_kernel = 0; esp = regs->esp; ss = regs->xss & 0xffff; @@ -264,7 +264,7 @@ char c; unsigned long eip; - if (regs->xcs & 3) + if (regs->xcs & 2) goto no_bug; /* Not in kernel */ eip = regs->eip; @@ -335,7 +335,7 @@ static inline void die_if_kernel(const char * str, struct pt_regs * regs, long err) { - if (!(regs->eflags & VM_MASK) && !(3 & regs->xcs)) + if (!(regs->eflags & VM_MASK) && !(2 & regs->xcs)) die(str, regs, err); } @@ -357,7 +357,7 @@ goto trap_signal; } - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto kernel_trap; trap_signal: { @@ -437,7 +437,7 @@ if (regs->eflags & VM_MASK) goto gp_in_vm86; - if (!(regs->xcs & 3)) + if (!(regs->xcs & 2)) goto gp_in_kernel; current->thread.error_code = error_code; @@ -614,7 +614,7 @@ * allowing programs to debug themselves without the ptrace() * interface. */ - if ((regs->xcs & 3) == 0) + if ((regs->xcs & 2) == 0) goto clear_TF_reenable; if ((tsk->ptrace & (PT_DTRACE|PT_PTRACED)) == PT_DTRACE) goto clear_TF; @@ -630,7 +630,7 @@ /* If this is a kernel mode trap, save the user PC on entry to * the kernel, that's what the debugger can make sense of. */ - info.si_addr = ((regs->xcs & 3) == 0) ? (void *)tsk->thread.eip : + info.si_addr = ((regs->xcs & 3) != 3) ? (void *)tsk->thread.eip : (void *)regs->eip; force_sig_info(SIGTRAP, &info, tsk); diff -urN linux-2.6.8-pcsp/arch/i386/mm/extable.c linux-2.6.8-stacks/arch/i386/mm/extable.c --- linux-2.6.8-pcsp/arch/i386/mm/extable.c 2004-06-09 15:44:19.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/extable.c 2004-09-23 17:54:52.000000000 +0400 @@ -26,6 +26,11 @@ } #endif + if (unlikely(regs->xcs == __ESPFIX_CS)) + { + regs->xcs = __KERNEL_CS; + } + fixup = search_exception_tables(regs->eip); if (fixup) { regs->eip = fixup->fixup; diff -urN linux-2.6.8-pcsp/arch/i386/mm/fault.c linux-2.6.8-stacks/arch/i386/mm/fault.c --- linux-2.6.8-pcsp/arch/i386/mm/fault.c 2004-08-10 11:02:37.000000000 +0400 +++ linux-2.6.8-stacks/arch/i386/mm/fault.c 2004-09-23 17:54:52.000000000 +0400 @@ -77,7 +77,7 @@ u32 seg_ar, seg_limit, base, *desc; /* The standard kernel/user address space limit. */ - *eip_limit = (seg & 3) ? USER_DS.seg : KERNEL_DS.seg; + *eip_limit = (seg & 2) ? USER_DS.seg : KERNEL_DS.seg; /* Unlikely, but must come before segment checks. */ if (unlikely((regs->eflags & VM_MASK) != 0)) diff -urN linux-2.6.8-pcsp/include/asm-i386/segment.h linux-2.6.8-stacks/include/asm-i386/segment.h --- linux-2.6.8-pcsp/include/asm-i386/segment.h 2004-01-09 09:59:19.000000000 +0300 +++ linux-2.6.8-stacks/include/asm-i386/segment.h 2004-09-23 17:54:52.000000000 +0400 @@ -38,8 +38,8 @@ * 24 - APM BIOS support * 25 - APM BIOS support * - * 26 - unused - * 27 - unused + * 26 - ESPfix Ring-1 trampoline CS + * 27 - ESPfix Ring-1 small SS * 28 - unused * 29 - unused * 30 - unused @@ -71,14 +71,21 @@ #define GDT_ENTRY_PNPBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 6) #define GDT_ENTRY_APMBIOS_BASE (GDT_ENTRY_KERNEL_BASE + 11) +#define GDT_ENTRY_ESPFIX_CS (GDT_ENTRY_KERNEL_BASE + 14) +#define GDT_ENTRY_ESPFIX_SS (GDT_ENTRY_KERNEL_BASE + 15) +#define __ESPFIX_CS (GDT_ENTRY_ESPFIX_CS * 8 + 1) +#define __ESPFIX_SS (GDT_ENTRY_ESPFIX_SS * 8 + 1) + #define GDT_ENTRY_DOUBLEFAULT_TSS 31 /* * The GDT has 32 entries */ -#define GDT_ENTRIES 32 - -#define GDT_SIZE (GDT_ENTRIES * 8) +#define GDT_ENTRIES_SHIFT 5 +#define GDT_ENTRIES (1 << GDT_ENTRIES_SHIFT) +#define GDT_ENTRY_SHIFT 3 +#define GDT_SIZE_SHIFT (GDT_ENTRIES_SHIFT + GDT_ENTRY_SHIFT) +#define GDT_SIZE (1 << GDT_SIZE_SHIFT) /* Simple and small GDT entries for booting only */ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-24 20:36 ` Stas Sergeev @ 2004-09-24 21:43 ` Petr Vandrovec 2004-09-25 8:04 ` Gabriel Paubert 0 siblings, 1 reply; 38+ messages in thread From: Petr Vandrovec @ 2004-09-24 21:43 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote: > Hi, > > Petr Vandrovec wrote: > >In that new patch I set the const to 0xe00, which > >is 3,5K. Is it still the limitation? I can probably > >For 4KB stacks 2KB looks better > OK, done that. Wondering though, for what? > I don't need 2K myself, I need 24 bytes only. > So what prevents me to raise the gap to 3.5K > or somesuch? Why 2K looks better? You run with ESP decreased by 2KB for some time during CPL1 stack setup. As you run in this part at CPL0 with same setup as on CPL1, I think that you should offer same stack for setup code, and for CPL1 code, and so each should get 2KB. > >>+8: subl $(NMI_STACK_RESERVE-4), %esp; \ > >>+ .rept 5; \ > >Any reason why you left this part of RESTORE_ALL macro? > The reason is that the previous part of the macro > can jump to that part. So how can I divide those? Use real labels instead of numeric local labels. That way macro does jne cpl1exit, and you create cpl1exit: label outside of macro, somewhere in entry.S... > >be moved out, IMHO. RESTORE_ALL is used twice in entry.S, so you > >could save one copy. > Do you mean the NMI return path doesn't need > the ESP fix at all? Why? No. I was attempting to say that RESTORE_ALL is on two places to save jumps (I cannot think about any other reason), and so cpl1exit could be shared: #define RESTORE_ALL \ movl EFLAGS(%esp),%eax \ .... \ jne 3f \ lar OLDSS(%esp),%eax \ testl $0x00400000,%eax \ jz cpl1exit \ 3: RESTORE_REGS \ add $4,%esp \ 1: iret \ .section __ex_table,"a";\ .align 4; \ long 1b,badss; \ .previous badss: sti; movl $(__USER_DS), %edx; movl %edx, %ds; movl %edx, %es; pushl $11; call do_exit; cpl1exit: RESTORE_REGS subl $...,%esp ... But your solution with killing RESTORE_ALL completely is also fine - you just could reorder jumps a bit, so return to kernel or VM86 is one (not taken) jump shorter than it is with current version of patch. > >Though I'm not sure why NMI handler simple > >does not jump to RESTORE_ALL we already have. > I can only change that and then un-macro the > RESTORE_ALL completely. So I did that. > Additionally I introduced the "short path" > for the case where we know for sure that we > are returning to the kernel. And I am not > setting the exception handler there because > returning to the kernel if fails, should die() > anyway. Is this correct? If you'll first test result of lar, and then you'll do two independent RESTORE_REGS (second one actually does not have to restore %ebp and %eax and you can restore them directly by move from stack after you are done with trampoline setup, instead of doing push + pop), you can reuse RESOTRE_REGS+add $4,%esp+iret return path, including its exception handling. > +ENTRY(espfix_trampoline) > + popl %esp > +espfix_past_esp: > +1: iret > +.section .fixup,"ax" > +2: sti > + movl $(__USER_DS), %edx > + movl %edx, %ds > + movl %edx, %es > + pushl $11 > + call do_exit > +.previous You can reuse fixup code from other iret instance, just give it real label and reference it from __ex_table instead of '2b'. Patch looks fine, though you could speedup it a bit as outlined above. Now you have to persuade others that this patch should include patch into the kernel. Petr ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-24 21:43 ` Petr Vandrovec @ 2004-09-25 8:04 ` Gabriel Paubert 2004-09-25 12:25 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Gabriel Paubert @ 2004-09-25 8:04 UTC (permalink / raw) To: Petr Vandrovec; +Cc: Stas Sergeev, linux-kernel On Fri, Sep 24, 2004 at 11:43:30PM +0200, Petr Vandrovec wrote: > On Sat, Sep 25, 2004 at 12:36:15AM +0400, Stas Sergeev wrote: > > Hi, > > > > Petr Vandrovec wrote: > > >In that new patch I set the const to 0xe00, which > > >is 3,5K. Is it still the limitation? I can probably > > >For 4KB stacks 2KB looks better > > OK, done that. Wondering though, for what? > > I don't need 2K myself, I need 24 bytes only. > > So what prevents me to raise the gap to 3.5K > > or somesuch? Why 2K looks better? > > You run with ESP decreased by 2KB for some time during > CPL1 stack setup. As you run in this part at CPL0 > with same setup as on CPL1, I think that you should > offer same stack for setup code, and for CPL1 code, > and so each should get 2KB. Maybe I miss something, but it seems that lret (or retl) is not affected by this bug. What prevents you from reordering the stack (doing the inverse operation of what the lcall7 entry point does) and then doing: popfl lret Yes, I see issues with debugging (trap flag mostly) but I believe that they are solvable. Regards, Gabriel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 8:04 ` Gabriel Paubert @ 2004-09-25 12:25 ` Stas Sergeev 2004-09-25 19:18 ` Gabriel Paubert 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-25 12:25 UTC (permalink / raw) To: Gabriel Paubert; +Cc: Petr Vandrovec, linux-kernel Hello. Gabriel Paubert wrote: > Maybe I miss something, but it seems that lret (or retl) > is not affected by this bug. Petr Vandrovec says (he forgot to CC that to LKML I think): --- Looking at VMware's code it seems that RETF suffers from this bug too... --- I tested that - he is right, and Intel docs make no sense as to not mentioning this. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 12:25 ` Stas Sergeev @ 2004-09-25 19:18 ` Gabriel Paubert 2004-09-25 20:40 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Gabriel Paubert @ 2004-09-25 19:18 UTC (permalink / raw) To: Stas Sergeev; +Cc: Petr Vandrovec, linux-kernel On Sat, Sep 25, 2004 at 04:25:43PM +0400, Stas Sergeev wrote: > Hello. > > Gabriel Paubert wrote: > >Maybe I miss something, but it seems that lret (or retl) > >is not affected by this bug. > Petr Vandrovec says (he forgot to CC that > to LKML I think): At least I did not see it. > --- > Looking at VMware's code it seems that RETF suffers from > this bug too... > --- > > I tested that - he is right, and Intel docs > make no sense as to not mentioning this. I suspected that they behaved differently because the pseudocode in iret's description is quite different (for iret, it even does not mention restoring ESP!). But if you expect Intel's doc, or that of any manufacturer for the matter, to tell the whole truth, you're naïve. Is ESP really properly restored for V86 bmode or is it that it does not hit the case of a default 32 bit code segment with a 16 bit stack? I'm absolutely amazed by the fact that this bug has been there since the beginning and only seems to hit users right now. I don't like adding an intermediate privilege level, but it looks hard to do through the vdso, and you always have to push something on the final stack. A hardware task switch might work, but it has problems with races on the segment descriptors. Anyway, I've just read again Intel's doc about mixing 16 and 32 bit code and I have found the understament of the day: "For most efficient and trouble-free operation of the processor, 32-bit ^^^^^^^^^^^^ programs or tasks should have the D flag in the code-segment descriptor and the B flag in the stack-segment descriptor set, and 16-bit programs or tasks should have these flags clear. Program control transfers from 16-bit segments to 32-bit segments (and vice versa) are handled most efficiently through call, interrupt, or trap gates." Regards, Gabriel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 19:18 ` Gabriel Paubert @ 2004-09-25 20:40 ` Stas Sergeev 2004-09-25 23:42 ` Gabriel Paubert 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-25 20:40 UTC (permalink / raw) To: Gabriel Paubert; +Cc: linux-kernel Hi. Gabriel Paubert wrote: > Is ESP really properly restored for V86 bmode or is it that > it does not hit the case of a default 32 bit code segment with > a 16 bit stack? It does not restore it in any case for the 16bit stack (no matter whether the code is 16 or 32 in PM). Petr says V86 is not affected, but I have not tested that case because why to care? The problem is only for the 32bit code. For 16bit code (PM or V86) it just doesn't really matter I think (I don't think using prefixes for ESP is sane). > I'm absolutely amazed by the fact that this bug has been there > since the beginning and only seems to hit users right now. No, right now it just hits me:) It used to hit the dosemu users always, but people just blamed dosemu itself and that was it. Noone wanted to spend weeks traceing the DOS programs under dosemu, then traceing dosemu itself, then traceing kernel, then looking through the docs to track the problem down to something known, then writing to Intel's techsup for clarifications, and then writing to LKML:) And if not for the great help I got here, this will end up nowhere again. So that's how it used to stay "unnoticed" for years. As for the other instances of that problem, here are some: http://www.tenberry.com/dos4g/watcom/rn4gw.html --- B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug which mangles the high word of ESP. --- http://clio.rice.edu/djgpp/r5bug04.txt --- On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not clear those bits when swapping to it's own stack. --- Searching google reveals quite a few implicit instances of that problem. > Anyway, I've just read again Intel's doc about mixing 16 and 32 bit > code and I have found the understament of the day: > "For most efficient and trouble-free operation of the processor, > ^^^^^^^^^^^^ What is to expect? They knew there are troubles, yet decided to make it a part of the specs... ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 20:40 ` Stas Sergeev @ 2004-09-25 23:42 ` Gabriel Paubert 2004-09-26 18:04 ` Stas Sergeev 2004-09-30 15:11 ` Bill Davidsen 0 siblings, 2 replies; 38+ messages in thread From: Gabriel Paubert @ 2004-09-25 23:42 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel Hi, > It does not restore it in any case for the 16bit > stack (no matter whether the code is 16 or 32 in PM). > Petr says V86 is not affected, but I have not tested > that case because why to care? The problem is only for > the 32bit code. For 16bit code (PM or V86) it just > doesn't really matter I think (I don't think using > prefixes for ESP is sane). Well, thrashing a register at any time under the user just because an interrupt happened is even less sane ;-) > > >I'm absolutely amazed by the fact that this bug has been there > >since the beginning and only seems to hit users right now. > No, right now it just hits me:) > It used to hit the dosemu users always, but people just > blamed dosemu itself and that was it. Noone wanted > to spend weeks traceing the DOS programs under dosemu, > then traceing dosemu itself, then traceing kernel, > then looking through the docs to track the problem down > to something known, then writing to Intel's techsup for > clarifications, and then writing to LKML:) And if not > for the great help I got here, this will end up nowhere > again. So that's how it used to stay "unnoticed" for > years. I see. And finally we know that the problem with the processor and that Intel changed the spec to conform to what the hardware does but is rather coy about it. > As for the other instances of that problem, here are some: > > http://www.tenberry.com/dos4g/watcom/rn4gw.html > --- > B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug > which mangles the high word of ESP. But if the bug is also affects Windows DOS box, it means that V86 is affected too, no? I'd like to know what OS/2 did. The DOS boxes and 16 bit mode DPMI applications ran very well and it was very stable, despite the fact that the kernel spent its time switching between 16 and 32 bit modes (for example, almost all device drivers were 16 bit code, but the drivers for DOS emulation were 32 bit). But I removed it from all my machines several years ago. > http://clio.rice.edu/djgpp/r5bug04.txt > --- > On VCPI servers which mode switch using ESP upper bits, CWSDPMI does not > clear those bits when swapping to it's own stack. > --- > > Searching google reveals quite a few implicit > instances of that problem. > > >Anyway, I've just read again Intel's doc about mixing 16 and 32 bit > >code and I have found the understament of the day: > >"For most efficient and trouble-free operation of the processor, > > ^^^^^^^^^^^^ > What is to expect? They knew there are troubles, > yet decided to make it a part of the specs... Of which specs? It is never clearly stated in the whole 4 volume doc of the architecture. Only in an obscure specification change... Gabriel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 23:42 ` Gabriel Paubert @ 2004-09-26 18:04 ` Stas Sergeev 2004-09-27 9:07 ` Gabriel Paubert 2004-09-30 15:11 ` Bill Davidsen 1 sibling, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-26 18:04 UTC (permalink / raw) To: Gabriel Paubert; +Cc: linux-kernel Hi. Gabriel Paubert wrote: >> the 32bit code. For 16bit code (PM or V86) it just >> doesn't really matter I think (I don't think using >> prefixes for ESP is sane). > Well, thrashing a register at any time under the user > just because an interrupt happened is even less sane ;-) OK, I agree, so I tested that: wrote some value to the higher word of ESP in the struct that dosemu passes to vm86() syscall, let the prog to run for a while, and the value was still there. It is not 100% reliable test because I can't guarantee the vm86() was interrupted during the period I was waiting (because vm86() is executed for the short durations, then exits on fault or signal), but it looks OK and why not to beleive Petr that v86 is not affected. >> http://www.tenberry.com/dos4g/watcom/rn4gw.html >> --- >> B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug >> which mangles the high word of ESP. > But if the bug is also affects Windows DOS box, it means that > V86 is affected too, no? No. This URL is about dos4gw, so that's about a prot. mode either. > I'd like to know what OS/2 did. AFAIK also WinNT is not affected. Not sure though. > The DOS boxes and 16 bit mode > DPMI applications ran very well and it was very stable, despite Hey. It is not about 16bit DPMI mode, it is about the 32bit DPMI mode. That's the whole problem. Be it only about the 16bit DPMI mode, the problem may not ever harm anything at all. But for 32bit mode that's quite a problem. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-26 18:04 ` Stas Sergeev @ 2004-09-27 9:07 ` Gabriel Paubert 0 siblings, 0 replies; 38+ messages in thread From: Gabriel Paubert @ 2004-09-27 9:07 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote: Hi, On Sun, Sep 26, 2004 at 10:04:55PM +0400, Stas Sergeev wrote: > OK, I agree, so I tested that: wrote some value > to the higher word of ESP in the struct that > dosemu passes to vm86() syscall, let the prog > to run for a while, and the value was still there. > It is not 100% reliable test because I can't > guarantee the vm86() was interrupted during the > period I was waiting (because vm86() is executed > for the short durations, then exits on fault or > signal), but it looks OK and why not to beleive > Petr that v86 is not affected. Should be enough, you'll likely get at least a timer interrupt in the middle of v86 mode sooner or later. > > >>http://www.tenberry.com/dos4g/watcom/rn4gw.html > >>--- > >>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug > >> which mangles the high word of ESP. > >But if the bug is also affects Windows DOS box, it means that > >V86 is affected too, no? > No. This URL is about dos4gw, so that's about a > prot. mode either. I see, I missed the 32 in mouse32. > > >I'd like to know what OS/2 did. > AFAIK also WinNT is not affected. Not sure though. > > >The DOS boxes and 16 bit mode > >DPMI applications ran very well and it was very stable, despite > Hey. It is not about 16bit DPMI mode, it is > about the 32bit DPMI mode. That's the whole > problem. Be it only about the 16bit DPMI mode, > the problem may not ever harm anything at all. > But for 32bit mode that's quite a problem. Well, I did not express myself correctly. OS/2 was probably the most frightening mix of 16 and 32 bit code ever produced, so they certainly had to work around the problem. There were literally tons of "thunks" to interface both modes and the LDT was set up in a way which made the translation of addresses relatively easy: off32 = (seg16>>3) + off16 Therefore the base address of LDT segment n was n*64k, limiting the addressable memory to 512MB in all versions I used (later versions lifted this limit), but making conversions of 16 to 32 bit addresses and back quite efficient. However, I can't remember exactly how the thunks worked. Gabriel Gabriel ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-25 23:42 ` Gabriel Paubert 2004-09-26 18:04 ` Stas Sergeev @ 2004-09-30 15:11 ` Bill Davidsen 1 sibling, 0 replies; 38+ messages in thread From: Bill Davidsen @ 2004-09-30 15:11 UTC (permalink / raw) To: linux-kernel Gabriel Paubert wrote: > Hi, > > >>It does not restore it in any case for the 16bit >>stack (no matter whether the code is 16 or 32 in PM). >>Petr says V86 is not affected, but I have not tested >>that case because why to care? The problem is only for >>the 32bit code. For 16bit code (PM or V86) it just >>doesn't really matter I think (I don't think using >>prefixes for ESP is sane). > > > Well, thrashing a register at any time under the user > just because an interrupt happened is even less sane ;-) > > >>>I'm absolutely amazed by the fact that this bug has been there >>>since the beginning and only seems to hit users right now. >> >>No, right now it just hits me:) >>It used to hit the dosemu users always, but people just >>blamed dosemu itself and that was it. Noone wanted >>to spend weeks traceing the DOS programs under dosemu, >>then traceing dosemu itself, then traceing kernel, >>then looking through the docs to track the problem down >>to something known, then writing to Intel's techsup for >>clarifications, and then writing to LKML:) And if not >>for the great help I got here, this will end up nowhere >>again. So that's how it used to stay "unnoticed" for >>years. > > > I see. And finally we know that the problem with the > processor and that Intel changed the spec to conform > to what the hardware does but is rather coy about it. > > >>As for the other instances of that problem, here are some: >> >>http://www.tenberry.com/dos4g/watcom/rn4gw.html >>--- >>B ** Fixed the mouse32 handler to ignore a Microsoft Windows DOS box bug >> which mangles the high word of ESP. > > > But if the bug is also affects Windows DOS box, it means that > V86 is affected too, no? > > I'd like to know what OS/2 did. The DOS boxes and 16 bit mode > DPMI applications ran very well and it was very stable, despite > the fact that the kernel spent its time switching between 16 > and 32 bit modes (for example, almost all device drivers were > 16 bit code, but the drivers for DOS emulation were 32 bit). > But I removed it from all my machines several years ago. May I suggest that IBM is a friend of Linux, and that *if* there is a simple way to get around the problem they will probably be very willing to share it with us. -- -bill davidsen (davidsen@tmr.com) "The secret to procrastination is to put things off until the last possible moment - but no longer" -me ^ permalink raw reply [flat|nested] 38+ messages in thread
* ESP corruption bug - what CPUs are affected?
@ 2004-09-16 17:49 Stas Sergeev
2004-09-16 19:03 ` Denis Vlasenko
0 siblings, 1 reply; 38+ messages in thread
From: Stas Sergeev @ 2004-09-16 17:49 UTC (permalink / raw)
To: linux-kernel
[-- Attachment #1: Type: text/plain, Size: 1638 bytes --]
Hello.
My question is not directly linux-related,
but I think people here can give me some clue
on an issue.
There is a "semi-official" bug in Intel CPUs,
which is described here:
http://www.intel.com/design/intarch/specupdt/27287402.PDF
chapter "Specification Clarifications"
section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers".
A short quote:
"ISSUE: When a 32-bit IRET is used to return to another privilege level,
and the old level uses a 4G stack (D/B bit in the segment register = 1),
while the new level uses a 64k stack (D/B bit = 0), then only the
lower word of ESP is updated."
Which means that the higher word of ESP gets
trashed. This bug beats dosemu rather badly,
but there seem to be not much info about that
bug available on the net.
What I want to find out, is what CPUs are
affected. I wrote a small test-case. It is
attached. I tried it on Intel Pentium and
on AMD Athlon - bug is present on both. But
I'd like to know if it is also present on a
Transmeta Crusoe, Cyrix and other CPUs.
Can someone please run the test-case on some
of that CPUs and see how it goes?
Note: specifying "32" as an arg to the
test-case, will make it to use the 32bit
segment for the stack, and it will not detect
the bug. It is there to prove that it detects
the Right Thing. Run it without that argument.
I'd also like to know any thoughts on whether
it is possible to work around the bug, probably
in a kernel? Well, I am not hoping on such a
possibility, but who knows...
Anyway, I'd be glad to get any info on that bug.
Why it was not fixed for so many years, looks
also like an interesting question, as for me.
[-- Attachment #2: stk.c --]
[-- Type: text/x-csrc, Size: 2636 bytes --]
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <signal.h>
#include <linux/unistd.h>
#include <asm/ldt.h>
#include <asm/ucontext.h>
unsigned char stack[65536];
void my_segv(int signum, siginfo_t *info, void *c)
{
struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext;
printf("In sighandler: esp=%lx\n", ctx->esp);
/* Skip HLT */
ctx->eip++;
}
_syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount);
static int set_ldt_entry(int entry, unsigned long base, unsigned int limit,
int seg_32bit_flag, int contents, int read_only_flag,
int limit_in_pages_flag, int seg_not_present, int useable)
{
struct modify_ldt_ldt_s ldt_info;
ldt_info.entry_number = entry;
ldt_info.base_addr = base;
ldt_info.limit = limit;
ldt_info.seg_32bit = seg_32bit_flag;
ldt_info.contents = contents;
ldt_info.read_exec_only = read_only_flag;
ldt_info.limit_in_pages = limit_in_pages_flag;
ldt_info.seg_not_present = seg_not_present;
ldt_info.useable = useable;
return modify_ldt(1, &ldt_info, sizeof(ldt_info));
}
int main(int argc, char *argv[])
{
unsigned short _ss, new_ss;
unsigned long _esp, new_esp;
int is32 = 0;
stack_t sig_stack;
struct sigaction sa;
if (argc > 1 && !strncmp(argv[1], "32", 3))
is32 = 1;
sig_stack.ss_sp = stack;
sig_stack.ss_flags = 0;
sig_stack.ss_size = sizeof(stack);
if (sigaltstack(&sig_stack, NULL)) {
perror("sigaltstack()");
return 1;
}
sa.sa_sigaction = my_segv;
sigemptyset(&sa.sa_mask);
sa.sa_flags = SA_ONSTACK | SA_SIGINFO;
sigaction(SIGSEGV, &sa, NULL);
/* Get SS, ESP */
asm volatile(
"movw %%ss, %0\n"
"movl %%esp, %1\n"
:"=m"(_ss), "=m"(_esp)
);
printf("sp=%#lx, ss=%#hx\n", _esp, _ss);
/* Force to LDT */
new_ss = _ss | 4;
/* Create the LDT entry */
set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0);
/* Do the trick... Switch stack and then switch context. */
asm volatile(
"movw %3, %%ss\n" /* Load our LDT selector to SS */
"hlt\n" /* Force the context switch */
"movw %1, %%ss\n" /* Restore SS ASAP */
"movl %%esp, %0\n" /* Get new ESP */
"movl %2, %%esp\n" /* Restore ESP */
:"=m"(new_esp)
: "m"(_ss), "m"(_esp), "m"(new_ss)
);
printf("Now sp=%#lx, ss=%#hx\n", new_esp, new_ss);
/* See what we've got... */
if (new_esp > 0xc0000000) {
printf("BUG!\n");
} else if (new_esp != _esp) {
printf("Esp changed, strange...\n");
} else {
printf("No bug here! What CPU is this?\n");
if (!is32)
system("cat /proc/cpuinfo");
}
return 0;
}
^ permalink raw reply [flat|nested] 38+ messages in thread* Re: ESP corruption bug - what CPUs are affected? 2004-09-16 17:49 Stas Sergeev @ 2004-09-16 19:03 ` Denis Vlasenko 2004-09-17 18:13 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Denis Vlasenko @ 2004-09-16 19:03 UTC (permalink / raw) To: Stas Sergeev, linux-kernel On Thursday 16 September 2004 20:49, Stas Sergeev wrote: > Hello. > > My question is not directly linux-related, > but I think people here can give me some clue > on an issue. > > There is a "semi-official" bug in Intel CPUs, > which is described here: > http://www.intel.com/design/intarch/specupdt/27287402.PDF > chapter "Specification Clarifications" > section 4: "Use Of ESP In 16-Bit Code With 32-Bit Interrupt Handlers". > > A short quote: > "ISSUE: When a 32-bit IRET is used to return to another privilege level, > and the old level uses a 4G stack (D/B bit in the segment register = 1), > while the new level uses a 64k stack (D/B bit = 0), then only the > lower word of ESP is updated." > > Which means that the higher word of ESP gets > trashed. This bug beats dosemu rather badly, > but there seem to be not much info about that > bug available on the net. Well. Strictly speaking it's not a bug. Code executes as it should. IRET returns to the correct location. Upper 16 bits of ESP do not affect execution in this case. You want CPU to preserve upper bits, but this will waste a handful of transistors practically for no gain. I think dosemu should just work around this. Is there some subtle problem with that in dosemu? > What I want to find out, is what CPUs are > affected. I wrote a small test-case. It is > attached. I tried it on Intel Pentium and > on AMD Athlon - bug is present on both. But > I'd like to know if it is also present on a > Transmeta Crusoe, Cyrix and other CPUs. > Can someone please run the test-case on some > of that CPUs and see how it goes? > Note: specifying "32" as an arg to the > test-case, will make it to use the 32bit > segment for the stack, and it will not detect > the bug. It is there to prove that it detects > the Right Thing. Run it without that argument. > > I'd also like to know any thoughts on whether > it is possible to work around the bug, probably > in a kernel? Well, I am not hoping on such a > possibility, but who knows... > Anyway, I'd be glad to get any info on that bug. > Why it was not fixed for so many years, looks > also like an interesting question, as for me. Because it does not matter for 99.9999% of users... # gcc -O2 stk.c # ./a.out sp=0xbffffb30, ss=0x7b In sighandler: esp=bffffb10 Now sp=0xccf1fb10, ss=0x7f Esp changed, strange... # cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 8 model name : Unknown CPU Typ stepping : 1 cpu MHz : 1743.632 cache size : 64 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow -- vda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-16 19:03 ` Denis Vlasenko @ 2004-09-17 18:13 ` Stas Sergeev 2004-09-17 22:04 ` Denis Vlasenko 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-17 18:13 UTC (permalink / raw) To: Denis Vlasenko; +Cc: linux-kernel Hello. Denis Vlasenko wrote: >> Which means that the higher word of ESP gets >> trashed. This bug beats dosemu rather badly, >> but there seem to be not much info about that >> bug available on the net. > Well. Strictly speaking it's not a bug. > Code executes as it should. IRET returns to the correct > location. Upper 16 bits of ESP do not affect execution > in this case. Unfortunately this is not true. Mainly because the code is still 32bit, so when is operates with the stack pointer directly, it gets ESP, not SP. Here are a few examples from the real DOS progs: --- frstor [esp] <--crash --- push ebp mov ebp,esp mov edx,[ebp+0x8] <--crash --- and there are much more. "mov ebp,esp" is just a standard thing, and then the ebp is being used to access the locals and function arguments. Why do they use the 16bit stack in that case, is another question. Unfortunately some of them do. > You want CPU to preserve upper bits, > but this will waste a handful of transistors > practically for no gain. I don't think so. After all, if only the B bit is set, this is being done. So I guess all the transistors are in place (well, of course this is just a wild guess). > I think dosemu should just work around this. I see no way to work around this in user-space. Esp. since dosemu doesn't control the execution completely, it acts as a monitor, only getting a control when exceptions occur. For dosemu code there are no problems, but the DOS program that is running, crashes. > Is there some subtle problem with that in dosemu? Nothing subtle I think. The problem is real and quite hurts. Fortunately it happened that using the dos32a extender instead of dos4gw "fixes" it for the dos4gw-based apps. But users wants everything out of the box, and never read docs or even the error messages in a log, so this doesn't really help:) And there are some not dos4gw-based progs that are affected either. >> Why it was not fixed for so many years, looks >> also like an interesting question, as for me. > Because it does not matter for 99.9999% of users... Aparently you are underestimating the amount of the dosemu users out there:) AFAIK it is also a problem for Win95 DOS box sometimes. > # gcc -O2 stk.c > # ./a.out > sp=0xbffffb30, ss=0x7b > In sighandler: esp=bffffb10 > Now sp=0xccf1fb10, ss=0x7f > Esp changed, strange... That's strange indeed! Apparently your ESP value, 0xccf1fb10, is greater than 0xc0000000, in which case my program should just write "BUG!", but for you it doesn't. Haven't you altered the code somehow? Why I compare it with 0xc0000000, is because ESP has the higher word of a kernel's stack address, so it should be above the TASK_SIZE when corrupted. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-17 18:13 ` Stas Sergeev @ 2004-09-17 22:04 ` Denis Vlasenko 2004-09-18 10:58 ` Stas Sergeev 0 siblings, 1 reply; 38+ messages in thread From: Denis Vlasenko @ 2004-09-17 22:04 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 3458 bytes --] On Friday 17 September 2004 21:13, Stas Sergeev wrote: > Hello. > > Denis Vlasenko wrote: > >> Which means that the higher word of ESP gets > >> trashed. This bug beats dosemu rather badly, > >> but there seem to be not much info about that > >> bug available on the net. > > > > Well. Strictly speaking it's not a bug. > > Code executes as it should. IRET returns to the correct > > location. Upper 16 bits of ESP do not affect execution > > in this case. > > Unfortunately this is not true. Mainly because > the code is still 32bit, so when is operates with > the stack pointer directly, it gets ESP, not SP. > Here are a few examples from the real DOS progs: > --- > frstor [esp] <--crash > --- > push ebp > mov ebp,esp > mov edx,[ebp+0x8] <--crash > --- Hmmm. Let me dust off my i386 knwoledge. This is what happens: CPU is doing IRET in 32bit mode. Stack layout: +20 (empty) <--- SS:ESP +16 old_CS +12 old_EIP +8 old_EFLAGS +4 old_SS +0 old_ESP If old_SS references 'small' data segment (16-bit one), processor does not restore ESP from old_ESP. It restores lower 16 bits only. Upper 16 bits are filled with garbage (or just not modified at all?). This works fine because processor uses SP, not ESP, for subsequent push/pop/call/ret operations. But if code uses full ESP, thinking that upper 16 bits are zero, it will crash badly. Correct me if I'm wrong. > and there are much more. "mov ebp,esp" is just a > standard thing, and then the ebp is being used to > access the locals and function arguments. Why do > they use the 16bit stack in that case, is another > question. Unfortunately some of them do. Hmm. I think you need to *emulate* this IRET. Is this IRET happens in kernel or in dosemu code? > > # gcc -O2 stk.c > > # ./a.out > > sp=0xbffffb30, ss=0x7b > > In sighandler: esp=bffffb10 > > Now sp=0xccf1fb10, ss=0x7f > > Esp changed, strange... > > That's strange indeed! Apparently your ESP value, > 0xccf1fb10, is greater than 0xc0000000, in which > case my program should just write "BUG!", but for > you it doesn't. Haven't you altered the code > somehow? No, I didn't alter anything. Strange indeed. > Why I compare it with 0xc0000000, is > because ESP has the higher word of a kernel's stack > address, so it should be above the TASK_SIZE when > corrupted. Even more strange! Now I did modify the code (attached), and this is what I've got: # gcc -o stk2 -O2 stk2.c # ./stk2 sp=0xbffffb30, ss=0x7b In sighandler: esp=bffffb10 Now sp=cd79fb10 <========= Now sp=bffffb50 <========= !? Now sp=bffffb50, ss=007f Now sp=bffffb50 Now sp=bffffb50 Now sp=bffffb50 Esp changed, strange... I am thoroughly confused now. Two back-to-back 'printf("Now sp=%08lx\n", new_esp);' gave different result. How this can happen?? Corresponding gcc -S code. new_esp is at -180(%ebp): #APP movw -182(%ebp), %ss hlt movw -170(%ebp), %ss movl %esp, -180(%ebp) movl -176(%ebp), %esp #NO_APP addl $40, %esp pushl -180(%ebp) pushl $.LC4 call printf popl %eax popl %edx pushl -180(%ebp) pushl $.LC4 call printf addl $12, %esp movzwl -182(%ebp), %eax pushl %eax pushl -180(%ebp) pushl $.LC5 call printf popl %ecx popl %ebx pushl -180(%ebp) pushl $.LC4 call printf popl %eax popl %edx -- vda [-- Attachment #2: stk2.c --] [-- Type: text/x-csrc, Size: 2903 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <signal.h> #include <linux/unistd.h> #include <asm/ldt.h> #include <asm/ucontext.h> unsigned char stack[65536]; void my_segv(int signum, siginfo_t *info, void *c) { struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext; printf("In sighandler: esp=%lx\n", ctx->esp); /* Skip HLT */ ctx->eip++; } _syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount); static int set_ldt_entry(int entry, unsigned long base, unsigned int limit, int seg_32bit_flag, int contents, int read_only_flag, int limit_in_pages_flag, int seg_not_present, int useable) { struct modify_ldt_ldt_s ldt_info; ldt_info.entry_number = entry; ldt_info.base_addr = base; ldt_info.limit = limit; ldt_info.seg_32bit = seg_32bit_flag; ldt_info.contents = contents; ldt_info.read_exec_only = read_only_flag; ldt_info.limit_in_pages = limit_in_pages_flag; ldt_info.seg_not_present = seg_not_present; ldt_info.useable = useable; return modify_ldt(1, &ldt_info, sizeof(ldt_info)); } int main(int argc, char *argv[]) { unsigned short _ss, new_ss; unsigned long _esp, new_esp; int is32 = 0; stack_t sig_stack; struct sigaction sa; if (argc > 1 && !strncmp(argv[1], "32", 3)) is32 = 1; sig_stack.ss_sp = stack; sig_stack.ss_flags = 0; sig_stack.ss_size = sizeof(stack); if (sigaltstack(&sig_stack, NULL)) { perror("sigaltstack()"); return 1; } sa.sa_sigaction = my_segv; sigemptyset(&sa.sa_mask); sa.sa_flags = SA_ONSTACK | SA_SIGINFO; sigaction(SIGSEGV, &sa, NULL); /* Get SS, ESP */ asm volatile( "movw %%ss, %0\n" "movl %%esp, %1\n" :"=m"(_ss), "=m"(_esp) ); printf("sp=%#lx, ss=%#hx\n", _esp, _ss); /* Force to LDT */ new_ss = _ss | 4; /* Create the LDT entry */ set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0); /* Do the trick... Switch stack and then switch context. */ asm volatile( "movw %3, %%ss\n" /* Load our LDT selector to SS */ "hlt\n" /* Force the context switch */ "movw %1, %%ss\n" /* Restore SS ASAP */ "movl %%esp, %0\n" /* Get new ESP */ "movl %2, %%esp\n" /* Restore ESP */ :"=m"(new_esp) : "m"(_ss), "m"(_esp), "m"(new_ss) ); printf("Now sp=%08lx\n", new_esp); printf("Now sp=%08lx\n", new_esp); printf("Now sp=%08lx, ss=%04hx\n", new_esp, new_ss); printf("Now sp=%08lx\n", new_esp); printf("Now sp=%08lx\n", new_esp); /* See what we've got... */ if (new_esp > 0xc0000000) { printf("Now sp=%08lx\n", new_esp); printf("BUG!\n"); } else if (new_esp != _esp) { printf("Now sp=%08lx\n", new_esp); printf("Esp changed, strange...\n"); } else { printf("Now sp=%08lx\n", new_esp); printf("No bug here! What CPU is this?\n"); if (!is32) system("cat /proc/cpuinfo"); } return 0; } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-17 22:04 ` Denis Vlasenko @ 2004-09-18 10:58 ` Stas Sergeev 2004-09-18 13:08 ` Denis Vlasenko 2004-09-21 11:19 ` Pavel Machek 0 siblings, 2 replies; 38+ messages in thread From: Stas Sergeev @ 2004-09-18 10:58 UTC (permalink / raw) To: Denis Vlasenko; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 2952 bytes --] Hello. Denis Vlasenko wrote: >> This is what happens: CPU is doing IRET in 32bit mode. >> Stack layout: > +20 (empty) <--- SS:ESP > +16 old_CS > +12 old_EIP > +8 old_EFLAGS > +4 old_SS > +0 old_ESP JFYI, looks like you swapped CS<-->EIP and SS<-->ESP on your stack frame layout. > If old_SS references 'small' data segment (16-bit one), > processor does not restore ESP from old_ESP. > It restores lower 16 bits only. Upper 16 bits are filled > with garbage (or just not modified at all?). Not modified at all, yes. That's why it is always greater than TASK_SIZE (0xc0000000) - it is still in a kernel space. > This works fine because processor uses SP, not ESP, > for subsequent push/pop/call/ret operations. > But if code uses full ESP, thinking that upper 16 bits are zero, > it will crash badly. Correct me if I'm wrong. That's correct. But you should note that the program not just "thinks" that the upper 16 bits are zero. It writes zero there itself, and a few instructions later - oops, it is no longer zero... My test-case does "hlt" to force the context switch, but in fact it is not necessary. I tried an empty loop instead of HLT. If the loop is long enough and some IRQ being handled by the kernel in a mean time, the ESP is corrupted. So the user-space never knows when exactly it gets corrupted. And, as I mentioned already, since the code is 32bit, it just does the things like "mov ebp,esp" and uses ebp to access the function params and locals, which crashes, or uses ESP directly to access something, etc. > Hmm. I think you need to *emulate* this IRET. > Is this IRET happens in kernel or in dosemu code? The problem happens only when iret is used to switch to another priv level. So it is the kernel's iret of course, something to the effect of entry.S:128 I think. The control is transferred from kernel code to the DOS code directly. dosemu code is completely bypassed. dosemu have no chances to intercept that. The DOS program is running on its own (as long as it doesn't trigger an exception, or SIGALRM comes), and the CPU is corrupting its ESP in *any* place, since it happens when an external IRQ arrives. >> That's strange indeed! Apparently your ESP value, >> 0xccf1fb10, is greater than 0xc0000000, in which >> case my program should just write "BUG!", but for >> you it doesn't. Haven't you altered the code >> somehow? > No, I didn't alter anything. Strange indeed. So that proves that writing the bug-free code, esp. when the asm is involved, is not always as easy as I assume... > I am thoroughly confused now. Two back-to-back > 'printf("Now sp=%08lx\n", new_esp);' > gave different result. How this can happen?? My code did the assumption that the ESP should not change within the execution of main(). This is true only if you don't enable the optimization (as I did), or use -fno-defer-pop option of gcc. Fixed code is attached (just for the overall completeness:) Should survive the optimization now. [-- Attachment #2: stk1.c --] [-- Type: text/x-csrc, Size: 2646 bytes --] #include <stdio.h> #include <string.h> #include <stdlib.h> #include <signal.h> #include <linux/unistd.h> #include <asm/ldt.h> #include <asm/ucontext.h> unsigned char stack[65536]; void my_segv(int signum, siginfo_t *info, void *c) { struct sigcontext *ctx = &((struct ucontext *)c)->uc_mcontext; printf("In sighandler: esp=%lx\n", ctx->esp); /* Skip HLT */ ctx->eip++; } _syscall3(int, modify_ldt, int, func, void *, ptr, unsigned long, bytecount) static int set_ldt_entry(int entry, unsigned long base, unsigned int limit, int seg_32bit_flag, int contents, int read_only_flag, int limit_in_pages_flag, int seg_not_present, int useable) { struct modify_ldt_ldt_s ldt_info; ldt_info.entry_number = entry; ldt_info.base_addr = base; ldt_info.limit = limit; ldt_info.seg_32bit = seg_32bit_flag; ldt_info.contents = contents; ldt_info.read_exec_only = read_only_flag; ldt_info.limit_in_pages = limit_in_pages_flag; ldt_info.seg_not_present = seg_not_present; ldt_info.useable = useable; return modify_ldt(1, &ldt_info, sizeof(ldt_info)); } int main(int argc, char *argv[]) { unsigned short _ss, new_ss; unsigned long _esp, new_esp; int is32 = 0; stack_t sig_stack; struct sigaction sa; if (argc > 1 && !strncmp(argv[1], "32", 3)) is32 = 1; sig_stack.ss_sp = stack; sig_stack.ss_flags = 0; sig_stack.ss_size = sizeof(stack); if (sigaltstack(&sig_stack, NULL)) { perror("sigaltstack()"); return 1; } sa.sa_sigaction = my_segv; sigemptyset(&sa.sa_mask); sa.sa_flags = SA_ONSTACK | SA_SIGINFO; sigaction(SIGSEGV, &sa, NULL); /* Get SS */ asm volatile( "movw %%ss, %0\n" :"=m"(_ss) ); /* Force to LDT */ new_ss = _ss | 4; /* Create the LDT entry */ set_ldt_entry(new_ss >> 3, 0, 0xbffff, is32, MODIFY_LDT_CONTENTS_DATA, 0, 1, 0, 0); printf("old_ss=%#hx new_ss=%#hx\n", _ss, new_ss); /* Do the trick... Switch stack and then switch context. */ asm volatile( "movl %%esp, %1\n" /* Save ESP */ "movw %2, %%ss\n" /* Load our LDT selector to SS */ "hlt\n" /* Force the context switch */ "movw %3, %%ss\n" /* Restore SS ASAP */ "movl %%esp, %0\n" /* Get new ESP */ "movl %1, %%esp\n" /* Restore ESP */ :"=m"(new_esp), "=a"(_esp) :"m"(new_ss), "m"(_ss) ); printf("old_esp=%#lx new_esp=%#lx\n", _esp, new_esp); /* See what we've got... */ if (new_esp > 0xc0000000) { printf("BUG!\n"); } else if (new_esp != _esp) { printf("Esp changed, strange...\n"); } else { printf("No bug here! What CPU is this?\n"); if (!is32) system("cat /proc/cpuinfo"); } return 0; } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 10:58 ` Stas Sergeev @ 2004-09-18 13:08 ` Denis Vlasenko 2004-09-18 17:05 ` Stas Sergeev 2004-09-21 11:19 ` Pavel Machek 1 sibling, 1 reply; 38+ messages in thread From: Denis Vlasenko @ 2004-09-18 13:08 UTC (permalink / raw) To: Stas Sergeev; +Cc: linux-kernel On Saturday 18 September 2004 13:58, Stas Sergeev wrote: > Hello. > > Denis Vlasenko wrote: > >> This is what happens: CPU is doing IRET in 32bit mode. > >> Stack layout: > > > > +20 (empty) <--- SS:ESP > > +16 old_CS > > +12 old_EIP > > +8 old_EFLAGS > > +4 old_SS > > +0 old_ESP > > JFYI, looks like you swapped CS<-->EIP and SS<-->ESP > on your stack frame layout. Hm. More than that. My figure is totally incorrect. Corrected one: +16 old_SS +12 old_ESP +8 old_EFLAGS +4 old_CS +0 old_EIP <--- SS:ESP Addresses grow from the bottom in this fig. Note that offsets are always stored in lower address relative to where selectors are stored. > > If old_SS references 'small' data segment (16-bit one), > > processor does not restore ESP from old_ESP. > > It restores lower 16 bits only. Upper 16 bits are filled > > with garbage (or just not modified at all?). > > Not modified at all, yes. That's why it is always > greater than TASK_SIZE (0xc0000000) - it is still > in a kernel space. > > > This works fine because processor uses SP, not ESP, > > for subsequent push/pop/call/ret operations. > > But if code uses full ESP, thinking that upper 16 bits are zero, > > it will crash badly. Correct me if I'm wrong. > > That's correct. But you should note that the > program not just "thinks" that the upper 16 bits > are zero. It writes zero there itself, and a few > instructions later - oops, it is no longer zero... > My test-case does "hlt" to force the context switch, > but in fact it is not necessary. I tried an > empty loop instead of HLT. If the loop is long > enough and some IRQ being handled by the kernel > in a mean time, the ESP is corrupted. So the > user-space never knows when exactly it gets corrupted. > And, as I mentioned already, since the code is 32bit, > it just does the things like "mov ebp,esp" and > uses ebp to access the function params and locals, > which crashes, or uses ESP directly to access > something, etc. > > > Hmm. I think you need to *emulate* this IRET. > > Is this IRET happens in kernel or in dosemu code? > > The problem happens only when iret is used to > switch to another priv level. So it is the kernel's > iret of course, something to the effect of entry.S:128 > I think. The control is transferred from kernel code > to the DOS code directly. dosemu code is completely > bypassed. dosemu have no chances to intercept that. > The DOS program is running on its own (as long as > it doesn't trigger an exception, or SIGALRM comes), > and the CPU is corrupting its ESP in *any* place, > since it happens when an external IRQ arrives. Aha. The only way to sanely handle it is to hack on entry.S I'm afraid. Something like rewriting CS:EIP so that it returns to a small ring-3 trampoline which clears upper 16 bits of ESP and jumps to original CS:EIP. You may use ring-3 user stack as a place to save original CS:EIP, and then just do RETF. This way, trampoline boils down to movzxl %sp,%esp retf which does not touch EFLAGS and you don't need to bother with thinking about preserving it. IRET will restore it, and trampoline wouldn't touch it anymore. Now, how to detect when to use this? Hmm.... the simplest thing is to check that (old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one) This will cost us only one comparison in the normal path, because typically ESP of Linus executables is greater than 0xffff. P.S. # gcc -o stk1 -O2 stk1.c # ./stk1 old_ss=0x7b new_ss=0x7f In sighandler: esp=bffffb30 old_esp=0xbffffb30 new_esp=0xc618fb30 BUG! -- vda ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 13:08 ` Denis Vlasenko @ 2004-09-18 17:05 ` Stas Sergeev [not found] ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua> 0 siblings, 1 reply; 38+ messages in thread From: Stas Sergeev @ 2004-09-18 17:05 UTC (permalink / raw) To: Denis Vlasenko; +Cc: linux-kernel Hi Denis. Denis Vlasenko wrote: > Aha. The only way to sanely handle it is to > hack on entry.S I'm afraid. Something like rewriting > CS:EIP so that it returns to a small ring-3 trampoline > which clears upper 16 bits of ESP and jumps to original CS:EIP. Well, I don't really want to clear the higher word of ESP (even if I want to do that in most cases), but I'd really like to just restore it properly. Of course (I think) ring-3 trampoline will not work for many reasons. The most obvious one is that it itself can be interrupted in any place. Another problem with it is that the return frame will have to be pushed to the stack of a DOS prog. This is not the good thing to do. dosemu avoids ever touching the stack of a DOS prog by setting up the sigaltstack. What *will* work, however, is a ring-1 trampoline, as Petr Vandrovec suggested. It can be executed with interrupts disabled (I think) so will not be interrupted, and it can (isn't it?) use the kernel stack, if I understand that correctly. > Now, how to detect when to use this? Hmm.... the simplest thing > is to check that > (old_ESP <= 0xffff) && !(old_EFLAGS & VM_MASK) && (descr_old_SS is 16bit one) Yes, that's an excellent idea (except for the ESP<=0xffff check - I don't think this one is necessary). > This will cost us only one comparison in the normal > path, Yes, so I think I should just try to implement that. Not as a ring-3 trampoline, but as a ring-1 one. > because typically ESP of Linus executables > is greater than 0xffff. I'd rather say, because the stack segment is 32bit for them always. And this all should work, as it seems to me. Thanks for the hint about the checking. ^ permalink raw reply [flat|nested] 38+ messages in thread
[parent not found: <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>]
* Re: ESP corruption bug - what CPUs are affected? [not found] ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua> @ 2004-09-22 19:05 ` Stas Sergeev 0 siblings, 0 replies; 38+ messages in thread From: Stas Sergeev @ 2004-09-22 19:05 UTC (permalink / raw) To: Denis Vlasenko; +Cc: linux-kernel Hi, Denis Vlasenko wrote: > Maybe. This would be a complicated thing. I bet it was! :) > Well. Not okay. Maybe this? > 1. We build IRET frame on ring1 stack for step 4 (see below), > modify IRET frame on ring0 stack so that intrs are disabled > and CS:EIP and SS:ESP point to values of ring1 code/stack. Much simpler: IRET frame to return to user is already there. Just push another one to return to ring1 first. > 2. IRET returns to ring1 code with dedicated *16-bit* ring1 stack > Upper word of ESP is wrong now, but we can safely fiddle with it. > 3. trampoline code fixes upper word of ESP (how?) popl %esp (as per Petr Vandrovec's suggestion) The value on stack is carefully prepared on ring0. > 4. trampoline IRETs to user code. > May work. Works! > ring1 stacks must be per-CPU. I allocate it on a ring0 stack. Noone seem to suggest that. Is this flawed for some reasons? >> ESP<=0xffff check - I don't think this one is >> necessary). > Any program which runs with 16bit stack and yet with > ESP > 0xffff is doing something *terminally* weird. > I think it is acceptable to leave this case unfixed. For what? We can have that fixed so why not? > You cannot check 16bitness of SS descriptor in two > insns. I do actually: larl OLDSS(%esp), %eax testl $0x00400000, %eax which is exactly two insns. Since I've forgot to CC the patch to you, I uploaded it here: http://www.dosemu.org/stas/linux-2.6.8-stacks2.diff so that you (or anyone interested) can make a review. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-18 10:58 ` Stas Sergeev 2004-09-18 13:08 ` Denis Vlasenko @ 2004-09-21 11:19 ` Pavel Machek 2004-09-21 11:43 ` Denis Vlasenko 1 sibling, 1 reply; 38+ messages in thread From: Pavel Machek @ 2004-09-21 11:19 UTC (permalink / raw) To: Stas Sergeev; +Cc: Denis Vlasenko, linux-kernel Hi! > >for subsequent push/pop/call/ret operations. > >But if code uses full ESP, thinking that upper 16 bits are zero, > >it will crash badly. Correct me if I'm wrong. > That's correct. But you should note that the > program not just "thinks" that the upper 16 bits > are zero. It writes zero there itself, and a few > instructions later - oops, it is no longer zero... Hmm, perhaps this can also be viewed as a "information leak"? Program running under dosemu is not expected to know high bits of kernel %esp... Pavel -- People were complaining that M$ turns users into beta-testers... ...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl! ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: ESP corruption bug - what CPUs are affected? 2004-09-21 11:19 ` Pavel Machek @ 2004-09-21 11:43 ` Denis Vlasenko 0 siblings, 0 replies; 38+ messages in thread From: Denis Vlasenko @ 2004-09-21 11:43 UTC (permalink / raw) To: Pavel Machek, Stas Sergeev; +Cc: linux-kernel On Tuesday 21 September 2004 14:19, Pavel Machek wrote: > Hi! > > > >for subsequent push/pop/call/ret operations. > > >But if code uses full ESP, thinking that upper 16 bits are zero, > > >it will crash badly. Correct me if I'm wrong. > > > > That's correct. But you should note that the > > program not just "thinks" that the upper 16 bits > > are zero. It writes zero there itself, and a few > > instructions later - oops, it is no longer zero... > > Hmm, perhaps this can also be viewed as a "information leak"? Program > running under dosemu is not expected to know high bits of kernel > %esp... Yes. I must admit that Stas was right and I was wrong. This is a 386 CPU bug. Since then it seems to be codified in i86 architecture and duplicated in all successors. Incredible... :( -- vda ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2004-10-11 23:16 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-10-06 17:18 ESP corruption bug - what CPUs are affected? (patch att Petr Vandrovec
2004-10-06 19:04 ` Stas Sergeev
2004-10-11 18:32 ` ESP corruption bug - what CPUs are affected? Stas Sergeev
-- strict thread matches above, loose matches on Subject: below --
2004-09-16 18:39 Petr Vandrovec
2004-09-17 18:12 ` Stas Sergeev
2004-09-18 16:45 ` Stas Sergeev
2004-09-18 16:59 ` Petr Vandrovec
2004-09-18 19:14 ` Stas Sergeev
2004-09-18 20:35 ` Petr Vandrovec
2004-09-22 18:49 ` Stas Sergeev
2004-09-22 19:19 ` Richard B. Johnson
2004-09-22 20:03 ` Stas Sergeev
2004-09-22 20:13 ` Richard B. Johnson
2004-09-28 15:43 ` Denis Vlasenko
2004-09-22 20:02 ` Petr Vandrovec
2004-09-23 4:09 ` Stas Sergeev
2004-09-23 17:08 ` Stas Sergeev
2004-09-23 18:06 ` Petr Vandrovec
2004-09-24 20:36 ` Stas Sergeev
2004-09-24 21:43 ` Petr Vandrovec
2004-09-25 8:04 ` Gabriel Paubert
2004-09-25 12:25 ` Stas Sergeev
2004-09-25 19:18 ` Gabriel Paubert
2004-09-25 20:40 ` Stas Sergeev
2004-09-25 23:42 ` Gabriel Paubert
2004-09-26 18:04 ` Stas Sergeev
2004-09-27 9:07 ` Gabriel Paubert
2004-09-30 15:11 ` Bill Davidsen
2004-09-16 17:49 Stas Sergeev
2004-09-16 19:03 ` Denis Vlasenko
2004-09-17 18:13 ` Stas Sergeev
2004-09-17 22:04 ` Denis Vlasenko
2004-09-18 10:58 ` Stas Sergeev
2004-09-18 13:08 ` Denis Vlasenko
2004-09-18 17:05 ` Stas Sergeev
[not found] ` <200409190108.45641.vda@port.imtp.ilyichevsk.odessa.ua>
2004-09-22 19:05 ` Stas Sergeev
2004-09-21 11:19 ` Pavel Machek
2004-09-21 11:43 ` Denis Vlasenko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).