* timer + fpu stuff locks my console race @ 2004-06-09 21:02 stian 2004-06-10 21:00 ` Matias Hermanrud Fjeld 2004-06-12 2:53 ` Rik van Riel 0 siblings, 2 replies; 18+ messages in thread From: stian @ 2004-06-09 21:02 UTC (permalink / raw) To: linux-kernel Please keep me in CC as I'm not on the mailinglist. I'm currently on a vaccation, so I can't hook my linux-box to the Internet, but I came across a race condition in the "old" 2.4.26-rc1 vanilla kernel. I'm doing some code tests when I came across problems with my program locking my console (even X if I'm using a xterm). I think first of all gcc triggers the problem, so the full report is here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=15905 For more details about versions and other information needed, please let me know if needed. It triggers at every attempt at my box currently (and I'm lacking Internet connection at the time-being on my machine). Stian Skjelstad ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-09 21:02 timer + fpu stuff locks my console race stian @ 2004-06-10 21:00 ` Matias Hermanrud Fjeld 2004-06-11 6:08 ` Lars Age Kamfjord 2004-06-12 2:53 ` Rik van Riel 1 sibling, 1 reply; 18+ messages in thread From: Matias Hermanrud Fjeld @ 2004-06-10 21:00 UTC (permalink / raw) To: linux-kernel [-- Attachment #1: Type: text/plain, Size: 158 bytes --] ACK mhf@bilbo:~$ uname -a Linux bilbo 2.6.6-1-k7 #1 Wed May 12 18:19:40 EST 2004 i686 GNU/Linux -- Matias Hermanrud Fjeld http://www.hex.no/mhf [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-10 21:00 ` Matias Hermanrud Fjeld @ 2004-06-11 6:08 ` Lars Age Kamfjord 0 siblings, 0 replies; 18+ messages in thread From: Lars Age Kamfjord @ 2004-06-11 6:08 UTC (permalink / raw) To: linux-kernel; +Cc: stian This bug seems VERY serious.... Every machine I've tested with so far has crashed totally; and it happens with every version of 2.4 and 2.6-kernels. I've tested with a 2.2.19-kernel, and that didn't crash, so it seems to be a bug in 2.4 and later..... Somebody really should look at this...... Lars Age Kamfjord ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-09 21:02 timer + fpu stuff locks my console race stian 2004-06-10 21:00 ` Matias Hermanrud Fjeld @ 2004-06-12 2:53 ` Rik van Riel 2004-06-12 3:50 ` Rik van Riel 2004-06-12 4:35 ` timer + fpu stuff locks my console race Matt Mackall 1 sibling, 2 replies; 18+ messages in thread From: Rik van Riel @ 2004-06-12 2:53 UTC (permalink / raw) To: stian; +Cc: linux-kernel On Wed, 9 Jun 2004 stian@nixia.no wrote: > I'm doing some code tests when I came across problems with my program > locking my console (even X if I'm using a xterm). Reproduced here, on my test system running a 2.6 kernel. I did get a kernel backtrace over serial console, though ;) Pid: 19752, comm: kernel-hang-bz1 EIP: 0060:[<ffff345c>] CPU: 0 EIP is at 0xffff345c EFLAGS: 00000202 Not tainted (2.6.5-1.332) EAX: 00000001 EBX: 12005870 ECX: fef32ea8 EDX: 1958f000 ESI: 1958f000 EDI: fef32ea8 EBP: fef32e48 DS: 007b ES: 007b CR0: 80050033 CR2: 00c4b720 CR3: 003ab000 CR4: 000006d0 Call Trace: [<0210dcda>] restore_i387_fxsave+0x18/0x60 [<0210dd38>] restore_i387+0x16/0x65 [<021059e5>] restore_sigcontext+0xf2/0x10c [<0215b737>] get_user_size+0x30/0x57 [<02105c13>] sys_sigreturn+0x214/0x23a -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-12 2:53 ` Rik van Riel @ 2004-06-12 3:50 ` Rik van Riel 2004-06-12 13:44 ` Sergey Vlasov 2004-06-12 4:35 ` timer + fpu stuff locks my console race Matt Mackall 1 sibling, 1 reply; 18+ messages in thread From: Rik van Riel @ 2004-06-12 3:50 UTC (permalink / raw) To: stian; +Cc: linux-kernel On Fri, 11 Jun 2004, Rik van Riel wrote: > Reproduced here, on my test system running a 2.6 kernel. > I did get a kernel backtrace over serial console, though ;) With a 2.4 kernel I get a similar stack trace (also on alt-sysrq-p) output: Pid/TGid: 3815/3815, comm: kernel-hang-bz1 EIP: 0060:[<c03ec1cc>] CPU: 0 EIP is at coprocessor_error [kernel] 0x0 (2.4.21-15.5.ELsmp) ESP: 0060:c0113d14 EFLAGS: 00000206 Not tainted EAX: 00100000 EBX: bfffc888 ECX: bfffc888 EDX: d9818000 ESI: bfffc888 EDI: d9819fb0 EBP: bfffc830 DS: 0068 ES: 0068 FS: 0000 GS: 0033 CR0: 80050033 CR2: b7566720 CR3: 02553380 CR4: 000006f0 Call Trace: [<c0113d14>] restore_i387_fxsave [kernel] 0x24 (0xd9819ee4) [<c0113de8>] restore_i387 [kernel] 0x78 (0xd9819f04) [<c010b40e>] restore_sigcontext [kernel] 0x10e (0xd9819f18) [<c010b51d>] sys_sigreturn [kernel] 0xed (0xd9819f94) Now I'm not sure if the process is actually stuck in kernel space or if it's looping tightly through both kernel and user space... -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-12 3:50 ` Rik van Riel @ 2004-06-12 13:44 ` Sergey Vlasov 2004-06-12 13:57 ` stian 2004-06-12 14:25 ` timer + fpu stuff locks up computer Alexander Nyberg 0 siblings, 2 replies; 18+ messages in thread From: Sergey Vlasov @ 2004-06-12 13:44 UTC (permalink / raw) To: Rik van Riel; +Cc: linux-kernel, stian [-- Attachment #1: Type: text/plain, Size: 3204 bytes --] On Fri, 11 Jun 2004 23:50:25 -0400, Rik van Riel wrote: > On Fri, 11 Jun 2004, Rik van Riel wrote: > >> Reproduced here, on my test system running a 2.6 kernel. >> I did get a kernel backtrace over serial console, though ;) > > Now I'm not sure if the process is actually stuck in kernel > space or if it's looping tightly through both kernel and > user space... Here is the culprit (include/asm-i386/i387.h): #define __clear_fpu( tsk ) \ do { \ if ((tsk)->thread_info->status & TS_USEDFPU) { \ asm volatile("fwait"); \ (tsk)->thread_info->status &= ~TS_USEDFPU; \ stts(); \ } \ } while (0) This is called in flush_thread() (which is used in flush_old_exec() and therefore in sys_execve() path) and in restore_i387_fsave(), restore_i387_fxsave() (which are reached from sys_sigreturn() and sys_rt_sigreturn()). The buggy code in the Stian's program corrupts the FPU state - in particular, it results in some exception bits being set in the FPU status word. In this state the next FP command (except non-waiting commands, like fnsave and fninit) will raise the FP error exception (trap 16). The "fwait" above happens to be that next command. The FP error handler do_coprocessor_error() calls math_error() for real work (both in arch/i386/traps.c). math_error() calls save_init_fpu(), which saves the FPU state in current->thread.i387 and sets the TS flag; then math_error() queues a SIGFPE to the task and returns. If the fault comes from userspace, this is enough - on the return path the pending signal will be noticed and delivered. However, in this case the fault happens in the kernel code, therefore execution just resumes at the same point - trying to reexecute that fwait again. At this time, however, the TS flag is set, so we get another trap - trap 7, device_not_available. The trap handler calls math_state_restore(), which clears the TS flag and reloads the FP state from current->thread.i387. Then it returns, and the faulting instruction is restarted again. But it gets the same FP error exception as at the first time... So the CPU is stuck handling endless faults in kernel mode. How to fix this? A quick and dirty fix is to remove the problematic fwait from __clear_fpu(); 2.2.x kernels did not have it - probably it was added in some 2.3.x. --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 +0400 +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 17:25:56 +0400 @@ -51,7 +51,6 @@ #define __clear_fpu( tsk ) \ do { \ if ((tsk)->thread_info->status & TS_USEDFPU) { \ - asm volatile("fwait"); \ (tsk)->thread_info->status &= ~TS_USEDFPU; \ stts(); \ } \ In this case we will ignore a pending FP exception at execve() or sigreturn() instead of raising SIGFPE (which was probably intended by whoever put an fwait there). If we want to be pedantic and care about such pending exceptions, we should add a check for kernel addresses to do_coprocessor_error() and add fixup_exception there, like we do for protection faults, so that the handler will not attempt to restart the failing instruction again. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-12 13:44 ` Sergey Vlasov @ 2004-06-12 13:57 ` stian 2004-06-12 14:28 ` Sergey Vlasov 2004-06-12 14:25 ` timer + fpu stuff locks up computer Alexander Nyberg 1 sibling, 1 reply; 18+ messages in thread From: stian @ 2004-06-12 13:57 UTC (permalink / raw) To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 > +0400 > +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 17:25:56 +0400 > @@ -51,7 +51,6 @@ > #define __clear_fpu( tsk ) \ > do { \ > if ((tsk)->thread_info->status & TS_USEDFPU) { \ > - asm volatile("fwait"); \ > (tsk)->thread_info->status &= ~TS_USEDFPU; \ > stts(); \ > } \ But what about task-switching and fpu-exceptions that comes in late? I know that the kernel does not use FPU in general, and the places it does, fsave, fwait and frstor embeddes it all in kernel-space. Stian Skjelstad ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-12 13:57 ` stian @ 2004-06-12 14:28 ` Sergey Vlasov 0 siblings, 0 replies; 18+ messages in thread From: Sergey Vlasov @ 2004-06-12 14:28 UTC (permalink / raw) To: stian; +Cc: Rik van Riel, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1300 bytes --] On Sat, Jun 12, 2004 at 03:57:42PM +0200, stian@nixia.no wrote: > > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 > > +0400 > > +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 17:25:56 +0400 > > @@ -51,7 +51,6 @@ > > #define __clear_fpu( tsk ) \ > > do { \ > > if ((tsk)->thread_info->status & TS_USEDFPU) { \ > > - asm volatile("fwait"); \ > > (tsk)->thread_info->status &= ~TS_USEDFPU; \ > > stts(); \ > > } \ > > But what about task-switching and fpu-exceptions that comes in late? I > know that the kernel does not use FPU in general, and the places it does, > fsave, fwait and frstor embeddes it all in kernel-space. Kernel code which uses FPU should call kernel_fpu_begin() before it and kernel_fpu_end() after. kernel_fpu_begin() is safe - it uses fnsave or fxsave, both of which don't raise pending FPU exceptions. Also fnsave performs implicit fninit, and fxsave is followed by fnclex, which clears pending exceptions. However, raid6_before_mmx() [drivers/md/raid6x86.h] seems to be buggy: static inline void raid6_before_mmx(raid6_mmx_save_t *s) { s->cr0 = raid6_get_fpu(); asm volatile("fsave %0 ; fwait" : "=m" (s->fsave[0])); } fsave will raise pending exceptions (unlike fnsave). [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 13:44 ` Sergey Vlasov 2004-06-12 13:57 ` stian @ 2004-06-12 14:25 ` Alexander Nyberg 2004-06-12 14:42 ` stian 2004-06-12 15:14 ` Sergey Vlasov 1 sibling, 2 replies; 18+ messages in thread From: Alexander Nyberg @ 2004-06-12 14:25 UTC (permalink / raw) To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian On Sat, 2004-06-12 at 15:44, Sergey Vlasov wrote: > On Fri, 11 Jun 2004 23:50:25 -0400, Rik van Riel wrote: > > > On Fri, 11 Jun 2004, Rik van Riel wrote: > > > >> Reproduced here, on my test system running a 2.6 kernel. > >> I did get a kernel backtrace over serial console, though ;) > > > > Now I'm not sure if the process is actually stuck in kernel > > space or if it's looping tightly through both kernel and > > user space... > > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 +0400 > +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 17:25:56 +0400 > @@ -51,7 +51,6 @@ > #define __clear_fpu( tsk ) \ > do { \ > if ((tsk)->thread_info->status & TS_USEDFPU) { \ > - asm volatile("fwait"); \ > (tsk)->thread_info->status &= ~TS_USEDFPU; \ > stts(); \ > } \ Sorry for this extremely informative mail but, doesn't work. Looks like the problem is only being delayed: Pid: 431, comm: sshd EIP: 0060:[<c0119f98>] CPU: 0 EIP is at force_sig_info+0x48/0x80 EFLAGS: 00000286 Not tainted (2.6.7-rc3-mm1) EAX: 00000000 EBX: de96d7d0 ECX: 00000007 EDX: 00000008 ESI: 00000008 EDI: 00000286 EBP: de9e3dd4 DS: 007b ES: 007b CR0: 8005003b CR2: 080b2664 CR3: 1f48f000 CR4: 000002d0 [<c0105560>] do_coprocessor_error+0x0/0x20 [<c01054f2>] math_error+0xb2/0x120 [<c01d2bb8>] fast_clear_page+0x8/0x50 [<c0105de3>] do_IRQ+0x113/0x150 [<c0105de3>] do_IRQ+0x113/0x150 [<c0105de3>] do_IRQ+0x113/0x150 [<c0104398>] common_interrupt+0x18/0x20 [<c0109ed5>] restore_fpu+0x15/0x20 [<c0104435>] error_code+0x2d/0x38 [<c01d2bb8>] fast_clear_page+0x8/0x50 [<c013286e>] do_anonymous_page+0x8e/0x140 [<c0132979>] do_no_page+0x59/0x290 [<c0132d5e>] handle_mm_fault+0xbe/0x120 [<c010e5b4>] do_page_fault+0x134/0x506 [<c010fd90>] default_wake_function+0x0/0x10 [<c01f4f6a>] tty_read+0xaa/0xf0 [<c014dd3d>] sys_select+0x22d/0x490 [<c013e583>] vfs_read+0xc3/0x100 [<c011b0ac>] sigprocmask+0x4c/0xb0 [<c010e480>] do_page_fault+0x0/0x506 [<c0104435>] error_code+0x2d/0x38 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 14:25 ` timer + fpu stuff locks up computer Alexander Nyberg @ 2004-06-12 14:42 ` stian 2004-06-12 15:20 ` martin capitanio 2004-06-12 15:14 ` Sergey Vlasov 1 sibling, 1 reply; 18+ messages in thread From: stian @ 2004-06-12 14:42 UTC (permalink / raw) To: Alexander Nyberg; +Cc: linux-kernel, Sergey Vlasov, Rik van Riel > Sorry for this extremely informative mail but, doesn't work. > > Looks like the problem is only being delayed: Makes sense, since fwait is done in kernel-mode and it takes some time for the exception to rise, since this is a slow instruction. So the problem gets delayed. What do you think Sergey? Does the other dirty nasty patch work for you? Stian Skjelstad ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 14:42 ` stian @ 2004-06-12 15:20 ` martin capitanio 2004-06-12 16:15 ` stian 0 siblings, 1 reply; 18+ messages in thread From: martin capitanio @ 2004-06-12 15:20 UTC (permalink / raw) To: stian; +Cc: linux-kernel On Saturday 12 June 2004 16:42, stian@nixia.no wrote: > > Does the other dirty nasty patch work for you? ACK for 2.6.7-rc4-mm1 (gcc-Version 3.3.3) user$ ./evil completely freeze --- linux-2.6.6-rc3-mm1/kernel/signal.c 2004-06-09 18:36:12.000000000 +0200 +++ linux-2.6.6-rc3-mm1-fpuhotfix/kernel/signal.c 2004-06-12 18:10:31.573001808 +0200 @@ -799,7 +799,15 @@ can get more detailed information about the cause of the signal. */ if (LEGACY_QUEUE(&t->pending, sig)) + { + if (sig==8) + { + printk("Attempt to exploit known bug, process=%s pid=%p uid=%d\n", + t->comm, t->pid, t->uid); + do_exit(0); + } goto out; + } ret = send_signal(sig, info, t, &t->pending); if (!ret && !sigismember(&t->blocked, sig)) 2.6.7-rc4-mm1-fpuhotfix: user$ ./evil ........................*............................................... ......................* Attempt to exploit known bug, process=evil pid=00000aa6 uid=1000 note: evil[2726] exited with preempt_count 2 bad: scheduling while atomic! [<c032a045>] schedule+0x4b5/0x4c0 [<c01435cb>] zap_pmd_range+0x4b/0x70 [<c014362d>] unmap_page_range+0x3d/0x70 [<c014380b>] unmap_vmas+0x1ab/0x1c0 [<c0147639>] exit_mmap+0x79/0x150 [<c01184ee>] mmput+0x5e/0xa0 [<c011c523>] do_exit+0x153/0x3e0 [<c0122e6f>] specific_send_sig_info+0xff/0x100 [<c0122eb2>] force_sig_info+0x42/0x90 [<c0105be0>] do_coprocessor_error+0x0/0x20 [<c0105b5e>] math_error+0xde/0x160 [<c010b0f6>] restore_i387_fxsave+0x26/0xa0 [<c0222c8c>] write_chan+0x18c/0x250 [<c01170e0>] default_wake_function+0x0/0x10 [<c01170e0>] default_wake_function+0x0/0x10 [<c0104a05>] error_code+0x2d/0x38 [<c010b0f6>] restore_i387_fxsave+0x26/0xa0 [<c010b1fc>] restore_i387+0x8c/0x90 [<c0103434>] restore_sigcontext+0x114/0x130 [<c0103503>] sys_sigreturn+0xb3/0xd0 [<c0103f6b>] syscall_call+0x7/0xb but it keeps the kernel alive :-) martin ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 15:20 ` martin capitanio @ 2004-06-12 16:15 ` stian 0 siblings, 0 replies; 18+ messages in thread From: stian @ 2004-06-12 16:15 UTC (permalink / raw) To: martin capitanio; +Cc: stian, linux-kernel >> Does the other dirty nasty patch work for you? > --- linux-2.6.6-rc3-mm1/kernel/signal.c 2004-06-09 18:36:12.000000000 > +0200 > +++ linux-2.6.6-rc3-mm1-fpuhotfix/kernel/signal.c 2004-06-12 > 18:10:31.573001808 +0200 > @@ -799,7 +799,15 @@ > can get more detailed information about the cause of > the signal. */ > if (LEGACY_QUEUE(&t->pending, sig)) > + { > + if (sig==8) > + { > + printk("Attempt to exploit known bug, process=%s pid=%p > uid=%d\n", > + t->comm, t->pid, t->uid); > + do_exit(0); > + } > goto out; > + } > > ret = send_signal(sig, info, t, &t->pending); > if (!ret && !sigismember(&t->blocked, sig)) > > 2.6.7-rc4-mm1-fpuhotfix: > user$ ./evil > ........................*............................................... > ......................* > Attempt to exploit known bug, process=evil pid=00000aa6 uid=1000 > note: evil[2726] exited with preempt_count 2 > bad: scheduling while atomic! > [<c032a045>] schedule+0x4b5/0x4c0 > [<c01435cb>] zap_pmd_range+0x4b/0x70 > [<c014362d>] unmap_page_range+0x3d/0x70 > [<c014380b>] unmap_vmas+0x1ab/0x1c0 > [<c0147639>] exit_mmap+0x79/0x150 > [<c01184ee>] mmput+0x5e/0xa0 > [<c011c523>] do_exit+0x153/0x3e0 > [<c0122e6f>] specific_send_sig_info+0xff/0x100 > [<c0122eb2>] force_sig_info+0x42/0x90 > [<c0105be0>] do_coprocessor_error+0x0/0x20 > [<c0105b5e>] math_error+0xde/0x160 > [<c010b0f6>] restore_i387_fxsave+0x26/0xa0 > [<c0222c8c>] write_chan+0x18c/0x250 > [<c01170e0>] default_wake_function+0x0/0x10 > [<c01170e0>] default_wake_function+0x0/0x10 > [<c0104a05>] error_code+0x2d/0x38 > [<c010b0f6>] restore_i387_fxsave+0x26/0xa0 > [<c010b1fc>] restore_i387+0x8c/0x90 > [<c0103434>] restore_sigcontext+0x114/0x130 > [<c0103503>] sys_sigreturn+0xb3/0xd0 > [<c0103f6b>] syscall_call+0x7/0xb > > but it keeps the kernel alive :-) The hotfix should probably me moved to arch/i386/traps.c before we start to due atomic locks, sinse it is beond dirty to kill the process here when we have locked down resources. But the best would be to fix the problem-source, since this is just a workaround. Stian Skjelstad ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 14:25 ` timer + fpu stuff locks up computer Alexander Nyberg 2004-06-12 14:42 ` stian @ 2004-06-12 15:14 ` Sergey Vlasov 2004-06-12 18:45 ` Sergey Vlasov 1 sibling, 1 reply; 18+ messages in thread From: Sergey Vlasov @ 2004-06-12 15:14 UTC (permalink / raw) To: Alexander Nyberg; +Cc: Rik van Riel, linux-kernel, stian [-- Attachment #1: Type: text/plain, Size: 1645 bytes --] On Sat, Jun 12, 2004 at 04:25:51PM +0200, Alexander Nyberg wrote: > > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 +0400 > > +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 17:25:56 +0400 > > @@ -51,7 +51,6 @@ > > #define __clear_fpu( tsk ) \ > > do { \ > > if ((tsk)->thread_info->status & TS_USEDFPU) { \ > > - asm volatile("fwait"); \ > > (tsk)->thread_info->status &= ~TS_USEDFPU; \ > > stts(); \ > > } \ > > Sorry for this extremely informative mail but, doesn't work. > > Looks like the problem is only being delayed: > > Pid: 431, comm: sshd > EIP: 0060:[<c0119f98>] CPU: 0 > EIP is at force_sig_info+0x48/0x80 > EFLAGS: 00000286 Not tainted (2.6.7-rc3-mm1) > EAX: 00000000 EBX: de96d7d0 ECX: 00000007 EDX: 00000008 > ESI: 00000008 EDI: 00000286 EBP: de9e3dd4 DS: 007b ES: 007b > CR0: 8005003b CR2: 080b2664 CR3: 1f48f000 CR4: 000002d0 > [<c0105560>] do_coprocessor_error+0x0/0x20 > [<c01054f2>] math_error+0xb2/0x120 > [<c01d2bb8>] fast_clear_page+0x8/0x50 ... Grrr. I was testing on a fairly generic kernel configuration which did not include fast_clear_page()... If the FPU state belong to the userspace process, kernel_fpu_begin() is safe even if some exceptions are pending. However, after __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does nothing with it. Replacing fwait with fnclex instead of removing it completely should avoid the fault later. However, looks like we really need the proper fix - teach do_coprocessor_error() to recognize kernel mode faults and fixup them. [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 15:14 ` Sergey Vlasov @ 2004-06-12 18:45 ` Sergey Vlasov 2004-06-12 20:27 ` Alexander Nyberg 0 siblings, 1 reply; 18+ messages in thread From: Sergey Vlasov @ 2004-06-12 18:45 UTC (permalink / raw) To: Alexander Nyberg; +Cc: Rik van Riel, linux-kernel, stian [-- Attachment #1: Type: text/plain, Size: 2158 bytes --] On Sat, Jun 12, 2004 at 07:14:22PM +0400, Sergey Vlasov wrote: > If the FPU state belong to the userspace process, kernel_fpu_begin() > is safe even if some exceptions are pending. However, after > __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does > nothing with it. > > Replacing fwait with fnclex instead of removing it completely should > avoid the fault later. Yes, it seems to be enough. Another case where it looks like FPU might be "orphaned" is exit(); however, it is handled as a normal task switch, __switch_to() calls __unlazy_fpu(), which clears pending exceptions. I'm still not sure what to do about possibly lost FP exceptions. This can happen in two cases: 1) Program calls execve() while an FP exception is pending. In this case clear_fpu() is called when the original executable is already destroyed. Even if we generate a SIGFPE in this case, it would be delivered to the new executable. 2) Program returns from a signal handler while an FP exception is pending. In this case at clear_fpu() time restore_sigcontext() has already wiped out all state of the signal handler, so the SIGFPE would appear to be raised from the program code at the point where it was interrupted by the handled signal. Signed-Off-By: Sergey Vlasov <vsu@altlinux.ru> --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 +0400 +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 22:02:58 +0400 @@ -48,10 +48,17 @@ save_init_fpu( tsk ); \ } while (0) +/* + * There might be some pending exceptions in the FP state at this point. + * However, it is too late to report them: this code is called during execve() + * (when the original executable is already gone) and during sigreturn() (when + * the signal handler context is already lost). So just clear them to prevent + * problems later. + */ #define __clear_fpu( tsk ) \ do { \ if ((tsk)->thread_info->status & TS_USEDFPU) { \ - asm volatile("fwait"); \ + asm volatile("fnclex"); \ (tsk)->thread_info->status &= ~TS_USEDFPU; \ stts(); \ } \ [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 18:45 ` Sergey Vlasov @ 2004-06-12 20:27 ` Alexander Nyberg 0 siblings, 0 replies; 18+ messages in thread From: Alexander Nyberg @ 2004-06-12 20:27 UTC (permalink / raw) To: Sergey Vlasov; +Cc: Rik van Riel, linux-kernel, stian On Sat, 2004-06-12 at 20:45, Sergey Vlasov wrote: > On Sat, Jun 12, 2004 at 07:14:22PM +0400, Sergey Vlasov wrote: > > If the FPU state belong to the userspace process, kernel_fpu_begin() > > is safe even if some exceptions are pending. However, after > > __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does > > nothing with it. > > > > Replacing fwait with fnclex instead of removing it completely should > > avoid the fault later. > > Yes, it seems to be enough. Another case where it looks like FPU > might be "orphaned" is exit(); however, it is handled as a normal task > switch, __switch_to() calls __unlazy_fpu(), which clears pending > exceptions. > > I'm still not sure what to do about possibly lost FP exceptions. This > can happen in two cases: > > 1) Program calls execve() while an FP exception is pending. > > In this case clear_fpu() is called when the original executable is > already destroyed. Even if we generate a SIGFPE in this case, it > would be delivered to the new executable. > > 2) Program returns from a signal handler while an FP exception is > pending. > > In this case at clear_fpu() time restore_sigcontext() has already > wiped out all state of the signal handler, so the SIGFPE would > appear to be raised from the program code at the point where it was > interrupted by the handled signal. > > Signed-Off-By: Sergey Vlasov <vsu@altlinux.ru> > > --- linux-2.6.6/include/asm-i386/i387.h.fp-lockup 2004-05-10 06:33:06 +0400 > +++ linux-2.6.6/include/asm-i386/i387.h 2004-06-12 22:02:58 +0400 > @@ -48,10 +48,17 @@ > save_init_fpu( tsk ); \ > } while (0) > > +/* > + * There might be some pending exceptions in the FP state at this point. > + * However, it is too late to report them: this code is called during execve() > + * (when the original executable is already gone) and during sigreturn() (when > + * the signal handler context is already lost). So just clear them to prevent > + * problems later. > + */ > #define __clear_fpu( tsk ) \ > do { \ > if ((tsk)->thread_info->status & TS_USEDFPU) { \ > - asm volatile("fwait"); \ > + asm volatile("fnclex"); \ > (tsk)->thread_info->status &= ~TS_USEDFPU; \ > stts(); \ > } \ > This works, tested also on a box with md and things looked fine. Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks my console race 2004-06-12 2:53 ` Rik van Riel 2004-06-12 3:50 ` Rik van Riel @ 2004-06-12 4:35 ` Matt Mackall 1 sibling, 0 replies; 18+ messages in thread From: Matt Mackall @ 2004-06-12 4:35 UTC (permalink / raw) To: Rik van Riel; +Cc: stian, linux-kernel On Fri, Jun 11, 2004 at 10:53:48PM -0400, Rik van Riel wrote: > On Wed, 9 Jun 2004 stian@nixia.no wrote: > > > I'm doing some code tests when I came across problems with my program > > locking my console (even X if I'm using a xterm). > > Reproduced here, on my test system running a 2.6 kernel. > I did get a kernel backtrace over serial console, though ;) I stuck some strategic printks in the kernel. The example code's bogus asm is generating an FPU fault in frstor in its signal handler, that's bumping us into math_error -> force_sig_info -> specific_send_sig_info. Then we hit: if (LEGACY_QUEUE(&t->pending, sig)) which decides we don't need to send the signal after all and we bail all the way back out and recurse. -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 18+ messages in thread
[parent not found: <26h3z-t3-15@gated-at.bofh.it>]
[parent not found: <26hGq-Zr-29@gated-at.bofh.it>]
[parent not found: <26isF-1Im-11@gated-at.bofh.it>]
[parent not found: <26lJU-4lC-23@gated-at.bofh.it>]
* Re: timer + fpu stuff locks up computer [not found] ` <26lJU-4lC-23@gated-at.bofh.it> @ 2004-06-12 22:08 ` Andi Kleen 2004-06-13 13:06 ` Sergey Vlasov 0 siblings, 1 reply; 18+ messages in thread From: Andi Kleen @ 2004-06-12 22:08 UTC (permalink / raw) To: Sergey Vlasov; +Cc: linux-kernel Sergey Vlasov <vsu@altlinux.ru> writes: > On Sat, Jun 12, 2004 at 07:14:22PM +0400, Sergey Vlasov wrote: >> If the FPU state belong to the userspace process, kernel_fpu_begin() >> is safe even if some exceptions are pending. However, after >> __clear_fpu() the FPU is "orphaned", and kernel_fpu_begin() does >> nothing with it. >> >> Replacing fwait with fnclex instead of removing it completely should >> avoid the fault later. > > Yes, it seems to be enough. Another case where it looks like FPU > might be "orphaned" is exit(); however, it is handled as a normal task > switch, __switch_to() calls __unlazy_fpu(), which clears pending > exceptions. One problem on 486s/P5s would be the race that is described in D.2.1.3 of Volume 1 of the Intel architecture manual when the FPU is in MSDOS compatibility. When that happens we can still get the exception later (e.g. on a following fwait which the kernel can still execute). The only way to handle that would be to check in the exception handler, like my patch did. However my patch was also not complete, since it didn't handle it for all fwaits in the kernel. Also BTW x86-64 must be fixed too. -Andi ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: timer + fpu stuff locks up computer 2004-06-12 22:08 ` timer + fpu stuff locks up computer Andi Kleen @ 2004-06-13 13:06 ` Sergey Vlasov 0 siblings, 0 replies; 18+ messages in thread From: Sergey Vlasov @ 2004-06-13 13:06 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-kernel [-- Attachment #1: Type: text/plain, Size: 979 bytes --] On Sun, Jun 13, 2004 at 12:08:10AM +0200, Andi Kleen wrote: > One problem on 486s/P5s would be the race that is described in D.2.1.3 > of Volume 1 of the Intel architecture manual when the FPU is in MSDOS > compatibility. When that happens we can still get the exception later > (e.g. on a following fwait which the kernel can still execute). The > only way to handle that would be to check in the exception handler, > like my patch did. But in head.S we set the NE flag in CR0 for all 486 or better processors, so the MSDOS compatibility mode is not used, and we don't need to care about this race. > However my patch was also not complete, since it > didn't handle it for all fwaits in the kernel. Looked at your patch... I was also thinking about something similar. You treat exception 16 and IRQ13 the same - is this really correct? Asynchronous IRQ13 might break things. But this would be visible only on a real 80386+80387 - does someone still have such hardware? ;) [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2004-06-13 13:07 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-09 21:02 timer + fpu stuff locks my console race stian
2004-06-10 21:00 ` Matias Hermanrud Fjeld
2004-06-11 6:08 ` Lars Age Kamfjord
2004-06-12 2:53 ` Rik van Riel
2004-06-12 3:50 ` Rik van Riel
2004-06-12 13:44 ` Sergey Vlasov
2004-06-12 13:57 ` stian
2004-06-12 14:28 ` Sergey Vlasov
2004-06-12 14:25 ` timer + fpu stuff locks up computer Alexander Nyberg
2004-06-12 14:42 ` stian
2004-06-12 15:20 ` martin capitanio
2004-06-12 16:15 ` stian
2004-06-12 15:14 ` Sergey Vlasov
2004-06-12 18:45 ` Sergey Vlasov
2004-06-12 20:27 ` Alexander Nyberg
2004-06-12 4:35 ` timer + fpu stuff locks my console race Matt Mackall
[not found] <26h3z-t3-15@gated-at.bofh.it>
[not found] ` <26hGq-Zr-29@gated-at.bofh.it>
[not found] ` <26isF-1Im-11@gated-at.bofh.it>
[not found] ` <26lJU-4lC-23@gated-at.bofh.it>
2004-06-12 22:08 ` timer + fpu stuff locks up computer Andi Kleen
2004-06-13 13:06 ` Sergey Vlasov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox