public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* FPU-intensive programs crashing with floating point exception on Cyrix MII
@ 2005-08-17 16:13 Ondrej Zary
  2005-08-17 17:08 ` linux-os (Dick Johnson)
  0 siblings, 1 reply; 8+ messages in thread
From: Ondrej Zary @ 2005-08-17 16:13 UTC (permalink / raw)
  To: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1081 bytes --]

My machine (Cyrix MII PR300 CPU, PCPartner TXB820DS board with i430TX 
chipset) exhibits a really weird problem:
When I run a program that uses FPU, it sometimes crashes with "flaoting 
point exception" - for example, when playing MP3 files using any player. 
Or with Prime95 - http://www.mersenne.org/freesoft.htm - the "torture 
test" does not crash but shows "fatal error" in less than 10 minutes.
It might be something like this:
http://lists.suse.com/archive/suse-linux-e/2000-Sep/1080.html
or this
http://lists.slug.org.au/archives/slug/2000/11/msg00343.html

The problem appears on 2.4.x kernels and 2.6.x kernels. It works fine in 
Windows 98 - it can play MP3s and run Prime95 for hours without any 
problems.
I've tracked it down to math_error() in arch/i386/kernel/traps.c and 
"fixed" it (I really don't know anything about FPU programming). The 
patch is attached. It fixes my system - with the patch, I can play MP3s 
fine and Prime95 runs without any problems too.

Does anyone know why these exceptions happen and/or what's the correct 
solution?

-- 
Ondrej Zary


[-- Attachment #2: cyrix-math.patch --]
[-- Type: text/plain, Size: 530 bytes --]

--- linux-2.6.10/arch/i386/kernel/traps.c~	2004-12-25 12:02:03.000000000 +0100
+++ linux-2.6.10/arch/i386/kernel/traps.c	2004-12-25 12:02:03.000000000 +0100
@@ -790,8 +790,11 @@
 	 */
 	cwd = get_fpu_cwd(task);
 	swd = get_fpu_swd(task);
+	printk("MATH ERROR %d\n",((~cwd) & swd & 0x3f) | (swd & 0x240));
 	switch (((~cwd) & swd & 0x3f) | (swd & 0x240)) {
-		case 0x000:
+		case 0x000: /* Hack for Cyrix problems */
+		case 0x200:
+			return;                                
 		default:
 			break;
 		case 0x001: /* Invalid Op */

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point exception on Cyrix MII
  2005-08-17 16:13 FPU-intensive programs crashing with floating point exception on Cyrix MII Ondrej Zary
@ 2005-08-17 17:08 ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 8+ messages in thread
From: linux-os (Dick Johnson) @ 2005-08-17 17:08 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Linux Kernel Mailing List


On Wed, 17 Aug 2005, Ondrej Zary wrote:

> My machine (Cyrix MII PR300 CPU, PCPartner TXB820DS board with i430TX
> chipset) exhibits a really weird problem:
> When I run a program that uses FPU, it sometimes crashes with "flaoting
> point exception" - for example, when playing MP3 files using any player.
> Or with Prime95 - http://www.mersenne.org/freesoft.htm - the "torture
> test" does not crash but shows "fatal error" in less than 10 minutes.
> It might be something like this:
> http://lists.suse.com/archive/suse-linux-e/2000-Sep/1080.html
> or this
> http://lists.slug.org.au/archives/slug/2000/11/msg00343.html
>
> The problem appears on 2.4.x kernels and 2.6.x kernels. It works fine in
> Windows 98 - it can play MP3s and run Prime95 for hours without any
> problems.
> I've tracked it down to math_error() in arch/i386/kernel/traps.c and
> "fixed" it (I really don't know anything about FPU programming). The
> patch is attached. It fixes my system - with the patch, I can play MP3s
> fine and Prime95 runs without any problems too.
>
> Does anyone know why these exceptions happen and/or what's the correct
> solution?
>
> --
> Ondrej Zary
>

0x200 is way up into the "condition" codes. There should have never
been an interrupt at all! Your "fix" is as good as you can get.

>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.12 on an i686 machine (5537.79 BogoMips).
Warning : 98.36% of all statistics are fiction.
.
I apologize for the following. I tried to kill it with the above dot :

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point exception on Cyrix MII
@ 2005-08-17 18:49 Chuck Ebbert
  2005-08-18 10:37 ` Ondrej Zary
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Ebbert @ 2005-08-17 18:49 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: linux-kernel

On Wed, 17 Aug 2005 at 18:13:55 +0200, Ondrej Zary wrote:

> When I run a program that uses FPU, it sometimes crashes with "flaoting 
> point exception"


> +     printk("MATH ERROR %d\n",((~cwd) & swd & 0x3f) | (swd & 0x240));

  Could you modify this to print the full values of cwd and swd like this?

        printk("MATH ERROR: cwd = 0x%hx, swd = 0x%hx\n", cwd, swd);

Then post the result.


__
Chuck

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point  exception on Cyrix MII
  2005-08-17 18:49 Chuck Ebbert
@ 2005-08-18 10:37 ` Ondrej Zary
  0 siblings, 0 replies; 8+ messages in thread
From: Ondrej Zary @ 2005-08-18 10:37 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel

Chuck Ebbert wrote:
> On Wed, 17 Aug 2005 at 18:13:55 +0200, Ondrej Zary wrote:
> 
> 
>>When I run a program that uses FPU, it sometimes crashes with "flaoting 
>>point exception"
> 
> 
> 
>>+     printk("MATH ERROR %d\n",((~cwd) & swd & 0x3f) | (swd & 0x240));
> 
> 
>   Could you modify this to print the full values of cwd and swd like this?
> 
>         printk("MATH ERROR: cwd = 0x%hx, swd = 0x%hx\n", cwd, swd);
> 
> Then post the result.
MATH ERROR: cwd = 0x37f, swd = 0x5020
MATH ERROR: cwd = 0x37f, swd = 0x20
MATH ERROR: cwd = 0x37f, swd = 0x20
MATH ERROR: cwd = 0x37f, swd = 0x2020
MATH ERROR: cwd = 0x37f, swd = 0x20
MATH ERROR: cwd = 0x37f, swd = 0x1820
MATH ERROR: cwd = 0x37f, swd = 0x1820
MATH ERROR: cwd = 0x37f, swd = 0x2020
MATH ERROR: cwd = 0x37f, swd = 0x20
MATH ERROR: cwd = 0x37f, swd = 0x2800
MATH ERROR: cwd = 0x37f, swd = 0x1820
MATH ERROR: cwd = 0x37f, swd = 0x820
MATH ERROR: cwd = 0x37f, swd = 0x2820
MATH ERROR: cwd = 0x37f, swd = 0x2820
MATH ERROR: cwd = 0x37f, swd = 0x1820
MATH ERROR: cwd = 0x37f, swd = 0x820
MATH ERROR: cwd = 0x37f, swd = 0x1a20

Running prime95 for almost 2 hours:
Torture Test ran 1 hours, 54 minutes - 0 errors, 0 warnings.
and playing some mpeg clips.

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point  exception on Cyrix MII
@ 2005-08-21  9:47 Chuck Ebbert
  2005-08-21 17:52 ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Chuck Ebbert @ 2005-08-21  9:47 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: linux-kernel, Linus Torvalds, Ingo Molnar

On Thu, 18 Aug 2005 12:37:30 +0200, Ondrej Zary wrote:

> >   Could you modify this to print the full values of cwd and swd like this?
> > 
> >         printk("MATH ERROR: cwd = 0x%hx, swd = 0x%hx\n", cwd, swd);
> > 
> > Then post the result.
> MATH ERROR: cwd = 0x37f, swd = 0x5020
> MATH ERROR: cwd = 0x37f, swd = 0x20
> MATH ERROR: cwd = 0x37f, swd = 0x20
> MATH ERROR: cwd = 0x37f, swd = 0x2020
> MATH ERROR: cwd = 0x37f, swd = 0x20
> MATH ERROR: cwd = 0x37f, swd = 0x1820
> MATH ERROR: cwd = 0x37f, swd = 0x1820
> MATH ERROR: cwd = 0x37f, swd = 0x2020
> MATH ERROR: cwd = 0x37f, swd = 0x20
> MATH ERROR: cwd = 0x37f, swd = 0x2800     <===========
> MATH ERROR: cwd = 0x37f, swd = 0x1820
> MATH ERROR: cwd = 0x37f, swd = 0x820
> MATH ERROR: cwd = 0x37f, swd = 0x2820
> MATH ERROR: cwd = 0x37f, swd = 0x2820
> MATH ERROR: cwd = 0x37f, swd = 0x1820
> MATH ERROR: cwd = 0x37f, swd = 0x820
> MATH ERROR: cwd = 0x37f, swd = 0x1a20

 The error I marked has no exception flags set.  The rest are all (masked)
denormal exceptions.  Why your Cyrix MII would cause an FPU exception in these
cases is beyond me.  Could you try the statically-linked mprime program?

 I had hoped someone who knew more about FPU error handling would jump in.
The below code from arch/i386/kernel/traps.c sends a signal back to
userspace even when the status word shows a masked (or no) exception has
occurred.  The 'case 0x000' strongly suggests this is deliberate but I
don't know why.


        /*
         * (~cwd & swd) will mask out exceptions that are not set to unmasked
         * status.  0x3f is the exception bits in these regs, 0x200 is the
         * C1 reg you need in case of a stack fault, 0x040 is the stack
         * fault bit.  We should only be taking one exception at a time,
         * so if this combination doesn't produce any single exception,
         * then we have a bad program that isn't syncronizing its FPU usage
         * and it will suffer the consequences since we won't be able to
         * fully reproduce the context of the exception
         */
        cwd = get_fpu_cwd(task);
        swd = get_fpu_swd(task);
        switch (((~cwd) & swd & 0x3f) | (swd & 0x240)) {
                case 0x000:
                default:
                        break;
                case 0x001: /* Invalid Op */
                case 0x041: /* Stack Fault */
                case 0x241: /* Stack Fault | Direction */
                        info.si_code = FPE_FLTINV;
                        /* Should we clear the SF or let user space do it ???? */
                        break;


(And it looks like there is a small bug in there.  The switch should be:

        switch (((~cwd) & swd & 0x3f) | (swd & 1 ? swd & 0x240 : 0)) {

because the SF and CC1 bits are only relevant when IE is set.)

__
Chuck

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point   exception on Cyrix MII
  2005-08-21  9:47 Chuck Ebbert
@ 2005-08-21 17:52 ` Linus Torvalds
  2005-08-21 21:28   ` Ondrej Zary
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2005-08-21 17:52 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Ondrej Zary, linux-kernel, Ingo Molnar



On Sun, 21 Aug 2005, Chuck Ebbert wrote:
>
> > MATH ERROR: cwd = 0x37f, swd = 0x2800     <===========
> 
>  The error I marked has no exception flags set.  The rest are all (masked)
> denormal exceptions.  Why your Cyrix MII would cause an FPU exception in these
> cases is beyond me.  Could you try the statically-linked mprime program?

Also, please try this one, to see where it happens.

			Linus

---
diff --git a/arch/i386/kernel/i8259.c b/arch/i386/kernel/i8259.c
--- a/arch/i386/kernel/i8259.c
+++ b/arch/i386/kernel/i8259.c
@@ -357,11 +357,11 @@ void init_8259A(int auto_eoi)
 
 static irqreturn_t math_error_irq(int cpl, void *dev_id, struct pt_regs *regs)
 {
-	extern void math_error(void __user *);
+	extern void math_error(struct pt_regs *);
 	outb(0,0xF0);
 	if (ignore_fpu_irq || !boot_cpu_data.hard_math)
 		return IRQ_NONE;
-	math_error((void __user *)regs->eip);
+	math_error(regs);
 	return IRQ_HANDLED;
 }
 
diff --git a/arch/i386/kernel/traps.c b/arch/i386/kernel/traps.c
--- a/arch/i386/kernel/traps.c
+++ b/arch/i386/kernel/traps.c
@@ -774,8 +774,9 @@ clear_TF_reenable:
  * the correct behaviour even in the presence of the asynchronous
  * IRQ13 behaviour
  */
-void math_error(void __user *eip)
+void math_error(struct pt_regs *regs)
 {
+	void __user *eip = (void __user *)regs->eip;
 	struct task_struct * task;
 	siginfo_t info;
 	unsigned short cwd, swd;
@@ -805,6 +806,7 @@ void math_error(void __user *eip)
 	swd = get_fpu_swd(task);
 	switch (((~cwd) & swd & 0x3f) | (swd & 0x240)) {
 		case 0x000:
+			show_regs(regs);
 		default:
 			break;
 		case 0x001: /* Invalid Op */
@@ -833,7 +835,7 @@ void math_error(void __user *eip)
 fastcall void do_coprocessor_error(struct pt_regs * regs, long error_code)
 {
 	ignore_fpu_irq = 1;
-	math_error((void __user *)regs->eip);
+	math_error(regs);
 }
 
 static void simd_math_error(void __user *eip)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point   exception on Cyrix MII
  2005-08-21 17:52 ` Linus Torvalds
@ 2005-08-21 21:28   ` Ondrej Zary
  2005-08-21 23:10     ` Linus Torvalds
  0 siblings, 1 reply; 8+ messages in thread
From: Ondrej Zary @ 2005-08-21 21:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chuck Ebbert, linux-kernel, Ingo Molnar

Linus Torvalds wrote:
> 
> On Sun, 21 Aug 2005, Chuck Ebbert wrote:
> 
>>>MATH ERROR: cwd = 0x37f, swd = 0x2800     <===========
>>
>> The error I marked has no exception flags set.  The rest are all (masked)
>>denormal exceptions.  Why your Cyrix MII would cause an FPU exception in these
>>cases is beyond me.  Could you try the statically-linked mprime program?

I use only the statically linked mprime.

> Also, please try this one, to see where it happens.
I did some modification to the code so it calls show_regs() in both 
cases where I get problems and also added the return so it does not 
crash. The code looks like this:
---
         printk("MATH ERROR: cwd = 0x%hx, swd = 0x%hx\n", cwd, swd);
         switch (((~cwd) & swd & 0x3f) | (swd & 0x240)) {
                 case 0x000:
                 case 0x200:
                         show_regs(regs);
                         return;
---
Here are the results.

MATH ERROR: cwd = 0x37f, swd = 0x1820

Pid: 1699, comm:               mprime
EIP: 0073:[<08181c73>] CPU: 0
EIP is at 0x8181c73
  ESP: 007b:bf927ab4 EFLAGS: 00010202    Not tainted  (2.6.12-pentium)
EAX: 00000001 EBX: 00000000 ECX: 0000808d EDX: b7f09480
ESI: b7455340 EDI: 080e01f0 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b7ed6058 CR3: 006f0000 CR4: 00000080
MATH ERROR: cwd = 0x37f, swd = 0x1020

Pid: 1699, comm:               mprime
EIP: 0073:[<0818ca5f>] CPU: 0
EIP is at 0x818ca5f
  ESP: 007b:bf927ab0 EFLAGS: 00010207    Not tainted  (2.6.12-pentium)
EAX: 00000005 EBX: 00000000 ECX: 00008407 EDX: b7f08140
ESI: b789aea0 EDI: b7f08200 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b75c6000 CR3: 006f0000 CR4: 00000080
MATH ERROR: cwd = 0x37f, swd = 0x2820

Pid: 1699, comm:               mprime
EIP: 0073:[<0818c4b1>] CPU: 0
EIP is at 0x818c4b1
  ESP: 007b:bf927ab0 EFLAGS: 00010247    Not tainted  (2.6.12-pentium)
EAX: 00000000 EBX: 00000000 ECX: 0000880f EDX: b7f09480
ESI: b741fc20 EDI: 080e0160 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b75c6000 CR3: 006f0000 CR4: 00000080
MATH ERROR: cwd = 0x37f, swd = 0x20

Pid: 1699, comm:               mprime
EIP: 0073:[<08181ca1>] CPU: 0
EIP is at 0x8181ca1
  ESP: 007b:bf927ab4 EFLAGS: 00010202    Not tainted  (2.6.12-pentium)
EAX: 00000002 EBX: 00000000 ECX: 00000084 EDX: b7f09480
ESI: b74f86c0 EDI: 080e01f0 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b75c6000 CR3: 006f0000 CR4: 00000080
MATH ERROR: cwd = 0x37f, swd = 0x1a20

Pid: 1699, comm:               mprime
EIP: 0073:[<08193c68>] CPU: 0
EIP is at 0x8193c68
  ESP: 007b:bf927ab8 EFLAGS: 00010206    Not tainted  (2.6.12-pentium)
EAX: 00000042 EBX: 00000000 ECX: 00154306 EDX: b7e3ba40
ESI: b7a1e680 EDI: b7e3be40 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b7499000 CR3: 006f0000 CR4: 00000080

MATH ERROR: cwd = 0x37f, swd = 0x20

Pid: 1699, comm:               mprime
EIP: 0073:[<0818de05>] CPU: 0
EIP is at 0x818de05
  ESP: 007b:bf927ab4 EFLAGS: 00010247    Not tainted  (2.6.12-pentium)
EAX: 00000004 EBX: 00000000 ECX: 0000880f EDX: b7f06b40
ESI: b7426400 EDI: 080e1960 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b7499000 CR3: 006f0000 CR4: 00000080
MATH ERROR: cwd = 0x37f, swd = 0x20

Pid: 1699, comm:               mprime
EIP: 0073:[<0818dfe4>] CPU: 0
EIP is at 0x818dfe4
  ESP: 007b:bf927ab4 EFLAGS: 00010247    Not tainted  (2.6.12-pentium)
EAX: 00000200 EBX: 00000000 ECX: 0000880f EDX: b7f06b40
ESI: b742a680 EDI: 080e1c60 EBP: bf927bf8 DS: 007b ES: 007b
CR0: 8005003b CR2: b7499000 CR3: 006f0000 CR4: 00000080

-- 
Ondrej Zary

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: FPU-intensive programs crashing with floating point   exception on Cyrix MII
  2005-08-21 21:28   ` Ondrej Zary
@ 2005-08-21 23:10     ` Linus Torvalds
  0 siblings, 0 replies; 8+ messages in thread
From: Linus Torvalds @ 2005-08-21 23:10 UTC (permalink / raw)
  To: Ondrej Zary; +Cc: Chuck Ebbert, linux-kernel, Ingo Molnar



On Sun, 21 Aug 2005, Ondrej Zary wrote:
> 
> MATH ERROR: cwd = 0x37f, swd = 0x1820
> 
> Pid: 1699, comm:               mprime
> EIP: 0073:[<08181c73>] CPU: 0
> EIP is at 0x8181c73
>   ESP: 007b:bf927ab4 EFLAGS: 00010202    Not tainted  (2.6.12-pentium)
> EAX: 00000001 EBX: 00000000 ECX: 0000808d EDX: b7f09480
> ESI: b7455340 EDI: 080e01f0 EBP: bf927bf8 DS: 007b ES: 007b
> CR0: 8005003b CR2: b7ed6058 CR3: 006f0000 CR4: 00000080

Ahh, so it's actually all in user space. I was thinking that the Cyrix
chip might use the old external interrupt-based (as opposed to exception
16) FP error reporting, and that it could be some kind of asynchronous
error that raced with the kernel task switching (ie the interrupt had
triggered, and then the FPU control register had been modified before the
irq handler actually got to run).

But that doesn't seem to be the case.

I don't see _why_ that exception would happen, other than a CPU bug.

Can you dump more of the FP state (the kernel doesn't have helpers for
doing that, so you'd have to write the code to print out the state by
hand)? Maybe there's some clue there - denormals or something..

		Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-08-21 23:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-08-17 16:13 FPU-intensive programs crashing with floating point exception on Cyrix MII Ondrej Zary
2005-08-17 17:08 ` linux-os (Dick Johnson)
  -- strict thread matches above, loose matches on Subject: below --
2005-08-17 18:49 Chuck Ebbert
2005-08-18 10:37 ` Ondrej Zary
2005-08-21  9:47 Chuck Ebbert
2005-08-21 17:52 ` Linus Torvalds
2005-08-21 21:28   ` Ondrej Zary
2005-08-21 23:10     ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox