Unexpected behaviour when catching SIGFPE on FPU-less system

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unexpected behaviour when catching SIGFPE on FPU-less system
@ 2010-05-03  2:17 Shane McDonald
  2010-05-03 20:39 ` Kevin D. Kissell
  2010-05-03 20:47 ` Kevin D. Kissell
  0 siblings, 2 replies; 19+ messages in thread
From: Shane McDonald @ 2010-05-03  2:17 UTC (permalink / raw)
  To: linux-mips

I have run into some strange behaviour involving using the FPU
emulation software in the MIPS kernel when trying to handle
a divide-by-zero-caused floating point exception.

I have come up with a simple test case to demonstrate this problem.
--
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <fenv.h>
#include <setjmp.h>

void fpe_handler(int);

jmp_buf env;

main()
{
	double x;

	feenableexcept( FE_DIVBYZERO );
	signal( SIGFPE, fpe_handler );

	if ( setjmp( env ) == 0 )
	{
		printf( "About to try calculation\n" );

		x = 5.0 / 0.0;
		printf( "Value is %f\n", x );
	}
	else
	{
		printf( "Calculation causes divide by zero\n" );
	}
}

void fpe_handler(int x)
{
	feclearexcept( FE_DIVBYZERO );
	longjmp( env, 1 );
}
--

The program sets up to generate a SIGFPE when a divide-by-zero occurs,
rather than setting the result to infinity.  Then, I've created a
handler to catch the exception, and the end result is to print out
the "Calculation causes divide by zero" message.

I have two MIPS-based systems, both running Debian Etch.  One of the
systems is a PMC-Sierra RM7035C-based system, which includes an FPU.  My
other system is a PMC-Sierra MSP7120-based system, which does not
include an FPU.  The RM7035C system is running the 2.6.34-rc6 kernel,
but the MSP7120 system is running 2.6.28.

When I run this program on the system with the FPU, I see the results
that I expect to see.  The program outputs:

    About to try calculation
    Calculation causes divide by zero

I see the same results when I run the program on an x86 Debian Etch system.

When I run the program on the system without the FPU, I see:

    About to try calculation
    Floating point exception

So, it appears that the floating point exception is not caught.
However, when I run strace, the last few lines of output are:

    old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace9000
    write(1, "About to try calculation\n"..., 25About to try calculation
    ) = 25
    --- SIGFPE (Floating point exception) @ 0 (0) ---
    --- SIGFPE (Floating point exception) @ 0 (0) ---
    +++ killed by SIGFPE +++

Running it on the system with the FPU, I see:

    old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace5000
    write(1, "About to try calculation\n"..., 25About to try calculation
    ) = 25
    --- SIGFPE (Floating point exception) @ 0 (0) ---
    write(1, "Calculation causes divide by zero"..., 34Calculation causes divide by zero
    ) = 34
    exit_group(34)                          = ?

After poking around for a while, and trying to account for differences
between the systems (endianness, FPUness, kernel version), I believe the
problem is related to the lack of FPU.  If I run the RM7035C with a
disabled FPU (kernel parameter nofpu), I see the same results as on
the FPU-less MSP7120.  So, I suspect this difference in behaviour
is caused by the FPU emulation software.

Now, I don't know if this is a problem, but it does seem strange.
My level of understanding of the FPU emulation software is very low,
so I'm not quite sure where to look.

This isn't actually something that I typically do.  I noticed this
problem when trying to understand why the Debian package "yorick"
failed to build (see
http://lists.debian.org/debian-mips/2010/04/msg00019.html).

I'd appreciate any insight that anyone can provide.  Thanks!

Shane McDonald

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-03  2:17 Unexpected behaviour when catching SIGFPE on FPU-less system Shane McDonald
@ 2010-05-03 20:39 ` Kevin D. Kissell
  2010-05-03 20:47 ` Kevin D. Kissell
  1 sibling, 0 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-03 20:39 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Shane McDonald wrote:
> I have run into some strange behaviour involving using the FPU
> emulation software in the MIPS kernel when trying to handle
> a divide-by-zero-caused floating point exception.
>
> I have come up with a simple test case to demonstrate this problem.
> --
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
> #include <fenv.h>
> #include <setjmp.h>
>
> void fpe_handler(int);
>
> jmp_buf env;
>
> main()
> {
> 	double x;
>
> 	feenableexcept( FE_DIVBYZERO );
> 	signal( SIGFPE, fpe_handler );
>
> 	if ( setjmp( env ) == 0 )
> 	{
> 		printf( "About to try calculation\n" );
> 	
> 		x = 5.0 / 0.0;
> 		printf( "Value is %f\n", x );
> 	}
> 	else
> 	{
> 		printf( "Calculation causes divide by zero\n" );
> 	}
> }
>
> void fpe_handler(int x)
> {
> 	feclearexcept( FE_DIVBYZERO );
> 	longjmp( env, 1 );
> }
> --
>
> The program sets up to generate a SIGFPE when a divide-by-zero occurs,
> rather than setting the result to infinity.  Then, I've created a
> handler to catch the exception, and the end result is to print out
> the "Calculation causes divide by zero" message.
>
> I have two MIPS-based systems, both running Debian Etch.  One of the
> systems is a PMC-Sierra RM7035C-based system, which includes an FPU.  My
> other system is a PMC-Sierra MSP7120-based system, which does not
> include an FPU.  The RM7035C system is running the 2.6.34-rc6 kernel,
> but the MSP7120 system is running 2.6.28.
>
> When I run this program on the system with the FPU, I see the results
> that I expect to see.  The program outputs:
>
>     About to try calculation
>     Calculation causes divide by zero
>
> I see the same results when I run the program on an x86 Debian Etch system.
>
> When I run the program on the system without the FPU, I see:
>
>     About to try calculation
>     Floating point exception
>
> So, it appears that the floating point exception is not caught.
> However, when I run strace, the last few lines of output are:
>
>     old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace9000
>     write(1, "About to try calculation\n"..., 25About to try calculation
>     ) = 25
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     +++ killed by SIGFPE +++
>
> Running it on the system with the FPU, I see:
>
>     old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace5000
>     write(1, "About to try calculation\n"..., 25About to try calculation
>     ) = 25
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     write(1, "Calculation causes divide by zero"..., 34Calculation causes divide by zero
>     ) = 34
>     exit_group(34)                          = ?
>
> After poking around for a while, and trying to account for differences
> between the systems (endianness, FPUness, kernel version), I believe the
> problem is related to the lack of FPU.  If I run the RM7035C with a
> disabled FPU (kernel parameter nofpu), I see the same results as on
> the FPU-less MSP7120.  So, I suspect this difference in behaviour
> is caused by the FPU emulation software.
>
> Now, I don't know if this is a problem, but it does seem strange.
> My level of understanding of the FPU emulation software is very low,
> so I'm not quite sure where to look.
>
> This isn't actually something that I typically do.  I noticed this
> problem when trying to understand why the Debian package "yorick"
> failed to build (see
> http://lists.debian.org/debian-mips/2010/04/msg00019.html).
>
> I'd appreciate any insight that anyone can provide.  Thanks!
>
> Shane McDonald
>
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-03  2:17 Unexpected behaviour when catching SIGFPE on FPU-less system Shane McDonald
  2010-05-03 20:39 ` Kevin D. Kissell
@ 2010-05-03 20:47 ` Kevin D. Kissell
       [not found]   ` <k2hb2b2f2321005031843l87f39f36h960153cae3ec5020@mail.gmail.com>
  1 sibling, 1 reply; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-03 20:47 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Sorry about my previous message having escaped with no value added.

I think you need to look at just what it is that your feclearexcept() 
does.  From the strace information, it looks as if it may be that the 
FPU emulator is erroneously throwing an exception in response to some 
manipulation of the emulated FPU registers by feclearexcept(), so that 
it's taking a second FP exception within the signal handler.  That's the 
simplest explanation for the macroscopic behavior, anyway.

          Regards,

          Kevin K.

Shane McDonald wrote:
> I have run into some strange behaviour involving using the FPU
> emulation software in the MIPS kernel when trying to handle
> a divide-by-zero-caused floating point exception.
>
> I have come up with a simple test case to demonstrate this problem.
> --
> #include <stdio.h>
> #include <stdlib.h>
> #include <signal.h>
> #include <fenv.h>
> #include <setjmp.h>
>
> void fpe_handler(int);
>
> jmp_buf env;
>
> main()
> {
> 	double x;
>
> 	feenableexcept( FE_DIVBYZERO );
> 	signal( SIGFPE, fpe_handler );
>
> 	if ( setjmp( env ) == 0 )
> 	{
> 		printf( "About to try calculation\n" );
> 	
> 		x = 5.0 / 0.0;
> 		printf( "Value is %f\n", x );
> 	}
> 	else
> 	{
> 		printf( "Calculation causes divide by zero\n" );
> 	}
> }
>
> void fpe_handler(int x)
> {
> 	feclearexcept( FE_DIVBYZERO );
> 	longjmp( env, 1 );
> }
> --
>
> The program sets up to generate a SIGFPE when a divide-by-zero occurs,
> rather than setting the result to infinity.  Then, I've created a
> handler to catch the exception, and the end result is to print out
> the "Calculation causes divide by zero" message.
>
> I have two MIPS-based systems, both running Debian Etch.  One of the
> systems is a PMC-Sierra RM7035C-based system, which includes an FPU.  My
> other system is a PMC-Sierra MSP7120-based system, which does not
> include an FPU.  The RM7035C system is running the 2.6.34-rc6 kernel,
> but the MSP7120 system is running 2.6.28.
>
> When I run this program on the system with the FPU, I see the results
> that I expect to see.  The program outputs:
>
>     About to try calculation
>     Calculation causes divide by zero
>
> I see the same results when I run the program on an x86 Debian Etch system.
>
> When I run the program on the system without the FPU, I see:
>
>     About to try calculation
>     Floating point exception
>
> So, it appears that the floating point exception is not caught.
> However, when I run strace, the last few lines of output are:
>
>     old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace9000
>     write(1, "About to try calculation\n"..., 25About to try calculation
>     ) = 25
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     +++ killed by SIGFPE +++
>
> Running it on the system with the FPU, I see:
>
>     old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace5000
>     write(1, "About to try calculation\n"..., 25About to try calculation
>     ) = 25
>     --- SIGFPE (Floating point exception) @ 0 (0) ---
>     write(1, "Calculation causes divide by zero"..., 34Calculation causes divide by zero
>     ) = 34
>     exit_group(34)                          = ?
>
> After poking around for a while, and trying to account for differences
> between the systems (endianness, FPUness, kernel version), I believe the
> problem is related to the lack of FPU.  If I run the RM7035C with a
> disabled FPU (kernel parameter nofpu), I see the same results as on
> the FPU-less MSP7120.  So, I suspect this difference in behaviour
> is caused by the FPU emulation software.
>
> Now, I don't know if this is a problem, but it does seem strange.
> My level of understanding of the FPU emulation software is very low,
> so I'm not quite sure where to look.
>
> This isn't actually something that I typically do.  I noticed this
> problem when trying to understand why the Debian package "yorick"
> failed to build (see
> http://lists.debian.org/debian-mips/2010/04/msg00019.html).
>
> I'd appreciate any insight that anyone can provide.  Thanks!
>
> Shane McDonald
>
>   

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
       [not found]   ` <k2hb2b2f2321005031843l87f39f36h960153cae3ec5020@mail.gmail.com>
@ 2010-05-04  2:04     ` Kevin D. Kissell
       [not found]       ` <n2pb2b2f2321005032049h56cd72ceh3ac7120c547b59c5@mail.gmail.com>
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04  2:04 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Shane McDonald wrote:
> Hi Kevin:
>
> On Mon, May 3, 2010 at 2:47 PM, Kevin D. Kissell <kevink@paralogos.com> wrote:
>   
>> Sorry about my previous message having escaped with no value added.
>>
>> I think you need to look at just what it is that your feclearexcept() does.
>>  From the strace information, it looks as if it may be that the FPU emulator
>> is erroneously throwing an exception in response to some manipulation of the
>> emulated FPU registers by feclearexcept(), so that it's taking a second FP
>> exception within the signal handler.  That's the simplest explanation for
>> the macroscopic behavior, anyway.
>>
>>         Regards,
>>
>>         Kevin K.
>>     
>
> Commenting out the feclearexcept() line gives the same result:
>
>      old_mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2ace9000
>      write(1, "About to try calculation\n"..., 25About to try calculation
>      ) = 25
>      --- SIGFPE (Floating point exception) @ 0 (0) ---
>      --- SIGFPE (Floating point exception) @ 0 (0) ---
>      +++ killed by SIGFPE +++
>
> So, it must not be the feclearexcept() causing the problem.
>   
Well, that nested floating point exception must be coming from 
*somewhere*.  If it's not library code being betrayed by the emulator, 
perhaps some kernel-mode code is being invoked which is carelessly 
assuming the existence of a hardware FPU and throwing us back into the 
exception handler. If it was me, at this point, I'd turn on some kind of 
logging of FP exception PCs to see where that second one is coming from.

There was a time when I had the necessary equipment on my desk to hunt 
this down and kill it, out of a lingering sense of responsibility for 
having bolted that FPE into the kernel way back when.  I no longer have 
that setup, so I'm free to speculate. ;o)

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
       [not found]       ` <n2pb2b2f2321005032049h56cd72ceh3ac7120c547b59c5@mail.gmail.com>
@ 2010-05-04  4:35         ` Shane McDonald
  2010-05-04  6:56           ` Shane McDonald
  0 siblings, 1 reply; 19+ messages in thread
From: Shane McDonald @ 2010-05-04  4:35 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips

On Mon, May 3, 2010 at 9:49 PM, Shane McDonald <mcdonald.shane@gmail.com> wrote:
> Looking at env[0], I see that the __fpc_csr field has a value of 1024,
> indicating a divide-by-zero.  As soon as that ctc1 instruction
> is executed, the exception is raised.  I guess that makes
> sense, but I don't understand why __fpc_csr has a value of 1024.
> When I step through the call to setjmp(), it gets set to a value of 0.
> In longjmp(), every other field in env[0] has the value that it was
> set to in the call to setjmp().

Wait, I take that back -- I was looking at the wrong env[0] variable!
I can see that __fpc_csr actually does have a value of 1024 when
I call setjmp(), and that's why longjmp() is setting the FCSR
register to indicate divide-by-zero.  If I comment out my call to
feenableexcept( FE_DIVBYZERO ), it is set to 0; if I include that call,
it is set to 1024.

Looking further, I also see that I confused the Cause bits and the
Enable bits of the FCSR -- the Enable divide-by-zero bit is set,
not the Cause bit.  Clearly, the call to feenableexcept() must
be setting that bit.  But, it no longer makes sense that an exception
is raised when the FCSR register is restored to the value 1024.

Shane

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04  4:35         ` Shane McDonald
@ 2010-05-04  6:56           ` Shane McDonald
  2010-05-04  7:13             ` Shane McDonald
  2010-05-04 11:16             ` Kevin D. Kissell
  0 siblings, 2 replies; 19+ messages in thread
From: Shane McDonald @ 2010-05-04  6:56 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips

OK, I think I've found the line that's causing me the problem.

On Mon, May 3, 2010 at 10:35 PM, Shane McDonald
<mcdonald.shane@gmail.com> wrote:
> On Mon, May 3, 2010 at 9:49 PM, Shane McDonald <mcdonald.shane@gmail.com> wrote:
>> Looking at env[0], I see that the __fpc_csr field has a value of 1024,
>> indicating a divide-by-zero.  As soon as that ctc1 instruction
>> is executed, the exception is raised.  I guess that makes
>> sense, but I don't understand why __fpc_csr has a value of 1024.
>> When I step through the call to setjmp(), it gets set to a value of 0.
>> In longjmp(), every other field in env[0] has the value that it was
>> set to in the call to setjmp().
>
> Wait, I take that back -- I was looking at the wrong env[0] variable!
> I can see that __fpc_csr actually does have a value of 1024 when
> I call setjmp(), and that's why longjmp() is setting the FCSR
> register to indicate divide-by-zero.  If I comment out my call to
> feenableexcept( FE_DIVBYZERO ), it is set to 0; if I include that call,
> it is set to 1024.
>
> Looking further, I also see that I confused the Cause bits and the
> Enable bits of the FCSR -- the Enable divide-by-zero bit is set,
> not the Cause bit.  Clearly, the call to feenableexcept() must
> be setting that bit.  But, it no longer makes sense that an exception
> is raised when the FCSR register is restored to the value 1024.

When I'm inside my handler, I see the FCSR register has the value 0x8420,
indicating that the Z bit is set in each of the Cause, Enables, and Flags
fields.  When longjmp() is called, it tries to write the old FCSR value
of 0x400 (just the Z bit of the Enables field).  In the emulation code,
at lines 392 - 394 of file cp1emu.c, is the code:

    if ((ctx->fcr31 >> 5) & ctx->fcr31 & FPU_CSR_ALL_E) {
            return SIGFPE;
    }

Given the original FCSR value of 0x8420 and the new value to set
of 0x400, the Z bit of the Cause field is still set, and as a result, the
above code causes the SIGFPE exception to be thrown.

Now that I've figured that out, I have to admit that I don't know
if the emulator has the proper behaviour, or if not, what the fix is.
Kevin, what do you think?

Shane

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04  6:56           ` Shane McDonald
@ 2010-05-04  7:13             ` Shane McDonald
  2010-05-04 11:16             ` Kevin D. Kissell
  1 sibling, 0 replies; 19+ messages in thread
From: Shane McDonald @ 2010-05-04  7:13 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips

Sorry to the list for all the noise...

One final data point:

On Tue, May 4, 2010 at 12:56 AM, Shane McDonald
<mcdonald.shane@gmail.com> wrote:
> When I'm inside my handler, I see the FCSR register has the value 0x8420,

On a machine with an FPU, when I'm inside the handler, the FCSR register
seems to have the value 0x400 (no Causes or Flags bits set),
rather than the 0x8420 that the FP emulator has.

Shane

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04  6:56           ` Shane McDonald
  2010-05-04  7:13             ` Shane McDonald
@ 2010-05-04 11:16             ` Kevin D. Kissell
  2010-05-04 12:56               ` Shane McDonald
  1 sibling, 1 reply; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 11:16 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Shane McDonald wrote:
> OK, I think I've found the line that's causing me the problem.
>
> On Mon, May 3, 2010 at 10:35 PM, Shane McDonald
> <mcdonald.shane@gmail.com> wrote:
>   
>> On Mon, May 3, 2010 at 9:49 PM, Shane McDonald <mcdonald.shane@gmail.com> wrote:
>>     
>>> Looking at env[0], I see that the __fpc_csr field has a value of 1024,
>>> indicating a divide-by-zero.  As soon as that ctc1 instruction
>>> is executed, the exception is raised.  I guess that makes
>>> sense, but I don't understand why __fpc_csr has a value of 1024.
>>> When I step through the call to setjmp(), it gets set to a value of 0.
>>> In longjmp(), every other field in env[0] has the value that it was
>>> set to in the call to setjmp().
>>>       
>> Wait, I take that back -- I was looking at the wrong env[0] variable!
>> I can see that __fpc_csr actually does have a value of 1024 when
>> I call setjmp(), and that's why longjmp() is setting the FCSR
>> register to indicate divide-by-zero.  If I comment out my call to
>> feenableexcept( FE_DIVBYZERO ), it is set to 0; if I include that call,
>> it is set to 1024.
>>
>> Looking further, I also see that I confused the Cause bits and the
>> Enable bits of the FCSR -- the Enable divide-by-zero bit is set,
>> not the Cause bit.  Clearly, the call to feenableexcept() must
>> be setting that bit.  But, it no longer makes sense that an exception
>> is raised when the FCSR register is restored to the value 1024.
>>     
>
> When I'm inside my handler, I see the FCSR register has the value 0x8420,
> indicating that the Z bit is set in each of the Cause, Enables, and Flags
> fields.  When longjmp() is called, it tries to write the old FCSR value
> of 0x400 (just the Z bit of the Enables field).  In the emulation code,
> at lines 392 - 394 of file cp1emu.c, is the code:
>
>     if ((ctx->fcr31 >> 5) & ctx->fcr31 & FPU_CSR_ALL_E) {
>             return SIGFPE;
>     }
>
> Given the original FCSR value of 0x8420 and the new value to set
> of 0x400, the Z bit of the Cause field is still set, and as a result, the
> above code causes the SIGFPE exception to be thrown.
>   
That's not how I read the code.  If ctx->fcr31 is 0x400, then the result
of the AND should be zero.
> Now that I've figured that out, I have to admit that I don't know
> if the emulator has the proper behaviour, or if not, what the fix is.
> Kevin, what do you think?
>   
I don't know where the bug is, but it doesn't look to be here.  I wonder
if someone hasn't added some code somewhere that does an extra
save/restore of the FCSR from the kernel stack, so that the explicit
write to clear the exception is undone by the restore from the stack
being emulated.  I note that there's a __build_clear_fpe macro that now
appears to clear the status bits of a real FPU on entry to the FPU
exception handler, but that there's nothing analogous which clears the
bits of the emulated register file in the emulated exception case -
because, after all, there's no new exception, just an invocation of
signal logic within the coprocessor unavailable handling of the
emulator.  That's presumably the cause of the different values you see
in the signal handlers, and very possibly a reason why we only see the
failure with the emulator.

I don't remember the name of the thing, but when I was with MIPS, there
was an old gradware test program that we used to test IEEE compliance of
the FPU and emulator.  Does this still pass with the emulator on the
kernels showing this bug?  If it does, the problem is subtle and the
test needs to be enhanced.  If it doesn't, whoever wrenches on anything
to do with the FPU in the kernel should really be running it before
committing.

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 11:16             ` Kevin D. Kissell
@ 2010-05-04 12:56               ` Shane McDonald
  2010-05-04 16:13                 ` Kevin D. Kissell
  0 siblings, 1 reply; 19+ messages in thread
From: Shane McDonald @ 2010-05-04 12:56 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: linux-mips

On Tue, May 4, 2010 at 5:16 AM, Kevin D. Kissell <kevink@paralogos.com> wrote:
> Shane McDonald wrote:
>> When I'm inside my handler, I see the FCSR register has the value 0x8420,
>> indicating that the Z bit is set in each of the Cause, Enables, and Flags
>> fields.  When longjmp() is called, it tries to write the old FCSR value
>> of 0x400 (just the Z bit of the Enables field).  In the emulation code,
>> at lines 392 - 394 of file cp1emu.c, is the code:
>>
>>     if ((ctx->fcr31 >> 5) & ctx->fcr31 & FPU_CSR_ALL_E) {
>>             return SIGFPE;
>>     }
>>
>> Given the original FCSR value of 0x8420 and the new value to set
>> of 0x400, the Z bit of the Cause field is still set, and as a result, the
>> above code causes the SIGFPE exception to be thrown.
>>
> That's not how I read the code.  If ctx->fcr31 is 0x400, then the result
> of the AND should be zero.

Sorry, I should have been more clear.

In the following chunk of code from cp1emu.c:

                case ctc_op:{
                        /* copregister rd <- rt */
                        u32 value;

                        if (MIPSInst_RT(ir) == 0)
                                value = 0;
                        else
                                value = xcp->regs[MIPSInst_RT(ir)];

                        /* we only have one writable control reg
                         */
                        if (MIPSInst_RD(ir) == FPCREG_CSR) {
#ifdef CSRTRACE
                                printk("%p gpr[%d]->csr=%08x\n",
                                        (void *) (xcp->cp0_epc),
                                        MIPSInst_RT(ir), value);
#endif
                                value &= (FPU_CSR_FLUSH |
FPU_CSR_ALL_E | FPU_CSR_ALL_S | 0x03);
                                ctx->fcr31 &= ~(FPU_CSR_FLUSH |
FPU_CSR_ALL_E | FPU_CSR_ALL_S | 0x03);
                                /* convert to ieee library modes */
                                ctx->fcr31 |= (value & ~0x3) |
ieee_rm[value & 0x3];
                        }
                        if ((ctx->fcr31 >> 5) & ctx->fcr31 & FPU_CSR_ALL_E) {
                                return SIGFPE;
                        }
                        break;

value gets set to an initial value of 0x400, and ctx->fcr31
comes in with an initial value of 0x8420.
By the time we hit the if statement around the return SIGFPE, ctx->fcr31
has been set to 0x8400, not the 0x400 I implied.

Nevertheless, that's not the problem.  You've given me some good pointers
for where to begin searching for the problem.

If anyone out there has a verification suite they can run on the emulator,
that would be much appreciated!

Shane

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 12:56               ` Shane McDonald
@ 2010-05-04 16:13                 ` Kevin D. Kissell
  2010-05-04 18:44                   ` Ralf Baechle
                                     ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 16:13 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Shane McDonald wrote:
>
> In the following chunk of code from cp1emu.c:
>   
[snip]
> value gets set to an initial value of 0x400, and ctx->fcr31
> comes in with an initial value of 0x8420.
> By the time we hit the if statement around the return SIGFPE, ctx->fcr31
> has been set to 0x8400, not the 0x400 I implied.
>   
Ah, well that would rather change things, and you *would* get an
exception there.  As written, the code doesn't seem to allow the pending
exception (.._X) bits to be cleared by the CTC.
> Nevertheless, that's not the problem.  
Maybe it is.  I don't have my MIPS specs handy anymore, but just what is
supposed to clear a pending exception bit in a real FPU?
> You've given me some good pointers
> for where to begin searching for the problem.
>
> If anyone out there has a verification suite they can run on the emulator,
> that would be much appreciated!
>   
What we used to use was what I *thought* was an old public domain
program whose name was an English word that had something to do with
being exacting.  Googling with obvious keywords didn't turn it up.

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 16:13                 ` Kevin D. Kissell
@ 2010-05-04 18:44                   ` Ralf Baechle
  2010-05-04 18:58                     ` Kevin D. Kissell
  2010-05-04 19:28                     ` Geert Uytterhoeven
  2010-05-04 18:55                   ` Kevin D. Kissell
  2010-05-04 21:52                   ` Kevin D. Kissell
  2 siblings, 2 replies; 19+ messages in thread
From: Ralf Baechle @ 2010-05-04 18:44 UTC (permalink / raw)
  To: Kevin D. Kissell; +Cc: Shane McDonald, linux-mips

On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:

> What we used to use was what I *thought* was an old public domain
> program whose name was an English word that had something to do with
> being exacting.  Googling with obvious keywords didn't turn it up.

Is it paranoia by any chance?  Paranoia is available as single files at:

  http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
  http://www.math.utah.edu/~beebe/software/ieee/paranoia.h

It's ages that soembody last ran it but last known status is that there
were no paranoia fault.

  Ralf

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 16:13                 ` Kevin D. Kissell
  2010-05-04 18:44                   ` Ralf Baechle
@ 2010-05-04 18:55                   ` Kevin D. Kissell
  2010-05-04 21:52                   ` Kevin D. Kissell
  2 siblings, 0 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 18:55 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Kevin D. Kissell wrote:
> Shane McDonald wrote:
>   
>> In the following chunk of code from cp1emu.c:
>>   
>>     
> [snip]
>   
>> value gets set to an initial value of 0x400, and ctx->fcr31
>> comes in with an initial value of 0x8420.
>> By the time we hit the if statement around the return SIGFPE, ctx->fcr31
>> has been set to 0x8400, not the 0x400 I implied.
>>   
>>     
> Ah, well that would rather change things, and you *would* get an
> exception there.  As written, the code doesn't seem to allow the pending
> exception (.._X) bits to be cleared by the CTC.
>   
>> Nevertheless, that's not the problem.  
>>     
> Maybe it is.  I don't have my MIPS specs handy anymore, but just what is
> supposed to clear a pending exception bit in a real FPU?
>   
 From old-ish MIPS32 specs out there on the web, it looks like the 
emulator was doing the right thing in raising the exception - it's 
specifically called out in the CTC1 definition that writing a value with 
both a Cause and an Enable (_X and _E) bit set will throw an exception.  
The question is:  Why wasn't the Cause bit cleared?  As I mentioned last 
night, in current kernels running on a real FPU, it gets cleared as part 
of the assembly-language preamble to servicing a FPU exception, a path 
which is definitely not taken in the emulator case, which is driven by 
coprocessor unusable exceptions.   So now I'm actually confused by two 
things:  One is where the emulator *should* have its _X flags cleared, 
and the other is how the current kernel/signal code communicates the 
nature of a floating point exception to the user.  I had thought that 
either we had a model where a SIGFPE signal carried the FPCR bits as 
part of its payload (something I've done for other architectures and 
could have sworn I'd done for MIPS at one point or another), or that the 
signal handler can inspect the FPCR to know what kind of exception it 
was.  As near as I can tell, when there's a real FPU, we wipe out the 
evidence before we save the context.

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 18:44                   ` Ralf Baechle
@ 2010-05-04 18:58                     ` Kevin D. Kissell
  2010-05-04 19:28                     ` Geert Uytterhoeven
  1 sibling, 0 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 18:58 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Shane McDonald, linux-mips

Ralf Baechle wrote:
> On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:
>
>   
>> What we used to use was what I *thought* was an old public domain
>> program whose name was an English word that had something to do with
>> being exacting.  Googling with obvious keywords didn't turn it up.
>>     
>
> Is it paranoia by any chance?  Paranoia is available as single files at:
>
>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.h
>
> It's ages that soembody last ran it but last known status is that there
> were no paranoia fault.
>   
Yes, that's it!  I used to run it all the time when I was working on 
SMTC FPU support. Shane, does it work on your system with the emulator??

/K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 18:44                   ` Ralf Baechle
  2010-05-04 18:58                     ` Kevin D. Kissell
@ 2010-05-04 19:28                     ` Geert Uytterhoeven
  2010-05-04 19:30                       ` Manuel Lauss
  1 sibling, 1 reply; 19+ messages in thread
From: Geert Uytterhoeven @ 2010-05-04 19:28 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: Kevin D. Kissell, Shane McDonald, linux-mips

On Tue, May 4, 2010 at 20:44, Ralf Baechle <ralf@linux-mips.org> wrote:
> On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:
>
>> What we used to use was what I *thought* was an old public domain
>> program whose name was an English word that had something to do with
>> being exacting.  Googling with obvious keywords didn't turn it up.
>
> Is it paranoia by any chance?  Paranoia is available as single files at:
>
>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.h

You also need

http://www.math.utah.edu/~beebe/software/ieee/args.h

Ran fine on:
  - Toshiba RBTX4927 (with FPU :-)
  - Mikrotik RouterBOARD 150 (without FPU), using an older 2.6.x OpenWRT kernel

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 19:28                     ` Geert Uytterhoeven
@ 2010-05-04 19:30                       ` Manuel Lauss
  2010-05-04 19:44                         ` Geert Uytterhoeven
  2010-05-04 20:01                         ` David Daney
  0 siblings, 2 replies; 19+ messages in thread
From: Manuel Lauss @ 2010-05-04 19:30 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Ralf Baechle, Kevin D. Kissell, Shane McDonald, linux-mips

On Tue, May 4, 2010 at 9:28 PM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
> On Tue, May 4, 2010 at 20:44, Ralf Baechle <ralf@linux-mips.org> wrote:
>> On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:
>>
>>> What we used to use was what I *thought* was an old public domain
>>> program whose name was an English word that had something to do with
>>> being exacting.  Googling with obvious keywords didn't turn it up.
>>
>> Is it paranoia by any chance?  Paranoia is available as single files at:
>>
>>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.h
>
> You also need
>
> http://www.math.utah.edu/~beebe/software/ieee/args.h
>
> Ran fine on:
>  - Toshiba RBTX4927 (with FPU :-)
>  - Mikrotik RouterBOARD 150 (without FPU), using an older 2.6.x OpenWRT kernel

and runs into an endless loop around line 806 when built with
a softfloat toolchain (gcc-4.4.3).

Manuel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 19:30                       ` Manuel Lauss
@ 2010-05-04 19:44                         ` Geert Uytterhoeven
  2010-05-04 20:01                         ` David Daney
  1 sibling, 0 replies; 19+ messages in thread
From: Geert Uytterhoeven @ 2010-05-04 19:44 UTC (permalink / raw)
  To: Manuel Lauss; +Cc: Ralf Baechle, Kevin D. Kissell, Shane McDonald, linux-mips

On Tue, May 4, 2010 at 21:30, Manuel Lauss <manuel.lauss@googlemail.com> wrote:
> On Tue, May 4, 2010 at 9:28 PM, Geert Uytterhoeven <geert@linux-m68k.org> wrote:
>> On Tue, May 4, 2010 at 20:44, Ralf Baechle <ralf@linux-mips.org> wrote:
>>> On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:
>>>
>>>> What we used to use was what I *thought* was an old public domain
>>>> program whose name was an English word that had something to do with
>>>> being exacting.  Googling with obvious keywords didn't turn it up.
>>>
>>> Is it paranoia by any chance?  Paranoia is available as single files at:
>>>
>>>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>>>  http://www.math.utah.edu/~beebe/software/ieee/paranoia.h
>>
>> You also need
>>
>> http://www.math.utah.edu/~beebe/software/ieee/args.h
>>
>> Ran fine on:
>>  - Toshiba RBTX4927 (with FPU :-)
>>  - Mikrotik RouterBOARD 150 (without FPU), using an older 2.6.x OpenWRT kernel
>
> and runs into an endless loop around line 806 when built with
> a softfloat toolchain (gcc-4.4.3).

I used my kernel cross-toolchain (gcc version 4.1.2 20061115
(prerelease) (Ubuntu 4.1.1-21)),
with Debian libs, and -static to make it run on the RB150.

I retried with the OpenWRT toolchain (also 4.1.2, presumably
softfloat) I had still lying around,
and it worked, too. I got small differences in the last 2 digits,
though, so I guess it actually is
softfloat.

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 19:30                       ` Manuel Lauss
  2010-05-04 19:44                         ` Geert Uytterhoeven
@ 2010-05-04 20:01                         ` David Daney
  2010-05-04 21:23                           ` Kevin D. Kissell
  1 sibling, 1 reply; 19+ messages in thread
From: David Daney @ 2010-05-04 20:01 UTC (permalink / raw)
  To: Manuel Lauss
  Cc: Geert Uytterhoeven, Ralf Baechle, Kevin D. Kissell,
	Shane McDonald, linux-mips

On 05/04/2010 12:30 PM, Manuel Lauss wrote:
> On Tue, May 4, 2010 at 9:28 PM, Geert Uytterhoeven<geert@linux-m68k.org>  wrote:
>> On Tue, May 4, 2010 at 20:44, Ralf Baechle<ralf@linux-mips.org>  wrote:
>>> On Tue, May 04, 2010 at 09:13:18AM -0700, Kevin D. Kissell wrote:
>>>
>>>> What we used to use was what I *thought* was an old public domain
>>>> program whose name was an English word that had something to do with
>>>> being exacting.  Googling with obvious keywords didn't turn it up.
>>>
>>> Is it paranoia by any chance?  Paranoia is available as single files at:
>>>
>>>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>>>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.h
>>
>> You also need
>>
>> http://www.math.utah.edu/~beebe/software/ieee/args.h
>>
>> Ran fine on:
>>   - Toshiba RBTX4927 (with FPU :-)
>>   - Mikrotik RouterBOARD 150 (without FPU), using an older 2.6.x OpenWRT kernel
>
> and runs into an endless loop around line 806 when built with
> a softfloat toolchain (gcc-4.4.3).
>

 From the point of view of this specific problem, using a softfloat 
toolchain isn't what you want to do.

The question is if the kernel's FP emulator is operating correctly,  if 
you never execute any FP instructions (due to the use of a softfloat 
toolchain), you would not be testing the emulator.

David Daney

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 20:01                         ` David Daney
@ 2010-05-04 21:23                           ` Kevin D. Kissell
  0 siblings, 0 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 21:23 UTC (permalink / raw)
  To: David Daney
  Cc: Manuel Lauss, Geert Uytterhoeven, Ralf Baechle, Shane McDonald,
	linux-mips

David Daney wrote:
> On 05/04/2010 12:30 PM, Manuel Lauss wrote:
>> On Tue, May 4, 2010 at 9:28 PM, Geert 
>> Uytterhoeven<geert@linux-m68k.org>  wrote:
>>> On Tue, May 4, 2010 at 20:44, Ralf Baechle<ralf@linux-mips.org>  wrote:
>>>> Is it paranoia by any chance?  Paranoia is available as single 
>>>> files at:
>>>>
>>>>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.c
>>>>   http://www.math.utah.edu/~beebe/software/ieee/paranoia.h
>>>
>>> You also need
>>>
>>> http://www.math.utah.edu/~beebe/software/ieee/args.h
>>>
>>> Ran fine on:
>>>   - Toshiba RBTX4927 (with FPU :-)
>>>   - Mikrotik RouterBOARD 150 (without FPU), using an older 2.6.x 
>>> OpenWRT kernel
>>
>> and runs into an endless loop around line 806 when built with
>> a softfloat toolchain (gcc-4.4.3).
>>
>
> From the point of view of this specific problem, using a softfloat 
> toolchain isn't what you want to do.
That's absolutely true.  I would mention, however, that in ancient 
times, I built and ran paranoia with a couple of different softfloat 
libraries, and was able to make it work with them.  If people care about 
softfloat (and I think they should), this should probably be 
investigated.  Easy for me to say, though... ;o)

          Regards,

          Kevin K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Unexpected behaviour when catching SIGFPE on FPU-less system
  2010-05-04 16:13                 ` Kevin D. Kissell
  2010-05-04 18:44                   ` Ralf Baechle
  2010-05-04 18:55                   ` Kevin D. Kissell
@ 2010-05-04 21:52                   ` Kevin D. Kissell
  2 siblings, 0 replies; 19+ messages in thread
From: Kevin D. Kissell @ 2010-05-04 21:52 UTC (permalink / raw)
  To: Shane McDonald; +Cc: linux-mips

Kevin D. Kissell wrote:
> Shane McDonald wrote:
>   
>> In the following chunk of code from cp1emu.c:
>>   
>>     
> [snip]
>   
>> value gets set to an initial value of 0x400, and ctx->fcr31
>> comes in with an initial value of 0x8420.
>> By the time we hit the if statement around the return SIGFPE, ctx->fcr31
>> has been set to 0x8400, not the 0x400 I implied.
>>   
>>     
> Ah, well that would rather change things, and you *would* get an
> exception there.  As written, the code doesn't seem to allow the pending
> exception (.._X) bits to be cleared by the CTC.
>   
>> Nevertheless, that's not the problem.  
>>     
> Maybe it is. 
OK, sorry to have been looking at this in fits and starts, but indeed, I 
submit that the bug is indeed in that ctc_op:  case of the emulator.  
The Cause bits (17:12) are supposed to be writable by that instruction, 
but the CTC1 emulation won't let them be updated by the instruction.  I 
don't have the means to generate, test, and submit a proper patch, but I 
think that actually if you just completely removed lines 387-388:


value &= (FPU_CSR_FLUSH | FPU_CSR_ALL_E | FPU_CSR_ALL_S | 0x03);
ctx->fcr31 &= ~(FPU_CSR_FLUSH | FPU_CSR_ALL_E | FPU_CSR_ALL_S |0x03);

Things would work a good deal better.  At least, it would be a more 
accurate emulation of the architecturally defined FPU.  If I wanted to 
be really, really
pedantic (which I sometimes do), I'd also protect the reserved bits that 
aren't necessarily writable, so we'd nuke those two lines, then have

/* Don't write reserved bits, and convert to ieee library modes */
ctx->fcr31 = (value & ~0x1c0003) | ieee_rm[value & 0x3];

Note that I've changed the existing |= to a direct assignment here.

Hope this helps.

/K.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2010-05-04 21:52 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-03  2:17 Unexpected behaviour when catching SIGFPE on FPU-less system Shane McDonald
2010-05-03 20:39 ` Kevin D. Kissell
2010-05-03 20:47 ` Kevin D. Kissell
     [not found]   ` <k2hb2b2f2321005031843l87f39f36h960153cae3ec5020@mail.gmail.com>
2010-05-04  2:04     ` Kevin D. Kissell
     [not found]       ` <n2pb2b2f2321005032049h56cd72ceh3ac7120c547b59c5@mail.gmail.com>
2010-05-04  4:35         ` Shane McDonald
2010-05-04  6:56           ` Shane McDonald
2010-05-04  7:13             ` Shane McDonald
2010-05-04 11:16             ` Kevin D. Kissell
2010-05-04 12:56               ` Shane McDonald
2010-05-04 16:13                 ` Kevin D. Kissell
2010-05-04 18:44                   ` Ralf Baechle
2010-05-04 18:58                     ` Kevin D. Kissell
2010-05-04 19:28                     ` Geert Uytterhoeven
2010-05-04 19:30                       ` Manuel Lauss
2010-05-04 19:44                         ` Geert Uytterhoeven
2010-05-04 20:01                         ` David Daney
2010-05-04 21:23                           ` Kevin D. Kissell
2010-05-04 18:55                   ` Kevin D. Kissell
2010-05-04 21:52                   ` Kevin D. Kissell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.