Linux MIPS Architecture development
 help / color / mirror / Atom feed
* Bug in Linux?  fcr31 not being saved-restored
@ 2002-06-15 23:56 Louis Hamilton
  2002-06-17  6:20 ` Carsten Langgaard
  0 siblings, 1 reply; 7+ messages in thread
From: Louis Hamilton @ 2002-06-15 23:56 UTC (permalink / raw)
  To: linux-mips; +Cc: sandcraft-elinux-project, hamilton

We have a customer here testing a 2.4.16 mips kernel on an embedded
Linux RM7000/SR71000 based system who has written a test that they
believe has uncovered a bug in Linux.  The FPU control register appears
to not get saved and restored.  I've reproduced the problem described
below and find the results consistent with their description.  The
problem occurs on both the RM7000 and SR71000 cpus.

It looks like save_fp_context and restore_fp_context are not being
called since the kernel save-restore logic thinks the process is not 
using floating point math.  If you do some fp math before calling the
test routine below, it seems to works fine.

Is this a known caveat?  A true bug?  Or a contorted corner case
unlikely to be seen under typical end-user usage (see customer's
last paragraph :-) ?   If true bug, recommended remedy?

TIA,
Louis

Louis Hamilton
hamilton@redhat.com


------ customer reports the following: ---------
We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
running program, it doesn't matter), causes the FCR (floating-point
control register) to be corrupted in another, unrelated process.  This
is repeatable behavior.

This can be reproduced with the following short assembly language
program that loops forever, waiting for the FCR to change.

	.align 2
	.globl mips_float_debug_loop
mips_float_debug_loop:
	li	$9, 0xF000F02F
	ctc1    $9, $31		# set FCR to some non-zero value
	nop
1:	cfc1	$8, $31		# get FCR
	beq	$8, $9, 1b	# spin, waiting for FCR to change
	nop
	or	$2, $0, $8
	jr    $31
	nop

You can call this function from a short C program and the return value
is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.

Run the above loop in one window (connected to the board using telnet)
and then in another window (connected to the same board) type ^C.

I'm surprised this bug hasn't been encountered by other MIPS vendors.

<end>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-15 23:56 Bug in Linux? fcr31 not being saved-restored Louis Hamilton
@ 2002-06-17  6:20 ` Carsten Langgaard
  2002-06-17 19:30   ` Jun Sun
  0 siblings, 1 reply; 7+ messages in thread
From: Carsten Langgaard @ 2002-06-17  6:20 UTC (permalink / raw)
  To: Louis Hamilton; +Cc: linux-mips, sandcraft-elinux-project

This is one of the bugs, among others, we have fixed.
I'm not sure, if Ralf have integrated the patches we send him yet.

/Carsten

Louis Hamilton wrote:

> We have a customer here testing a 2.4.16 mips kernel on an embedded
> Linux RM7000/SR71000 based system who has written a test that they
> believe has uncovered a bug in Linux.  The FPU control register appears
> to not get saved and restored.  I've reproduced the problem described
> below and find the results consistent with their description.  The
> problem occurs on both the RM7000 and SR71000 cpus.
>
> It looks like save_fp_context and restore_fp_context are not being
> called since the kernel save-restore logic thinks the process is not
> using floating point math.  If you do some fp math before calling the
> test routine below, it seems to works fine.
>
> Is this a known caveat?  A true bug?  Or a contorted corner case
> unlikely to be seen under typical end-user usage (see customer's
> last paragraph :-) ?   If true bug, recommended remedy?
>
> TIA,
> Louis
>
> Louis Hamilton
> hamilton@redhat.com
>
> ------ customer reports the following: ---------
> We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
> running program, it doesn't matter), causes the FCR (floating-point
> control register) to be corrupted in another, unrelated process.  This
> is repeatable behavior.
>
> This can be reproduced with the following short assembly language
> program that loops forever, waiting for the FCR to change.
>
>         .align 2
>         .globl mips_float_debug_loop
> mips_float_debug_loop:
>         li      $9, 0xF000F02F
>         ctc1    $9, $31         # set FCR to some non-zero value
>         nop
> 1:      cfc1    $8, $31         # get FCR
>         beq     $8, $9, 1b      # spin, waiting for FCR to change
>         nop
>         or      $2, $0, $8
>         jr    $31
>         nop
>
> You can call this function from a short C program and the return value
> is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
>
> Run the above loop in one window (connected to the board using telnet)
> and then in another window (connected to the same board) type ^C.
>
> I'm surprised this bug hasn't been encountered by other MIPS vendors.
>
> <end>

--
_    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
                   Denmark             http://www.mips.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-17  6:20 ` Carsten Langgaard
@ 2002-06-17 19:30   ` Jun Sun
  2002-06-18  8:07     ` Carsten Langgaard
  0 siblings, 1 reply; 7+ messages in thread
From: Jun Sun @ 2002-06-17 19:30 UTC (permalink / raw)
  To: Carsten Langgaard; +Cc: Louis Hamilton, linux-mips, sandcraft-elinux-project

Carsten Langgaard wrote:

> This is one of the bugs, among others, we have fixed.
> I'm not sure, if Ralf have integrated the patches we send him yet.
> 


Carsten,

Do you remember the cause and the fix?  It appears to me the first ctc1 
instruction should trap into kernel and mark current process as fpu owner, and 
should not cause fcr31 corruption.

Or somehow the ctc1 does not trap into kernel?

Jun


> /Carsten
> 
> Louis Hamilton wrote:
> 
> 
>>We have a customer here testing a 2.4.16 mips kernel on an embedded
>>Linux RM7000/SR71000 based system who has written a test that they
>>believe has uncovered a bug in Linux.  The FPU control register appears
>>to not get saved and restored.  I've reproduced the problem described
>>below and find the results consistent with their description.  The
>>problem occurs on both the RM7000 and SR71000 cpus.
>>
>>It looks like save_fp_context and restore_fp_context are not being
>>called since the kernel save-restore logic thinks the process is not
>>using floating point math.  If you do some fp math before calling the
>>test routine below, it seems to works fine.
>>
>>Is this a known caveat?  A true bug?  Or a contorted corner case
>>unlikely to be seen under typical end-user usage (see customer's
>>last paragraph :-) ?   If true bug, recommended remedy?
>>
>>TIA,
>>Louis
>>
>>Louis Hamilton
>>hamilton@redhat.com
>>
>>------ customer reports the following: ---------
>>We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
>>running program, it doesn't matter), causes the FCR (floating-point
>>control register) to be corrupted in another, unrelated process.  This
>>is repeatable behavior.
>>
>>This can be reproduced with the following short assembly language
>>program that loops forever, waiting for the FCR to change.
>>
>>        .align 2
>>        .globl mips_float_debug_loop
>>mips_float_debug_loop:
>>        li      $9, 0xF000F02F
>>        ctc1    $9, $31         # set FCR to some non-zero value
>>        nop
>>1:      cfc1    $8, $31         # get FCR
>>        beq     $8, $9, 1b      # spin, waiting for FCR to change
>>        nop
>>        or      $2, $0, $8
>>        jr    $31
>>        nop
>>
>>You can call this function from a short C program and the return value
>>is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
>>
>>Run the above loop in one window (connected to the board using telnet)
>>and then in another window (connected to the same board) type ^C.
>>
>>I'm surprised this bug hasn't been encountered by other MIPS vendors.
>>
>><end>
>>
> 
> --
> _    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
> |\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> | \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>   TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>                    Denmark             http://www.mips.com
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-17 19:30   ` Jun Sun
@ 2002-06-18  8:07     ` Carsten Langgaard
  2002-06-18 17:46       ` Jun Sun
  0 siblings, 1 reply; 7+ messages in thread
From: Carsten Langgaard @ 2002-06-18  8:07 UTC (permalink / raw)
  To: Jun Sun; +Cc: Louis Hamilton, linux-mips, sandcraft-elinux-project

I believe this is the patch that solve the problem.
The lazy fpu stuff has cause so many problems, that we have decided (together
with Ralf) to get rid of it, and do the FPU context save and restore a little bit
differently.
We now have a solution here locally and we are testing it at the moment.

/Carsten


Index: arch/mips/kernel/traps.c
===================================================================
RCS file: /cvs/linux/arch/mips/kernel/traps.c,v
retrieving revision 1.99.2.14
diff -u -r1.99.2.14 traps.c
--- arch/mips/kernel/traps.c    2002/05/28 06:33:13     1.99.2.14
+++ arch/mips/kernel/traps.c    2002/06/18 07:53:47
+               {
+                       __enable_fpu();
                        save_fp(last_task_used_math);
+                       /* last_task_used_math loose fpu */
+                       ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
+                                           sizeof(struct pt_regs)))->
+                               cp0_status &= ~ST0_CU1;
+               }

Index: include/asm-mips/processor.h
===================================================================
RCS file: /cvs/linux/include/asm-mips/processor.h,v
retrieving revision 1.43.2.2
diff -u -r1.43.2.2 processor.h
--- include/asm-mips/processor.h        2002/05/28 06:11:56     1.43.2.2
+++ include/asm-mips/processor.h        2002/06/18 07:56:58
@@ -215,6 +215,7 @@
        regs->cp0_epc = new_pc;                                         \
        regs->regs[29] = new_sp;                                        \
        current->thread.current_ds = USER_DS;                           \
+       current->used_math = 0;                                         \
 } while (0)

 unsigned long get_wchan(struct task_struct *p);



Jun Sun wrote:

> Carsten Langgaard wrote:
>
> > This is one of the bugs, among others, we have fixed.
> > I'm not sure, if Ralf have integrated the patches we send him yet.
> >
>
> Carsten,
>
> Do you remember the cause and the fix?  It appears to me the first ctc1
> instruction should trap into kernel and mark current process as fpu owner, and
> should not cause fcr31 corruption.
>
> Or somehow the ctc1 does not trap into kernel?
>
> Jun
>
> > /Carsten
> >
> > Louis Hamilton wrote:
> >
> >
> >>We have a customer here testing a 2.4.16 mips kernel on an embedded
> >>Linux RM7000/SR71000 based system who has written a test that they
> >>believe has uncovered a bug in Linux.  The FPU control register appears
> >>to not get saved and restored.  I've reproduced the problem described
> >>below and find the results consistent with their description.  The
> >>problem occurs on both the RM7000 and SR71000 cpus.
> >>
> >>It looks like save_fp_context and restore_fp_context are not being
> >>called since the kernel save-restore logic thinks the process is not
> >>using floating point math.  If you do some fp math before calling the
> >>test routine below, it seems to works fine.
> >>
> >>Is this a known caveat?  A true bug?  Or a contorted corner case
> >>unlikely to be seen under typical end-user usage (see customer's
> >>last paragraph :-) ?   If true bug, recommended remedy?
> >>
> >>TIA,
> >>Louis
> >>
> >>Louis Hamilton
> >>hamilton@redhat.com
> >>
> >>------ customer reports the following: ---------
> >>We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
> >>running program, it doesn't matter), causes the FCR (floating-point
> >>control register) to be corrupted in another, unrelated process.  This
> >>is repeatable behavior.
> >>
> >>This can be reproduced with the following short assembly language
> >>program that loops forever, waiting for the FCR to change.
> >>
> >>        .align 2
> >>        .globl mips_float_debug_loop
> >>mips_float_debug_loop:
> >>        li      $9, 0xF000F02F
> >>        ctc1    $9, $31         # set FCR to some non-zero value
> >>        nop
> >>1:      cfc1    $8, $31         # get FCR
> >>        beq     $8, $9, 1b      # spin, waiting for FCR to change
> >>        nop
> >>        or      $2, $0, $8
> >>        jr    $31
> >>        nop
> >>
> >>You can call this function from a short C program and the return value
> >>is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
> >>
> >>Run the above loop in one window (connected to the board using telnet)
> >>and then in another window (connected to the same board) type ^C.
> >>
> >>I'm surprised this bug hasn't been encountered by other MIPS vendors.
> >>
> >><end>
> >>
> >
> > --
> > _    _ ____  ___   Carsten Langgaard  Mailto:carstenl@mips.com
> > |\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> > | \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
> >   TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
> >                    Denmark            http://www.mips.com
> >
> >
> >
> >

--
_    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
                   Denmark             http://www.mips.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-18  8:07     ` Carsten Langgaard
@ 2002-06-18 17:46       ` Jun Sun
  2002-06-19  6:55         ` Carsten Langgaard
  0 siblings, 1 reply; 7+ messages in thread
From: Jun Sun @ 2002-06-18 17:46 UTC (permalink / raw)
  To: Carsten Langgaard; +Cc: Louis Hamilton, linux-mips, sandcraft-elinux-project


Your first chunk is not clear as to where it applies.  Maybe you are using a 
different code base.

The second chunck is not necessary.  Although the FPU values in thread struct 
for the new thread are stale, the new program cannot assume the fpu register 
values and cannot use them without initialization anyway.

I don't see such a code in x86's code.  Current call to start_thread() is ok 
with or without this change.  I am afraid future use of start_thread() may 
give undesired effect after we make this change.

As for the new fpu context switch code, I wrote a experiemental patch after 
much discussion with Ralf and Kevin.  It is at

http://linux.junsun.net/patches/oss.sgi.com/experimental/020304-half-fpu-context-switch

I also did some performance study of this patch.  Did not get much feedbacks 
when I asked last time. :-(

Jun

Carsten Langgaard wrote:

> I believe this is the patch that solve the problem.
> The lazy fpu stuff has cause so many problems, that we have decided (together
> with Ralf) to get rid of it, and do the FPU context save and restore a little bit
> differently.
> We now have a solution here locally and we are testing it at the moment.
> 
> /Carsten
> 
> 
> Index: arch/mips/kernel/traps.c
> ===================================================================
> RCS file: /cvs/linux/arch/mips/kernel/traps.c,v
> retrieving revision 1.99.2.14
> diff -u -r1.99.2.14 traps.c
> --- arch/mips/kernel/traps.c    2002/05/28 06:33:13     1.99.2.14
> +++ arch/mips/kernel/traps.c    2002/06/18 07:53:47
> +               {
> +                       __enable_fpu();
>                         save_fp(last_task_used_math);
> +                       /* last_task_used_math loose fpu */
> +                       ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
> +                                           sizeof(struct pt_regs)))->
> +                               cp0_status &= ~ST0_CU1;
> +               }
> 
> Index: include/asm-mips/processor.h
> ===================================================================
> RCS file: /cvs/linux/include/asm-mips/processor.h,v
> retrieving revision 1.43.2.2
> diff -u -r1.43.2.2 processor.h
> --- include/asm-mips/processor.h        2002/05/28 06:11:56     1.43.2.2
> +++ include/asm-mips/processor.h        2002/06/18 07:56:58
> @@ -215,6 +215,7 @@
>         regs->cp0_epc = new_pc;                                         \
>         regs->regs[29] = new_sp;                                        \
>         current->thread.current_ds = USER_DS;                           \
> +       current->used_math = 0;                                         \
>  } while (0)
> 
>  unsigned long get_wchan(struct task_struct *p);
> 
> 
> 
> Jun Sun wrote:
> 
> 
>>Carsten Langgaard wrote:
>>
>>
>>>This is one of the bugs, among others, we have fixed.
>>>I'm not sure, if Ralf have integrated the patches we send him yet.
>>>
>>>
>>Carsten,
>>
>>Do you remember the cause and the fix?  It appears to me the first ctc1
>>instruction should trap into kernel and mark current process as fpu owner, and
>>should not cause fcr31 corruption.
>>
>>Or somehow the ctc1 does not trap into kernel?
>>
>>Jun
>>
>>
>>>/Carsten
>>>
>>>Louis Hamilton wrote:
>>>
>>>
>>>
>>>>We have a customer here testing a 2.4.16 mips kernel on an embedded
>>>>Linux RM7000/SR71000 based system who has written a test that they
>>>>believe has uncovered a bug in Linux.  The FPU control register appears
>>>>to not get saved and restored.  I've reproduced the problem described
>>>>below and find the results consistent with their description.  The
>>>>problem occurs on both the RM7000 and SR71000 cpus.
>>>>
>>>>It looks like save_fp_context and restore_fp_context are not being
>>>>called since the kernel save-restore logic thinks the process is not
>>>>using floating point math.  If you do some fp math before calling the
>>>>test routine below, it seems to works fine.
>>>>
>>>>Is this a known caveat?  A true bug?  Or a contorted corner case
>>>>unlikely to be seen under typical end-user usage (see customer's
>>>>last paragraph :-) ?   If true bug, recommended remedy?
>>>>
>>>>TIA,
>>>>Louis
>>>>
>>>>Louis Hamilton
>>>>hamilton@redhat.com
>>>>
>>>>------ customer reports the following: ---------
>>>>We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
>>>>running program, it doesn't matter), causes the FCR (floating-point
>>>>control register) to be corrupted in another, unrelated process.  This
>>>>is repeatable behavior.
>>>>
>>>>This can be reproduced with the following short assembly language
>>>>program that loops forever, waiting for the FCR to change.
>>>>
>>>>       .align 2
>>>>       .globl mips_float_debug_loop
>>>>mips_float_debug_loop:
>>>>       li      $9, 0xF000F02F
>>>>       ctc1    $9, $31         # set FCR to some non-zero value
>>>>       nop
>>>>1:      cfc1    $8, $31         # get FCR
>>>>       beq     $8, $9, 1b      # spin, waiting for FCR to change
>>>>       nop
>>>>       or      $2, $0, $8
>>>>       jr    $31
>>>>       nop
>>>>
>>>>You can call this function from a short C program and the return value
>>>>is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
>>>>
>>>>Run the above loop in one window (connected to the board using telnet)
>>>>and then in another window (connected to the same board) type ^C.
>>>>
>>>>I'm surprised this bug hasn't been encountered by other MIPS vendors.
>>>>
>>>><end>
>>>>
>>>--
>>>_    _ ____  ___   Carsten Langgaard  Mailto:carstenl@mips.com
>>>|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
>>>| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>>>  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>>>                   Denmark            http://www.mips.com
>>>
>>>
>>>
>>>
>>>
> 
> --
> _    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
> |\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> | \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>   TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>                    Denmark             http://www.mips.com
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-18 17:46       ` Jun Sun
@ 2002-06-19  6:55         ` Carsten Langgaard
  2002-06-20 18:03           ` Jun Sun
  0 siblings, 1 reply; 7+ messages in thread
From: Carsten Langgaard @ 2002-06-19  6:55 UTC (permalink / raw)
  To: Jun Sun; +Cc: Louis Hamilton, linux-mips, sandcraft-elinux-project

Jun Sun wrote:

> Your first chunk is not clear as to where it applies.  Maybe you are using a
> different code base.

Here is my do_cpu handler:

asmlinkage void do_cpu(struct pt_regs *regs)
{
        unsigned int cpid;
        extern void lazy_fpu_switch(void *);
        extern void save_fp(struct task_struct *);
        extern void init_fpu(void);
        void fpu_emulator_init_fpu(void);
        int sig;

        cpid = (regs->cp0_cause >> CAUSEB_CE) & 3;
        if (cpid != 1)
                goto bad_cid;

        if (!(mips_cpu.options & MIPS_CPU_FPU))
                goto fp_emul;

        regs->cp0_status |= ST0_CU1;
        if (last_task_used_math == current)
                return;

        if (current->used_math) {               /* Using the FPU again.  */
                lazy_fpu_switch(last_task_used_math);
        } else {                                /* First time FPU user.  */
                if (last_task_used_math != NULL)
                {
                        __enable_fpu();
                        save_fp(last_task_used_math);
                        /* last_task_used_math loose fpu */
                        ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
                                            sizeof(struct pt_regs)))->
                                cp0_status &= ~ST0_CU1;
                }
                init_fpu();
                current->used_math = 1;
        }
        last_task_used_math = current;

        return;

fp_emul:
        if (last_task_used_math != current) {
                if (!current->used_math) {
                        fpu_emulator_init_fpu();
                        current->used_math = 1;
                }
        }
        sig = fpu_emulator_cop1Handler(0, regs, &current->thread.fpu.soft);
        last_task_used_math = current;
        if (sig)
        {
                /*
                 * Return EPC is not calculated in the FPU emulator, if
                 * a signal is being send. So we calculate it here.
                 */
                compute_return_epc(regs);
                force_sig(sig, current);
        }
        return;

bad_cid:
#ifndef CONFIG_CPU_HAS_LLSC
        switch (mips_cpu.cputype) {
        case CPU_TX3927:
        case CPU_TX39XX:
                do_ri(regs);
                return;
        }
#endif
        compute_return_epc(regs);
        force_sig(SIGILL, current);
}


>
> The second chunck is not necessary.  Although the FPU values in thread struct
> for the new thread are stale, the new program cannot assume the fpu register
> values and cannot use them without initialization anyway.

At least with the do_cpu handler above, you need this.
I'm not sure which tests failed without this, but I think it was some of the gcc selftests
and I think the problem was, that a thread inherited the previous thread's fsr (floating
point status register).

>
> I don't see such a code in x86's code.  Current call to start_thread() is ok
> with or without this change.  I am afraid future use of start_thread() may
> give undesired effect after we make this change.

Why ?

>
> As for the new fpu context switch code, I wrote a experiemental patch after
> much discussion with Ralf and Kevin.  It is at
>
> http://linux.junsun.net/patches/oss.sgi.com/experimental/020304-half-fpu-context-switch
>
> I also did some performance study of this patch.  Did not get much feedbacks
> when I asked last time. :-(
>
> Jun
>
> Carsten Langgaard wrote:
>
> > I believe this is the patch that solve the problem.
> > The lazy fpu stuff has cause so many problems, that we have decided (together
> > with Ralf) to get rid of it, and do the FPU context save and restore a little bit
> > differently.
> > We now have a solution here locally and we are testing it at the moment.
> >
> > /Carsten
> >
> >
> > Index: arch/mips/kernel/traps.c
> > ===================================================================
> > RCS file: /cvs/linux/arch/mips/kernel/traps.c,v
> > retrieving revision 1.99.2.14
> > diff -u -r1.99.2.14 traps.c
> > --- arch/mips/kernel/traps.c    2002/05/28 06:33:13     1.99.2.14
> > +++ arch/mips/kernel/traps.c    2002/06/18 07:53:47
> > +               {
> > +                       __enable_fpu();
> >                         save_fp(last_task_used_math);
> > +                       /* last_task_used_math loose fpu */
> > +                       ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
> > +                                           sizeof(struct pt_regs)))->
> > +                               cp0_status &= ~ST0_CU1;
> > +               }
> >
> > Index: include/asm-mips/processor.h
> > ===================================================================
> > RCS file: /cvs/linux/include/asm-mips/processor.h,v
> > retrieving revision 1.43.2.2
> > diff -u -r1.43.2.2 processor.h
> > --- include/asm-mips/processor.h        2002/05/28 06:11:56     1.43.2.2
> > +++ include/asm-mips/processor.h        2002/06/18 07:56:58
> > @@ -215,6 +215,7 @@
> >         regs->cp0_epc = new_pc;                                         \
> >         regs->regs[29] = new_sp;                                        \
> >         current->thread.current_ds = USER_DS;                           \
> > +       current->used_math = 0;                                         \
> >  } while (0)
> >
> >  unsigned long get_wchan(struct task_struct *p);
> >
> >
> >
> > Jun Sun wrote:
> >
> >
> >>Carsten Langgaard wrote:
> >>
> >>
> >>>This is one of the bugs, among others, we have fixed.
> >>>I'm not sure, if Ralf have integrated the patches we send him yet.
> >>>
> >>>
> >>Carsten,
> >>
> >>Do you remember the cause and the fix?  It appears to me the first ctc1
> >>instruction should trap into kernel and mark current process as fpu owner, and
> >>should not cause fcr31 corruption.
> >>
> >>Or somehow the ctc1 does not trap into kernel?
> >>
> >>Jun
> >>
> >>
> >>>/Carsten
> >>>
> >>>Louis Hamilton wrote:
> >>>
> >>>
> >>>
> >>>>We have a customer here testing a 2.4.16 mips kernel on an embedded
> >>>>Linux RM7000/SR71000 based system who has written a test that they
> >>>>believe has uncovered a bug in Linux.  The FPU control register appears
> >>>>to not get saved and restored.  I've reproduced the problem described
> >>>>below and find the results consistent with their description.  The
> >>>>problem occurs on both the RM7000 and SR71000 cpus.
> >>>>
> >>>>It looks like save_fp_context and restore_fp_context are not being
> >>>>called since the kernel save-restore logic thinks the process is not
> >>>>using floating point math.  If you do some fp math before calling the
> >>>>test routine below, it seems to works fine.
> >>>>
> >>>>Is this a known caveat?  A true bug?  Or a contorted corner case
> >>>>unlikely to be seen under typical end-user usage (see customer's
> >>>>last paragraph :-) ?   If true bug, recommended remedy?
> >>>>
> >>>>TIA,
> >>>>Louis
> >>>>
> >>>>Louis Hamilton
> >>>>hamilton@redhat.com
> >>>>
> >>>>------ customer reports the following: ---------
> >>>>We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
> >>>>running program, it doesn't matter), causes the FCR (floating-point
> >>>>control register) to be corrupted in another, unrelated process.  This
> >>>>is repeatable behavior.
> >>>>
> >>>>This can be reproduced with the following short assembly language
> >>>>program that loops forever, waiting for the FCR to change.
> >>>>
> >>>>       .align 2
> >>>>       .globl mips_float_debug_loop
> >>>>mips_float_debug_loop:
> >>>>       li      $9, 0xF000F02F
> >>>>       ctc1    $9, $31         # set FCR to some non-zero value
> >>>>       nop
> >>>>1:      cfc1    $8, $31         # get FCR
> >>>>       beq     $8, $9, 1b      # spin, waiting for FCR to change
> >>>>       nop
> >>>>       or      $2, $0, $8
> >>>>       jr    $31
> >>>>       nop
> >>>>
> >>>>You can call this function from a short C program and the return value
> >>>>is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
> >>>>
> >>>>Run the above loop in one window (connected to the board using telnet)
> >>>>and then in another window (connected to the same board) type ^C.
> >>>>
> >>>>I'm surprised this bug hasn't been encountered by other MIPS vendors.
> >>>>
> >>>><end>
> >>>>
> >>>--
> >>>_    _ ____  ___   Carsten Langgaard Mailto:carstenl@mips.com
> >>>|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> >>>| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
> >>>  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
> >>>                   Denmark           http://www.mips.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> > --
> > _    _ ____  ___   Carsten Langgaard  Mailto:carstenl@mips.com
> > |\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> > | \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
> >   TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
> >                    Denmark            http://www.mips.com
> >
> >
> >
> >

--
_    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
                   Denmark             http://www.mips.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Bug in Linux?  fcr31 not being saved-restored
  2002-06-19  6:55         ` Carsten Langgaard
@ 2002-06-20 18:03           ` Jun Sun
  0 siblings, 0 replies; 7+ messages in thread
From: Jun Sun @ 2002-06-20 18:03 UTC (permalink / raw)
  To: Carsten Langgaard; +Cc: Louis Hamilton, linux-mips, sandcraft-elinux-project


I think this patch is OK.

Whatever your gcc test is, it seems like it should fail on all other archs 
too, because a new program would always inherit the old fpu registers.  It 
would be interesting to do a test on those platforms.

I can see that having old values for general FPU registers is not a problem, 
but having the old value in FPU status register might.  So an alternative is 
to clear fcr31 only in start_thread().

Note that without your patch, the secondary if (last_task_used_math != NULL) 
branch is never taken.  This is because init process will be the only process 
that does not have used_math flag set and in that case last_task_use_math is 
indeed NULL.  All other processes will inherit the used_math flag from its parent.

Jun

Carsten Langgaard wrote:

> Jun Sun wrote:
> 
> 
>>Your first chunk is not clear as to where it applies.  Maybe you are using a
>>different code base.
>>
> 
> Here is my do_cpu handler:
> 
> asmlinkage void do_cpu(struct pt_regs *regs)
> {
>         unsigned int cpid;
>         extern void lazy_fpu_switch(void *);
>         extern void save_fp(struct task_struct *);
>         extern void init_fpu(void);
>         void fpu_emulator_init_fpu(void);
>         int sig;
> 
>         cpid = (regs->cp0_cause >> CAUSEB_CE) & 3;
>         if (cpid != 1)
>                 goto bad_cid;
> 
>         if (!(mips_cpu.options & MIPS_CPU_FPU))
>                 goto fp_emul;
> 
>         regs->cp0_status |= ST0_CU1;
>         if (last_task_used_math == current)
>                 return;
> 
>         if (current->used_math) {               /* Using the FPU again.  */
>                 lazy_fpu_switch(last_task_used_math);
>         } else {                                /* First time FPU user.  */
>                 if (last_task_used_math != NULL)
>                 {
>                         __enable_fpu();
>                         save_fp(last_task_used_math);
>                         /* last_task_used_math loose fpu */
>                         ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
>                                             sizeof(struct pt_regs)))->
>                                 cp0_status &= ~ST0_CU1;
>                 }
>                 init_fpu();
>                 current->used_math = 1;
>         }
>         last_task_used_math = current;
> 
>         return;
> 
> fp_emul:
>         if (last_task_used_math != current) {
>                 if (!current->used_math) {
>                         fpu_emulator_init_fpu();
>                         current->used_math = 1;
>                 }
>         }
>         sig = fpu_emulator_cop1Handler(0, regs, &current->thread.fpu.soft);
>         last_task_used_math = current;
>         if (sig)
>         {
>                 /*
>                  * Return EPC is not calculated in the FPU emulator, if
>                  * a signal is being send. So we calculate it here.
>                  */
>                 compute_return_epc(regs);
>                 force_sig(sig, current);
>         }
>         return;
> 
> bad_cid:
> #ifndef CONFIG_CPU_HAS_LLSC
>         switch (mips_cpu.cputype) {
>         case CPU_TX3927:
>         case CPU_TX39XX:
>                 do_ri(regs);
>                 return;
>         }
> #endif
>         compute_return_epc(regs);
>         force_sig(SIGILL, current);
> }
> 
> 
> 
>>The second chunck is not necessary.  Although the FPU values in thread struct
>>for the new thread are stale, the new program cannot assume the fpu register
>>values and cannot use them without initialization anyway.
>>
> 
> At least with the do_cpu handler above, you need this.
> I'm not sure which tests failed without this, but I think it was some of the gcc selftests
> and I think the problem was, that a thread inherited the previous thread's fsr (floating
> point status register).
> 
> 
>>I don't see such a code in x86's code.  Current call to start_thread() is ok
>>with or without this change.  I am afraid future use of start_thread() may
>>give undesired effect after we make this change.
>>
> 
> Why ?
> 
> 
>>As for the new fpu context switch code, I wrote a experiemental patch after
>>much discussion with Ralf and Kevin.  It is at
>>
>>http://linux.junsun.net/patches/oss.sgi.com/experimental/020304-half-fpu-context-switch
>>
>>I also did some performance study of this patch.  Did not get much feedbacks
>>when I asked last time. :-(
>>
>>Jun
>>
>>Carsten Langgaard wrote:
>>
>>
>>>I believe this is the patch that solve the problem.
>>>The lazy fpu stuff has cause so many problems, that we have decided (together
>>>with Ralf) to get rid of it, and do the FPU context save and restore a little bit
>>>differently.
>>>We now have a solution here locally and we are testing it at the moment.
>>>
>>>/Carsten
>>>
>>>
>>>Index: arch/mips/kernel/traps.c
>>>===================================================================
>>>RCS file: /cvs/linux/arch/mips/kernel/traps.c,v
>>>retrieving revision 1.99.2.14
>>>diff -u -r1.99.2.14 traps.c
>>>--- arch/mips/kernel/traps.c    2002/05/28 06:33:13     1.99.2.14
>>>+++ arch/mips/kernel/traps.c    2002/06/18 07:53:47
>>>+               {
>>>+                       __enable_fpu();
>>>                        save_fp(last_task_used_math);
>>>+                       /* last_task_used_math loose fpu */
>>>+                       ((struct pt_regs *)(__KSTK_TOS(last_task_used_math) -
>>>+                                           sizeof(struct pt_regs)))->
>>>+                               cp0_status &= ~ST0_CU1;
>>>+               }
>>>
>>>Index: include/asm-mips/processor.h
>>>===================================================================
>>>RCS file: /cvs/linux/include/asm-mips/processor.h,v
>>>retrieving revision 1.43.2.2
>>>diff -u -r1.43.2.2 processor.h
>>>--- include/asm-mips/processor.h        2002/05/28 06:11:56     1.43.2.2
>>>+++ include/asm-mips/processor.h        2002/06/18 07:56:58
>>>@@ -215,6 +215,7 @@
>>>        regs->cp0_epc = new_pc;                                         \
>>>        regs->regs[29] = new_sp;                                        \
>>>        current->thread.current_ds = USER_DS;                           \
>>>+       current->used_math = 0;                                         \
>>> } while (0)
>>>
>>> unsigned long get_wchan(struct task_struct *p);
>>>
>>>
>>>
>>>Jun Sun wrote:
>>>
>>>
>>>
>>>>Carsten Langgaard wrote:
>>>>
>>>>
>>>>
>>>>>This is one of the bugs, among others, we have fixed.
>>>>>I'm not sure, if Ralf have integrated the patches we send him yet.
>>>>>
>>>>>
>>>>>
>>>>Carsten,
>>>>
>>>>Do you remember the cause and the fix?  It appears to me the first ctc1
>>>>instruction should trap into kernel and mark current process as fpu owner, and
>>>>should not cause fcr31 corruption.
>>>>
>>>>Or somehow the ctc1 does not trap into kernel?
>>>>
>>>>Jun
>>>>
>>>>
>>>>
>>>>>/Carsten
>>>>>
>>>>>Louis Hamilton wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>We have a customer here testing a 2.4.16 mips kernel on an embedded
>>>>>>Linux RM7000/SR71000 based system who has written a test that they
>>>>>>believe has uncovered a bug in Linux.  The FPU control register appears
>>>>>>to not get saved and restored.  I've reproduced the problem described
>>>>>>below and find the results consistent with their description.  The
>>>>>>problem occurs on both the RM7000 and SR71000 cpus.
>>>>>>
>>>>>>It looks like save_fp_context and restore_fp_context are not being
>>>>>>called since the kernel save-restore logic thinks the process is not
>>>>>>using floating point math.  If you do some fp math before calling the
>>>>>>test routine below, it seems to works fine.
>>>>>>
>>>>>>Is this a known caveat?  A true bug?  Or a contorted corner case
>>>>>>unlikely to be seen under typical end-user usage (see customer's
>>>>>>last paragraph :-) ?   If true bug, recommended remedy?
>>>>>>
>>>>>>TIA,
>>>>>>Louis
>>>>>>
>>>>>>Louis Hamilton
>>>>>>hamilton@redhat.com
>>>>>>
>>>>>>------ customer reports the following: ---------
>>>>>>We found a bug in Linux.  A ^C (control-C) typed into a shell (or a
>>>>>>running program, it doesn't matter), causes the FCR (floating-point
>>>>>>control register) to be corrupted in another, unrelated process.  This
>>>>>>is repeatable behavior.
>>>>>>
>>>>>>This can be reproduced with the following short assembly language
>>>>>>program that loops forever, waiting for the FCR to change.
>>>>>>
>>>>>>      .align 2
>>>>>>      .globl mips_float_debug_loop
>>>>>>mips_float_debug_loop:
>>>>>>      li      $9, 0xF000F02F
>>>>>>      ctc1    $9, $31         # set FCR to some non-zero value
>>>>>>      nop
>>>>>>1:      cfc1    $8, $31         # get FCR
>>>>>>      beq     $8, $9, 1b      # spin, waiting for FCR to change
>>>>>>      nop
>>>>>>      or      $2, $0, $8
>>>>>>      jr    $31
>>>>>>      nop
>>>>>>
>>>>>>You can call this function from a short C program and the return value
>>>>>>is the (corrupted) FCR, which turns out to alwyas be: 0x00000002.
>>>>>>
>>>>>>Run the above loop in one window (connected to the board using telnet)
>>>>>>and then in another window (connected to the same board) type ^C.
>>>>>>
>>>>>>I'm surprised this bug hasn't been encountered by other MIPS vendors.
>>>>>>
>>>>>><end>
>>>>>>
>>>>>--
>>>>>_    _ ____  ___   Carsten Langgaard Mailto:carstenl@mips.com
>>>>>|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
>>>>>| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>>>>> TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>>>>>                  Denmark           http://www.mips.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>--
>>>_    _ ____  ___   Carsten Langgaard  Mailto:carstenl@mips.com
>>>|\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
>>>| \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>>>  TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>>>                   Denmark            http://www.mips.com
>>>
>>>
>>>
>>>
>>>
> 
> --
> _    _ ____  ___   Carsten Langgaard   Mailto:carstenl@mips.com
> |\  /|||___)(___   MIPS Denmark        Direct: +45 4486 5527
> | \/ |||    ____)  Lautrupvang 4B      Switch: +45 4486 5555
>   TECHNOLOGIES     2750 Ballerup       Fax...: +45 4486 5556
>                    Denmark             http://www.mips.com
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2002-06-20 18:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-15 23:56 Bug in Linux? fcr31 not being saved-restored Louis Hamilton
2002-06-17  6:20 ` Carsten Langgaard
2002-06-17 19:30   ` Jun Sun
2002-06-18  8:07     ` Carsten Langgaard
2002-06-18 17:46       ` Jun Sun
2002-06-19  6:55         ` Carsten Langgaard
2002-06-20 18:03           ` Jun Sun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox