corruption of load instruction offset

All of lore.kernel.org
 help / color / mirror / Atom feed

* corruption of load instruction offset
@ 2006-04-03  4:12 ` Chuck Meade
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Meade @ 2006-04-03  4:12 UTC (permalink / raw)
  To: linux-mips; +Cc: Chuck Meade (mindspring)

Hello,

I am seeing a very interesting/worrisome bug on an RM7965 cpu, which has
an E9000 core.  I am running 2.6.14-rc1.  Please take a look at the
behavior I describe and send me your thoughts.  Thanks.

The error message is immediately below.  Notice that the epc is 8021e28c,
and the BadVA is 87e39681, and register 4 (a0) is 87e38660.

Now scan down below the error message, to the disassembly of move_32bytes.
If you look at the instruction at 8021e28c, it appears harmless enough.  
Nothing to cause an unaligned access or invalid instruction.  But look
about 6 lines above that, and we are loading at offsets from a0.  The
offsets from a0 in those 4 load instructions are 16, 20, 24, and 28.  If
you look at the opcodes in the column to the left, those offsets appear in
the least significant 16 bits of the opcode.

Now look again at the value of a0 in the register dump:  87e38660.  And
at the BadVA value:  87e39681.  The BadVA is offset exactly 0x1021 from
a0.  This indicates that we somehow tried to access memory at offset 
0x1021 from a0.  However, we never should have done that according to
the disassembly.  *But* there are many instructions in the vicinity which
have a least significant 16 bits of 0x1021.  None of them are loads from a0,
but I believe that this is the root of the problem.  Something is happening
here, possibly an interrupt, or a cpu bug(?) that is causing the load from
a0 to use an offset of 0x1021 (the least significant 16-bits of many of
the nearby instructions) rather than the correct offset for the load
instruction, which is found in the least significant 16-bits of the actual
load instructions.

This is not "quickly" reproducible.  I run a TCP blaster/blastee test between
this machine and Linux PC, and at some point during the run (sometimes much
later) this error appears.

Thanks for your ideas,
Chuck

Error message:

Unhandled kernel unaligned access or invalid instruction in arch/mips/kernel/unaligned.c::emulate_load_store_insn, line 487[#1]:
Cpu 0
$ 0   : 00000000 10004ce8 00000000 00000000
$ 4   : 87e38660 000005a8 00000000 00000000
$ 8   : 00000000 00000000 00000020 00000000
$12   : 00000000 80402000 00000001 00000000
$16   : 00000000 87e171a0 000005a8 87c1f060
$20   : 87e380e0 004009e0 10004740 00002ad8
$24   : 00000008 803171c0
$28   : 8120a000 8120bd48 00000000 802deb30
Hi    : 0000000c
Lo    : 000d4bf8
epc   : 8021e28c move_32bytes+0x64/0x88     Not tainted
ra    : 802deb30 tcp_sendmsg+0x460/0xd80
Status: 90018403    KERNEL EXL IE
Cause : 00000010
BadVA : 87e39681
PrId  : 00003422
Modules linked in:
Process blaster (pid: 162, threadinfo=8120a000, task=8050b3f8)
Stack : 8120bdd0 00000000 812fd4a0 8120bdf0 8120bd70 87e18520 00000001 00000000
        8120be40 7fffffff 00000000 8120bf18 8120be14 00000000 000005a8 000005a8
        000032e8 00000001 00000000 90018400 8120be40 00005dc0 10001458 8120bf18
        00000005 004009e0 10011044 10010000 10010fd4 8028e7a8 00000020 ffffffff
        00000001 00000000 00005dc0 10001458 87e18520 00005dc0 812fd4a0 004009e0
        ...
Call Trace:
 [<8028e7a8>] sock_aio_write+0x10c/0x12c
 [<8016bef8>] do_sync_write+0xd0/0x128
 [<801037d4>] do_IRQ+0x24/0x34
 [<804203cc>] init+0xd8/0xe4
 [<8013cf78>] autoremove_wake_function+0x0/0x44
 [<8016c020>] vfs_write+0xd0/0x144
 [<8016c020>] vfs_write+0xd0/0x144
 [<8016c074>] vfs_write+0x124/0x144
 [<8016c150>] sys_write+0x24/0x98
 [<8016c180>] sys_write+0x54/0x98
 [<8016c154>] sys_write+0x28/0x98
 [<801037d4>] do_IRQ+0x24/0x34
 [<8010b260>] stack_done+0x20/0x3c

Disassembly of relevant portion of move_32bytes:

8021e228 <move_32bytes>:
8021e228:       8c880000        lw      t0,0(a0)
8021e22c:       8c890004        lw      t1,4(a0)
8021e230:       8c8b0008        lw      t3,8(a0)
8021e234:       8c8c000c        lw      t4,12(a0)
8021e238:       00481021        addu    v0,v0,t0
8021e23c:       0048182b        sltu    v1,v0,t0
8021e240:       00431021        addu    v0,v0,v1
8021e244:       00491021        addu    v0,v0,t1
8021e248:       0049182b        sltu    v1,v0,t1
8021e24c:       00431021        addu    v0,v0,v1
8021e250:       004b1021        addu    v0,v0,t3
8021e254:       004b182b        sltu    v1,v0,t3
8021e258:       00431021        addu    v0,v0,v1
8021e25c:       004c1021        addu    v0,v0,t4
8021e260:       004c182b        sltu    v1,v0,t4
8021e264:       00431021        addu    v0,v0,v1
8021e268:       8c880010        lw      t0,16(a0)
8021e26c:       8c890014        lw      t1,20(a0)
8021e270:       8c8b0018        lw      t3,24(a0)
8021e274:       8c8c001c        lw      t4,28(a0)
8021e278:       00481021        addu    v0,v0,t0
8021e27c:       0048182b        sltu    v1,v0,t0
8021e280:       00431021        addu    v0,v0,v1
8021e284:       00491021        addu    v0,v0,t1
8021e288:       0049182b        sltu    v1,v0,t1
8021e28c:       00431021        addu    v0,v0,v1
8021e290:       004b1021        addu    v0,v0,t3
8021e294:       004b182b        sltu    v1,v0,t3
8021e298:       00431021        addu    v0,v0,v1
8021e29c:       004c1021        addu    v0,v0,t4
8021e2a0:       004c182b        sltu    v1,v0,t4
8021e2a4:       00431021        addu    v0,v0,v1
8021e2a8:       30b8001c        andi    t8,a1,0x1c
8021e2ac:       24840020        addiu   a0,a0,32

^ permalink raw reply	[flat|nested] 7+ messages in thread

* corruption of load instruction offset
@ 2006-04-03  4:12 ` Chuck Meade
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Meade @ 2006-04-03  4:12 UTC (permalink / raw)
  To: linux-mips; +Cc: Chuck Meade (mindspring)

Hello,

I am seeing a very interesting/worrisome bug on an RM7965 cpu, which has
an E9000 core.  I am running 2.6.14-rc1.  Please take a look at the
behavior I describe and send me your thoughts.  Thanks.

The error message is immediately below.  Notice that the epc is 8021e28c,
and the BadVA is 87e39681, and register 4 (a0) is 87e38660.

Now scan down below the error message, to the disassembly of move_32bytes.
If you look at the instruction at 8021e28c, it appears harmless enough.  
Nothing to cause an unaligned access or invalid instruction.  But look
about 6 lines above that, and we are loading at offsets from a0.  The
offsets from a0 in those 4 load instructions are 16, 20, 24, and 28.  If
you look at the opcodes in the column to the left, those offsets appear in
the least significant 16 bits of the opcode.

Now look again at the value of a0 in the register dump:  87e38660.  And
at the BadVA value:  87e39681.  The BadVA is offset exactly 0x1021 from
a0.  This indicates that we somehow tried to access memory at offset 
0x1021 from a0.  However, we never should have done that according to
the disassembly.  *But* there are many instructions in the vicinity which
have a least significant 16 bits of 0x1021.  None of them are loads from a0,
but I believe that this is the root of the problem.  Something is happening
here, possibly an interrupt, or a cpu bug(?) that is causing the load from
a0 to use an offset of 0x1021 (the least significant 16-bits of many of
the nearby instructions) rather than the correct offset for the load
instruction, which is found in the least significant 16-bits of the actual
load instructions.

This is not "quickly" reproducible.  I run a TCP blaster/blastee test between
this machine and Linux PC, and at some point during the run (sometimes much
later) this error appears.

Thanks for your ideas,
Chuck

Error message:

Unhandled kernel unaligned access or invalid instruction in arch/mips/kernel/unaligned.c::emulate_load_store_insn, line 487[#1]:
Cpu 0
$ 0   : 00000000 10004ce8 00000000 00000000
$ 4   : 87e38660 000005a8 00000000 00000000
$ 8   : 00000000 00000000 00000020 00000000
$12   : 00000000 80402000 00000001 00000000
$16   : 00000000 87e171a0 000005a8 87c1f060
$20   : 87e380e0 004009e0 10004740 00002ad8
$24   : 00000008 803171c0
$28   : 8120a000 8120bd48 00000000 802deb30
Hi    : 0000000c
Lo    : 000d4bf8
epc   : 8021e28c move_32bytes+0x64/0x88     Not tainted
ra    : 802deb30 tcp_sendmsg+0x460/0xd80
Status: 90018403    KERNEL EXL IE
Cause : 00000010
BadVA : 87e39681
PrId  : 00003422
Modules linked in:
Process blaster (pid: 162, threadinfo=8120a000, task=8050b3f8)
Stack : 8120bdd0 00000000 812fd4a0 8120bdf0 8120bd70 87e18520 00000001 00000000
        8120be40 7fffffff 00000000 8120bf18 8120be14 00000000 000005a8 000005a8
        000032e8 00000001 00000000 90018400 8120be40 00005dc0 10001458 8120bf18
        00000005 004009e0 10011044 10010000 10010fd4 8028e7a8 00000020 ffffffff
        00000001 00000000 00005dc0 10001458 87e18520 00005dc0 812fd4a0 004009e0
        ...
Call Trace:
 [<8028e7a8>] sock_aio_write+0x10c/0x12c
 [<8016bef8>] do_sync_write+0xd0/0x128
 [<801037d4>] do_IRQ+0x24/0x34
 [<804203cc>] init+0xd8/0xe4
 [<8013cf78>] autoremove_wake_function+0x0/0x44
 [<8016c020>] vfs_write+0xd0/0x144
 [<8016c020>] vfs_write+0xd0/0x144
 [<8016c074>] vfs_write+0x124/0x144
 [<8016c150>] sys_write+0x24/0x98
 [<8016c180>] sys_write+0x54/0x98
 [<8016c154>] sys_write+0x28/0x98
 [<801037d4>] do_IRQ+0x24/0x34
 [<8010b260>] stack_done+0x20/0x3c

Disassembly of relevant portion of move_32bytes:

8021e228 <move_32bytes>:
8021e228:       8c880000        lw      t0,0(a0)
8021e22c:       8c890004        lw      t1,4(a0)
8021e230:       8c8b0008        lw      t3,8(a0)
8021e234:       8c8c000c        lw      t4,12(a0)
8021e238:       00481021        addu    v0,v0,t0
8021e23c:       0048182b        sltu    v1,v0,t0
8021e240:       00431021        addu    v0,v0,v1
8021e244:       00491021        addu    v0,v0,t1
8021e248:       0049182b        sltu    v1,v0,t1
8021e24c:       00431021        addu    v0,v0,v1
8021e250:       004b1021        addu    v0,v0,t3
8021e254:       004b182b        sltu    v1,v0,t3
8021e258:       00431021        addu    v0,v0,v1
8021e25c:       004c1021        addu    v0,v0,t4
8021e260:       004c182b        sltu    v1,v0,t4
8021e264:       00431021        addu    v0,v0,v1
8021e268:       8c880010        lw      t0,16(a0)
8021e26c:       8c890014        lw      t1,20(a0)
8021e270:       8c8b0018        lw      t3,24(a0)
8021e274:       8c8c001c        lw      t4,28(a0)
8021e278:       00481021        addu    v0,v0,t0
8021e27c:       0048182b        sltu    v1,v0,t0
8021e280:       00431021        addu    v0,v0,v1
8021e284:       00491021        addu    v0,v0,t1
8021e288:       0049182b        sltu    v1,v0,t1
8021e28c:       00431021        addu    v0,v0,v1
8021e290:       004b1021        addu    v0,v0,t3
8021e294:       004b182b        sltu    v1,v0,t3
8021e298:       00431021        addu    v0,v0,v1
8021e29c:       004c1021        addu    v0,v0,t4
8021e2a0:       004c182b        sltu    v1,v0,t4
8021e2a4:       00431021        addu    v0,v0,v1
8021e2a8:       30b8001c        andi    t8,a1,0x1c
8021e2ac:       24840020        addiu   a0,a0,32

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: corruption of load instruction offset
@ 2006-04-03  7:25   ` Kevin D. Kissell
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin D. Kissell @ 2006-04-03  7:25 UTC (permalink / raw)
  To: Chuck Meade, linux-mips; +Cc: Chuck Meade (mindspring)

That's pretty twisted - one could almost believe that the fetch from
0x8021e28c got corrupted to pick up the most significant 16 bits
of the instruction at 0x8021e22c or 0x8021e26c - but given that
instructions are fetched and issued word-by-word, it's hard to see
where that could happen, in either CPU hardware or software. 
What is the I-cache line size? If it  were me, I'd check my clocks, 
voltages, and above all my RAM timing, and I'd re-seat my CPU 
and RAM in their sockets...

            Regards,

            Kevin K.

----- Original Message ----- 
From: "Chuck Meade" <chuckmeade@mindspring.com>
To: <linux-mips@linux-mips.org>
Cc: "Chuck Meade (mindspring)" <chuckmeade@mindspring.com>
Sent: Monday, April 03, 2006 6:12 AM
Subject: corruption of load instruction offset


> Hello,
> 
> I am seeing a very interesting/worrisome bug on an RM7965 cpu, which has
> an E9000 core.  I am running 2.6.14-rc1.  Please take a look at the
> behavior I describe and send me your thoughts.  Thanks.
> 
> The error message is immediately below.  Notice that the epc is 8021e28c,
> and the BadVA is 87e39681, and register 4 (a0) is 87e38660.
> 
> Now scan down below the error message, to the disassembly of move_32bytes.
> If you look at the instruction at 8021e28c, it appears harmless enough.  
> Nothing to cause an unaligned access or invalid instruction.  But look
> about 6 lines above that, and we are loading at offsets from a0.  The
> offsets from a0 in those 4 load instructions are 16, 20, 24, and 28.  If
> you look at the opcodes in the column to the left, those offsets appear in
> the least significant 16 bits of the opcode.
> 
> Now look again at the value of a0 in the register dump:  87e38660.  And
> at the BadVA value:  87e39681.  The BadVA is offset exactly 0x1021 from
> a0.  This indicates that we somehow tried to access memory at offset 
> 0x1021 from a0.  However, we never should have done that according to
> the disassembly.  *But* there are many instructions in the vicinity which
> have a least significant 16 bits of 0x1021.  None of them are loads from a0,
> but I believe that this is the root of the problem.  Something is happening
> here, possibly an interrupt, or a cpu bug(?) that is causing the load from
> a0 to use an offset of 0x1021 (the least significant 16-bits of many of
> the nearby instructions) rather than the correct offset for the load
> instruction, which is found in the least significant 16-bits of the actual
> load instructions.
> 
> This is not "quickly" reproducible.  I run a TCP blaster/blastee test between
> this machine and Linux PC, and at some point during the run (sometimes much
> later) this error appears.
> 
> Thanks for your ideas,
> Chuck
> 
> Error message:
> 
> Unhandled kernel unaligned access or invalid instruction in arch/mips/kernel/unaligned.c::emulate_load_store_insn, line 487[#1]:
> Cpu 0
> $ 0   : 00000000 10004ce8 00000000 00000000
> $ 4   : 87e38660 000005a8 00000000 00000000
> $ 8   : 00000000 00000000 00000020 00000000
> $12   : 00000000 80402000 00000001 00000000
> $16   : 00000000 87e171a0 000005a8 87c1f060
> $20   : 87e380e0 004009e0 10004740 00002ad8
> $24   : 00000008 803171c0
> $28   : 8120a000 8120bd48 00000000 802deb30
> Hi    : 0000000c
> Lo    : 000d4bf8
> epc   : 8021e28c move_32bytes+0x64/0x88     Not tainted
> ra    : 802deb30 tcp_sendmsg+0x460/0xd80
> Status: 90018403    KERNEL EXL IE
> Cause : 00000010
> BadVA : 87e39681
> PrId  : 00003422
> Modules linked in:
> Process blaster (pid: 162, threadinfo=8120a000, task=8050b3f8)
> Stack : 8120bdd0 00000000 812fd4a0 8120bdf0 8120bd70 87e18520 00000001 00000000
>         8120be40 7fffffff 00000000 8120bf18 8120be14 00000000 000005a8 000005a8
>         000032e8 00000001 00000000 90018400 8120be40 00005dc0 10001458 8120bf18
>         00000005 004009e0 10011044 10010000 10010fd4 8028e7a8 00000020 ffffffff
>         00000001 00000000 00005dc0 10001458 87e18520 00005dc0 812fd4a0 004009e0
>         ...
> Call Trace:
>  [<8028e7a8>] sock_aio_write+0x10c/0x12c
>  [<8016bef8>] do_sync_write+0xd0/0x128
>  [<801037d4>] do_IRQ+0x24/0x34
>  [<804203cc>] init+0xd8/0xe4
>  [<8013cf78>] autoremove_wake_function+0x0/0x44
>  [<8016c020>] vfs_write+0xd0/0x144
>  [<8016c020>] vfs_write+0xd0/0x144
>  [<8016c074>] vfs_write+0x124/0x144
>  [<8016c150>] sys_write+0x24/0x98
>  [<8016c180>] sys_write+0x54/0x98
>  [<8016c154>] sys_write+0x28/0x98
>  [<801037d4>] do_IRQ+0x24/0x34
>  [<8010b260>] stack_done+0x20/0x3c
> 
> 
> 
> Disassembly of relevant portion of move_32bytes:
> 
> 8021e228 <move_32bytes>:
> 8021e228:       8c880000        lw      t0,0(a0)
> 8021e22c:       8c890004        lw      t1,4(a0)
> 8021e230:       8c8b0008        lw      t3,8(a0)
> 8021e234:       8c8c000c        lw      t4,12(a0)
> 8021e238:       00481021        addu    v0,v0,t0
> 8021e23c:       0048182b        sltu    v1,v0,t0
> 8021e240:       00431021        addu    v0,v0,v1
> 8021e244:       00491021        addu    v0,v0,t1
> 8021e248:       0049182b        sltu    v1,v0,t1
> 8021e24c:       00431021        addu    v0,v0,v1
> 8021e250:       004b1021        addu    v0,v0,t3
> 8021e254:       004b182b        sltu    v1,v0,t3
> 8021e258:       00431021        addu    v0,v0,v1
> 8021e25c:       004c1021        addu    v0,v0,t4
> 8021e260:       004c182b        sltu    v1,v0,t4
> 8021e264:       00431021        addu    v0,v0,v1
> 8021e268:       8c880010        lw      t0,16(a0)
> 8021e26c:       8c890014        lw      t1,20(a0)
> 8021e270:       8c8b0018        lw      t3,24(a0)
> 8021e274:       8c8c001c        lw      t4,28(a0)
> 8021e278:       00481021        addu    v0,v0,t0
> 8021e27c:       0048182b        sltu    v1,v0,t0
> 8021e280:       00431021        addu    v0,v0,v1
> 8021e284:       00491021        addu    v0,v0,t1
> 8021e288:       0049182b        sltu    v1,v0,t1
> 8021e28c:       00431021        addu    v0,v0,v1
> 8021e290:       004b1021        addu    v0,v0,t3
> 8021e294:       004b182b        sltu    v1,v0,t3
> 8021e298:       00431021        addu    v0,v0,v1
> 8021e29c:       004c1021        addu    v0,v0,t4
> 8021e2a0:       004c182b        sltu    v1,v0,t4
> 8021e2a4:       00431021        addu    v0,v0,v1
> 8021e2a8:       30b8001c        andi    t8,a1,0x1c
> 8021e2ac:       24840020        addiu   a0,a0,32
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: corruption of load instruction offset
@ 2006-04-03  7:25   ` Kevin D. Kissell
  0 siblings, 0 replies; 7+ messages in thread
From: Kevin D. Kissell @ 2006-04-03  7:25 UTC (permalink / raw)
  To: Chuck Meade, linux-mips

That's pretty twisted - one could almost believe that the fetch from
0x8021e28c got corrupted to pick up the most significant 16 bits
of the instruction at 0x8021e22c or 0x8021e26c - but given that
instructions are fetched and issued word-by-word, it's hard to see
where that could happen, in either CPU hardware or software. 
What is the I-cache line size? If it  were me, I'd check my clocks, 
voltages, and above all my RAM timing, and I'd re-seat my CPU 
and RAM in their sockets...

            Regards,

            Kevin K.

----- Original Message ----- 
From: "Chuck Meade" <chuckmeade@mindspring.com>
To: <linux-mips@linux-mips.org>
Cc: "Chuck Meade (mindspring)" <chuckmeade@mindspring.com>
Sent: Monday, April 03, 2006 6:12 AM
Subject: corruption of load instruction offset


> Hello,
> 
> I am seeing a very interesting/worrisome bug on an RM7965 cpu, which has
> an E9000 core.  I am running 2.6.14-rc1.  Please take a look at the
> behavior I describe and send me your thoughts.  Thanks.
> 
> The error message is immediately below.  Notice that the epc is 8021e28c,
> and the BadVA is 87e39681, and register 4 (a0) is 87e38660.
> 
> Now scan down below the error message, to the disassembly of move_32bytes.
> If you look at the instruction at 8021e28c, it appears harmless enough.  
> Nothing to cause an unaligned access or invalid instruction.  But look
> about 6 lines above that, and we are loading at offsets from a0.  The
> offsets from a0 in those 4 load instructions are 16, 20, 24, and 28.  If
> you look at the opcodes in the column to the left, those offsets appear in
> the least significant 16 bits of the opcode.
> 
> Now look again at the value of a0 in the register dump:  87e38660.  And
> at the BadVA value:  87e39681.  The BadVA is offset exactly 0x1021 from
> a0.  This indicates that we somehow tried to access memory at offset 
> 0x1021 from a0.  However, we never should have done that according to
> the disassembly.  *But* there are many instructions in the vicinity which
> have a least significant 16 bits of 0x1021.  None of them are loads from a0,
> but I believe that this is the root of the problem.  Something is happening
> here, possibly an interrupt, or a cpu bug(?) that is causing the load from
> a0 to use an offset of 0x1021 (the least significant 16-bits of many of
> the nearby instructions) rather than the correct offset for the load
> instruction, which is found in the least significant 16-bits of the actual
> load instructions.
> 
> This is not "quickly" reproducible.  I run a TCP blaster/blastee test between
> this machine and Linux PC, and at some point during the run (sometimes much
> later) this error appears.
> 
> Thanks for your ideas,
> Chuck
> 
> Error message:
> 
> Unhandled kernel unaligned access or invalid instruction in arch/mips/kernel/unaligned.c::emulate_load_store_insn, line 487[#1]:
> Cpu 0
> $ 0   : 00000000 10004ce8 00000000 00000000
> $ 4   : 87e38660 000005a8 00000000 00000000
> $ 8   : 00000000 00000000 00000020 00000000
> $12   : 00000000 80402000 00000001 00000000
> $16   : 00000000 87e171a0 000005a8 87c1f060
> $20   : 87e380e0 004009e0 10004740 00002ad8
> $24   : 00000008 803171c0
> $28   : 8120a000 8120bd48 00000000 802deb30
> Hi    : 0000000c
> Lo    : 000d4bf8
> epc   : 8021e28c move_32bytes+0x64/0x88     Not tainted
> ra    : 802deb30 tcp_sendmsg+0x460/0xd80
> Status: 90018403    KERNEL EXL IE
> Cause : 00000010
> BadVA : 87e39681
> PrId  : 00003422
> Modules linked in:
> Process blaster (pid: 162, threadinfo=8120a000, task=8050b3f8)
> Stack : 8120bdd0 00000000 812fd4a0 8120bdf0 8120bd70 87e18520 00000001 00000000
>         8120be40 7fffffff 00000000 8120bf18 8120be14 00000000 000005a8 000005a8
>         000032e8 00000001 00000000 90018400 8120be40 00005dc0 10001458 8120bf18
>         00000005 004009e0 10011044 10010000 10010fd4 8028e7a8 00000020 ffffffff
>         00000001 00000000 00005dc0 10001458 87e18520 00005dc0 812fd4a0 004009e0
>         ...
> Call Trace:
>  [<8028e7a8>] sock_aio_write+0x10c/0x12c
>  [<8016bef8>] do_sync_write+0xd0/0x128
>  [<801037d4>] do_IRQ+0x24/0x34
>  [<804203cc>] init+0xd8/0xe4
>  [<8013cf78>] autoremove_wake_function+0x0/0x44
>  [<8016c020>] vfs_write+0xd0/0x144
>  [<8016c020>] vfs_write+0xd0/0x144
>  [<8016c074>] vfs_write+0x124/0x144
>  [<8016c150>] sys_write+0x24/0x98
>  [<8016c180>] sys_write+0x54/0x98
>  [<8016c154>] sys_write+0x28/0x98
>  [<801037d4>] do_IRQ+0x24/0x34
>  [<8010b260>] stack_done+0x20/0x3c
> 
> 
> 
> Disassembly of relevant portion of move_32bytes:
> 
> 8021e228 <move_32bytes>:
> 8021e228:       8c880000        lw      t0,0(a0)
> 8021e22c:       8c890004        lw      t1,4(a0)
> 8021e230:       8c8b0008        lw      t3,8(a0)
> 8021e234:       8c8c000c        lw      t4,12(a0)
> 8021e238:       00481021        addu    v0,v0,t0
> 8021e23c:       0048182b        sltu    v1,v0,t0
> 8021e240:       00431021        addu    v0,v0,v1
> 8021e244:       00491021        addu    v0,v0,t1
> 8021e248:       0049182b        sltu    v1,v0,t1
> 8021e24c:       00431021        addu    v0,v0,v1
> 8021e250:       004b1021        addu    v0,v0,t3
> 8021e254:       004b182b        sltu    v1,v0,t3
> 8021e258:       00431021        addu    v0,v0,v1
> 8021e25c:       004c1021        addu    v0,v0,t4
> 8021e260:       004c182b        sltu    v1,v0,t4
> 8021e264:       00431021        addu    v0,v0,v1
> 8021e268:       8c880010        lw      t0,16(a0)
> 8021e26c:       8c890014        lw      t1,20(a0)
> 8021e270:       8c8b0018        lw      t3,24(a0)
> 8021e274:       8c8c001c        lw      t4,28(a0)
> 8021e278:       00481021        addu    v0,v0,t0
> 8021e27c:       0048182b        sltu    v1,v0,t0
> 8021e280:       00431021        addu    v0,v0,v1
> 8021e284:       00491021        addu    v0,v0,t1
> 8021e288:       0049182b        sltu    v1,v0,t1
> 8021e28c:       00431021        addu    v0,v0,v1
> 8021e290:       004b1021        addu    v0,v0,t3
> 8021e294:       004b182b        sltu    v1,v0,t3
> 8021e298:       00431021        addu    v0,v0,v1
> 8021e29c:       004c1021        addu    v0,v0,t4
> 8021e2a0:       004c182b        sltu    v1,v0,t4
> 8021e2a4:       00431021        addu    v0,v0,v1
> 8021e2a8:       30b8001c        andi    t8,a1,0x1c
> 8021e2ac:       24840020        addiu   a0,a0,32
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: corruption of load instruction offset
  2006-04-03  4:12 ` Chuck Meade
  (?)
  (?)
@ 2006-04-03 10:42 ` Ralf Baechle
  -1 siblings, 0 replies; 7+ messages in thread
From: Ralf Baechle @ 2006-04-03 10:42 UTC (permalink / raw)
  To: Chuck Meade; +Cc: linux-mips

On Mon, Apr 03, 2006 at 12:12:46AM -0400, Chuck Meade wrote:

> I am seeing a very interesting/worrisome bug on an RM7965 cpu, which has
> an E9000 core.  I am running 2.6.14-rc1.  Please take a look at the
> behavior I describe and send me your thoughts.  Thanks.

Well, 2.6.14-rc1.  The -rc1 part says it already.  While the rc may stand
for release candidate, they -rc1 kernels are definatly far from ready for
a release.  All the new features for the 2.6.14 but hardly any of the fixes.
I really suggest you go either for 2.6.13 or upgrade to 2.6.14.

Talking about upgrading, 2.6.16.1 is a good vintage for MIPS so far.

Aside of this more general warning about -rc but especially -rc1 kernels
I think the remainder of your analysis is correct ...

  Ralf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: corruption of load instruction offset
@ 2006-04-03 14:37     ` Chuck Meade
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Meade @ 2006-04-03 14:37 UTC (permalink / raw)
  To: linux-mips; +Cc: Chuck Meade (mindspring)

Hi,

> That's pretty twisted - one could almost believe that the fetch from
> 0x8021e28c got corrupted to pick up the most significant 16 bits
> of the instruction at 0x8021e22c or 0x8021e26c - but given that
> instructions are fetched and issued word-by-word, it's hard to see
> where that could happen, in either CPU hardware or software. 
> What is the I-cache line size? If it  were me, I'd check my clocks, 
> voltages, and above all my RAM timing, and I'd re-seat my CPU 
> and RAM in their sockets...

I agree that it is twisted.  The I-cache line size is 32 bytes by the way.

I left it running overnight and got a different error.  Slightly harder to
pinpoint the exact instruction that caused the actual bad load, because the
failing instruction is loading indirect thru a register that is set to 0000fac4.
So the bad load was done previously, and resulted in this register (a1) being
set to 0000fac4.

The common theme here seems to be that I am getting a bad 16-bits of RAM when
loading...  First error that I mentioned last night was an instruction load,
and this new error looks more to me like a data load, since a1 was previously
loaded with a bogus value 0000fac4.  Another bad 16-bit load in the most
significant 16-bits.

So if my analysis is correct, the most significant 16 bits is loading flaky,
both for instructions and for data loads.  This points to some of the lower
level issues you mention -- physical RAM interface, clocking, voltages, and
RAM timing setup.  If anyone can think of something else I should check, let
me know.

Thanks again for the feedback.

Also Ralf, I got your message about the 2.6.14-rc1 version loud and clear.
Thanks to you too for the feedback.

Chuck

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: corruption of load instruction offset
@ 2006-04-03 14:37     ` Chuck Meade
  0 siblings, 0 replies; 7+ messages in thread
From: Chuck Meade @ 2006-04-03 14:37 UTC (permalink / raw)
  To: linux-mips; +Cc: Chuck Meade (mindspring)

Hi,

> That's pretty twisted - one could almost believe that the fetch from
> 0x8021e28c got corrupted to pick up the most significant 16 bits
> of the instruction at 0x8021e22c or 0x8021e26c - but given that
> instructions are fetched and issued word-by-word, it's hard to see
> where that could happen, in either CPU hardware or software. 
> What is the I-cache line size? If it  were me, I'd check my clocks, 
> voltages, and above all my RAM timing, and I'd re-seat my CPU 
> and RAM in their sockets...

I agree that it is twisted.  The I-cache line size is 32 bytes by the way.

I left it running overnight and got a different error.  Slightly harder to
pinpoint the exact instruction that caused the actual bad load, because the
failing instruction is loading indirect thru a register that is set to 0000fac4.
So the bad load was done previously, and resulted in this register (a1) being
set to 0000fac4.

The common theme here seems to be that I am getting a bad 16-bits of RAM when
loading...  First error that I mentioned last night was an instruction load,
and this new error looks more to me like a data load, since a1 was previously
loaded with a bogus value 0000fac4.  Another bad 16-bit load in the most
significant 16-bits.

So if my analysis is correct, the most significant 16 bits is loading flaky,
both for instructions and for data loads.  This points to some of the lower
level issues you mention -- physical RAM interface, clocking, voltages, and
RAM timing setup.  If anyone can think of something else I should check, let
me know.

Thanks again for the feedback.

Also Ralf, I got your message about the 2.6.14-rc1 version loud and clear.
Thanks to you too for the feedback.

Chuck

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-04-03 14:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-03  4:12 corruption of load instruction offset Chuck Meade
2006-04-03  4:12 ` Chuck Meade
2006-04-03  7:25 ` Kevin D. Kissell
2006-04-03  7:25   ` Kevin D. Kissell
2006-04-03 14:37   ` Chuck Meade
2006-04-03 14:37     ` Chuck Meade
2006-04-03 10:42 ` Ralf Baechle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.