Linux MIPS Architecture development
 help / color / mirror / Atom feed
* Strange gp corruption problem
@ 2007-07-12 17:06 Mike Crowe
  2007-07-12 17:21 ` Thiemo Seufer
  0 siblings, 1 reply; 3+ messages in thread
From: Mike Crowe @ 2007-07-12 17:06 UTC (permalink / raw)
  To: linux-mips

I'm seeing a rather strange problem on an NXP PNX8550 board we're
using (the PNX8550 is a SoC with a MIPS PR4450 core) and I'm wondering
if anyone here has ever seen anything similar or have any particular
advice as to how it can be investigated. I've searched the list
archives, googled, diffed with 2.6.22 etc. but not found anything
revealing. :(

We're using gcc 4.0.0 (from ELDK-4.0) on a 2.6.19 kernel with
glibc/LinuxThreads 2.3.5. I realise that this version of gcc is quite
old but it's the version used by the chip vendor for this platform.

The problem is easy to reproduce with a particular build of our
software but goes away very easily if code is changed, particularly
changes that affect the GOT or move code around. Even adding a single
"nop" instruction to the offending function "fixes the problem" This
is making it hard to debug.

We have a function that does some string manipulation (not
particularly dangerous manipulation and I've been through it
carefully) and then calls atol. As expected the prologue of this
function calculates the value of the gp register by applying an offset
to the t9 register which contains the address of the start of the
function like this:

 47995c:       3c1c0fba        lui     gp,0xfba
 479960:       279c1fe4        addiu   gp,gp,8164
 479964:       0399e021        addu    gp,gp,t9
 479968:       27bdff80        addiu   sp,sp,-128
 47996c:       afbf007c        sw      ra,124(sp)
 479970:       afbe0078        sw      s8,120(sp)
 479974:       03a0f021        move    s8,sp
 479978:       afbc0010        sw      gp,16(sp)
 47997c:       afc40080        sw      a0,128(s8)
 479980:       afc50084        sw      a1,132(s8)

The function then doesn't go near the t9 or gp registers until it it
needs to read the address of the atol function from the GOT and it
does so like this:

 479a98:       8f99a7c4        lw      t9,-22588(gp)
 479a9c:       00402021        move    a0,v0
 479aa0:       0320f809        jalr    t9
 479aa4:       00000000        nop

At this point it segfaults because gp has an invalid value of
0x10497280.  t9 still has the correct value of 0x47995c. 16(sp) (see
479978 above) also has the incorrect value of gp of 0x10497280.

The correct value for gp in this binary is 0x1001b940.

Interestingly the bad and good gp values are related in the following
way:

 0x1001b940 (correct value of gp)
+  0x47995c (address of this function == t9)
+    0x1fe4 (second part of gp fixup (8164 in decimal) from 479960 above)
=0x10497280 (bad value of gp)

This implies that upon execution of the instruction at 0x479960 gp
contained the "good" gp value of 0x1001b940 rather than the 0x0fba0000
it should have contained according to the previous instruction.

The only user-space reason I can come up with for this happening is if
the caller jumped into this function one instruction late. This seems
unlikely because t9 contains the correct value and the stack looks
fine. By instrumenting the kernel I determined that no signals are
being delivered around this time but instrumenting all context
switches looked rather hard.

TIA. Any advice gratefully received.

Mike.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Strange gp corruption problem
  2007-07-12 17:06 Strange gp corruption problem Mike Crowe
@ 2007-07-12 17:21 ` Thiemo Seufer
  2007-07-12 19:10   ` Mike Crowe
  0 siblings, 1 reply; 3+ messages in thread
From: Thiemo Seufer @ 2007-07-12 17:21 UTC (permalink / raw)
  To: Mike Crowe; +Cc: linux-mips

Mike Crowe wrote:
[snip]
> We have a function that does some string manipulation (not
> particularly dangerous manipulation and I've been through it
> carefully) and then calls atol. As expected the prologue of this
> function calculates the value of the gp register by applying an offset
> to the t9 register which contains the address of the start of the
> function like this:
> 
>  47995c:       3c1c0fba        lui     gp,0xfba

Looks weird as an entry point. Normally entries are 8 byte aligned.

[snip]
> The only user-space reason I can come up with for this happening is if
> the caller jumped into this function one instruction late. This seems
> unlikely because t9 contains the correct value and the stack looks
> fine.

Check the value of $ra (e.g. with a gdb breakpoint) after entering the
function.


Thiemo

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Strange gp corruption problem
  2007-07-12 17:21 ` Thiemo Seufer
@ 2007-07-12 19:10   ` Mike Crowe
  0 siblings, 0 replies; 3+ messages in thread
From: Mike Crowe @ 2007-07-12 19:10 UTC (permalink / raw)
  To: Thiemo Seufer; +Cc: linux-mips

I wrote:
>> We have a function that does some string manipulation (not
>> particularly dangerous manipulation and I've been through it
>> carefully) and then calls atol. As expected the prologue of this
>> function calculates the value of the gp register by applying an offset
>> to the t9 register which contains the address of the start of the
>> function like this:
>> 
>>  47995c:       3c1c0fba        lui     gp,0xfba
 
On Thu, Jul 12, 2007 at 06:21:52PM +0100, Thiemo Seufer wrote:
> Looks weird as an entry point. Normally entries are 8 byte aligned.

Many of the entry points in the image are only four byte aligned. :(

>> The only user-space reason I can come up with for this happening is if
>> the caller jumped into this function one instruction late. This seems
>> unlikely because t9 contains the correct value and the stack looks
>> fine.
> 
> Check the value of $ra (e.g. with a gdb breakpoint) after entering the
> function.

I should have mentioned that this problem doesn't occur every time the
function is called. The reproduction case I have calls the function
over and over again from a higher level (in fact the loop is in a
scripting language). I believe that it usually fails on the second
invocation but if I make unrelated changes to the code it can
sometimes happen after ten or twenty invocations. As soon as I try and
put breakpoints into the function it doesn't happen (although if I
disable the breakpoint and continue it does tend to strike). It looks
like something asynchronous but it's difficult to work out why it
likes to strike only this function.

At the point of the segfault $ra = 0x479fe8 which is the value I would
expect.

  479fcc:       8fc20048        lw      v0,72(s8)
  479fd0:       8c420000        lw      v0,0(v0)
  479fd4:       8fc40040        lw      a0,64(s8)
  479fd8:       00402821        move    a1,v0
  479fdc:       8f999750        lw      t9,-26800(gp)
  479fe0:       0320f809        jalr    t9
  479fe4:       00000000        nop
  479fe8:       8fdc0010        lw      gp,16(s8)
  479fec:       afc20018        sw      v0,24(s8)

I also failed to mention that we're using binutils-2.16.1.

Thanks for your response.

Mike.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2007-07-12 19:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-12 17:06 Strange gp corruption problem Mike Crowe
2007-07-12 17:21 ` Thiemo Seufer
2007-07-12 19:10   ` Mike Crowe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox