gcc-4.8+ and R10000+

All of lore.kernel.org
 help / color / mirror / Atom feed

* gcc-4.8+ and R10000+
@ 2014-09-07  8:25 Joshua Kinard
  2014-09-08  8:11 ` Miod Vallat
  2014-09-28 19:34 ` Maciej W. Rozycki
  0 siblings, 2 replies; 7+ messages in thread
From: Joshua Kinard @ 2014-09-07  8:25 UTC (permalink / raw)
  To: Linux MIPS List

I've been banging my head on the desk over gcc PR61538 [1] the last few
months, and talking to the gcc people, I went looking through the R10000
manual again to try and see if some kind of errata sticks out.  I found this
bit:

"""
Load Linked and Store Conditional instructions (LL, LLD,
SC, and SCD) do not implicitly perform SYNC operations in
the R10000.  Any of the following events that occur between
a Load Linked and a Store Conditional will cause the Store
Conditional to fail: an exception; execution of an ERET,
a load, a store, a SYNC, a CacheOp, a prefetch, or an
external intervention/invalidation on the block containing
the linked address. Instruction cache misses do not cause
the Store Conditional to fail.
"""

The regression happens inside glibc's __lll_lock_wait_private routine:

void
__lll_lock_wait_private (int *futex)
{
  if (*futex == 2)
    lll_futex_wait (futex, 2, LLL_PRIVATE);

  while (atomic_exchange_acq (futex, 2) != 0)
    lll_futex_wait (futex, 2, LLL_PRIVATE);
}

It appears to hang forever on the "atomic_exchange_acq" function call.

Disassembling a statically-built copy of the "sln" binary generated by
glibc's compile phase, there are slight differences in how gcc-4.7 and
gcc-4.8 are compiling the __lll_lock_wait_private function.  The key
differences in the output asm are
this:

gcc-4.7:
   x+4   <START>
         ...
   x+24  bne     v1,v0,<x+56>
         ...
   x+32  0x7c03e83b /* rdhwr */
   x+36  li      a2,2
   x+40  lw      a1,-29832(v1)
   x+44  move    a3,zero
   x+48  li      v0,4238
   x+52  syscall
*  x+56  li      v0,2
*  x+60  ll      v1,0(s0)
*  x+64  move    a0,v0
*  x+68  sc      a0,0(s0)
   x+72  beqzl   a0,<x+56>
   x+76  nop
   x+80  sync
   x+84  bnez    v1,<x+32>

gcc-4.8:
   x+4   <START>
         ...
   x+24  bne     v1,v0,<x+56>
         ...
   x+32  0x7c03e83b /* rdhwr */
   x+36  li      a2,2
   x+40  lw      a1,-29832(v1)
   x+44  move    a3,zero
   x+48  li      v0,4238
   x+52  syscall
*  x+56  ll      v0,0(s0)
*  x+60  li      at,2
*  x+64  sc      at,0
   x+68  beqzl   at,<x+56>
   x+72  nop
   x+76  sync
   x+80  bnez    v0,<x+32>

Using gdb, if I step through 'sln', the gcc-4.7 copy never calls
__lll_lock_wait_private, so I have no idea how the insns are being executed.
 But the 4.8 copy does get into this function, and stepping each instruction
at a time yields this execution path:

   x+4   <START>
         ...
   x+24  bne     v1,v0,<x+56>
   x+56  ll      v0,0(s0)
   x+68  beqzl   at,<x+56> /* beqzl check fails -> x+76 */
   x+76  sync
   x+80  bnez    v0,<x+32>
   x+32  0x7c03e83b /* rdhwr */
   x+36  li      a2,2
   x+40  lw      a1,-29832(v1)
   x+44  move    a3,zero
   x+48  li      v0,4238
   x+52  syscall
   x+56  ll      v0,0(s0)
   <HANG>

Executing the 'bnez' insn puts us at the rdhwr insn (x+32), then stepping
through, the 'syscall' (x+56) returns and leaves us at the 'll' a second
time, where the program just hangs.

I am guessing at a few things here:

- Because ll/sc are atomic, gdb doesn't let you step through them, which is
why the instruction pointer jumps over the 'li' and 'sc' insns.

- The 'li' after 'll' triggers the 'sc' to fail on R10K.

Does this look correct for an R10000, given the above statement from the
manual?  I'm not sure how or why this would cause the program to hang, but
it seems to directly correlate.

Anyone from Debian able to test building gcc-4.8 (or greater) and glibc-2.19
on an R10K system and see if it hangs at the end of glibc's compile phase
using the 'sln' binary to generate symlinks?  I've ran into this on R12000
and R14000 systems.  I am assuming it'll happen on an R10000 system as well.

1: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61538

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
  2014-09-07  8:25 gcc-4.8+ and R10000+ Joshua Kinard
@ 2014-09-08  8:11 ` Miod Vallat
  2014-09-08  9:44   ` Joshua Kinard
  2014-09-28 19:34 ` Maciej W. Rozycki
  1 sibling, 1 reply; 7+ messages in thread
From: Miod Vallat @ 2014-09-08  8:11 UTC (permalink / raw)
  To: linux-mips

> Disassembling a statically-built copy of the "sln" binary generated by
> glibc's compile phase, there are slight differences in how gcc-4.7 and
> gcc-4.8 are compiling the __lll_lock_wait_private function.  The key
> differences in the output asm are
> this:

[...]

> gcc-4.8:
>    x+4   <START>
>          ...
>    x+24  bne     v1,v0,<x+56>
>          ...
>    x+32  0x7c03e83b /* rdhwr */
>    x+36  li      a2,2
>    x+40  lw      a1,-29832(v1)
>    x+44  move    a3,zero
>    x+48  li      v0,4238
>    x+52  syscall
> *  x+56  ll      v0,0(s0)
> *  x+60  li      at,2
> *  x+64  sc      at,0

Note how the sc address is no longer 0(s0). Since the address does
not match the address used in the ll instruction, sc will always
fail on the R10k.

Miod

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
  2014-09-08  8:11 ` Miod Vallat
@ 2014-09-08  9:44   ` Joshua Kinard
  2014-09-22  9:31       ` Joshua Kinard
  0 siblings, 1 reply; 7+ messages in thread
From: Joshua Kinard @ 2014-09-08  9:44 UTC (permalink / raw)
  To: linux-mips

On 09/08/2014 04:11, Miod Vallat wrote:
>> Disassembling a statically-built copy of the "sln" binary generated by
>> glibc's compile phase, there are slight differences in how gcc-4.7 and
>> gcc-4.8 are compiling the __lll_lock_wait_private function.  The key
>> differences in the output asm are
>> this:
> 
> [...]
> 
>> gcc-4.8:
>>    x+4   <START>
>>          ...
>>    x+24  bne     v1,v0,<x+56>
>>          ...
>>    x+32  0x7c03e83b /* rdhwr */
>>    x+36  li      a2,2
>>    x+40  lw      a1,-29832(v1)
>>    x+44  move    a3,zero
>>    x+48  li      v0,4238
>>    x+52  syscall
>> *  x+56  ll      v0,0(s0)
>> *  x+60  li      at,2
>> *  x+64  sc      at,0
> 
> Note how the sc address is no longer 0(s0). Since the address does
> not match the address used in the ll instruction, sc will always
> fail on the R10k.

That would be a typo on my part.  I typed that out by hand and just missed it.  It should read:

gcc-4.8:
   x+4   <START>
         ...
   x+24  bne     v1,v0,<x+56>
         ...
   x+32  0x7c03e83b /* rdhwr */
   x+36  li      a2,2
   x+40  lw      a1,-29832(v1)
   x+44  move    a3,zero
   x+48  li      v0,4238
   x+52  syscall
*  x+56  ll      v0,0(s0)
*  x+60  li      at,2
*  x+64  sc      at,0(s0)
   x+68  beqzl   at,<x+56>
   x+72  nop
   x+76  sync
   x+80  bnez    v0,<x+32>

Thanks!,

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
@ 2014-09-22  9:31       ` Joshua Kinard
  0 siblings, 0 replies; 7+ messages in thread
From: Joshua Kinard @ 2014-09-22  9:31 UTC (permalink / raw)
  To: linux-mips

On 09/08/2014 05:44, Joshua Kinard wrote:
> On 09/08/2014 04:11, Miod Vallat wrote:
>>> Disassembling a statically-built copy of the "sln" binary generated by
>>> glibc's compile phase, there are slight differences in how gcc-4.7 and
>>> gcc-4.8 are compiling the __lll_lock_wait_private function.  The key
>>> differences in the output asm are
>>> this:
>>
>> [...]
>>
>>> gcc-4.8:
>>>    x+4   <START>
>>>          ...
>>>    x+24  bne     v1,v0,<x+56>
>>>          ...
>>>    x+32  0x7c03e83b /* rdhwr */
>>>    x+36  li      a2,2
>>>    x+40  lw      a1,-29832(v1)
>>>    x+44  move    a3,zero
>>>    x+48  li      v0,4238
>>>    x+52  syscall
>>> *  x+56  ll      v0,0(s0)
>>> *  x+60  li      at,2
>>> *  x+64  sc      at,0
>>
>> Note how the sc address is no longer 0(s0). Since the address does
>> not match the address used in the ll instruction, sc will always
>> fail on the R10k.
> 
> That would be a typo on my part.  I typed that out by hand and just missed it.  It should read:
> 
> gcc-4.8:
>    x+4   <START>
>          ...
>    x+24  bne     v1,v0,<x+56>
>          ...
>    x+32  0x7c03e83b /* rdhwr */
>    x+36  li      a2,2
>    x+40  lw      a1,-29832(v1)
>    x+44  move    a3,zero
>    x+48  li      v0,4238
>    x+52  syscall
> *  x+56  ll      v0,0(s0)
> *  x+60  li      at,2
> *  x+64  sc      at,0(s0)
>    x+68  beqzl   at,<x+56>
>    x+72  nop
>    x+76  sync
>    x+80  bnez    v0,<x+32>

I did some more tracing.  It seems the issue with glibc itself stems from the
addition of __atomic_* builtins added generally in gcc-4.7 and
MIPS-specifically in gcc-4.8:

From ports/sysdeps/mips/bits/atomic.h (for 2.19) or sysdeps/mips/bits/atomic.h
(for 2.20):

/* The __atomic_* builtins are available in GCC 4.7 and later, but MIPS
   support for their efficient implementation was added only in GCC 4.8.
   We still want to use them even with GCC 4.7 for MIPS16 code where we
   have no assembly alternative available and want to avoid the __sync_*
   builtins if at all possible.  */

#if __GNUC_PREREQ (4, 8) || (defined __mips16 && __GNUC_PREREQ (4, 7))
[snip]

This is why the assembly is different between the two gcc versions.  This same
code is in the kernel's atomic.h copy under arch/mips/include/asm/ as well.

I tested by removing the top part of the #if macro and basically forcing the
inline versions only, then rebuilt glibc-2.20 with gcc-4.9.2 (20140921
prerelease), and lo and behold, sln executes and returns its usage information.
 When using the gcc internal builtins, a futex gets used, which is why I wasn't
seeing futexes in 4.7-built copies of sln, only in 4.8 or greater-built copies.
 This means that the gcc internal __atomic_* builtins may be somewhat to blame
for this problem on R1x000 systems.

I traced the kernel side of the problem out and figured out that when the futex
is taken by sln, the process gets frozen by the scheduler via a call to
freezable_schedule() in function futex_wait_queue_me in kernel/futex.c.  I
added two printk statements, one before freezable_schedule() and one after, and
the first statement executes (verified by dumping /proc/kmsg directly because
dmesg itself generates futexes), but not the printk after.  The printk after
freezable_schedule() only executes when I ctrl+C the frozen process and it
exits out of the futex code.

I visually checked through include/linux/freezer.h and noticed that
freezable_schedule eventually calls freezing(), which executes an atomic_read()
on system_freezing_cnt.  In the mips code, that just comes out as a pointer
dereference of a volatile variable.  I'm not certain, though, if in gcc's case,
the use of volatile means it tries to use its builtin __atomic_ functions
again, and tries to take another futex /while it's trying to take a futex/.
Chicken and egg?

So, could still very well be a gcc issue, or maybe it's something really subtle
in the kernel code.  I am not sure which.  I at least know of a specific gcc
commit that enables/disables the problem, and that's pointing the finger at gcc
here.

Ideas?

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
@ 2014-09-22  9:31       ` Joshua Kinard
  0 siblings, 0 replies; 7+ messages in thread
From: Joshua Kinard @ 2014-09-22  9:31 UTC (permalink / raw)
  To: linux-mips

On 09/08/2014 05:44, Joshua Kinard wrote:
> On 09/08/2014 04:11, Miod Vallat wrote:
>>> Disassembling a statically-built copy of the "sln" binary generated by
>>> glibc's compile phase, there are slight differences in how gcc-4.7 and
>>> gcc-4.8 are compiling the __lll_lock_wait_private function.  The key
>>> differences in the output asm are
>>> this:
>>
>> [...]
>>
>>> gcc-4.8:
>>>    x+4   <START>
>>>          ...
>>>    x+24  bne     v1,v0,<x+56>
>>>          ...
>>>    x+32  0x7c03e83b /* rdhwr */
>>>    x+36  li      a2,2
>>>    x+40  lw      a1,-29832(v1)
>>>    x+44  move    a3,zero
>>>    x+48  li      v0,4238
>>>    x+52  syscall
>>> *  x+56  ll      v0,0(s0)
>>> *  x+60  li      at,2
>>> *  x+64  sc      at,0
>>
>> Note how the sc address is no longer 0(s0). Since the address does
>> not match the address used in the ll instruction, sc will always
>> fail on the R10k.
> 
> That would be a typo on my part.  I typed that out by hand and just missed it.  It should read:
> 
> gcc-4.8:
>    x+4   <START>
>          ...
>    x+24  bne     v1,v0,<x+56>
>          ...
>    x+32  0x7c03e83b /* rdhwr */
>    x+36  li      a2,2
>    x+40  lw      a1,-29832(v1)
>    x+44  move    a3,zero
>    x+48  li      v0,4238
>    x+52  syscall
> *  x+56  ll      v0,0(s0)
> *  x+60  li      at,2
> *  x+64  sc      at,0(s0)
>    x+68  beqzl   at,<x+56>
>    x+72  nop
>    x+76  sync
>    x+80  bnez    v0,<x+32>

I did some more tracing.  It seems the issue with glibc itself stems from the
addition of __atomic_* builtins added generally in gcc-4.7 and
MIPS-specifically in gcc-4.8:

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
  2014-09-07  8:25 gcc-4.8+ and R10000+ Joshua Kinard
  2014-09-08  8:11 ` Miod Vallat
@ 2014-09-28 19:34 ` Maciej W. Rozycki
  2014-09-29  5:18   ` Joshua Kinard
  1 sibling, 1 reply; 7+ messages in thread
From: Maciej W. Rozycki @ 2014-09-28 19:34 UTC (permalink / raw)
  To: Joshua Kinard; +Cc: Linux MIPS List

Joshua,

 I can't help you with the problem, but I can confirm one of your guesses:

> I am guessing at a few things here:
> 
> - Because ll/sc are atomic, gdb doesn't let you step through them, which is
> why the instruction pointer jumps over the 'li' and 'sc' insns.

-- this is exactly the case, GDB tries to be smart enough and when it sees 
an LL or LLD instruction it examines code that follows to find a matching 
SC or SCD instruction and any other exit points from the atomic section 
and sets internal breakpoints correctly to let the code fragment run at 
the full speed even if single stepping.  Otherwise the exception taken at 
each single step would cause the conditional store instruction to always 
fail -- which might not be a big issue if you were knowingly stepping code 
e.g. with `stepi', but would cause big harm in implicit stepping through 
unknown or unrelated code such as when a software watchpoint is active.

 See `deal_with_atomic_sequence' in gdb/mips-tdep.c if curious about the 
details.

  Maciej

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: gcc-4.8+ and R10000+
  2014-09-28 19:34 ` Maciej W. Rozycki
@ 2014-09-29  5:18   ` Joshua Kinard
  0 siblings, 0 replies; 7+ messages in thread
From: Joshua Kinard @ 2014-09-29  5:18 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: Linux MIPS List

On 09/28/2014 15:34, Maciej W. Rozycki wrote:
> Joshua,
> 
>  I can't help you with the problem, but I can confirm one of your guesses:
> 
>> I am guessing at a few things here:
>>
>> - Because ll/sc are atomic, gdb doesn't let you step through them, which is
>> why the instruction pointer jumps over the 'li' and 'sc' insns.
> 
> -- this is exactly the case, GDB tries to be smart enough and when it sees 
> an LL or LLD instruction it examines code that follows to find a matching 
> SC or SCD instruction and any other exit points from the atomic section 
> and sets internal breakpoints correctly to let the code fragment run at 
> the full speed even if single stepping.  Otherwise the exception taken at 
> each single step would cause the conditional store instruction to always 
> fail -- which might not be a big issue if you were knowingly stepping code 
> e.g. with `stepi', but would cause big harm in implicit stepping through 
> unknown or unrelated code such as when a software watchpoint is active.
> 
>  See `deal_with_atomic_sequence' in gdb/mips-tdep.c if curious about the 
> details.

Ah ha, that does explain it!  Though I don't think it's an issue with ll/sc in
the R10000.  It's something with gcc's builtin __atomic_* functions I think,
though I still haven't ruled the kernel out yet, either.  I have no way to step
through the kernel syscall to make that determination, though, so I'm focusing
more on gcc as time permits.  Hopefully, the gcc maintainers will find time to
look into PR61538 some more soon.

Thanks!

-- 
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us.  And our
lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-09-29  5:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-07  8:25 gcc-4.8+ and R10000+ Joshua Kinard
2014-09-08  8:11 ` Miod Vallat
2014-09-08  9:44   ` Joshua Kinard
2014-09-22  9:31     ` Joshua Kinard
2014-09-22  9:31       ` Joshua Kinard
2014-09-28 19:34 ` Maciej W. Rozycki
2014-09-29  5:18   ` Joshua Kinard

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.