public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* Possible race condition with deferred binding on IPF
@ 2004-03-05  0:05 Cary Coutant
  2004-03-07 21:53 ` Zack Weinberg
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Cary Coutant @ 2004-03-05  0:05 UTC (permalink / raw)
  To: linux-ia64

We (HP) have discovered a missing requirement in the psABI document 
with respect to import stubs and inlined import stubs. The deferred 
binding sequence relies on careful ordering of the two stores that 
update the function descriptor in the dynamic loader's 
bind-on-reference routine, and the two loads in the import stub. The 
import stub must load the entry point address, followed by the gp 
value, while the BOR routine must store the gp value first, followed by 
the entry point address. This ensures that one thread cannot interfere 
with another thread if both are making the same call at about the same 
time. Furthermore, we need to ensure strong ordering with the use of 
".acq" and ".rel" completers.

The race condition to be avoided is this:

     Thread 1               Thread 2
     --------               --------
                            load (garbage) gp value
     store new gp value
     store new entry point
                            load (new) entry point

If the loads become visible out of order, and the stores become visible 
between the two loads, we will branch to the actual entry point with a 
garbage gp value. Forcing strong ordering between the two stores and 
between the two loads eliminates the possibility of this ordering 
between the two loads, and therefore eliminates the race condition.

To our knowledge, we have not yet observed an application failure due 
to this problem. We discovered this while finding a compiler bug in 
which the two loads were physically moved out of order by the compiler, 
and came to realize that the psABI does not state any requirement for 
strong ordering.

This problem has potential impact on the following Gnu/Linux components:

- gcc is not affected, since it does not inline import stubs (I think).
- The intel compiler needs to be fixed to generate the necessary 
ld8.acq in inlined import stubs.
- gld needs to be fixed to generate the necessary ld8.acq in the import 
stubs.
- the dynamic loader is not affected, as it uses the required st8.rel 
already.
- handcoded assembly language source is at risk, but inlined import 
stub sequences in handcoded assembly code are, I hope, rare.

While it's clearly important to fix these components, I believe that 
any actual failures due to the missing .acq are so unlikely that there 
is no great urgency to encourage a recompile of the world.

I have proposed the following changes to the IA-64 psABI document:

1. In Section 5.3.6 ("Procedure Linkage Table"), the following 
additional text is needed:

"An import stub (whether in the procedure linkage table or inlined at 
the point of call) must ensure strong ordering between  the load of the 
first doubleword of the local function descriptor and the load of the 
second doubleword. This means that the two loads may not be reordered, 
and the first load must have acquire semantics."

...

"When the dynamic loader updates the local function descriptor entry, 
it must store the second doubleword of the function descriptor (the 
function's gp value) before storing the first doubleword (the function 
address), and the second store must have release semantics, ensuring 
strong ordering between the two stores."

2. The example code in Figure 5-4 needs the ".acq" completer on the 
first load instruction, as follows:

     ...
     .PLT1: (entry for symbol name1)
         addl     r15 = @pltoff(name1), gp ;;
         ld8.acq  r16 = [r15], 8
         mov      r14 = gp ;;
         ld8      gp = [r15]
         mov      b6 = r16
         br       b6
     ...


Cary Coutant
HP-UX Runtime Architect


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
@ 2004-03-07 21:53 ` Zack Weinberg
  2004-03-07 23:09 ` Zack Weinberg
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-07 21:53 UTC (permalink / raw)
  To: linux-ia64

Cary Coutant <cary@cup.hp.com> writes:

> We (HP) have discovered a missing requirement in the psABI document
> with respect to import stubs and inlined import stubs.
...

> 2. The example code in Figure 5-4 needs the ".acq" completer on the
> first load instruction, as follows:
>
>      ...
>      .PLT1: (entry for symbol name1)
>          addl     r15 = @pltoff(name1), gp ;;
>          ld8.acq  r16 = [r15], 8
>          mov      r14 = gp ;;
>          ld8      gp = [r15]
>          mov      b6 = r16
>          br       b6

I believe that this corresponds to the following code in bfd/elfxx-ia64.c:

static const bfd_byte plt_full_entry[PLT_FULL_ENTRY_SIZE] {
  0x0b, 0x78, 0x00, 0x02, 0x00, 0x24,  /*   [MMI]       addl r15=0,r1;;    */
  0x00, 0x41, 0x3c, 0x30, 0x28, 0xc0,  /*               ld8 r16=[r15],8    */
  0x01, 0x08, 0x00, 0x84,              /*               mov r14=r1;;       */
  0x11, 0x08, 0x00, 0x1e, 0x18, 0x10,  /*   [MIB]       ld8 r1=[r15]       */
  0x60, 0x80, 0x04, 0x80, 0x03, 0x00,  /*               mov b6=r16         */
  0x60, 0x00, 0x80, 0x00               /*               br.few b6;;        */
};

Converting the ld8 to a ld8.acq is a simple matter of changing the
second line of this array to

  0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0,  /*               ld8.acq r16=[r15],8 */

However, I have two related concerns before I try to submit a patch:

1) If I assemble the sample code above, using GAS 2.14, the first byte
   of the first bundle is 0a, not 0b.  Hex-editing it to 0b doesn't
   seem to make any difference to the disassembly, but I would like to
   know if there is a difference anyway.

2) There is another code sequence synthesized by the linker that might
   need the same treatment:

static const bfd_byte plt_header[PLT_HEADER_SIZE] {
  0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21,  /*   [MMI]       mov r2=r14;;       */
  0xe0, 0x00, 0x08, 0x00, 0x48, 0x00,  /*               addl r14=0,r2      */
  0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;        */
  0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14,  /*   [MMI]       ld8 r16=[r14],8;;  */
  0x10, 0x41, 0x38, 0x30, 0x28, 0x00,  /*               ld8 r17=[r14],8    */
  0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;        */
  0x11, 0x08, 0x00, 0x1c, 0x18, 0x10,  /*   [MIB]       ld8 r1=[r14]       */
  0x60, 0x88, 0x04, 0x80, 0x03, 0x00,  /*               mov b6=r17         */
  0x60, 0x00, 0x80, 0x00               /*               br.few b6;;        */
};

I don't understand what this code is doing so I can't be sure which
ld8 needs an .acq.  (In fact, I don't understand the point of the
first bundle at all.)

zw

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
  2004-03-07 21:53 ` Zack Weinberg
@ 2004-03-07 23:09 ` Zack Weinberg
  2004-03-08 17:15 ` Cary Coutant
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-07 23:09 UTC (permalink / raw)
  To: linux-ia64

Zack Weinberg <zack@codesourcery.com> writes:

> I have two related concerns before I try to submit a patch:
>
> 1) If I assemble the sample code above, using GAS 2.14, the first byte
>    of the first bundle is 0a, not 0b.  Hex-editing it to 0b doesn't
>    seem to make any difference to the disassembly, but I would like to
>    know if there is a difference anyway.

... maybe I should read the disassembly dumps more carefully.  This
turns out to be because I dropped the ;; on the third instruction of
the first bundle.

> 2) There is another code sequence synthesized by the linker that might
>    need the same treatment:
>
> static const bfd_byte plt_header[PLT_HEADER_SIZE] > {
>   0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21,  /*   [MMI]       mov r2=r14;;       */
>   0xe0, 0x00, 0x08, 0x00, 0x48, 0x00,  /*               addl r14=0,r2      */
>   0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;        */
>   0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14,  /*   [MMI]       ld8 r16=[r14],8;;  */
>   0x10, 0x41, 0x38, 0x30, 0x28, 0x00,  /*               ld8 r17=[r14],8    */
>   0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;        */
>   0x11, 0x08, 0x00, 0x1c, 0x18, 0x10,  /*   [MIB]       ld8 r1=[r14]       */
>   0x60, 0x88, 0x04, 0x80, 0x03, 0x00,  /*               mov b6=r17         */
>   0x60, 0x00, 0x80, 0x00               /*               br.few b6;;        */
> };

I looked this up in the ABI document, and now I understand what it is
doing.  There is in fact a function descriptor fetch in here, from the
PLT_RESERVE area; it's the second and third ld8 instructions.  It
seems unlikely that we have to worry about this getting changed on the
fly at runtime, but a belt-and-suspenders approach would put an .acq
suffix on the second ld8.

I have a related question.  It seems to me that the canonical form of
the PLT entries has not been optimized quite as much as it could be.
In particular, the use of r14 as the pointer to the function
descriptor seems suboptimal.  As I read the document, this register is
dead after it's used to load the global pointer.  If r2 were used
instead, I think PLT0 could be tightened up a bit, at the cost of
pushing the PLT_RESERVE pointer load into the secondary PLT entries
(where there is a free bundle slot - the cost is in having to update
all of them at load time, but then, that has to happen anyway to set
up the PLT index).  Thus:

.PLT0:
        ld8     r16 = [r2], 8
        ld8     r17 = [r2], 8 ;;   # possibly ld8.acq
        ld8     r1 = [r2]
        mov     b6 = r17
        br      b6            ;;

.PLT1:
        addl    r15 = @pltoff(name1), r1 ;;
        ld8.acq r16 = [r15], 8
        mov     r2 = r1 ;;
        ld8     r1 = [r15]
        mov     b6 = r16
        br.few  b6      ;;

.PLT1a:
        addl    r2 = @gprel(plt_reserve), r2
        mov     r15 = @iplt(name1)
        br      .PLT0

The net effect is to shrink .PLT0 by one bundle and execute one fewer
non-NOP instruction.  Thoughts?

zw

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
  2004-03-07 21:53 ` Zack Weinberg
  2004-03-07 23:09 ` Zack Weinberg
@ 2004-03-08 17:15 ` Cary Coutant
  2004-03-08 18:08 ` Zack Weinberg
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Cary Coutant @ 2004-03-08 17:15 UTC (permalink / raw)
  To: linux-ia64

> Converting the ld8 to a ld8.acq is a simple matter of changing the
> second line of this array to
>
>   0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0,  /*               ld8.acq 
> r16=[r15],8 */

Yes, this is the same bit pattern Steve Ellcey and I came up with.

> 1) If I assemble the sample code above, using GAS 2.14, the first byte
>    of the first bundle is 0a, not 0b.  Hex-editing it to 0b doesn't
>    seem to make any difference to the disassembly, but I would like to
>    know if there is a difference anyway.

As you discovered, that's just a missing stop bit.

> 2) There is another code sequence synthesized by the linker that might
>    need the same treatment:
>
> static const bfd_byte plt_header[PLT_HEADER_SIZE] > {
>   0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21,  /*   [MMI]       mov r2=r14;;   
>     */
>   0xe0, 0x00, 0x08, 0x00, 0x48, 0x00,  /*               addl r14=0,r2  
>     */
>   0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;    
>     */
>   0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14,  /*   [MMI]       ld8 
> r16=[r14],8;;  */
>   0x10, 0x41, 0x38, 0x30, 0x28, 0x00,  /*               ld8 
> r17=[r14],8    */
>   0x00, 0x00, 0x04, 0x00,              /*               nop.i 0x0;;    
>     */
>   0x11, 0x08, 0x00, 0x1c, 0x18, 0x10,  /*   [MIB]       ld8 r1=[r14]   
>     */
>   0x60, 0x88, 0x04, 0x80, 0x03, 0x00,  /*               mov b6=r17     
>     */
>   0x60, 0x00, 0x80, 0x00               /*               br.few b6;;    
>     */
> };

This code does not need to be patched. The two words loaded here point 
to the dynamic loader's BOR routine. The dynamic loader must provide 
the proper values in the linkage table before the program can run; 
these values will not change, so the ordering isn't important. Adding 
an ld.acq here would unnecessarily slow the code down.

> I have a related question.  It seems to me that the canonical form of
> the PLT entries has not been optimized quite as much as it could be.
> In particular, the use of r14 as the pointer to the function
> descriptor seems suboptimal.  As I read the document, this register is
> dead after it's used to load the global pointer.  If r2 were used
> instead, I think PLT0 could be tightened up a bit, at the cost of
> pushing the PLT_RESERVE pointer load into the secondary PLT entries
> (where there is a free bundle slot - the cost is in having to update
> all of them at load time, but then, that has to happen anyway to set
> up the PLT index).

I don't see anything wrong with you're reasoning, but changing this 
will have a binary compatibility impact, as the copy of gp to r14 is 
now part of the ABI, and will be present in inlined import stubs in 
existing .o files. I don't think gcc generates inlined import stubs at 
the moment, but I think Intel's compiler does.

Too bad. It leaves me wondering why we didn't design it this way in the 
first place.

-cary


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
                   ` (2 preceding siblings ...)
  2004-03-08 17:15 ` Cary Coutant
@ 2004-03-08 18:08 ` Zack Weinberg
  2004-03-09 20:49 ` Jim Wilson
  2004-03-09 20:51 ` Zack Weinberg
  5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-08 18:08 UTC (permalink / raw)
  To: linux-ia64

Cary Coutant <cary@cup.hp.com> writes:

>> Converting the ld8 to a ld8.acq is a simple matter of changing the
>> second line of this array to
>>
>>   0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0,  /*               ld8.acq
>> r16=[r15],8 */
>
> Yes, this is the same bit pattern Steve Ellcey and I came up with.

Ok.  I'll see about testing this and submitting a proper patch.

> This code does not need to be patched. The two words loaded here point
> to the dynamic loader's BOR routine. The dynamic loader must provide
> the proper values in the linkage table before the program can run;
> these values will not change, so the ordering isn't important. Adding
> an ld.acq here would unnecessarily slow the code down.

Ok, thanks for the clarification.

> I don't see anything wrong with you're reasoning, but changing this
> will have a binary compatibility impact, as the copy of gp to r14 is
> now part of the ABI, and will be present in inlined import stubs in
> existing .o files. I don't think gcc generates inlined import stubs at
> the moment, but I think Intel's compiler does.
>
> Too bad. It leaves me wondering why we didn't design it this way in
> the first place.

Understood.  I can still squeeze PLT0 down to two bundles by moving
the r2=r14 move into PLT1a, but I suspect that it only fits because
I didn't put in all the necessary stop bits.  Also it relies on being
able to express a relocation to PLT_RESERVE+8, which may not be
possible.  And I'm not sure whether this actually executes any faster.

(The idea is, since the ordering doesn't matter, to fetch the branch
target address first, and then the move to b6 can fit into that bundle -
but only if I don't need a stop bit between the load and the move to b6.)

.PLT0:
        addl    r2 = @gprel(plt_reserve+8), r2 ;;
        ld8     r17 = [r2], 8
        mov     b6 = r17
        ld8     r1 = [r2], -16
        ld8     r16 = [r2]
        br      b6 ;;
.PLT1:
        addl    r15 = @pltoff(name1), r1 ;;
        ld8.acq r16 = [r15], 8
        mov     r14 = r1 ;;
        ld8     r1 = [r15]
        mov     b6 = r16
        br      b6 ;;
.PLT1a:
        mov     r2 = r14
        mov     r15 = @iplt(name1)
        br      .PLT0 ;;

zw

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
                   ` (3 preceding siblings ...)
  2004-03-08 18:08 ` Zack Weinberg
@ 2004-03-09 20:49 ` Jim Wilson
  2004-03-09 20:51 ` Zack Weinberg
  5 siblings, 0 replies; 7+ messages in thread
From: Jim Wilson @ 2004-03-09 20:49 UTC (permalink / raw)
  To: linux-ia64

On Mon, 2004-03-08 at 10:08, Zack Weinberg wrote:
>         addl    r2 = @gprel(plt_reserve+8), r2 ;;
>         ld8     r17 = [r2], 8
>         mov     b6 = r17
>         ld8     r1 = [r2], -16
>         ld8     r16 = [r2]
>         br      b6 ;;

You are missing a stop bit between the load of r17 and its use.

You have two post-increment addressing uses of r2.  A post-increment
address is both a read and a write of the register.  Thus you need stop
bits between the first and second loads to avoid a read-after write
dependency violation.  Likewise, between the second and third loads.

So your example needs 4 stop bits total (5 instruction groups).  The ABI
recommended sequence only requires 2 (3 instruction groups).
-- 
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible race condition with deferred binding on IPF
  2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
                   ` (4 preceding siblings ...)
  2004-03-09 20:49 ` Jim Wilson
@ 2004-03-09 20:51 ` Zack Weinberg
  5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-09 20:51 UTC (permalink / raw)
  To: linux-ia64

Jim Wilson <wilson@specifixinc.com> writes:

> On Mon, 2004-03-08 at 10:08, Zack Weinberg wrote:
>>         addl    r2 = @gprel(plt_reserve+8), r2 ;;
>>         ld8     r17 = [r2], 8
>>         mov     b6 = r17
>>         ld8     r1 = [r2], -16
>>         ld8     r16 = [r2]
>>         br      b6 ;;
>
> You are missing a stop bit between the load of r17 and its use.
>
> You have two post-increment addressing uses of r2.  A post-increment
> address is both a read and a write of the register.  Thus you need stop
> bits between the first and second loads to avoid a read-after write
> dependency violation.  Likewise, between the second and third loads.
>
> So your example needs 4 stop bits total (5 instruction groups).  The ABI
> recommended sequence only requires 2 (3 instruction groups).

Yeah, I was afraid of that.  I haven't had much luck making sense of
the stop-bit rules.

zw

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-03-09 20:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-05  0:05 Possible race condition with deferred binding on IPF Cary Coutant
2004-03-07 21:53 ` Zack Weinberg
2004-03-07 23:09 ` Zack Weinberg
2004-03-08 17:15 ` Cary Coutant
2004-03-08 18:08 ` Zack Weinberg
2004-03-09 20:49 ` Jim Wilson
2004-03-09 20:51 ` Zack Weinberg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox