* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
@ 2004-03-07 21:53 ` Zack Weinberg
2004-03-07 23:09 ` Zack Weinberg
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-07 21:53 UTC (permalink / raw)
To: linux-ia64
Cary Coutant <cary@cup.hp.com> writes:
> We (HP) have discovered a missing requirement in the psABI document
> with respect to import stubs and inlined import stubs.
...
> 2. The example code in Figure 5-4 needs the ".acq" completer on the
> first load instruction, as follows:
>
> ...
> .PLT1: (entry for symbol name1)
> addl r15 = @pltoff(name1), gp ;;
> ld8.acq r16 = [r15], 8
> mov r14 = gp ;;
> ld8 gp = [r15]
> mov b6 = r16
> br b6
I believe that this corresponds to the following code in bfd/elfxx-ia64.c:
static const bfd_byte plt_full_entry[PLT_FULL_ENTRY_SIZE] {
0x0b, 0x78, 0x00, 0x02, 0x00, 0x24, /* [MMI] addl r15=0,r1;; */
0x00, 0x41, 0x3c, 0x30, 0x28, 0xc0, /* ld8 r16=[r15],8 */
0x01, 0x08, 0x00, 0x84, /* mov r14=r1;; */
0x11, 0x08, 0x00, 0x1e, 0x18, 0x10, /* [MIB] ld8 r1=[r15] */
0x60, 0x80, 0x04, 0x80, 0x03, 0x00, /* mov b6=r16 */
0x60, 0x00, 0x80, 0x00 /* br.few b6;; */
};
Converting the ld8 to a ld8.acq is a simple matter of changing the
second line of this array to
0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0, /* ld8.acq r16=[r15],8 */
However, I have two related concerns before I try to submit a patch:
1) If I assemble the sample code above, using GAS 2.14, the first byte
of the first bundle is 0a, not 0b. Hex-editing it to 0b doesn't
seem to make any difference to the disassembly, but I would like to
know if there is a difference anyway.
2) There is another code sequence synthesized by the linker that might
need the same treatment:
static const bfd_byte plt_header[PLT_HEADER_SIZE] {
0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21, /* [MMI] mov r2=r14;; */
0xe0, 0x00, 0x08, 0x00, 0x48, 0x00, /* addl r14=0,r2 */
0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;; */
0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14, /* [MMI] ld8 r16=[r14],8;; */
0x10, 0x41, 0x38, 0x30, 0x28, 0x00, /* ld8 r17=[r14],8 */
0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;; */
0x11, 0x08, 0x00, 0x1c, 0x18, 0x10, /* [MIB] ld8 r1=[r14] */
0x60, 0x88, 0x04, 0x80, 0x03, 0x00, /* mov b6=r17 */
0x60, 0x00, 0x80, 0x00 /* br.few b6;; */
};
I don't understand what this code is doing so I can't be sure which
ld8 needs an .acq. (In fact, I don't understand the point of the
first bundle at all.)
zw
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
2004-03-07 21:53 ` Zack Weinberg
@ 2004-03-07 23:09 ` Zack Weinberg
2004-03-08 17:15 ` Cary Coutant
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-07 23:09 UTC (permalink / raw)
To: linux-ia64
Zack Weinberg <zack@codesourcery.com> writes:
> I have two related concerns before I try to submit a patch:
>
> 1) If I assemble the sample code above, using GAS 2.14, the first byte
> of the first bundle is 0a, not 0b. Hex-editing it to 0b doesn't
> seem to make any difference to the disassembly, but I would like to
> know if there is a difference anyway.
... maybe I should read the disassembly dumps more carefully. This
turns out to be because I dropped the ;; on the third instruction of
the first bundle.
> 2) There is another code sequence synthesized by the linker that might
> need the same treatment:
>
> static const bfd_byte plt_header[PLT_HEADER_SIZE] > {
> 0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21, /* [MMI] mov r2=r14;; */
> 0xe0, 0x00, 0x08, 0x00, 0x48, 0x00, /* addl r14=0,r2 */
> 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;; */
> 0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14, /* [MMI] ld8 r16=[r14],8;; */
> 0x10, 0x41, 0x38, 0x30, 0x28, 0x00, /* ld8 r17=[r14],8 */
> 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;; */
> 0x11, 0x08, 0x00, 0x1c, 0x18, 0x10, /* [MIB] ld8 r1=[r14] */
> 0x60, 0x88, 0x04, 0x80, 0x03, 0x00, /* mov b6=r17 */
> 0x60, 0x00, 0x80, 0x00 /* br.few b6;; */
> };
I looked this up in the ABI document, and now I understand what it is
doing. There is in fact a function descriptor fetch in here, from the
PLT_RESERVE area; it's the second and third ld8 instructions. It
seems unlikely that we have to worry about this getting changed on the
fly at runtime, but a belt-and-suspenders approach would put an .acq
suffix on the second ld8.
I have a related question. It seems to me that the canonical form of
the PLT entries has not been optimized quite as much as it could be.
In particular, the use of r14 as the pointer to the function
descriptor seems suboptimal. As I read the document, this register is
dead after it's used to load the global pointer. If r2 were used
instead, I think PLT0 could be tightened up a bit, at the cost of
pushing the PLT_RESERVE pointer load into the secondary PLT entries
(where there is a free bundle slot - the cost is in having to update
all of them at load time, but then, that has to happen anyway to set
up the PLT index). Thus:
.PLT0:
ld8 r16 = [r2], 8
ld8 r17 = [r2], 8 ;; # possibly ld8.acq
ld8 r1 = [r2]
mov b6 = r17
br b6 ;;
.PLT1:
addl r15 = @pltoff(name1), r1 ;;
ld8.acq r16 = [r15], 8
mov r2 = r1 ;;
ld8 r1 = [r15]
mov b6 = r16
br.few b6 ;;
.PLT1a:
addl r2 = @gprel(plt_reserve), r2
mov r15 = @iplt(name1)
br .PLT0
The net effect is to shrink .PLT0 by one bundle and execute one fewer
non-NOP instruction. Thoughts?
zw
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
2004-03-07 21:53 ` Zack Weinberg
2004-03-07 23:09 ` Zack Weinberg
@ 2004-03-08 17:15 ` Cary Coutant
2004-03-08 18:08 ` Zack Weinberg
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Cary Coutant @ 2004-03-08 17:15 UTC (permalink / raw)
To: linux-ia64
> Converting the ld8 to a ld8.acq is a simple matter of changing the
> second line of this array to
>
> 0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0, /* ld8.acq
> r16=[r15],8 */
Yes, this is the same bit pattern Steve Ellcey and I came up with.
> 1) If I assemble the sample code above, using GAS 2.14, the first byte
> of the first bundle is 0a, not 0b. Hex-editing it to 0b doesn't
> seem to make any difference to the disassembly, but I would like to
> know if there is a difference anyway.
As you discovered, that's just a missing stop bit.
> 2) There is another code sequence synthesized by the linker that might
> need the same treatment:
>
> static const bfd_byte plt_header[PLT_HEADER_SIZE] > {
> 0x0b, 0x10, 0x00, 0x1c, 0x00, 0x21, /* [MMI] mov r2=r14;;
> */
> 0xe0, 0x00, 0x08, 0x00, 0x48, 0x00, /* addl r14=0,r2
> */
> 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;;
> */
> 0x0b, 0x80, 0x20, 0x1c, 0x18, 0x14, /* [MMI] ld8
> r16=[r14],8;; */
> 0x10, 0x41, 0x38, 0x30, 0x28, 0x00, /* ld8
> r17=[r14],8 */
> 0x00, 0x00, 0x04, 0x00, /* nop.i 0x0;;
> */
> 0x11, 0x08, 0x00, 0x1c, 0x18, 0x10, /* [MIB] ld8 r1=[r14]
> */
> 0x60, 0x88, 0x04, 0x80, 0x03, 0x00, /* mov b6=r17
> */
> 0x60, 0x00, 0x80, 0x00 /* br.few b6;;
> */
> };
This code does not need to be patched. The two words loaded here point
to the dynamic loader's BOR routine. The dynamic loader must provide
the proper values in the linkage table before the program can run;
these values will not change, so the ordering isn't important. Adding
an ld.acq here would unnecessarily slow the code down.
> I have a related question. It seems to me that the canonical form of
> the PLT entries has not been optimized quite as much as it could be.
> In particular, the use of r14 as the pointer to the function
> descriptor seems suboptimal. As I read the document, this register is
> dead after it's used to load the global pointer. If r2 were used
> instead, I think PLT0 could be tightened up a bit, at the cost of
> pushing the PLT_RESERVE pointer load into the secondary PLT entries
> (where there is a free bundle slot - the cost is in having to update
> all of them at load time, but then, that has to happen anyway to set
> up the PLT index).
I don't see anything wrong with you're reasoning, but changing this
will have a binary compatibility impact, as the copy of gp to r14 is
now part of the ABI, and will be present in inlined import stubs in
existing .o files. I don't think gcc generates inlined import stubs at
the moment, but I think Intel's compiler does.
Too bad. It leaves me wondering why we didn't design it this way in the
first place.
-cary
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
` (2 preceding siblings ...)
2004-03-08 17:15 ` Cary Coutant
@ 2004-03-08 18:08 ` Zack Weinberg
2004-03-09 20:49 ` Jim Wilson
2004-03-09 20:51 ` Zack Weinberg
5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-08 18:08 UTC (permalink / raw)
To: linux-ia64
Cary Coutant <cary@cup.hp.com> writes:
>> Converting the ld8 to a ld8.acq is a simple matter of changing the
>> second line of this array to
>>
>> 0x00, 0x41, 0x3c, 0x70, 0x29, 0xc0, /* ld8.acq
>> r16=[r15],8 */
>
> Yes, this is the same bit pattern Steve Ellcey and I came up with.
Ok. I'll see about testing this and submitting a proper patch.
> This code does not need to be patched. The two words loaded here point
> to the dynamic loader's BOR routine. The dynamic loader must provide
> the proper values in the linkage table before the program can run;
> these values will not change, so the ordering isn't important. Adding
> an ld.acq here would unnecessarily slow the code down.
Ok, thanks for the clarification.
> I don't see anything wrong with you're reasoning, but changing this
> will have a binary compatibility impact, as the copy of gp to r14 is
> now part of the ABI, and will be present in inlined import stubs in
> existing .o files. I don't think gcc generates inlined import stubs at
> the moment, but I think Intel's compiler does.
>
> Too bad. It leaves me wondering why we didn't design it this way in
> the first place.
Understood. I can still squeeze PLT0 down to two bundles by moving
the r2=r14 move into PLT1a, but I suspect that it only fits because
I didn't put in all the necessary stop bits. Also it relies on being
able to express a relocation to PLT_RESERVE+8, which may not be
possible. And I'm not sure whether this actually executes any faster.
(The idea is, since the ordering doesn't matter, to fetch the branch
target address first, and then the move to b6 can fit into that bundle -
but only if I don't need a stop bit between the load and the move to b6.)
.PLT0:
addl r2 = @gprel(plt_reserve+8), r2 ;;
ld8 r17 = [r2], 8
mov b6 = r17
ld8 r1 = [r2], -16
ld8 r16 = [r2]
br b6 ;;
.PLT1:
addl r15 = @pltoff(name1), r1 ;;
ld8.acq r16 = [r15], 8
mov r14 = r1 ;;
ld8 r1 = [r15]
mov b6 = r16
br b6 ;;
.PLT1a:
mov r2 = r14
mov r15 = @iplt(name1)
br .PLT0 ;;
zw
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
` (3 preceding siblings ...)
2004-03-08 18:08 ` Zack Weinberg
@ 2004-03-09 20:49 ` Jim Wilson
2004-03-09 20:51 ` Zack Weinberg
5 siblings, 0 replies; 7+ messages in thread
From: Jim Wilson @ 2004-03-09 20:49 UTC (permalink / raw)
To: linux-ia64
On Mon, 2004-03-08 at 10:08, Zack Weinberg wrote:
> addl r2 = @gprel(plt_reserve+8), r2 ;;
> ld8 r17 = [r2], 8
> mov b6 = r17
> ld8 r1 = [r2], -16
> ld8 r16 = [r2]
> br b6 ;;
You are missing a stop bit between the load of r17 and its use.
You have two post-increment addressing uses of r2. A post-increment
address is both a read and a write of the register. Thus you need stop
bits between the first and second loads to avoid a read-after write
dependency violation. Likewise, between the second and third loads.
So your example needs 4 stop bits total (5 instruction groups). The ABI
recommended sequence only requires 2 (3 instruction groups).
--
Jim Wilson, GNU Tools Support, http://www.SpecifixInc.com
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Possible race condition with deferred binding on IPF
2004-03-05 0:05 Possible race condition with deferred binding on IPF Cary Coutant
` (4 preceding siblings ...)
2004-03-09 20:49 ` Jim Wilson
@ 2004-03-09 20:51 ` Zack Weinberg
5 siblings, 0 replies; 7+ messages in thread
From: Zack Weinberg @ 2004-03-09 20:51 UTC (permalink / raw)
To: linux-ia64
Jim Wilson <wilson@specifixinc.com> writes:
> On Mon, 2004-03-08 at 10:08, Zack Weinberg wrote:
>> addl r2 = @gprel(plt_reserve+8), r2 ;;
>> ld8 r17 = [r2], 8
>> mov b6 = r17
>> ld8 r1 = [r2], -16
>> ld8 r16 = [r2]
>> br b6 ;;
>
> You are missing a stop bit between the load of r17 and its use.
>
> You have two post-increment addressing uses of r2. A post-increment
> address is both a read and a write of the register. Thus you need stop
> bits between the first and second loads to avoid a read-after write
> dependency violation. Likewise, between the second and third loads.
>
> So your example needs 4 stop bits total (5 instruction groups). The ABI
> recommended sequence only requires 2 (3 instruction groups).
Yeah, I was afraid of that. I haven't had much luck making sense of
the stop-bit rules.
zw
^ permalink raw reply [flat|nested] 7+ messages in thread