optimize __gp location

public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed

* optimize __gp location
@ 2005-01-21 23:22 Chen, Kenneth W
  2005-01-22  1:02 ` Keith Owens
                   ` (14 more replies)
  0 siblings, 15 replies; 17+ messages in thread
From: Chen, Kenneth W @ 2005-01-21 23:22 UTC (permalink / raw)
  To: linux-ia64

__gp is positioned so far out that it is almost at the end of all data
sections.  On 2.6.11-rc1, 80% of kernel data symbols are out of 22-bit
immediate offset from __gp.  This means accessing these symbols are 
unnecessarily expansive such that they have to go through global offset
table (a memory load to get the symbol address).  Among these out of
reach symbols from __gp, some are very frequently used, like Jiffies,
etc.

Can we position the __gp somewhat more optimally, to cover more of these
symbols? Something like the following patch would make all of them fall
into the 22-bit immediate offset relative to gp.

--- linux-2.6.11-rc1/arch/ia64/kernel/vmlinux.lds.S.orig
2005-01-21 11:43:57.000000000 -0800
+++ linux-2.6.11-rc1/arch/ia64/kernel/vmlinux.lds.S	2005-01-21
14:45:29.000000000 -0800
@@ -193,7 +193,7 @@ SECTIONS
   . = ALIGN(16);	/* gp must be 16-byte aligned for exc. table */
   .got : AT(ADDR(.got) - LOAD_OFFSET)
 	{ *(.got.plt) *(.got) }
-  __gp = ADDR(.got) + 0x200000;
+  __gp = _end - 0x200000;
   /* We want the small data sections together, so single-instruction
offsets
      can access them all, and initialized data all before
uninitialized, so
      we can shorten the on-disk segment size.  */
@@ -205,7 +205,7 @@ SECTIONS
 	{ *(.sbss) *(.scommon) }
   .bss : AT(ADDR(.bss) - LOAD_OFFSET)
 	{ *(.bss) *(COMMON) }
-
+  . = ALIGN(16);
   _end = .;

   code : { } :code

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
@ 2005-01-22  1:02 ` Keith Owens
  2005-01-22  1:02 ` Luck, Tony
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Keith Owens @ 2005-01-22  1:02 UTC (permalink / raw)
  To: linux-ia64

On Fri, 21 Jan 2005 15:22:18 -0800, 
"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>__gp is positioned so far out that it is almost at the end of all data
>sections.  On 2.6.11-rc1, 80% of kernel data symbols are out of 22-bit
>immediate offset from __gp.  This means accessing these symbols are 
>unnecessarily expansive such that they have to go through global offset
>table (a memory load to get the symbol address).  Among these out of
>reach symbols from __gp, some are very frequently used, like Jiffies,
>etc.
>
>Can we position the __gp somewhat more optimally, to cover more of these
>symbols? Something like the following patch would make all of them fall
>into the 22-bit immediate offset relative to gp.

The best place for __gp is in the exact middle of the range
.data.init_task through the end of .sbss.  Unfortunately a large .data
section could result in .got and .sbss being out of range of a median
__gp.  Is it possible in the linker script to first try (.sbss.end -
.data.init_task) / 2, then test the result for reachability to .sbss
and adjust __gp if necessary?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
  2005-01-22  1:02 ` Keith Owens
@ 2005-01-22  1:02 ` Luck, Tony
  2005-01-22  2:20 ` Chen, Kenneth W
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2005-01-22  1:02 UTC (permalink / raw)
  To: linux-ia64

>-  __gp = ADDR(.got) + 0x200000;
>+  __gp = _end - 0x200000;

Did we used to link the ".got" section earlier?  It's after "data" now,
but the expression used there might have made sense if ".got" was before
the "data".

_end - 0x200000 may work for you now, but won't this be very configuration
dependent?  If I configure lots of drivers with "=y" option, and they
declare lots of "bss" objects, then __gp may still be too high to reach the
interesting data objects.

Would an expression anchoring on the ".sdata" section be better?

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
  2005-01-22  1:02 ` Keith Owens
  2005-01-22  1:02 ` Luck, Tony
@ 2005-01-22  2:20 ` Chen, Kenneth W
  2005-01-22  3:09 ` Keith Owens
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Chen, Kenneth W @ 2005-01-22  2:20 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote on Friday, January 21, 2005 5:03 PM
> >-  __gp = ADDR(.got) + 0x200000;
> >+  __gp = _end - 0x200000;
>
> Did we used to link the ".got" section earlier?  It's after "data" now,
> but the expression used there might have made sense if ".got" was before
> the "data".
>
> _end - 0x200000 may work for you now, but won't this be very configuration
> dependent?  If I configure lots of drivers with "=y" option, and they
> declare lots of "bss" objects, then __gp may still be too high to reach the
> interesting data objects.
>
> Would an expression anchoring on the ".sdata" section be better?

I wish I can do that.  But I'm frustrated that __gp is jailed in between
GOT section and the linker symbol _end.  There are references to _end from a
couple of functions like reserve_memory(), mem_init() that compiler insist on
using gp relative to calculate value of _end.

- Ken



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (2 preceding siblings ...)
  2005-01-22  2:20 ` Chen, Kenneth W
@ 2005-01-22  3:09 ` Keith Owens
  2005-01-24  7:51 ` Christian Hildner
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Keith Owens @ 2005-01-22  3:09 UTC (permalink / raw)
  To: linux-ia64

On Fri, 21 Jan 2005 18:20:51 -0800, 
"Chen, Kenneth W" <kenneth.w.chen@intel.com> wrote:
>Luck, Tony wrote on Friday, January 21, 2005 5:03 PM
>> >-  __gp = ADDR(.got) + 0x200000;
>> >+  __gp = _end - 0x200000;
>>
>> Did we used to link the ".got" section earlier?  It's after "data" now,
>> but the expression used there might have made sense if ".got" was before
>> the "data".
>>
>> _end - 0x200000 may work for you now, but won't this be very configuration
>> dependent?  If I configure lots of drivers with "=y" option, and they
>> declare lots of "bss" objects, then __gp may still be too high to reach the
>> interesting data objects.
>>
>> Would an expression anchoring on the ".sdata" section be better?
>
>I wish I can do that.  But I'm frustrated that __gp is jailed in between
>GOT section and the linker symbol _end.  There are references to _end from a
>couple of functions like reserve_memory(), mem_init() that compiler insist on
>using gp relative to calculate value of _end.

Compiled and linked but not booted.  The references to _end and _stext
are now DIR64LSB in .sdata.  That makes for a couple of extra
instructions to get the value of _end and _stext in mem_init(), but the
code is only executed once, so who cares?

Index: linux/arch/ia64/mm/init.c
=================================--- linux.orig/arch/ia64/mm/init.c	2005-01-20 11:05:56.000000000 +1100
+++ linux/arch/ia64/mm/init.c	2005-01-22 14:02:52.000000000 +1100
@@ -535,6 +535,8 @@ nolwsys_setup (char *s)
 
 __setup("nolwsys", nolwsys_setup);
 
+static char *p_end = _end, *p_stext = _stext;
+
 void
 mem_init (void)
 {
@@ -563,7 +565,7 @@ mem_init (void)
 
 	kclist_add(&kcore_mem, __va(0), max_low_pfn * PAGE_SIZE);
 	kclist_add(&kcore_vmem, (void *)VMALLOC_START, VMALLOC_END-VMALLOC_START);
-	kclist_add(&kcore_kernel, _stext, _end - _stext);
+	kclist_add(&kcore_kernel, _stext, p_end - p_stext);
 
 	for_each_pgdat(pgdat)
 		totalram_pages += free_all_bootmem_node(pgdat);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (3 preceding siblings ...)
  2005-01-22  3:09 ` Keith Owens
@ 2005-01-24  7:51 ` Christian Hildner
  2005-01-24 13:22 ` Keith Owens
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Christian Hildner @ 2005-01-24  7:51 UTC (permalink / raw)
  To: linux-ia64

Chen, Kenneth W schrieb:

>__gp is positioned so far out that it is almost at the end of all data
>sections.  On 2.6.11-rc1, 80% of kernel data symbols are out of 22-bit
>immediate offset from __gp.  This means accessing these symbols are 
>unnecessarily expansive such that they have to go through global offset
>table (a memory load to get the symbol address).  Among these out of
>reach symbols from __gp, some are very frequently used, like Jiffies,
>etc.
>
Wouldn't a solution using movl for the offset and then add to gp be the 
cheaper solution in terms of cycles? I am wondering that there is an 
additional and expensive load needed with the item possibly (likely or 
not) being not in the cache. But there are the software conventions and 
"nobody will ever need more than" 4MB of short data.

>Can we position the __gp somewhat more optimally, to cover more of these
>symbols? Something like the following patch would make all of them fall
>into the 22-bit immediate offset relative to gp.
>
Did you have benchmarks? Or at least a comparison of the resulting code 
size. The code size should shrink when more items can be addressed 
directly. Furthermore the code size should be a good indicator for the 
performance gain you could achive.

Christian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (4 preceding siblings ...)
  2005-01-24  7:51 ` Christian Hildner
@ 2005-01-24 13:22 ` Keith Owens
  2005-01-24 13:29   ` Matthew Wilcox
  2005-01-24 13:44 ` Christian Hildner
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 17+ messages in thread
From: Keith Owens @ 2005-01-24 13:22 UTC (permalink / raw)
  To: linux-ia64

On Mon, 24 Jan 2005 08:51:17 +0100, 
Christian Hildner <christian.hildner@hob.de> wrote:
>Chen, Kenneth W schrieb:
>>Can we position the __gp somewhat more optimally, to cover more of these
>>symbols? Something like the following patch would make all of them fall
>>into the 22-bit immediate offset relative to gp.
>>
>Did you have benchmarks? Or at least a comparison of the resulting code 
>size. The code size should shrink when more items can be addressed 
>directly. Furthermore the code size should be a good indicator for the 
>performance gain you could achive.

The IA64 ABI supports link time rewriting of instructions if the linker
can determine that the field being loaded can be access via __gp
instead of via the linkage offset table.  One of the restrictions of
link time rewriting is that the code offsets cannot change, which means
that the code size cannot change either.  This code snippet will result
in two different run time sequences, depending on whether jiffies can
be referenced via __gp or not.

    addl r20=0,r1;;	// LTOFF22X  jiffies
    ld8 r16=[r20];;	// LDXMOV    jiffies
    ld8.acq r23=[r16]	// value of jiffies

When jiffies is within 22 bit range of __gp, the linker writes the
sequence as

    addl r20=offset_of(jiffies,__gp),r1;;
    mov r16=r20;;
    ld8.acq r23=[r16]	// value of jiffies

When jiffies is outside 22 bit range of __gp, the linker writes the
sequence as

    addl r20=offset_of(pointer_to_jiffies,__gp),r1;;
    ld8 r16=[r20];;	// load pointer_to_jiffies from linkage offset table
    ld8.acq r23=[r16]	// value of jiffies

Exactly the same code size, but the second form requires an extra
memory reference which is always going to be slower.

gcc emits LTOFF22X/LDXMOV if it might be able to use __gp addressing
and save the memory access, but gcc does not know at compile time if
jiffies will be in range of __gp or not.  So gcc has to use the worst
case three instruction code sequence and let the linker remove the
slow memory reference at link time.

If jiffies was defined as section .sdata then gcc would know at compile
time that jiffies was in range of __gp so gcc would use this shorter
code sequence.  Enough changes like that would shrink the code size.

    addl r16=offset_of(jiffies,__gp),r1;;
    ld8.acq r23=[r16]	// value of jiffies

Unfortunately marking jiffies and similar small but high usage
variables as section .sbss or .sdata requires changes to common code.
It might be worth doing, but the change would have to be structured so
it worked on all architectures.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-24 13:22 ` Keith Owens
@ 2005-01-24 13:29   ` Matthew Wilcox
  0 siblings, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2005-01-24 13:29 UTC (permalink / raw)
  To: Keith Owens; +Cc: Christian Hildner, Chen, Kenneth W, linux-ia64, linux-kernel

On Tue, Jan 25, 2005 at 12:22:16AM +1100, Keith Owens wrote:
> Unfortunately marking jiffies and similar small but high usage
> variables as section .sbss or .sdata requires changes to common code.
> It might be worth doing, but the change would have to be structured so
> it worked on all architectures.

I think there are other architectures which would prefer
small-and-frequently-used global variables to be placed somewhere special,
so there may well be wide-spread enthusiasm for such an annotation.

CC'ing linux-kernel to see if anyone bites.

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (5 preceding siblings ...)
  2005-01-24 13:22 ` Keith Owens
@ 2005-01-24 13:44 ` Christian Hildner
  2005-01-24 15:32 ` Keith Owens
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Christian Hildner @ 2005-01-24 13:44 UTC (permalink / raw)
  To: linux-ia64

Keith Owens schrieb:

>On Mon, 24 Jan 2005 08:51:17 +0100, 
>Christian Hildner <christian.hildner@hob.de> wrote:
>  
>
>>Chen, Kenneth W schrieb:
>>    
>>
>>>Can we position the __gp somewhat more optimally, to cover more of these
>>>symbols? Something like the following patch would make all of them fall
>>>into the 22-bit immediate offset relative to gp.
>>>
>>>      
>>>
>>Did you have benchmarks? Or at least a comparison of the resulting code 
>>size. The code size should shrink when more items can be addressed 
>>directly. Furthermore the code size should be a good indicator for the 
>>performance gain you could achive.
>>    
>>
>
>The IA64 ABI supports link time rewriting of instructions if the linker
>can determine that the field being loaded can be access via __gp
>instead of via the linkage offset table.  One of the restrictions of
>link time rewriting is that the code offsets cannot change, which means
>that the code size cannot change either.  This code snippet will result
>in two different run time sequences, depending on whether jiffies can
>be referenced via __gp or not.
>
>    addl r20=0,r1;;	// LTOFF22X  jiffies
>    ld8 r16=[r20];;	// LDXMOV    jiffies
>    ld8.acq r23=[r16]	// value of jiffies
>
>When jiffies is within 22 bit range of __gp, the linker writes the
>sequence as
>
>    addl r20=offset_of(jiffies,__gp),r1;;
>    mov r16=r20;;
>    ld8.acq r23=[r16]	// value of jiffies
>
Is there a restriction to not rewrite to

    addl r16=offset_of(jiffies,__gp),r1;;
    ld8.acq r23=[r16]	// value of jiffies
    nop.i 0

because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course).

Christian



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (6 preceding siblings ...)
  2005-01-24 13:44 ` Christian Hildner
@ 2005-01-24 15:32 ` Keith Owens
  2005-01-24 17:51 ` David Mosberger
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Keith Owens @ 2005-01-24 15:32 UTC (permalink / raw)
  To: linux-ia64

On Mon, 24 Jan 2005 14:44:22 +0100, 
Christian Hildner <christian.hildner@hob.de> wrote:
>Keith Owens schrieb:
>>When jiffies is within 22 bit range of __gp, the linker writes the
>>sequence as
>>
>>    addl r20=offset_of(jiffies,__gp),r1;;
>>    mov r16=r20;;
>>    ld8.acq r23=[r16]	// value of jiffies
>>
>Is there a restriction to not rewrite to
>
>    addl r16=offset_of(jiffies,__gp),r1;;
>    ld8.acq r23=[r16]	// value of jiffies
>    nop.i 0
>
>because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course).

The code snippet was a simplification of what gcc actually does.  If
you look at some object code, you will find that the 3 instructions are
already spread over multiple bundles.  Moving the final ld8 upwards
cannot save any cycles, you still have to execute the same number of
bundles.  A real example from kernel/sched.o

    4830:       09 50 20 42 00 21       [MMI]       adds r10=8,r33
                        4832: LTOFF22X  jiffies
    4836:       20 81 84 00 42 c0                   adds r18\x16,r33
    483c:       01 08 00 90                         addl r14=0,r1;;
    4840:       08 00 08 1e d8 19       [MMI]       stf.spill [r15]ò
                        4841: LDXMOV    jiffies
                        4842: LTOFF22X  __per_cpu_offset
    4846:       b0 00 38 30 20 40                   ld8 r11=[r14]
    484c:       03 08 00 90                         addl r26=0,r1
    4850:       08 a0 00 02 00 24       [MMI]       addl r20=0,r1
                        4850: LTOFF22X  .data.percpu+0x440
    4856:       90 00 01 20 40 e0                   shladd r9=r32,1,r0
    485c:       02 00 59 00                         sxt4 r23=r32
    4860:       08 40 00 14 18 10       [MMI]       ld8 r8=[r10]
    4866:       10 01 48 30 20 e0                   ld8 r17=[r18]
    486c:       04 00 c4 00                         mov r39°
    4870:       05 00 00 00 01 40       [MLX]       nop.m 0x0
    4876:       10 00 00 00 00 60                   movl r27=0x10624dd3;;
    487c:       33 55 6c 62 
    4880:       10 00 00 00 01 00       [MIB]       nop.m 0x0
    4886:       f0 40 e0 f0 29 00                   shl r15=r8,7
    488c:       00 00 00 20                         nop.b 0x0
    4890:       09 c0 00 34 18 10       [MMI]       ld8 r24=[r26]
                        4890: LDXMOV    __per_cpu_offset
    4896:       30 00 2c 70 21 40                   ld8.acq r3=[r11]

The LDXMOV relocation is designed to make it simple to convert the
instruction from ld8 r11=[r14] to mov r11=r14, it is easy to do in
place.  Moving an entire slot around is a lot messier, for no
performance gain.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (7 preceding siblings ...)
  2005-01-24 15:32 ` Keith Owens
@ 2005-01-24 17:51 ` David Mosberger
  2005-01-24 17:53 ` David Mosberger
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: David Mosberger @ 2005-01-24 17:51 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Fri, 21 Jan 2005 15:22:18 -0800, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> Can we position the __gp somewhat more optimally, to cover more of these
  Ken> symbols? Something like the following patch would make all of them fall
  Ken> into the 22-bit immediate offset relative to gp.

The position was chosen to maximize the amount of data that can be
addressed via the GOT.  As you observe, that doesn't maximize the
amount of data that can be addressed in a GP-relative fashion.

It might be best to just remove the __gp definition.  In that case,
the linker will automatically choose a value and it ought to be able
to choose an optimal value.

	--david

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (8 preceding siblings ...)
  2005-01-24 17:51 ` David Mosberger
@ 2005-01-24 17:53 ` David Mosberger
  2005-01-25  7:30 ` Christian Hildner
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: David Mosberger @ 2005-01-24 17:53 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 24 Jan 2005 08:51:17 +0100, Christian Hildner <christian.hildner@hob.de> said:

  Christian> Did you have benchmarks? Or at least a comparison of the
  Christian> resulting code size. The code size should shrink when
  Christian> more items can be addressed directly. Furthermore the
  Christian> code size should be a good indicator for the performance
  Christian> gain you could achive.

I'm curious about this too.  Ken, do you have a real-world benchmark
where improved GP-relative addressability actually made a significant
difference?

	--david

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (9 preceding siblings ...)
  2005-01-24 17:53 ` David Mosberger
@ 2005-01-25  7:30 ` Christian Hildner
  2005-01-25 19:44 ` Chen, Kenneth W
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Christian Hildner @ 2005-01-25  7:30 UTC (permalink / raw)
  To: linux-ia64

Keith Owens schrieb:

>On Mon, 24 Jan 2005 14:44:22 +0100, 
>Christian Hildner <christian.hildner@hob.de> wrote:
>  
>
>>Keith Owens schrieb:
>>    
>>
>>>When jiffies is within 22 bit range of __gp, the linker writes the
>>>sequence as
>>>
>>>   addl r20=offset_of(jiffies,__gp),r1;;
>>>   mov r16=r20;;
>>>   ld8.acq r23=[r16]	// value of jiffies
>>>
>>>      
>>>
>>Is there a restriction to not rewrite to
>>
>>   addl r16=offset_of(jiffies,__gp),r1;;
>>   ld8.acq r23=[r16]	// value of jiffies
>>   nop.i 0
>>
>>because that would save at least one cycle and would make bundling easier (dependend of additional instructions, of course).
>>    
>>
>
>The code snippet was a simplification of what gcc actually does.  If
>you look at some object code, you will find that the 3 instructions are
>already spread over multiple bundles.  Moving the final ld8 upwards
>cannot save any cycles, you still have to execute the same number of
>bundles.  
>
But it is one instruction group less. And that relates to at least (here 
exactly) one cycle.

>A real example from kernel/sched.o
>
>    4830:       09 50 20 42 00 21       [MMI]       adds r10=8,r33
>                        4832: LTOFF22X  jiffies
>    4836:       20 81 84 00 42 c0                   adds r18\x16,r33
>    483c:       01 08 00 90                         addl r14=0,r1;;
>    4840:       08 00 08 1e d8 19       [MMI]       stf.spill [r15]ò
>                        4841: LDXMOV    jiffies
>                        4842: LTOFF22X  __per_cpu_offset
>    4846:       b0 00 38 30 20 40                   ld8 r11=[r14]
>    484c:       03 08 00 90                         addl r26=0,r1
>    4850:       08 a0 00 02 00 24       [MMI]       addl r20=0,r1
>                        4850: LTOFF22X  .data.percpu+0x440
>    4856:       90 00 01 20 40 e0                   shladd r9=r32,1,r0
>    485c:       02 00 59 00                         sxt4 r23=r32
>    4860:       08 40 00 14 18 10       [MMI]       ld8 r8=[r10]
>    4866:       10 01 48 30 20 e0                   ld8 r17=[r18]
>    486c:       04 00 c4 00                         mov r39°
>    4870:       05 00 00 00 01 40       [MLX]       nop.m 0x0
>    4876:       10 00 00 00 00 60                   movl r27=0x10624dd3;;
>    487c:       33 55 6c 62 
>    4880:       10 00 00 00 01 00       [MIB]       nop.m 0x0
>    4886:       f0 40 e0 f0 29 00                   shl r15=r8,7
>    488c:       00 00 00 20                         nop.b 0x0
>    4890:       09 c0 00 34 18 10       [MMI]       ld8 r24=[r26]
>                        4890: LDXMOV    __per_cpu_offset
>    4896:       30 00 2c 70 21 40                   ld8.acq r3=[r11]
>
>The LDXMOV relocation is designed to make it simple to convert the
>instruction from ld8 r11=[r14] to mov r11=r14, it is easy to do in
>place.
>
Ok, simplicity is an argument.

>  Moving an entire slot around is a lot messier, for no
>performance gain.
>
You have still one memory unit wasted for the mov logically being a nop. 
So dependant on the cpu implementation there is a possible loss of one 
cycle specially for memory intensive code fragments/instructions groups. 
In the example the LDXMOV instruction group has seven memory units 
utilized. And if the cpu has only six of them implemented? But I see the 
complexity when changing that. It would result in the need for another 
optimizer step. A linker optimizer?

Christian


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (10 preceding siblings ...)
  2005-01-25  7:30 ` Christian Hildner
@ 2005-01-25 19:44 ` Chen, Kenneth W
  2005-01-25 19:51 ` David Mosberger
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 17+ messages in thread
From: Chen, Kenneth W @ 2005-01-25 19:44 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Mon, 24 Jan 2005 08:51:17 +0100, Christian Hildner said:
  Christian> Did you have benchmarks? Or at least a comparison of the
  Christian> resulting code size. The code size should shrink when
  Christian> more items can be addressed directly. Furthermore the
  Christian> code size should be a good indicator for the performance
  Christian> gain you could achive.

David Mosberger wrote on Monday, January 24, 2005 9:53 AM
> I'm curious about this too.  Ken, do you have a real-world benchmark
> where improved GP-relative addressability actually made a significant
> difference?

I don't have any benchmark result off hand.  But digging through recent
data, for a well known db benchmark, GOT table access account to about
4% of total memory latency in the kernel.

- Ken



^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (11 preceding siblings ...)
  2005-01-25 19:44 ` Chen, Kenneth W
@ 2005-01-25 19:51 ` David Mosberger
  2005-01-25 19:57 ` Chen, Kenneth W
  2005-01-25 20:01 ` David Mosberger
  14 siblings, 0 replies; 17+ messages in thread
From: David Mosberger @ 2005-01-25 19:51 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 25 Jan 2005 11:44:34 -0800, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> I don't have any benchmark result off hand.  But digging through recent
  Ken> data, for a well known db benchmark, GOT table access account to about
  Ken> 4% of total memory latency in the kernel.

4% actual CPU stalls or 4% of potentially overlapped stalls?

I'm not trying to be difficult.  It's just that GOT optimizations so
far haven't really paid off in any significant way in the benchmarks I
tried.  You'd think for some benchmarks it does make a significant
difference (even just based on a cache-pollution argument), I just
haven't seen it yet...

	--david

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (12 preceding siblings ...)
  2005-01-25 19:51 ` David Mosberger
@ 2005-01-25 19:57 ` Chen, Kenneth W
  2005-01-25 20:01 ` David Mosberger
  14 siblings, 0 replies; 17+ messages in thread
From: Chen, Kenneth W @ 2005-01-25 19:57 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 25 Jan 2005 11:44:34 -0800, "Chen, Kenneth W" said:
  Ken> I don't have any benchmark result off hand.  But digging through recent
  Ken> data, for a well known db benchmark, GOT table access account to about
  Ken> 4% of total memory latency in the kernel.

David Mosberger wrote on Tuesday, January 25, 2005 11:52 AM
>
> 4% actual CPU stalls or 4% of potentially overlapped stalls?
>

4% of total cpu stall cycles.  6% of all memory references (in kernel space).
Though they are very concentrated to a few cache lines.


> I'm not trying to be difficult.  It's just that GOT optimizations so
> far haven't really paid off in any significant way in the benchmarks I
> tried.  You'd think for some benchmarks it does make a significant
> difference (even just based on a cache-pollution argument), I just
> haven't seen it yet...

No biggy, I want to find out myself too whether this has any real impact
or not :-)

- Ken



^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: optimize __gp location
  2005-01-21 23:22 optimize __gp location Chen, Kenneth W
                   ` (13 preceding siblings ...)
  2005-01-25 19:57 ` Chen, Kenneth W
@ 2005-01-25 20:01 ` David Mosberger
  14 siblings, 0 replies; 17+ messages in thread
From: David Mosberger @ 2005-01-25 20:01 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 25 Jan 2005 11:57:53 -0800, "Chen, Kenneth W" <kenneth.w.chen@intel.com> said:

  Ken> 4% of total cpu stall cycles.  6% of all memory references (in
  Ken> kernel space).  Though they are very concentrated to a few
  Ken> cache lines.

Sounds pretty promising.

	--david

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-01-25 20:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-21 23:22 optimize __gp location Chen, Kenneth W
2005-01-22  1:02 ` Keith Owens
2005-01-22  1:02 ` Luck, Tony
2005-01-22  2:20 ` Chen, Kenneth W
2005-01-22  3:09 ` Keith Owens
2005-01-24  7:51 ` Christian Hildner
2005-01-24 13:22 ` Keith Owens
2005-01-24 13:29   ` Matthew Wilcox
2005-01-24 13:44 ` Christian Hildner
2005-01-24 15:32 ` Keith Owens
2005-01-24 17:51 ` David Mosberger
2005-01-24 17:53 ` David Mosberger
2005-01-25  7:30 ` Christian Hildner
2005-01-25 19:44 ` Chen, Kenneth W
2005-01-25 19:51 ` David Mosberger
2005-01-25 19:57 ` Chen, Kenneth W
2005-01-25 20:01 ` David Mosberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox