From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Hildner Date: Tue, 25 Jan 2005 07:30:26 +0000 Subject: Re: optimize __gp location Message-Id: <41F5F592.3090806@hob.de> List-Id: References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: linux-ia64@vger.kernel.org Keith Owens schrieb: >On Mon, 24 Jan 2005 14:44:22 +0100,=20 >Christian Hildner wrote: > =20 > >>Keith Owens schrieb: >> =20 >> >>>When jiffies is within 22 bit range of __gp, the linker writes the >>>sequence as >>> >>> addl r20=3Doffset_of(jiffies,__gp),r1;; >>> mov r16=3Dr20;; >>> ld8.acq r23=3D[r16] // value of jiffies >>> >>> =20 >>> >>Is there a restriction to not rewrite to >> >> addl r16=3Doffset_of(jiffies,__gp),r1;; >> ld8.acq r23=3D[r16] // value of jiffies >> nop.i 0 >> >>because that would save at least one cycle and would make bundling easier= (dependend of additional instructions, of course). >> =20 >> > >The code snippet was a simplification of what gcc actually does. If >you look at some object code, you will find that the 3 instructions are >already spread over multiple bundles. Moving the final ld8 upwards >cannot save any cycles, you still have to execute the same number of >bundles. =20 > But it is one instruction group less. And that relates to at least (here=20 exactly) one cycle. >A real example from kernel/sched.o > > 4830: 09 50 20 42 00 21 [MMI] adds r10=3D8,r33 > 4832: LTOFF22X jiffies > 4836: 20 81 84 00 42 c0 adds r18=16,r33 > 483c: 01 08 00 90 addl r14=3D0,r1;; > 4840: 08 00 08 1e d8 19 [MMI] stf.spill [r15]=F2 > 4841: LDXMOV jiffies > 4842: LTOFF22X __per_cpu_offset > 4846: b0 00 38 30 20 40 ld8 r11=3D[r14] > 484c: 03 08 00 90 addl r26=3D0,r1 > 4850: 08 a0 00 02 00 24 [MMI] addl r20=3D0,r1 > 4850: LTOFF22X .data.percpu+0x440 > 4856: 90 00 01 20 40 e0 shladd r9=3Dr32,1,r0 > 485c: 02 00 59 00 sxt4 r23=3Dr32 > 4860: 08 40 00 14 18 10 [MMI] ld8 r8=3D[r10] > 4866: 10 01 48 30 20 e0 ld8 r17=3D[r18] > 486c: 04 00 c4 00 mov r39=B0 > 4870: 05 00 00 00 01 40 [MLX] nop.m 0x0 > 4876: 10 00 00 00 00 60 movl r27=3D0x10624dd3;; > 487c: 33 55 6c 62=20 > 4880: 10 00 00 00 01 00 [MIB] nop.m 0x0 > 4886: f0 40 e0 f0 29 00 shl r15=3Dr8,7 > 488c: 00 00 00 20 nop.b 0x0 > 4890: 09 c0 00 34 18 10 [MMI] ld8 r24=3D[r26] > 4890: LDXMOV __per_cpu_offset > 4896: 30 00 2c 70 21 40 ld8.acq r3=3D[r11] > >The LDXMOV relocation is designed to make it simple to convert the >instruction from ld8 r11=3D[r14] to mov r11=3Dr14, it is easy to do in >place. > Ok, simplicity is an argument. > Moving an entire slot around is a lot messier, for no >performance gain. > You have still one memory unit wasted for the mov logically being a nop.=20 So dependant on the cpu implementation there is a possible loss of one=20 cycle specially for memory intensive code fragments/instructions groups.=20 In the example the LDXMOV instruction group has seven memory units=20 utilized. And if the cpu has only six of them implemented? But I see the=20 complexity when changing that. It would result in the need for another=20 optimizer step. A linker optimizer? Christian