* Re: [PATCH] alpha: simplify and optimize sched_find_first_bit
[not found] ` <4BBE2B5C.6020802@twiddle.net>
@ 2010-04-29 2:08 ` Matt Turner
0 siblings, 0 replies; 5+ messages in thread
From: Matt Turner @ 2010-04-29 2:08 UTC (permalink / raw)
To: Richard Henderson; +Cc: linux-alpha
[-- Attachment #1: Type: text/plain, Size: 2621 bytes --]
On Thu, Apr 8, 2010 at 3:15 PM, Richard Henderson <rth@twiddle.net> wrote:
> On 04/08/2010 11:34 AM, mattst88@gmail.com wrote:
>> + asm(
>> + "cmoveq %0,64,%1 # ofs = (b[0] ? ofs : 64);\n"
>> + "cmoveq %0,%2,%0 # temp = (b[0] ? b[0] : b[1]);\n"
>> + "cttz %0,%0 # output = cttz(temp);\n "
>> + : "=r" (output), "=r" (ofs)
>> + : "r" (b[1]), "0" (b[0]), "1" (0)
>
> I must say I'd also prefer a comment like
>
> /* This is equivalent to
> ofs = (b[0] ? 0 : 64);
> tmp = (b[0] ? b[0] : b[1]);
> but is a bit faster than what GCC would produce on its own. */
> asm("cmoveq %0,64,%1\n\tcmoveq %0,%2,%0"
> : "=r"(output), "=r"(ofs)
> : "r"(b[1]), "0"(b[0]), "1"(0));
>
> ... except that I can't see that it is, at least for mainline gcc.
>
> [anchor:~] cat z.c
> long foo(const unsigned long *b)
> {
> unsigned long b0, b1, ofs, tmp;
>
> b0 = b[0];
> b1 = b[1];
> ofs = (b0 ? 0 : 64);
> tmp = (b0 ? b0 : b1);
>
> /* tmp = __ffs(tmp); -- elided for clarity wrt ev5 vs ev67 */
> return tmp + ofs;
> }
>
> -mcpu=ev5 -Os (to avoid nop padding):
> ldq $2,0($16)
> ldq $1,8($16)
> lda $0,64($31)
> cmovne $2,0,$0
> cmovne $2,$2,$1
> addq $0,$1,$0
>
> -mcpu=ev6 -Os:
> ldq $0,0($16)
> ldq $1,8($16)
> cmovne $0,$0,$1
> cmpeq $0,0,$0
> sll $0,6,$0
> addq $0,$1,$0
>
> I seem to recall that cmov is slightly more expensive on ev6,
> so gcc doesn't prefer it and came up with an equivalent using
> cmpeq+sll.
>
> If some previous version of gcc isn't so smart, I'm ok with
> continuing to use the asm.
>
>
> r~
>
So with your test program, the code generation results are:
4.3.4 -Os: good
4.3.4 -O1: bad
4.3.4 -O2: good
4.3.4 -O3: good
4.4.3 -Os: bad
4.4.3 -O1: bad
4.4.3 -O2: bad
4.4.3 -O3: good
4.5.0 -Os: good
4.5.0 -O1: bad
4.5.0 -O2: good
4.5.0 -O3: good
o -O3 is produces good code in all versions.
o -O1 is bad in all versions.
o -Os and -O2 regressed from 4.3.4 to 4.4.3,
but are back to 4.3.4 quality as of 4.5.0.
All produced cmov instructions just as you said.
My patch doesn't help any of the bad cases and even causes some that
were good to produce worse code, so it's not useful. Does any of this
look like it should warrant a gcc bug report, Richard?
I'll send a patch just to update sched_find_first_bit to search just
the first 100-bits.
Thanks!
Matt
[-- Attachment #2: test --]
[-- Type: application/octet-stream, Size: 9196 bytes --]
Test Program
# define __kernel_cttz(x) __builtin_ctzl(x)
unsigned long __ffs(unsigned long word)
{
/* Whee. EV67 can calculate it directly. */
return __kernel_cttz(word);
}
long foo(const unsigned long *b)
{
unsigned long b0, b1, ofs, tmp;
b0 = b[0];
b1 = b[1];
ofs = (b0 ? 0 : 64);
tmp = (b0 ? b0 : b1);
tmp = __ffs(tmp);
return tmp + ofs;
}
# gcc-4.3.4 -Os -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 10 a4 ldq v0,0(a0)
c: 08 00 30 a4 ldq t0,8(a0)
10: c1 04 00 44 cmovne v0,v0,t0
14: a0 15 00 40 cmpeq v0,0,v0
18: 20 d7 00 48 sll v0,0x6,v0
1c: 61 06 e1 73 cttz t0,t0
20: 00 04 01 40 addq v0,t0,v0
24: 01 80 fa 6b ret
# gcc-4.3.4 -O1 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 bb 27 ldah gp,0(t12)
c: 00 00 bd 23 lda gp,0(gp)
10: f0 ff de 23 lda sp,-16(sp)
14: 00 00 5e b7 stq ra,0(sp)
18: 08 00 3e b5 stq s0,8(sp)
1c: 01 04 f0 47 mov a0,t0
20: 00 00 10 a6 ldq a0,0(a0)
24: 08 00 21 a4 ldq t0,8(t0)
28: a9 15 00 42 cmpeq a0,0,s0
2c: 29 d7 20 49 sll s0,0x6,s0
30: 90 04 01 46 cmoveq a0,t0,a0
34: 00 00 40 d3 bsr ra,38 <foo+0x30>
38: 00 04 20 41 addq s0,v0,v0
3c: 00 00 5e a7 ldq ra,0(sp)
40: 08 00 3e a5 ldq s0,8(sp)
44: 10 00 de 23 lda sp,16(sp)
48: 01 80 fa 6b ret
# gcc-4.3.4 -O2 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 10 a4 ldq v0,0(a0)
14: 08 00 30 a4 ldq t0,8(a0)
18: c1 04 00 44 cmovne v0,v0,t0
1c: a0 15 00 40 cmpeq v0,0,v0
20: 20 d7 00 48 sll v0,0x6,v0
24: 61 06 e1 73 cttz t0,t0
28: 00 04 01 40 addq v0,t0,v0
2c: 01 80 fa 6b ret
# gcc-4.3.4 -O3 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 10 a4 ldq v0,0(a0)
14: 08 00 30 a4 ldq t0,8(a0)
18: c1 04 00 44 cmovne v0,v0,t0
1c: a0 15 00 40 cmpeq v0,0,v0
20: 20 d7 00 48 sll v0,0x6,v0
24: 61 06 e1 73 cttz t0,t0
28: 00 04 01 40 addq v0,t0,v0
2c: 01 80 fa 6b ret
# gcc-4.4.3 -Os -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 bb 27 ldah gp,0(t12)
c: 00 00 bd 23 lda gp,0(gp)
10: f0 ff de 23 lda sp,-16(sp)
14: 08 00 30 a4 ldq t0,8(a0)
18: 08 00 3e b5 stq s0,8(sp)
1c: 00 00 30 a5 ldq s0,0(a0)
20: 00 00 5e b7 stq ra,0(sp)
24: 10 04 e1 47 mov t0,a0
28: d0 04 29 45 cmovne s0,s0,a0
2c: a9 15 20 41 cmpeq s0,0,s0
30: 29 d7 20 49 sll s0,0x6,s0
34: 00 00 40 d3 bsr ra,38 <foo+0x30>
38: 00 04 20 41 addq s0,v0,v0
3c: 00 00 5e a7 ldq ra,0(sp)
40: 08 00 3e a5 ldq s0,8(sp)
44: 10 00 de 23 lda sp,16(sp)
48: 01 80 fa 6b ret
# gcc-4.4.3 -O1 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 bb 27 ldah gp,0(t12)
c: 00 00 bd 23 lda gp,0(gp)
10: f0 ff de 23 lda sp,-16(sp)
14: 00 00 5e b7 stq ra,0(sp)
18: 08 00 3e b5 stq s0,8(sp)
1c: 00 00 30 a4 ldq t0,0(a0)
20: 08 00 10 a6 ldq a0,8(a0)
24: a9 15 20 40 cmpeq t0,0,s0
28: 29 d7 20 49 sll s0,0x6,s0
2c: d0 04 21 44 cmovne t0,t0,a0
30: 00 00 40 d3 bsr ra,34 <foo+0x2c>
34: 00 04 20 41 addq s0,v0,v0
38: 00 00 5e a7 ldq ra,0(sp)
3c: 08 00 3e a5 ldq s0,8(sp)
40: 10 00 de 23 lda sp,16(sp)
44: 01 80 fa 6b ret
# gcc-4.4.3 -O2 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 bb 27 ldah gp,0(t12)
14: 00 00 bd 23 lda gp,0(gp)
18: f0 ff de 23 lda sp,-16(sp)
1c: 08 00 30 a4 ldq t0,8(a0)
20: 08 00 3e b5 stq s0,8(sp)
24: 00 00 30 a5 ldq s0,0(a0)
28: 00 00 5e b7 stq ra,0(sp)
2c: 10 04 e1 47 mov t0,a0
30: d0 04 29 45 cmovne s0,s0,a0
34: a9 15 20 41 cmpeq s0,0,s0
38: 29 d7 20 49 sll s0,0x6,s0
3c: 00 00 40 d3 bsr ra,40 <foo+0x30>
40: 00 04 20 41 addq s0,v0,v0
44: 00 00 5e a7 ldq ra,0(sp)
48: 08 00 3e a5 ldq s0,8(sp)
4c: 10 00 de 23 lda sp,16(sp)
50: 01 80 fa 6b ret
54: 00 00 fe 2f unop
58: 1f 04 ff 47 nop
5c: 00 00 fe 2f unop
# gcc-4.4.3 -O3 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 30 a4 ldq t0,0(a0)
14: 08 00 10 a4 ldq v0,8(a0)
18: c0 04 21 44 cmovne t0,t0,v0
1c: a1 15 20 40 cmpeq t0,0,t0
20: 21 d7 20 48 sll t0,0x6,t0
24: 60 06 e0 73 cttz v0,v0
28: 00 04 01 40 addq v0,t0,v0
2c: 01 80 fa 6b ret
# gcc-4.5.0 -Os -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 30 a4 ldq t0,0(a0)
c: 08 00 10 a4 ldq v0,8(a0)
10: c0 04 21 44 cmovne t0,t0,v0
14: a1 15 20 40 cmpeq t0,0,t0
18: 21 d7 20 48 sll t0,0x6,t0
1c: 60 06 e0 73 cttz v0,v0
20: 00 04 01 40 addq v0,t0,v0
24: 01 80 fa 6b ret
# gcc-4.5.0 -O1 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
0000000000000008 <foo>:
8: 00 00 bb 27 ldah gp,0(t12)
c: 00 00 bd 23 lda gp,0(gp)
10: f0 ff de 23 lda sp,-16(sp)
14: 00 00 5e b7 stq ra,0(sp)
18: 08 00 3e b5 stq s0,8(sp)
1c: 00 00 30 a4 ldq t0,0(a0)
20: 08 00 10 a6 ldq a0,8(a0)
24: a9 15 20 40 cmpeq t0,0,s0
28: 29 d7 20 49 sll s0,0x6,s0
2c: d0 04 21 44 cmovne t0,t0,a0
30: 00 00 40 d3 bsr ra,34 <foo+0x2c>
34: 00 04 20 41 addq s0,v0,v0
38: 00 00 5e a7 ldq ra,0(sp)
3c: 08 00 3e a5 ldq s0,8(sp)
40: 10 00 de 23 lda sp,16(sp)
44: 01 80 fa 6b ret
# gcc-4.5.0 -O2 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 30 a4 ldq t0,0(a0)
14: 08 00 10 a4 ldq v0,8(a0)
18: c0 04 21 44 cmovne t0,t0,v0
1c: a1 15 20 40 cmpeq t0,0,t0
20: 21 d7 20 48 sll t0,0x6,t0
24: 60 06 e0 73 cttz v0,v0
28: 00 04 01 40 addq v0,t0,v0
2c: 01 80 fa 6b ret
# gcc-4.5.0 -O3 -mcpu=ev67 -c z.c && objdump -d z.o
z.o: file format elf64-alpha
Disassembly of section .text:
0000000000000000 <__ffs>:
0: 60 06 f0 73 cttz a0,v0
4: 01 80 fa 6b ret
8: 1f 04 ff 47 nop
c: 00 00 fe 2f unop
0000000000000010 <foo>:
10: 00 00 30 a4 ldq t0,0(a0)
14: 08 00 10 a4 ldq v0,8(a0)
18: c0 04 21 44 cmovne t0,t0,v0
1c: a1 15 20 40 cmpeq t0,0,t0
20: 21 d7 20 48 sll t0,0x6,t0
24: 60 06 e0 73 cttz v0,v0
28: 00 04 01 40 addq v0,t0,v0
2c: 01 80 fa 6b ret
^ permalink raw reply [flat|nested] 5+ messages in thread