From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yury Norov <yury.norov@gmail.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Will Deacon <will@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
Date: Fri, 28 Oct 2022 18:45:50 +0100 [thread overview]
Message-ID: <Y1wVTkIZjoMVfxOK@shell.armlinux.org.uk> (raw)
In-Reply-To: <CAHk-=wi63Sw3vNJ86gzg1Tdr=_xGwGyj+mH-eT0UgaAfGAHX+A@mail.gmail.com>
On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> <rmk+kernel@armlinux.org.uk> wrote:
> >
> > Document the ARMv5 bit offset calculation code.
>
> Hmm. Don't the generic bitop functions end up using this? We do have a
> comment in the code that says
>
> * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> * and produce optimal inlined code in all cases. On ARMv7 it is even
> * better by also using the rbit instruction.
It's true that the generic code also makes use of the rbit and clz
instructions - but in terms of the speed of the functions these only
get used once we've found a word that is interesting to locate the
bit we want in.
> but that 'may' makes me wonder...
>
> IOW, what is it in the hand-written code that doesn't get done by the
> generic code these days?
For the _find_first_bit, there isn't much difference in the number
of instructions or really what is going on, only the organisation
and flow of the code is more inline - but that shouldn't make much
of a difference. Yet, there is a definite repeatable measurable
difference between the two:
random-filled:
arm : find_first_bit: 17778911 ns, 16448 iterations
generic: find_first_bit: 18596622 ns, 16401 iterations
sparse:
arm : find_first_bit: 7301363 ns, 656 iterations
generic: find_first_bit: 7589120 ns, 655 iterations
The bigger difference is in the find_next_bit operations, and this
likely comes from the arm32 code not having the hassles of the "_and"
and other conditionals that the generic code has:
random-filled:
arm : find_next_bit: 2242618 ns, 163949 iterations
generic: find_next_bit: 2632859 ns, 163743 iterations
sparse:
arm : find_next_bit: 40078 ns, 656 iterations
generic: find_next_bit: 69943 ns, 655 iterations
find_next_zero_bit show a greater difference:
random-filled:
arm : find_next_zero_bit: 2049129 ns, 163732 iterations
generic: find_next_zero_bit: 2844221 ns, 163938 iterations
sparse:
arm : find_next_zero_bit: 3939309 ns, 327025 iterations
generic: find_next_zero_bit: 5529553 ns, 327026 iterations
Here's the disassemblies for comparison. Note that the arm versions
share code paths between the functions which makes the code even more
compact - so the loop in the find_first gets re-used for find_next
after we check the current word.
generic:
00000000 <_find_first_bit>:
0: e1a02000 mov r2, r0
4: e2510000 subs r0, r1, #0
8: 012fff1e bxeq lr
c: e2422004 sub r2, r2, #4
10: e3a03000 mov r3, #0
14: ea000002 b 24 <_find_first_bit+0x24>
18: e2833020 add r3, r3, #32
1c: e1500003 cmp r0, r3
20: 912fff1e bxls lr
24: e5b2c004 ldr ip, [r2, #4]!
28: e35c0000 cmp ip, #0
2c: 0afffff9 beq 18 <_find_first_bit+0x18>
30: e6ffcf3c rbit ip, ip
34: e16fcf1c clz ip, ip
38: e08c3003 add r3, ip, r3
3c: e1500003 cmp r0, r3
40: 21a00003 movcs r0, r3
44: e12fff1e bx lr
000000e8 <_find_next_bit>:
e8: e92d4070 push {r4, r5, r6, lr}
ec: e1530002 cmp r3, r2
f0: e59d4010 ldr r4, [sp, #16]
f4: e59d5014 ldr r5, [sp, #20]
f8: 2a000024 bcs 190 <_find_next_bit+0xa8>
fc: e1a0e2a3 lsr lr, r3, #5
100: e3510000 cmp r1, #0
104: e203601f and r6, r3, #31
108: e3c3301f bic r3, r3, #31
10c: e790c10e ldr ip, [r0, lr, lsl #2]
110: 1791e10e ldrne lr, [r1, lr, lsl #2]
114: 100cc00e andne ip, ip, lr
118: e3e0e000 mvn lr, #0
11c: e02cc004 eor ip, ip, r4
120: e3550000 cmp r5, #0
124: e1a0e61e lsl lr, lr, r6
128: 1a00001a bne 198 <_find_next_bit+0xb0>
12c: e01cc00e ands ip, ip, lr
130: 1a000011 bne 17c <_find_next_bit+0x94>
134: e2833020 add r3, r3, #32
138: e1530002 cmp r3, r2
13c: 3a000003 bcc 150 <_find_next_bit+0x68>
140: ea000012 b 190 <_find_next_bit+0xa8>
144: e2833020 add r3, r3, #32
148: e1520003 cmp r2, r3
14c: 9a00000f bls 190 <_find_next_bit+0xa8>
150: e1a0e2a3 lsr lr, r3, #5
154: e3510000 cmp r1, #0
158: e790c10e ldr ip, [r0, lr, lsl #2]
15c: 1791e10e ldrne lr, [r1, lr, lsl #2]
160: 100cc00e andne ip, ip, lr
164: e15c0004 cmp ip, r4
168: 0afffff5 beq 144 <_find_next_bit+0x5c>
16c: e02cc004 eor ip, ip, r4
170: e3550000 cmp r5, #0
174: 0a000000 beq 17c <_find_next_bit+0x94>
178: e6bfcf3c rev ip, ip
17c: e6ffcf3c rbit ip, ip
180: e16fcf1c clz ip, ip
184: e08c3003 add r3, ip, r3
188: e1520003 cmp r2, r3
18c: 21a02003 movcs r2, r3
190: e1a00002 mov r0, r2
194: e8bd8070 pop {r4, r5, r6, pc}
198: e6bfef3e rev lr, lr
19c: e01cc00e ands ip, ip, lr
1a0: 0affffe3 beq 134 <_find_next_bit+0x4c>
1a4: eafffff3 b 178 <_find_next_bit+0x90>
==========
arm optimised:
00000060 <_find_first_bit_le>:
60: e3310000 teq r1, #0
64: 0a000006 beq 84 <_find_first_bit_le+0x24>
68: e3a02000 mov r2, #0
6c: e4903004 ldr r3, [r0], #4
70: e1b03003 movs r3, r3
74: 1a000011 bne c0 <_find_next_bit_le+0x34>
78: e2822020 add r2, r2, #32
7c: e1520001 cmp r2, r1
80: 3afffff9 bcc 6c <_find_first_bit_le+0xc>
84: e1a00001 mov r0, r1
88: e12fff1e bx lr
0000008c <_find_next_bit_le>:
8c: e1520001 cmp r2, r1
90: 2afffffb bcs 84 <_find_first_bit_le+0x24>
94: e1a0c2a2 lsr ip, r2, #5
98: e080010c add r0, r0, ip, lsl #2
9c: e212c01f ands ip, r2, #31
a0: 0afffff1 beq 6c <_find_first_bit_le+0xc>
a4: e4903004 ldr r3, [r0], #4
a8: e1b03c33 lsrs r3, r3, ip
ac: 1a000003 bne c0 <_find_next_bit_le+0x34>
b0: e382201f orr r2, r2, #31
b4: e2822001 add r2, r2, #1
b8: eaffffef b 7c <_find_first_bit_le+0x1c>
bc: e6bf3f33 rev r3, r3
c0: e6ff3f33 rbit r3, r3
c4: e16f3f13 clz r3, r3
c8: e0820003 add r0, r2, r3
cc: e1510000 cmp r1, r0
d0: 31a00001 movcc r0, r1
d4: e12fff1e bx lr
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: "Russell King (Oracle)" <linux@armlinux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yury Norov <yury.norov@gmail.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Mark Rutland <mark.rutland@arm.com>,
Will Deacon <will@kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
Date: Fri, 28 Oct 2022 18:45:50 +0100 [thread overview]
Message-ID: <Y1wVTkIZjoMVfxOK@shell.armlinux.org.uk> (raw)
In-Reply-To: <CAHk-=wi63Sw3vNJ86gzg1Tdr=_xGwGyj+mH-eT0UgaAfGAHX+A@mail.gmail.com>
On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> <rmk+kernel@armlinux.org.uk> wrote:
> >
> > Document the ARMv5 bit offset calculation code.
>
> Hmm. Don't the generic bitop functions end up using this? We do have a
> comment in the code that says
>
> * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> * and produce optimal inlined code in all cases. On ARMv7 it is even
> * better by also using the rbit instruction.
It's true that the generic code also makes use of the rbit and clz
instructions - but in terms of the speed of the functions these only
get used once we've found a word that is interesting to locate the
bit we want in.
> but that 'may' makes me wonder...
>
> IOW, what is it in the hand-written code that doesn't get done by the
> generic code these days?
For the _find_first_bit, there isn't much difference in the number
of instructions or really what is going on, only the organisation
and flow of the code is more inline - but that shouldn't make much
of a difference. Yet, there is a definite repeatable measurable
difference between the two:
random-filled:
arm : find_first_bit: 17778911 ns, 16448 iterations
generic: find_first_bit: 18596622 ns, 16401 iterations
sparse:
arm : find_first_bit: 7301363 ns, 656 iterations
generic: find_first_bit: 7589120 ns, 655 iterations
The bigger difference is in the find_next_bit operations, and this
likely comes from the arm32 code not having the hassles of the "_and"
and other conditionals that the generic code has:
random-filled:
arm : find_next_bit: 2242618 ns, 163949 iterations
generic: find_next_bit: 2632859 ns, 163743 iterations
sparse:
arm : find_next_bit: 40078 ns, 656 iterations
generic: find_next_bit: 69943 ns, 655 iterations
find_next_zero_bit show a greater difference:
random-filled:
arm : find_next_zero_bit: 2049129 ns, 163732 iterations
generic: find_next_zero_bit: 2844221 ns, 163938 iterations
sparse:
arm : find_next_zero_bit: 3939309 ns, 327025 iterations
generic: find_next_zero_bit: 5529553 ns, 327026 iterations
Here's the disassemblies for comparison. Note that the arm versions
share code paths between the functions which makes the code even more
compact - so the loop in the find_first gets re-used for find_next
after we check the current word.
generic:
00000000 <_find_first_bit>:
0: e1a02000 mov r2, r0
4: e2510000 subs r0, r1, #0
8: 012fff1e bxeq lr
c: e2422004 sub r2, r2, #4
10: e3a03000 mov r3, #0
14: ea000002 b 24 <_find_first_bit+0x24>
18: e2833020 add r3, r3, #32
1c: e1500003 cmp r0, r3
20: 912fff1e bxls lr
24: e5b2c004 ldr ip, [r2, #4]!
28: e35c0000 cmp ip, #0
2c: 0afffff9 beq 18 <_find_first_bit+0x18>
30: e6ffcf3c rbit ip, ip
34: e16fcf1c clz ip, ip
38: e08c3003 add r3, ip, r3
3c: e1500003 cmp r0, r3
40: 21a00003 movcs r0, r3
44: e12fff1e bx lr
000000e8 <_find_next_bit>:
e8: e92d4070 push {r4, r5, r6, lr}
ec: e1530002 cmp r3, r2
f0: e59d4010 ldr r4, [sp, #16]
f4: e59d5014 ldr r5, [sp, #20]
f8: 2a000024 bcs 190 <_find_next_bit+0xa8>
fc: e1a0e2a3 lsr lr, r3, #5
100: e3510000 cmp r1, #0
104: e203601f and r6, r3, #31
108: e3c3301f bic r3, r3, #31
10c: e790c10e ldr ip, [r0, lr, lsl #2]
110: 1791e10e ldrne lr, [r1, lr, lsl #2]
114: 100cc00e andne ip, ip, lr
118: e3e0e000 mvn lr, #0
11c: e02cc004 eor ip, ip, r4
120: e3550000 cmp r5, #0
124: e1a0e61e lsl lr, lr, r6
128: 1a00001a bne 198 <_find_next_bit+0xb0>
12c: e01cc00e ands ip, ip, lr
130: 1a000011 bne 17c <_find_next_bit+0x94>
134: e2833020 add r3, r3, #32
138: e1530002 cmp r3, r2
13c: 3a000003 bcc 150 <_find_next_bit+0x68>
140: ea000012 b 190 <_find_next_bit+0xa8>
144: e2833020 add r3, r3, #32
148: e1520003 cmp r2, r3
14c: 9a00000f bls 190 <_find_next_bit+0xa8>
150: e1a0e2a3 lsr lr, r3, #5
154: e3510000 cmp r1, #0
158: e790c10e ldr ip, [r0, lr, lsl #2]
15c: 1791e10e ldrne lr, [r1, lr, lsl #2]
160: 100cc00e andne ip, ip, lr
164: e15c0004 cmp ip, r4
168: 0afffff5 beq 144 <_find_next_bit+0x5c>
16c: e02cc004 eor ip, ip, r4
170: e3550000 cmp r5, #0
174: 0a000000 beq 17c <_find_next_bit+0x94>
178: e6bfcf3c rev ip, ip
17c: e6ffcf3c rbit ip, ip
180: e16fcf1c clz ip, ip
184: e08c3003 add r3, ip, r3
188: e1520003 cmp r2, r3
18c: 21a02003 movcs r2, r3
190: e1a00002 mov r0, r2
194: e8bd8070 pop {r4, r5, r6, pc}
198: e6bfef3e rev lr, lr
19c: e01cc00e ands ip, ip, lr
1a0: 0affffe3 beq 134 <_find_next_bit+0x4c>
1a4: eafffff3 b 178 <_find_next_bit+0x90>
==========
arm optimised:
00000060 <_find_first_bit_le>:
60: e3310000 teq r1, #0
64: 0a000006 beq 84 <_find_first_bit_le+0x24>
68: e3a02000 mov r2, #0
6c: e4903004 ldr r3, [r0], #4
70: e1b03003 movs r3, r3
74: 1a000011 bne c0 <_find_next_bit_le+0x34>
78: e2822020 add r2, r2, #32
7c: e1520001 cmp r2, r1
80: 3afffff9 bcc 6c <_find_first_bit_le+0xc>
84: e1a00001 mov r0, r1
88: e12fff1e bx lr
0000008c <_find_next_bit_le>:
8c: e1520001 cmp r2, r1
90: 2afffffb bcs 84 <_find_first_bit_le+0x24>
94: e1a0c2a2 lsr ip, r2, #5
98: e080010c add r0, r0, ip, lsl #2
9c: e212c01f ands ip, r2, #31
a0: 0afffff1 beq 6c <_find_first_bit_le+0xc>
a4: e4903004 ldr r3, [r0], #4
a8: e1b03c33 lsrs r3, r3, ip
ac: 1a000003 bne c0 <_find_next_bit_le+0x34>
b0: e382201f orr r2, r2, #31
b4: e2822001 add r2, r2, #1
b8: eaffffef b 7c <_find_first_bit_le+0x1c>
bc: e6bf3f33 rev r3, r3
c0: e6ff3f33 rbit r3, r3
c4: e16f3f13 clz r3, r3
c8: e0820003 add r0, r2, r3
cc: e1510000 cmp r1, r0
d0: 31a00001 movcc r0, r1
d4: e12fff1e bx lr
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
next prev parent reply other threads:[~2022-10-28 17:47 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-28 16:47 [PATCH 0/5] ARM: findbit assembly updates Russell King (Oracle)
2022-10-28 16:47 ` Russell King (Oracle)
2022-10-28 16:47 ` [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation Russell King (Oracle)
2022-10-28 16:47 ` Russell King (Oracle)
2022-10-28 17:05 ` Linus Torvalds
2022-10-28 17:05 ` Linus Torvalds
2022-10-28 17:45 ` Russell King (Oracle) [this message]
2022-10-28 17:45 ` Russell King (Oracle)
2022-10-28 18:37 ` Yury Norov
2022-10-28 18:37 ` Yury Norov
2022-10-28 19:42 ` Russell King (Oracle)
2022-10-28 19:42 ` Russell King (Oracle)
2022-10-28 19:01 ` Linus Torvalds
2022-10-28 19:01 ` Linus Torvalds
2022-10-28 19:10 ` Linus Torvalds
2022-10-28 19:10 ` Linus Torvalds
2022-10-28 19:46 ` Russell King (Oracle)
2022-10-28 19:46 ` Russell King (Oracle)
2022-10-28 20:26 ` Linus Torvalds
2022-10-28 20:26 ` Linus Torvalds
2022-10-28 16:47 ` [PATCH 2/5] ARM: findbit: provide more efficient ARMv7 implementation Russell King (Oracle)
2022-10-28 16:47 ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 3/5] ARM: findbit: convert to macros Russell King (Oracle)
2022-10-28 16:48 ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 4/5] ARM: findbit: operate by words Russell King (Oracle)
2022-10-28 16:48 ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 5/5] ARM: findbit: add unwinder information Russell King (Oracle)
2022-10-28 16:48 ` Russell King (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y1wVTkIZjoMVfxOK@shell.armlinux.org.uk \
--to=linux@armlinux.org.uk \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=torvalds@linux-foundation.org \
--cc=will@kernel.org \
--cc=yury.norov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.