All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yury Norov <yury.norov@gmail.com>
To: "Russell King (Oracle)" <linux@armlinux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Will Deacon <will@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Alexey Klimov <klimov.linux@gmail.com>
Subject: Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
Date: Fri, 28 Oct 2022 11:37:44 -0700	[thread overview]
Message-ID: <Y1wheOT4yP7VCZ0p@yury-laptop> (raw)
In-Reply-To: <Y1wVTkIZjoMVfxOK@shell.armlinux.org.uk>

+ Alexey Klimov

On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > <rmk+kernel@armlinux.org.uk> wrote:
> > >
> > > Document the ARMv5 bit offset calculation code.
> > 
> > Hmm. Don't the generic bitop functions end up using this? We do have a
> > comment in the code that says
> > 
> >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> >  * better by also using the rbit instruction.
> 
> It's true that the generic code also makes use of the rbit and clz
> instructions - but in terms of the speed of the functions these only
> get used once we've found a word that is interesting to locate the
> bit we want in.
> 
> > but that 'may' makes me wonder...
> > 
> > IOW, what is it in the hand-written code that doesn't get done by the
> > generic code these days?
> 
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:
> 
> random-filled:
> arm    : find_first_bit:               17778911 ns,  16448 iterations
> generic: find_first_bit:               18596622 ns,  16401 iterations
> 
> sparse:
> arm    : find_first_bit:                7301363 ns,    656 iterations
> generic: find_first_bit:                7589120 ns,    655 iterations
> 
> The bigger difference is in the find_next_bit operations, and this
> likely comes from the arm32 code not having the hassles of the "_and"
> and other conditionals that the generic code has:
> 
> random-filled:
> arm    : find_next_bit:                 2242618 ns, 163949 iterations
> generic: find_next_bit:                 2632859 ns, 163743 iterations
> 
> sparse:
> arm    : find_next_bit:                   40078 ns,    656 iterations
> generic: find_next_bit:                   69943 ns,    655 iterations
> 
> find_next_zero_bit show a greater difference:
> 
> random-filled:
> arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> 
> sparse:
> arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Those numbers disagree with what Alexey has measured on Odroid board
for A15 but somewhat in line with what he had for A7:

https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Can you please share details about your methodology: what CPU did you use;
did you lock cpu frequencies; in addition to mean, can you also show
standard deviation, raw logs?

To Alexey: if you have time, can you please repeat those measurements
on top of v6 + this series? This would help a lot in understanding how
new code performs, both generic and arch.

If generic vs arch code comparison looks different for different CPU
versions, what should we prefer? 
 
> Here's the disassemblies for comparison. Note that the arm versions
> share code paths between the functions which makes the code even more
> compact - so the loop in the find_first gets re-used for find_next
> after we check the current word.
> 
> generic:

[...]
 
> 000000e8 <_find_next_bit>:
>   e8:   e92d4070        push    {r4, r5, r6, lr}
>   ec:   e1530002        cmp     r3, r2
>   f0:   e59d4010        ldr     r4, [sp, #16]
>   f4:   e59d5014        ldr     r5, [sp, #20]
>   f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
>   fc:   e1a0e2a3        lsr     lr, r3, #5
>  100:   e3510000        cmp     r1, #0
>  104:   e203601f        and     r6, r3, #31
>  108:   e3c3301f        bic     r3, r3, #31
>  10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  114:   100cc00e        andne   ip, ip, lr
>  118:   e3e0e000        mvn     lr, #0
>  11c:   e02cc004        eor     ip, ip, r4
>  120:   e3550000        cmp     r5, #0
>  124:   e1a0e61e        lsl     lr, lr, r6
>  128:   1a00001a        bne     198 <_find_next_bit+0xb0>
>  12c:   e01cc00e        ands    ip, ip, lr
>  130:   1a000011        bne     17c <_find_next_bit+0x94>
>  134:   e2833020        add     r3, r3, #32
>  138:   e1530002        cmp     r3, r2
>  13c:   3a000003        bcc     150 <_find_next_bit+0x68>
>  140:   ea000012        b       190 <_find_next_bit+0xa8>
>  144:   e2833020        add     r3, r3, #32
>  148:   e1520003        cmp     r2, r3
>  14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
>  150:   e1a0e2a3        lsr     lr, r3, #5
>  154:   e3510000        cmp     r1, #0
>  158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  160:   100cc00e        andne   ip, ip, lr
>  164:   e15c0004        cmp     ip, r4
>  168:   0afffff5        beq     144 <_find_next_bit+0x5c>
>  16c:   e02cc004        eor     ip, ip, r4
>  170:   e3550000        cmp     r5, #0
>  174:   0a000000        beq     17c <_find_next_bit+0x94>
>  178:   e6bfcf3c        rev     ip, ip
>  17c:   e6ffcf3c        rbit    ip, ip
>  180:   e16fcf1c        clz     ip, ip
>  184:   e08c3003        add     r3, ip, r3
>  188:   e1520003        cmp     r2, r3
>  18c:   21a02003        movcs   r2, r3
>  190:   e1a00002        mov     r0, r2
>  194:   e8bd8070        pop     {r4, r5, r6, pc}
>  198:   e6bfef3e        rev     lr, lr
>  19c:   e01cc00e        ands    ip, ip, lr
>  1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
>  1a4:   eafffff3        b       178 <_find_next_bit+0x90>

On top of master, my generic _find_next_bit looks different:

000000e4 <_find_next_bit>:
  e4:   e1510002        cmp     r1, r2
  e8:   e1a0c000        mov     ip, r0
  ec:   e1a00001        mov     r0, r1
  f0:   912fff1e        bxls    lr
  f4:   e1a012a2        lsr     r1, r2, #5
  f8:   e92d4010        push    {r4, lr}
  fc:   e202201f        and     r2, r2, #31
 100:   e3e03000        mvn     r3, #0
 104:   e79ce101        ldr     lr, [ip, r1, lsl #2]
 108:   e01ee213        ands    lr, lr, r3, lsl r2
 10c:   1a000012        bne     15c <_find_next_bit+0x78>
 110:   e2813001        add     r3, r1, #1
 114:   e1a04283        lsl     r4, r3, #5
 118:   e1540000        cmp     r4, r0
 11c:   28bd8010        popcs   {r4, pc}
 120:   e08c2101        add     r2, ip, r1, lsl #2
 124:   ea000002        b       134 <_find_next_bit+0x50>
 128:   e1a04283        lsl     r4, r3, #5
 12c:   e1500004        cmp     r0, r4
 130:   98bd8010        popls   {r4, pc}
 134:   e5b2e004        ldr     lr, [r2, #4]!
 138:   e2833001        add     r3, r3, #1
 13c:   e35e0000        cmp     lr, #0
 140:   0afffff8        beq     128 <_find_next_bit+0x44>
 144:   e6ffef3e        rbit    lr, lr
 148:   e16fef1e        clz     lr, lr
 14c:   e084400e        add     r4, r4, lr
 150:   e1500004        cmp     r0, r4
 154:   21a00004        movcs   r0, r4
 158:   e8bd8010        pop     {r4, pc}
 15c:   e1a04281        lsl     r4, r1, #5
 160:   eafffff7        b       144 <_find_next_bit+0x60>

Are you sure you're running latest kernel?

Thanks,
Yury

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

WARNING: multiple messages have this Message-ID (diff)
From: Yury Norov <yury.norov@gmail.com>
To: "Russell King (Oracle)" <linux@armlinux.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Mark Rutland <mark.rutland@arm.com>,
	Will Deacon <will@kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-arm-kernel@lists.infradead.org,
	Alexey Klimov <klimov.linux@gmail.com>
Subject: Re: [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation
Date: Fri, 28 Oct 2022 11:37:44 -0700	[thread overview]
Message-ID: <Y1wheOT4yP7VCZ0p@yury-laptop> (raw)
In-Reply-To: <Y1wVTkIZjoMVfxOK@shell.armlinux.org.uk>

+ Alexey Klimov

On Fri, Oct 28, 2022 at 06:45:50PM +0100, Russell King (Oracle) wrote:
> On Fri, Oct 28, 2022 at 10:05:29AM -0700, Linus Torvalds wrote:
> > On Fri, Oct 28, 2022 at 9:47 AM Russell King (Oracle)
> > <rmk+kernel@armlinux.org.uk> wrote:
> > >
> > > Document the ARMv5 bit offset calculation code.
> > 
> > Hmm. Don't the generic bitop functions end up using this? We do have a
> > comment in the code that says
> > 
> >  * On ARMv5 and above, the gcc built-ins may rely on the clz instruction
> >  * and produce optimal inlined code in all cases. On ARMv7 it is even
> >  * better by also using the rbit instruction.
> 
> It's true that the generic code also makes use of the rbit and clz
> instructions - but in terms of the speed of the functions these only
> get used once we've found a word that is interesting to locate the
> bit we want in.
> 
> > but that 'may' makes me wonder...
> > 
> > IOW, what is it in the hand-written code that doesn't get done by the
> > generic code these days?
> 
> For the _find_first_bit, there isn't much difference in the number
> of instructions or really what is going on, only the organisation
> and flow of the code is more inline - but that shouldn't make much
> of a difference. Yet, there is a definite repeatable measurable
> difference between the two:
> 
> random-filled:
> arm    : find_first_bit:               17778911 ns,  16448 iterations
> generic: find_first_bit:               18596622 ns,  16401 iterations
> 
> sparse:
> arm    : find_first_bit:                7301363 ns,    656 iterations
> generic: find_first_bit:                7589120 ns,    655 iterations
> 
> The bigger difference is in the find_next_bit operations, and this
> likely comes from the arm32 code not having the hassles of the "_and"
> and other conditionals that the generic code has:
> 
> random-filled:
> arm    : find_next_bit:                 2242618 ns, 163949 iterations
> generic: find_next_bit:                 2632859 ns, 163743 iterations
> 
> sparse:
> arm    : find_next_bit:                   40078 ns,    656 iterations
> generic: find_next_bit:                   69943 ns,    655 iterations
> 
> find_next_zero_bit show a greater difference:
> 
> random-filled:
> arm    : find_next_zero_bit:            2049129 ns, 163732 iterations
> generic: find_next_zero_bit:            2844221 ns, 163938 iterations
> 
> sparse:
> arm    : find_next_zero_bit:            3939309 ns, 327025 iterations
> generic: find_next_zero_bit:            5529553 ns, 327026 iterations

Those numbers disagree with what Alexey has measured on Odroid board
for A15 but somewhat in line with what he had for A7:

https://lore.kernel.org/all/YuWk3titnOiQACzC@yury-laptop/

Can you please share details about your methodology: what CPU did you use;
did you lock cpu frequencies; in addition to mean, can you also show
standard deviation, raw logs?

To Alexey: if you have time, can you please repeat those measurements
on top of v6 + this series? This would help a lot in understanding how
new code performs, both generic and arch.

If generic vs arch code comparison looks different for different CPU
versions, what should we prefer? 
 
> Here's the disassemblies for comparison. Note that the arm versions
> share code paths between the functions which makes the code even more
> compact - so the loop in the find_first gets re-used for find_next
> after we check the current word.
> 
> generic:

[...]
 
> 000000e8 <_find_next_bit>:
>   e8:   e92d4070        push    {r4, r5, r6, lr}
>   ec:   e1530002        cmp     r3, r2
>   f0:   e59d4010        ldr     r4, [sp, #16]
>   f4:   e59d5014        ldr     r5, [sp, #20]
>   f8:   2a000024        bcs     190 <_find_next_bit+0xa8>
>   fc:   e1a0e2a3        lsr     lr, r3, #5
>  100:   e3510000        cmp     r1, #0
>  104:   e203601f        and     r6, r3, #31
>  108:   e3c3301f        bic     r3, r3, #31
>  10c:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  110:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  114:   100cc00e        andne   ip, ip, lr
>  118:   e3e0e000        mvn     lr, #0
>  11c:   e02cc004        eor     ip, ip, r4
>  120:   e3550000        cmp     r5, #0
>  124:   e1a0e61e        lsl     lr, lr, r6
>  128:   1a00001a        bne     198 <_find_next_bit+0xb0>
>  12c:   e01cc00e        ands    ip, ip, lr
>  130:   1a000011        bne     17c <_find_next_bit+0x94>
>  134:   e2833020        add     r3, r3, #32
>  138:   e1530002        cmp     r3, r2
>  13c:   3a000003        bcc     150 <_find_next_bit+0x68>
>  140:   ea000012        b       190 <_find_next_bit+0xa8>
>  144:   e2833020        add     r3, r3, #32
>  148:   e1520003        cmp     r2, r3
>  14c:   9a00000f        bls     190 <_find_next_bit+0xa8>
>  150:   e1a0e2a3        lsr     lr, r3, #5
>  154:   e3510000        cmp     r1, #0
>  158:   e790c10e        ldr     ip, [r0, lr, lsl #2]
>  15c:   1791e10e        ldrne   lr, [r1, lr, lsl #2]
>  160:   100cc00e        andne   ip, ip, lr
>  164:   e15c0004        cmp     ip, r4
>  168:   0afffff5        beq     144 <_find_next_bit+0x5c>
>  16c:   e02cc004        eor     ip, ip, r4
>  170:   e3550000        cmp     r5, #0
>  174:   0a000000        beq     17c <_find_next_bit+0x94>
>  178:   e6bfcf3c        rev     ip, ip
>  17c:   e6ffcf3c        rbit    ip, ip
>  180:   e16fcf1c        clz     ip, ip
>  184:   e08c3003        add     r3, ip, r3
>  188:   e1520003        cmp     r2, r3
>  18c:   21a02003        movcs   r2, r3
>  190:   e1a00002        mov     r0, r2
>  194:   e8bd8070        pop     {r4, r5, r6, pc}
>  198:   e6bfef3e        rev     lr, lr
>  19c:   e01cc00e        ands    ip, ip, lr
>  1a0:   0affffe3        beq     134 <_find_next_bit+0x4c>
>  1a4:   eafffff3        b       178 <_find_next_bit+0x90>

On top of master, my generic _find_next_bit looks different:

000000e4 <_find_next_bit>:
  e4:   e1510002        cmp     r1, r2
  e8:   e1a0c000        mov     ip, r0
  ec:   e1a00001        mov     r0, r1
  f0:   912fff1e        bxls    lr
  f4:   e1a012a2        lsr     r1, r2, #5
  f8:   e92d4010        push    {r4, lr}
  fc:   e202201f        and     r2, r2, #31
 100:   e3e03000        mvn     r3, #0
 104:   e79ce101        ldr     lr, [ip, r1, lsl #2]
 108:   e01ee213        ands    lr, lr, r3, lsl r2
 10c:   1a000012        bne     15c <_find_next_bit+0x78>
 110:   e2813001        add     r3, r1, #1
 114:   e1a04283        lsl     r4, r3, #5
 118:   e1540000        cmp     r4, r0
 11c:   28bd8010        popcs   {r4, pc}
 120:   e08c2101        add     r2, ip, r1, lsl #2
 124:   ea000002        b       134 <_find_next_bit+0x50>
 128:   e1a04283        lsl     r4, r3, #5
 12c:   e1500004        cmp     r0, r4
 130:   98bd8010        popls   {r4, pc}
 134:   e5b2e004        ldr     lr, [r2, #4]!
 138:   e2833001        add     r3, r3, #1
 13c:   e35e0000        cmp     lr, #0
 140:   0afffff8        beq     128 <_find_next_bit+0x44>
 144:   e6ffef3e        rbit    lr, lr
 148:   e16fef1e        clz     lr, lr
 14c:   e084400e        add     r4, r4, lr
 150:   e1500004        cmp     r0, r4
 154:   21a00004        movcs   r0, r4
 158:   e8bd8010        pop     {r4, pc}
 15c:   e1a04281        lsl     r4, r1, #5
 160:   eafffff7        b       144 <_find_next_bit+0x60>

Are you sure you're running latest kernel?

Thanks,
Yury

  reply	other threads:[~2022-10-28 18:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-28 16:47 [PATCH 0/5] ARM: findbit assembly updates Russell King (Oracle)
2022-10-28 16:47 ` Russell King (Oracle)
2022-10-28 16:47 ` [PATCH 1/5] ARM: findbit: document ARMv5 bit offset calculation Russell King (Oracle)
2022-10-28 16:47   ` Russell King (Oracle)
2022-10-28 17:05   ` Linus Torvalds
2022-10-28 17:05     ` Linus Torvalds
2022-10-28 17:45     ` Russell King (Oracle)
2022-10-28 17:45       ` Russell King (Oracle)
2022-10-28 18:37       ` Yury Norov [this message]
2022-10-28 18:37         ` Yury Norov
2022-10-28 19:42         ` Russell King (Oracle)
2022-10-28 19:42           ` Russell King (Oracle)
2022-10-28 19:01       ` Linus Torvalds
2022-10-28 19:01         ` Linus Torvalds
2022-10-28 19:10         ` Linus Torvalds
2022-10-28 19:10           ` Linus Torvalds
2022-10-28 19:46         ` Russell King (Oracle)
2022-10-28 19:46           ` Russell King (Oracle)
2022-10-28 20:26           ` Linus Torvalds
2022-10-28 20:26             ` Linus Torvalds
2022-10-28 16:47 ` [PATCH 2/5] ARM: findbit: provide more efficient ARMv7 implementation Russell King (Oracle)
2022-10-28 16:47   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 3/5] ARM: findbit: convert to macros Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 4/5] ARM: findbit: operate by words Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)
2022-10-28 16:48 ` [PATCH 5/5] ARM: findbit: add unwinder information Russell King (Oracle)
2022-10-28 16:48   ` Russell King (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y1wheOT4yP7VCZ0p@yury-laptop \
    --to=yury.norov@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=klimov.linux@gmail.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@armlinux.org.uk \
    --cc=mark.rutland@arm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.