* [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
@ 2013-12-22 11:32 Aurelien Jarno
2013-12-22 17:04 ` Richard Henderson
2014-06-19 8:36 ` Paolo Bonzini
0 siblings, 2 replies; 6+ messages in thread
From: Aurelien Jarno @ 2013-12-22 11:32 UTC (permalink / raw)
To: qemu-devel; +Cc: Corentin Chary, Aurelien Jarno, Richard Henderson
find_first_bit has started to be used heavily in TCG code. The current
implementation based on find_next_bit is not optimal and can't be
optimized be the compiler if the bit array has a fixed size, which is
the case most of the time.
This new implementation does not use find_next_bit and is yet small
enough to be inlined.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
include/qemu/bitops.h | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h
index 304c90c..041142a 100644
--- a/include/qemu/bitops.h
+++ b/include/qemu/bitops.h
@@ -157,7 +157,17 @@ unsigned long find_next_zero_bit(const unsigned long *addr,
static inline unsigned long find_first_bit(const unsigned long *addr,
unsigned long size)
{
- return find_next_bit(addr, size, 0);
+ unsigned long result, tmp;
+
+ for (result = 0; result < size; result += BITS_PER_LONG) {
+ tmp = *addr++;
+ if (tmp) {
+ result += ctzl(tmp);
+ return result < size ? result : size;
+ }
+ }
+ /* Not found */
+ return size;
}
/**
--
1.7.10.4
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
2013-12-22 11:32 [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit Aurelien Jarno
@ 2013-12-22 17:04 ` Richard Henderson
2014-06-19 8:36 ` Paolo Bonzini
1 sibling, 0 replies; 6+ messages in thread
From: Richard Henderson @ 2013-12-22 17:04 UTC (permalink / raw)
To: Aurelien Jarno, qemu-devel; +Cc: Corentin Chary
On 12/22/2013 03:32 AM, Aurelien Jarno wrote:
> find_first_bit has started to be used heavily in TCG code. The current
> implementation based on find_next_bit is not optimal and can't be
> optimized be the compiler if the bit array has a fixed size, which is
> the case most of the time.
>
> This new implementation does not use find_next_bit and is yet small
> enough to be inlined.
>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: Corentin Chary <corentin.chary@gmail.com>
>
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
> include/qemu/bitops.h | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
Reviewed-by: Richard Henderson <rth@twiddle.net>
r~
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
2013-12-22 11:32 [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit Aurelien Jarno
2013-12-22 17:04 ` Richard Henderson
@ 2014-06-19 8:36 ` Paolo Bonzini
2014-06-20 8:48 ` Aurelien Jarno
1 sibling, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2014-06-19 8:36 UTC (permalink / raw)
To: Aurelien Jarno, qemu-devel; +Cc: Corentin Chary, Richard Henderson
Il 22/12/2013 12:32, Aurelien Jarno ha scritto:
> find_first_bit has started to be used heavily in TCG code. The current
> implementation based on find_next_bit is not optimal and can't be
> optimized be the compiler if the bit array has a fixed size, which is
> the case most of the time.
If you mean by fully unrolling the loop, that's right.
However...
> - return find_next_bit(addr, size, 0);
> + unsigned long result, tmp;
> +
> + for (result = 0; result < size; result += BITS_PER_LONG) {
> + tmp = *addr++;
> + if (tmp) {
> + result += ctzl(tmp);
> + return result < size ? result : size;
> + }
> + }
> + /* Not found */
> + return size;
> }
>
> /**
>
... you probably want to limit this to bitmaps that are of constant
size, and small enough that the compiler will unroll them.
So it probably would be a good idea to add an
if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG)
return find_next_bit(addr, size, 0);
Not urgent since TCG is the only user of find_first_bit right now, but
worth considering.
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
2014-06-19 8:36 ` Paolo Bonzini
@ 2014-06-20 8:48 ` Aurelien Jarno
2014-06-20 8:58 ` Paolo Bonzini
0 siblings, 1 reply; 6+ messages in thread
From: Aurelien Jarno @ 2014-06-20 8:48 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, Corentin Chary, Richard Henderson
On Thu, Jun 19, 2014 at 10:36:18AM +0200, Paolo Bonzini wrote:
> Il 22/12/2013 12:32, Aurelien Jarno ha scritto:
> >find_first_bit has started to be used heavily in TCG code. The current
> >implementation based on find_next_bit is not optimal and can't be
> >optimized be the compiler if the bit array has a fixed size, which is
> >the case most of the time.
>
> If you mean by fully unrolling the loop, that's right.
>
> However...
>
> >- return find_next_bit(addr, size, 0);
> >+ unsigned long result, tmp;
> >+
> >+ for (result = 0; result < size; result += BITS_PER_LONG) {
> >+ tmp = *addr++;
> >+ if (tmp) {
> >+ result += ctzl(tmp);
> >+ return result < size ? result : size;
> >+ }
> >+ }
> >+ /* Not found */
> >+ return size;
> > }
> >
> > /**
> >
>
> ... you probably want to limit this to bitmaps that are of constant
> size, and small enough that the compiler will unroll them.
>
> So it probably would be a good idea to add an
>
> if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG)
> return find_next_bit(addr, size, 0);
>
> Not urgent since TCG is the only user of find_first_bit right now,
> but worth considering.
In practice on x86_64, this function takes 27 instructions in the
general case, and 18 instructions in the fixed case, even for big
sizes. I therefore think that checking if the size is constant is a good
idea, but we should not make any test on the size itself and trust the
compiler to correctly decide if the loop should be unrolled or not.
I will send a patch about that in the next days.
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
2014-06-20 8:48 ` Aurelien Jarno
@ 2014-06-20 8:58 ` Paolo Bonzini
2014-06-20 9:43 ` Aurelien Jarno
0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2014-06-20 8:58 UTC (permalink / raw)
To: Aurelien Jarno; +Cc: qemu-devel, Corentin Chary, Richard Henderson
Il 20/06/2014 10:48, Aurelien Jarno ha scritto:
> In practice on x86_64, this function takes 27 instructions in the
> general case, and 18 instructions in the fixed case, even for big
> sizes. I therefore think that checking if the size is constant is a good
> idea, but we should not make any test on the size itself and trust the
> compiler to correctly decide if the loop should be unrolled or not.
But if the size is large enough that the compiler will (likely) not
unroll the function, then it should pay off to use the more optimized
code in find_next_bit.
This of course is unless you expect find_first_bit to return a small
value and not be used in a loop; and dually expect find_next_bit's usage
to be more like walking sparser bitmaps in a loop.
This actually makes sense, and then there's no need to change anything.
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit
2014-06-20 8:58 ` Paolo Bonzini
@ 2014-06-20 9:43 ` Aurelien Jarno
0 siblings, 0 replies; 6+ messages in thread
From: Aurelien Jarno @ 2014-06-20 9:43 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel, Corentin Chary, Richard Henderson
On Fri, Jun 20, 2014 at 10:58:31AM +0200, Paolo Bonzini wrote:
> Il 20/06/2014 10:48, Aurelien Jarno ha scritto:
> >In practice on x86_64, this function takes 27 instructions in the
> >general case, and 18 instructions in the fixed case, even for big
> >sizes. I therefore think that checking if the size is constant is a good
> >idea, but we should not make any test on the size itself and trust the
> >compiler to correctly decide if the loop should be unrolled or not.
>
> But if the size is large enough that the compiler will (likely) not
> unroll the function, then it should pay off to use the more
> optimized code in find_next_bit.
The point there is that given find_next_bit is a generalized version of
find_first_bit, it is actually slower. I originally noticed that by
running profiling tools and noticing this function appeared relatively
high for what it is supposed to do.
> This of course is unless you expect find_first_bit to return a small
> value and not be used in a loop; and dually expect find_next_bit's
> usage to be more like walking sparser bitmaps in a loop.
I think that's the point. In the TCG case, this is used to map the
temp allocation to answer the question "give me a free temp". That said
people might invent new usages.
> This actually makes sense, and then there's no need to change anything.
>
> Paolo
>
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-06-20 9:44 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-22 11:32 [Qemu-devel] [PATCH] bitops: provide an inline implementation of find_first_bit Aurelien Jarno
2013-12-22 17:04 ` Richard Henderson
2014-06-19 8:36 ` Paolo Bonzini
2014-06-20 8:48 ` Aurelien Jarno
2014-06-20 8:58 ` Paolo Bonzini
2014-06-20 9:43 ` Aurelien Jarno
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).