From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:44392)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UF57l-0001wE-H6
	for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:58 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UF57i-00033U-S2
	for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:53 -0400
Received: from mx1.redhat.com ([209.132.183.28]:22255)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1UF57i-00033P-L0
	for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:50 -0400
Message-ID: <513DFF33.5080400@redhat.com>
Date: Mon, 11 Mar 2013 16:58:43 +0100
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <513DDFA3.1020308@dlhnet.de> <513DE6D6.9000105@redhat.com>
	<FE8F3AB7-45F9-4B76-B2D4-46179AAD08FE@dlhnet.de>
	<513DEBCF.9050407@redhat.com>
	<E594D4E5-EDA7-4277-9910-D498E9E1E66A@dlhnet.de>
	<513DF854.80003@redhat.com>
	<32589117-17AC-4A82-820F-364CA5EEC23E@dlhnet.de>
In-Reply-To: <32589117-17AC-4A82-820F-364CA5EEC23E@dlhnet.de>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC] find_next_bit optimizations
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@dlhnet.de>
Cc: Orit Wasserman <owasserm@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Corentin Chary <corentin.chary@gmail.com>

Il 11/03/2013 16:37, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>
>>>> How would that be different in your patch?  But you can solve it by
>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>> BITS_PER_LONG.
>>>
>>> This is what I have now:
>>>
>>> diff --git a/util/bitops.c b/util/bitops.c
>>> index e72237a..b0dc93f 100644
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>     unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>>
>>>     if (offset >= size) {
>>>         return size;
>>>     }
>>>     size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>     if (offset) {
>>>         tmp = *(p++);
>>>         tmp &= (~0UL << offset);
>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>         size -= BITS_PER_LONG;
>>>         result += BITS_PER_LONG;
>>>     }
>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>> +    while (size >= 4*BITS_PER_LONG) {
>>> +        d0 = *p;
>>> +        d1 = *(p+1);
>>> +        d2 = *(p+2);
>>> +        d3 = *(p+3);
>>> +        if (d0 || d1 || d2 || d3) {
>>> +            break;
>>> +        }
>>> +        p+=4;
>>> +        result += 4*BITS_PER_LONG;
>>> +        size -= 4*BITS_PER_LONG;
>>> +    }
>>> +    while (size >= BITS_PER_LONG) {
>>>         if ((tmp = *(p++))) {
>>>             goto found_middle;
>>>         }
>>>
>>
>> Minus the %= vs. &=,
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> Perhaps:
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp) {
>>            goto found_middle;
>>        }
>>        if (d1 || d2 || d3) {
>>            break;
>>        }
> 
> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?

It depends on the target and how expensive branches are.

> i would guess its sth like one addition w/ carry and 1 test?

It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
followed by a compare-and-jump.

That is, either:

     test %r8, %r8
     jz   second_loop
     test %r9, %r9
     jz   second_loop
     test %r10, %r10
     jz   second_loop
     test %r11, %r11
     jz   second_loop

or

     or %r9, %r8
     or %r11, %r10
     or %r8, %r10
     jz   second_loop

Don't let the length of the code fool you.  The processor knows how to
optimize all of these, and GCC knows too.

> your proposed change would introduce 2 tests (maybe)?

Yes, but I expect they to be fairly well predicted.

> what about this to be sure?
> 
>        tmp = *p;
>        d1 = *(p+1);
>        d2 = *(p+2);
>        d3 = *(p+3);
>        if (tmp || d1 || d2 || d3) {
>            if (tmp) {
>                goto found_middle;

I suspect that GCC would rewrite it my version (definitely if it
produces 4 compare-and-jumps; but possibly it does it even if it goes
for bitwise ORs, I haven't checked.

Regarding your other question ("one last thought. would it make sense to
update only `size`in the while loops and compute the `result` at the end
as `orgsize` - `size`?"), again the compiler knows better and might even
do this for you.  It will likely drop the p increases and use p[result],
so if you do that change you may even get the same code, only this time
p is increased and you get an extra subtraction at the end. :)

Bottom line: don't try to outsmart an optimizing C compiler on
micro-optimization, unless you have benchmarked it and it shows there is
a problem.

Paolo

>            }
>            break;
>        }
> 
> Peter
>