From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:44392) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UF57l-0001wE-H6 for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:58 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UF57i-00033U-S2 for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:22255) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UF57i-00033P-L0 for qemu-devel@nongnu.org; Mon, 11 Mar 2013 11:58:50 -0400 Message-ID: <513DFF33.5080400@redhat.com> Date: Mon, 11 Mar 2013 16:58:43 +0100 From: Paolo Bonzini MIME-Version: 1.0 References: <513DDFA3.1020308@dlhnet.de> <513DE6D6.9000105@redhat.com> <513DEBCF.9050407@redhat.com> <513DF854.80003@redhat.com> <32589117-17AC-4A82-820F-364CA5EEC23E@dlhnet.de> In-Reply-To: <32589117-17AC-4A82-820F-364CA5EEC23E@dlhnet.de> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC] find_next_bit optimizations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Lieven Cc: Orit Wasserman , "qemu-devel@nongnu.org" , Corentin Chary Il 11/03/2013 16:37, Peter Lieven ha scritto: > > Am 11.03.2013 um 16:29 schrieb Paolo Bonzini : > >> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>> >>>> How would that be different in your patch? But you can solve it by >>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>>> BITS_PER_LONG. >>> >>> This is what I have now: >>> >>> diff --git a/util/bitops.c b/util/bitops.c >>> index e72237a..b0dc93f 100644 >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> + unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> - offset %= BITS_PER_LONG; >>> + offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>> size -= BITS_PER_LONG; >>> result += BITS_PER_LONG; >>> } >>> - while (size & ~(BITS_PER_LONG-1)) { >>> + while (size >= 4*BITS_PER_LONG) { >>> + d0 = *p; >>> + d1 = *(p+1); >>> + d2 = *(p+2); >>> + d3 = *(p+3); >>> + if (d0 || d1 || d2 || d3) { >>> + break; >>> + } >>> + p+=4; >>> + result += 4*BITS_PER_LONG; >>> + size -= 4*BITS_PER_LONG; >>> + } >>> + while (size >= BITS_PER_LONG) { >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >>> >> >> Minus the %= vs. &=, >> >> Reviewed-by: Paolo Bonzini >> >> Perhaps: >> >> tmp = *p; >> d1 = *(p+1); >> d2 = *(p+2); >> d3 = *(p+3); >> if (tmp) { >> goto found_middle; >> } >> if (d1 || d2 || d3) { >> break; >> } > > i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? It depends on the target and how expensive branches are. > i would guess its sth like one addition w/ carry and 1 test? It could be either 4 compare-and-jump sequences, or 3 bitwise ORs followed by a compare-and-jump. That is, either: test %r8, %r8 jz second_loop test %r9, %r9 jz second_loop test %r10, %r10 jz second_loop test %r11, %r11 jz second_loop or or %r9, %r8 or %r11, %r10 or %r8, %r10 jz second_loop Don't let the length of the code fool you. The processor knows how to optimize all of these, and GCC knows too. > your proposed change would introduce 2 tests (maybe)? Yes, but I expect they to be fairly well predicted. > what about this to be sure? > > tmp = *p; > d1 = *(p+1); > d2 = *(p+2); > d3 = *(p+3); > if (tmp || d1 || d2 || d3) { > if (tmp) { > goto found_middle; I suspect that GCC would rewrite it my version (definitely if it produces 4 compare-and-jumps; but possibly it does it even if it goes for bitwise ORs, I haven't checked. Regarding your other question ("one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`?"), again the compiler knows better and might even do this for you. It will likely drop the p increases and use p[result], so if you do that change you may even get the same code, only this time p is increased and you get an extra subtraction at the end. :) Bottom line: don't try to outsmart an optimizing C compiler on micro-optimization, unless you have benchmarked it and it shows there is a problem. Paolo > } > break; > } > > Peter >