From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1Illca-0004d2-Jh for qemu-devel@nongnu.org; Sat, 27 Oct 2007 09:22:36 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1IllcZ-0004cg-4K for qemu-devel@nongnu.org; Sat, 27 Oct 2007 09:22:36 -0400 Received: from [199.232.76.173] (helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1IllcY-0004cb-Sl for qemu-devel@nongnu.org; Sat, 27 Oct 2007 09:22:34 -0400 Received: from bangui.magic.fr ([195.154.194.245]) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1IllcY-0001up-FS for qemu-devel@nongnu.org; Sat, 27 Oct 2007 09:22:34 -0400 Subject: Re: [Qemu-devel] Mips 64 emulation not compiling From: "J. Mayer" In-Reply-To: References: <1193222474.16781.236.camel@rapid> <20071027111939.GH29176@networkno.de> <1193487891.16781.280.camel@rapid> Content-Type: text/plain Date: Sat, 27 Oct 2007 15:22:04 +0200 Message-Id: <1193491325.16781.298.camel@rapid> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Blue Swirl Cc: qemu-devel@nongnu.org On Sat, 2007-10-27 at 16:01 +0300, Blue Swirl wrote: > On 10/27/07, J. Mayer wrote: > > I also got optimized versions of bit population count which could also > > be shared: > > static always_inline int ctpop32 (uint32_t val) > > { > > int i; > > > > for (i = 0; val != 0; i++) > > val = val ^ (val - 1); > > > > return i; > > } > > > > If you prefer, I can add those shared functions (ctz32, ctz64, cto32, > > cto64, ctpop32, ctpop64) later, as they do not seem as widely used as > > clxxx functions. > > This would be interesting for Sparc64. Could you compare your version > to do_popc() in target-sparc/op_helper.c? My feeling is: my implementation does n loops, n being the number of bits set in the word, then will always be faster than yours when only a few bits are set. your implementation could be better because: - it has a fixed cost - it does not do any tests / jumps / loops The drawback of your implementation is that it generates a lot of code, thus could never be used directly in micro-ops: on my amd64 host, my implementation compiles in 36 bytes of code and the 64 bits version does not generate more code than the 32 bits one. Your (64 bits only) implementation compiles in 217 bytes of code. On a x86, my 32 bits version is 49 bytes long, the 64 bits one is 79 bits long and yours is 323 bytes long. But this would never be a problem when called from a helper. Then, I'm not really sure of what is the best choice to be done here.... We may have to do tests to see which one of the 2 implementations seems more efficient. -- J. Mayer Never organized