From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751751AbdGYLuf (ORCPT ); Tue, 25 Jul 2017 07:50:35 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:45772 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750919AbdGYLue (ORCPT ); Tue, 25 Jul 2017 07:50:34 -0400 Date: Tue, 25 Jul 2017 12:50:40 +0100 From: Will Deacon To: Peter Zijlstra Cc: torvalds@linux-foundation.org, akpm@linux-foundation.org, dave@stgolabs.net, aksgarg1989@gmail.com, tglx@linutronix.de, mingo@kernel.org, joe@perches.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/3] lib/int_sqrt: Optimize initial value compute Message-ID: <20170725115040.GB27768@arm.com> References: <20170724151630.447009076@infradead.org> <20170724153455.710696863@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170724153455.710696863@infradead.org> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Peter, On Mon, Jul 24, 2017 at 05:16:32PM +0200, Peter Zijlstra wrote: > The initial value (@m) compute is: > > m = 1UL << (BITS_PER_LONG - 2); > while (m > x) > m >>= 2; > > Which is a linear search for the highest even bit smaller or equal to @x > We can implement this using a binary search using __fls() (or better > when its hardware implemented). > > m = 1UL << (__fls(x) & ~1UL); > > Especially for small values of @x; which are the more common > arguments; the linear search is near to worst case, while the binary > search of __fls() is a constant 6 branches. > > cycles: branches: branch-misses: > > PRE: > > hot: 43.633557 +- 0.034373 45.333132 +- 0.002277 0.023529 +- 0.000681 > cold: 207.438411 +- 0.125840 45.333132 +- 0.002277 6.976486 +- 0.004219 > > SOFTWARE FLS: > > hot: 29.576176 +- 0.028850 26.666730 +- 0.004511 0.019463 +- 0.000663 > cold: 165.947136 +- 0.188406 26.666746 +- 0.004511 6.133897 +- 0.004386 > > HARDWARE FLS: > > hot: 24.720922 +- 0.025161 20.666784 +- 0.004509 0.020836 +- 0.000677 > cold: 132.777197 +- 0.127471 20.666776 +- 0.004509 5.080285 +- 0.003874 > > Averages computed over all values <128k using a LFSR to generate > order. Cold numbers have a LFSR based branch trace buffer 'confuser' > ran between each int_sqrt() invocation. The hardware fls version works nicely for arm64, where it can be implemented using the clz instruction (via the __builtin_clzl intrinsic). Acked-by: Will Deacon Cheers, Will