From: Jeff King <peff@peff.net>
To: "SZEDER Gábor" <szeder@ira.uka.de>
Cc: "Ramsay Jones" <ramsay@ramsay1.demon.co.uk>,
"Vicent Marti" <tanoku@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Torsten Bögershausen" <tboegi@web.de>,
"GIT Mailing-list" <git@vger.kernel.org>
Subject: Re: [PATCH 1/5] compat/bswap.h: Fix build on cygwin, MinGW and msvc
Date: Fri, 8 Nov 2013 14:27:49 -0800 [thread overview]
Message-ID: <20131108222749.GA19912@sigill.intra.peff.net> (raw)
In-Reply-To: <20131108004550.GA16843@goldbirke>
On Fri, Nov 08, 2013 at 01:45:50AM +0100, SZEDER Gábor wrote:
> Hi,
>
> On Thu, Nov 07, 2013 at 09:59:38PM +0000, Ramsay Jones wrote:
> > +static inline uint64_t default_bswap64(uint64_t val)
> > +{
> > + return (((val & (uint64_t)0x00000000000000ffULL) << 56) |
> > + ((val & (uint64_t)0x000000000000ff00ULL) << 40) |
> > + ((val & (uint64_t)0x0000000000ff0000ULL) << 24) |
> > + ((val & (uint64_t)0x00000000ff000000ULL) << 8) |
> > + ((val & (uint64_t)0x000000ff00000000ULL) >> 8) |
> > + ((val & (uint64_t)0x0000ff0000000000ULL) >> 24) |
> > + ((val & (uint64_t)0x00ff000000000000ULL) >> 40) |
> > + ((val & (uint64_t)0xff00000000000000ULL) >> 56));
> > +}
>
> This got me thinking.
> To swap 8 bytes this function performs 8 bitwise shifts, 8 bitwise
> ANDs and 7 bitwise ORs plus uses 8 64bit constants. We could do
> better than that:
>
> static inline uint64_t hacked_bswap64(uint64_t val)
> {
> uint64_t tmp = val << 32 | val >> 32;
> return (((tmp & (uint64_t)0xff000000ff000000ULL) >> 24) |
> ((tmp & (uint64_t)0x00ff000000ff0000ULL) >> 8) |
> ((tmp & (uint64_t)0x0000ff000000ff00ULL) << 8) |
> ((tmp & (uint64_t)0x000000ff000000ffULL) << 24));
> }
>
> This performs only 6 shifts, 4 ANDs, 4 ORs and uses 4 64bit constants.
You can do better still by using the bswap instruction. :)
I tried timing the program below.
With -O0, your version is actually _slower_ than the naive version (14s
versus 13s), and the bswap version is better than either (8s). Oddly, in
an earlier iteration of my test, your version was faster than naive,
with bswap faster still (10s, 8s, 5s, respectively). In that version, I
was not assigning the result of the bswap anywhere.
So I think the timing is highly dependent on the code surrounding the
call, but of course the asm instruction seems to always perform better.
If I turn on -O2, all three take about 1.2s. Inspecting the generated
assembly, it's because gcc converts the raw-code cases into a bswap
instruction. We end up with something like:
.L3:
subq $1, %rdx
bswap %rax
jne .L3
And according to perf, we spend all of our time on the jump. The bswap
hardly registers.
> I doubt that in normal usage git would spend enough time bswap64ing to
> make this noticeable, but it was a fun micro-optimization on a wet
> Thursday evening nevertheless :)
We do a fair number, but it's dwarfed by the real work. And from the
numbers above, I think our best bet is to use the system bswap if it's
there, and then not worry too hard about micro-optimizing the rest.
-Peff
-- >8 --
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#ifdef NAIVE
static inline uint64_t test(uint64_t val)
{
return ((val & (uint64_t)0x00000000000000ffULL) << 56)
| ((val & (uint64_t)0x000000000000ff00ULL) << 40)
| ((val & (uint64_t)0x0000000000ff0000ULL) << 24)
| ((val & (uint64_t)0x00000000ff000000ULL) << 8)
| ((val & (uint64_t)0x000000ff00000000ULL) >> 8)
| ((val & (uint64_t)0x0000ff0000000000ULL) >> 24)
| ((val & (uint64_t)0x00ff000000000000ULL) >> 40)
| ((val & (uint64_t)0xff00000000000000ULL) >> 56);
}
#elif defined(OPTIM)
static inline uint64_t test(uint64_t val)
{
uint64_t tmp = val << 32 | val >> 32;
return (((tmp & (uint64_t)0xff000000ff000000ULL) >> 24) |
((tmp & (uint64_t)0x00ff000000ff0000ULL) >> 8) |
((tmp & (uint64_t)0x0000ff000000ff00ULL) << 8) |
((tmp & (uint64_t)0x000000ff000000ffULL) << 24));
}
#elif defined(ASM)
static inline uint64_t test(uint64_t val)
{
register uint64_t v, x = val;
__asm__("bswap %q0" : "=r" (v) : "0" (x));
return v;
}
#endif
int main(int argc, char **argv)
{
unsigned long i;
uint64_t n = strtoull(argv[1], NULL, 10);
for (i = 0; i < 1000000000; i++)
n = test(n);
/* convince gcc that we really want the output value, so
* it won't optimize out the whole program */
printf("%d\n", (int)n);
return 0;
}
prev parent reply other threads:[~2013-11-08 22:27 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-07 21:59 [PATCH 1/5] compat/bswap.h: Fix build on cygwin, MinGW and msvc Ramsay Jones
2013-11-08 0:45 ` SZEDER Gábor
2013-11-08 22:27 ` Jeff King [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131108222749.GA19912@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=ramsay@ramsay1.demon.co.uk \
--cc=szeder@ira.uka.de \
--cc=tanoku@gmail.com \
--cc=tboegi@web.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).