Re: [PATCH v2] x86/crc32: use builtins to improve code generation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Bill Wendling <morbo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, Eric Biggers <ebiggers@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Justin Stitt <justinstitt@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-crypto@vger.kernel.org,
	clang-built-linux <llvm@lists.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Date: Tue, 4 Mar 2025 04:32:23 +0000	[thread overview]
Message-ID: <20250304043223.68ed310f@pumpkin> (raw)
In-Reply-To: <CAGG=3QUnUQL2=YxN2ozwSba2A_x-S7sAEUP5oGhCWOzu4Q9SQA@mail.gmail.com>

On Mon, 3 Mar 2025 16:16:43 -0800
Bill Wendling <morbo@google.com> wrote:

> On Mon, Mar 3, 2025 at 3:58 PM H. Peter Anvin <hpa@zytor.com> wrote:
> > On March 3, 2025 2:42:16 PM PST, David Laight <david.laight.linux@gmail.com> wrote:  
> > >On Mon, 3 Mar 2025 12:27:21 -0800
> > >Bill Wendling <morbo@google.com> wrote:
> > >  
> > >> On Mon, Mar 3, 2025 at 12:15 PM David Laight
> > >> <david.laight.linux@gmail.com> wrote:  
> > >> > On Thu, 27 Feb 2025 15:47:03 -0800
> > >> > Bill Wendling <morbo@google.com> wrote:
> > >> >  
> > >> > > For both gcc and clang, crc32 builtins generate better code than the
> > >> > > inline asm. GCC improves, removing unneeded "mov" instructions. Clang
> > >> > > does the same and unrolls the loops. GCC has no changes on i386, but
> > >> > > Clang's code generation is vastly improved, due to Clang's "rm"
> > >> > > constraint issue.
> > >> > >
> > >> > > The number of cycles improved by ~0.1% for GCC and ~1% for Clang, which
> > >> > > is expected because of the "rm" issue. However, Clang's performance is
> > >> > > better than GCC's by ~1.5%, most likely due to loop unrolling.  
> > >> >
> > >> > How much does it unroll?
> > >> > How much you need depends on the latency of the crc32 instruction.
> > >> > The copy of Agner's tables I have gives it a latency of 3 on
> > >> > pretty much everything.
> > >> > If you can only do one chained crc instruction every three clocks
> > >> > it is hard to see how unrolling the loop will help.
> > >> > Intel cpu (since sandy bridge) will run a two clock loop.
> > >> > With three clocks to play with it should be easy (even for a compiler)
> > >> > to generate a loop with no extra clock stalls.
> > >> >
> > >> > Clearly if Clang decides to copy arguments to the stack an extra time
> > >> > that will kill things. But in this case you want the "m" constraint
> > >> > to directly read from the buffer (with a (reg,reg,8) addressing mode).
> > >> >  
> > >> Below is what Clang generates with the builtins. From what Eric said,
> > >> this code is only run for sizes <= 512 bytes? So maybe it's not super
> > >> important to micro-optimize this. I apologize, but my ability to
> > >> measure clock loops for x86 code isn't great. (I'm sure I lack the
> > >> requisite benchmarks, etc.)  
> > >
> > >Jeepers - that is trashing the I-cache.
> > >Not to mention all the conditional branches at the bottom.
> > >Consider the basic loop:
> > >1:     crc32q  (%rcx), %rbx
> > >       addq    $8, %rcx
> > >       cmp     %rcx, %rdx
> > >       jne     1b
> > >The crc32 has latency 3 so it must take at least 3 clocks.
> > >Even naively the addq can be issued in the same clock as the crc32
> > >and the cmp and jne in the following ones.
> > >Since the jne is predicted taken, the addq can be assumed to execute
> > >in the same clock as the jne.
> > >(The cmp+jne might also get merged into a single u-op)
> > >(I've done this with adc (for IP checksum), with two adc the loop takes
> > >two clocks even with the extra memory reads.)
> > >
> > >So that loop is likely to run limited by the three clock latency of crc32.
> > >Even the memory reads will happen with all the crc32 just waiting for the
> > >previous crc32 to finish.
> > >You can take an instruction out of the loop:
> > >1:     crc32q  (%rcx,%rdx), %rbx
> > >       addq    $8, %rdx
> > >       jne     1b
> > >but that may not be necessary, and (IIRC) gcc doesn't like letting you
> > >generate it.
> > >
> > >For buffers that aren't multiples of 8 bytes 'remember' that the crc of
> > >a byte depends on how far it is from the end of the buffer, and that initial
> > >zero bytes have no effect.
> > >So (provided the buffer is 8+ bytes long) read the first 8 bytes, shift
> > >right by the number of bytes needed to make the rest of the buffer a multiple
> > >or 8 bytes (the same as reading from across the start of the buffer and masking
> > >the low bytes) then treat exactly the same as a buffer that is a multiple
> > >of 8 bytes long.
> > >Don't worry about misaligned reads, you lose less than one clock per cache
> > >line (that is with adc doing a read every clock).
> > >  
> For reference, GCC does much better with code gen, but only with the builtin:
> 
> .L39:
>         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
>         addq    $8, %rax        #, p
>         cmpq    %rcx, %rax      # _37, p
>         jne     .L39    #,

That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
then you either need to unroll once (no point doing any more) or use
the loop that does negative offsets from the end.

>         leaq    (%rsi,%rdi,8), %rsi     #, p

That is gcc being brain-dead again.
It pretty much refuses to use a loop-updated pointer (%rax above)
and recalculates it from the count.
At least it is a single instruction here and there are the extra
register don't cause a spill to stack.

> .L38:
>         andl    $7, %edx        #, len
>         je      .L41    #,
>         addq    %rsi, %rdx      # p, _11
>         movl    %ebx, %eax      # crc, <retval>
>         .p2align 4
> .L40:
>         crc32b  (%rsi), %eax    # MEM[(const u8 *)p_45], <retval>
>         addq    $1, %rsi        #, p
>         cmpq    %rsi, %rdx      # p, _11
>         jne     .L40    #,
> 
> > >Actually measuring the performance is hard.
> > >You can use rdtsc because the clock speed will change when the cpu gets busy.
> > >There is a 'performance counter' that is actual clocks.
> > >While you can use the library functions to set it up, you need to just read the
> > >register - the library overhead it too big.
> > >You also need the odd lfence.
> > >Having done that, and provided the buffer is in the L1 d-cache you can measure
> > >the loop time in clocks and compare against the expected value.
> > >Once you've got 3 clocks per crc32 instruction it won't get any better,
> > >which is why the 'fast' code for big buffers does crc of 3+ buffers sections
> > >in parallel.
> > >  
> Thanks for the info! It'll help a lot the next time I need to delve
> deeply into performance.
> 
> I tried using rdtsc and another programmatic way of measuring timing.
> Also tried making the task have high priority, restricting to one CPU,
> etc. But the numbers weren't as consistent as I wanted them to be. The
> times I reported were the based on the fastest times / clocks /
> whatever from several runs for each build.

I'll find the code loop I use - machine isn't powered on at the moment.

> 
> > >       David
> > >  
> > >>
> > >> -bw
> > >>
> > >> .LBB1_9:                                # =>This Inner Loop Header: Depth=1
> > >>         movl    %ebx, %ebx
> > >>         crc32q  (%rcx), %rbx
> > >>         addq    $8, %rcx
> > >>         incq    %rdi
> > >>         cmpq    %rdi, %rsi
> > >>         jne     .LBB1_9
> > >> # %bb.10:
> > >>         subq    %rdi, %rax
> > >>         jmp     .LBB1_11
> > >> .LBB1_7:
> > >>         movq    %r14, %rcx
> > >> .LBB1_11:
> > >>         movq    %r15, %rsi
> > >>         andq    $-8, %rsi
> > >>         cmpq    $7, %rdx
> > >>         jb      .LBB1_14
> > >> # %bb.12:
> > >>         xorl    %edx, %edx
> > >> .LBB1_13:                               # =>This Inner Loop Header: Depth=1
> > >>         movl    %ebx, %ebx
> > >>         crc32q  (%rcx,%rdx,8), %rbx
> > >>         crc32q  8(%rcx,%rdx,8), %rbx
> > >>         crc32q  16(%rcx,%rdx,8), %rbx
> > >>         crc32q  24(%rcx,%rdx,8), %rbx
> > >>         crc32q  32(%rcx,%rdx,8), %rbx
> > >>         crc32q  40(%rcx,%rdx,8), %rbx
> > >>         crc32q  48(%rcx,%rdx,8), %rbx
> > >>         crc32q  56(%rcx,%rdx,8), %rbx
> > >>         addq    $8, %rdx
> > >>         cmpq    %rdx, %rax
> > >>         jne     .LBB1_13
> > >> .LBB1_14:
> > >>         addq    %rsi, %r14
> > >> .LBB1_15:
> > >>         andq    $7, %r15
> > >>         je      .LBB1_23
> > >> # %bb.16:
> > >>         crc32b  (%r14), %ebx
> > >>         cmpl    $1, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.17:
> > >>         crc32b  1(%r14), %ebx
> > >>         cmpl    $2, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.18:
> > >>         crc32b  2(%r14), %ebx
> > >>         cmpl    $3, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.19:
> > >>         crc32b  3(%r14), %ebx
> > >>         cmpl    $4, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.20:
> > >>         crc32b  4(%r14), %ebx
> > >>         cmpl    $5, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.21:
> > >>         crc32b  5(%r14), %ebx
> > >>         cmpl    $6, %r15d
> > >>         je      .LBB1_23
> > >> # %bb.22:
> > >>         crc32b  6(%r14), %ebx
> > >> .LBB1_23:
> > >>         movl    %ebx, %eax
> > >> .LBB1_24:  
> > >
> > >  
> >
> > The tail is *weird*. Wouldn't it be better to do a 4-2-1 stepdown?

Well, provided the branches aren't mispredicted it'll be limited by
the crc32b - so three clocks per byte, max 27
The 4-2-1 stepdown needs the extra address update but that may not cost
and is then max 9 clocks. Also a lot less I-cache.
The code logic may not matter unless the buffer is short.
I think the cpu will be executing the tail instructions while many
of the crc32 from the main loop are still queued waiting results
from earlier instructions (especially if you get a loop that would
run in two clocks with (say) addq instead of crc32q.

> Definitely on the weird side! I considered hard-coding something like
> that, but thought it might be a bit convoluted, though certainly less
> convoluted than what we generate now. A simple loop is probably all
> that's needed, because it should only need to be done at most seven
> times.

The byte loop should be limited by the crc32b. So probably as fast
as that unrolled mess, although it will always have a mispredicted
branch (or two) - I suspect all loops do.

	David

next prev parent reply	other threads:[~2025-03-04  4:32 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-27  6:12 [PATCH] x86/crc32: use builtins to improve code generation Bill Wendling
2025-02-27  6:28 ` Eric Biggers
2025-02-27  7:08   ` Bill Wendling
2025-02-28  2:08     ` Eric Biggers
2025-02-27 10:52   ` H. Peter Anvin
2025-02-27 12:17     ` Bill Wendling
2025-02-27 20:56       ` Bill Wendling
2025-02-27 16:26 ` Dave Hansen
2025-02-27 20:57   ` Bill Wendling
2025-02-27 21:03     ` Dave Hansen
2025-02-27 23:47 ` [PATCH v2] " Bill Wendling
2025-02-28 21:20   ` Eric Biggers
2025-02-28 21:29     ` Bill Wendling
2025-03-03 20:15   ` David Laight
2025-03-03 20:27     ` Bill Wendling
2025-03-03 22:42       ` David Laight
2025-03-03 23:57         ` H. Peter Anvin
2025-03-04  0:16           ` Bill Wendling
2025-03-04  0:43             ` H. Peter Anvin
2025-03-04  4:32             ` David Laight [this message]
2025-03-04 20:52               ` David Laight
2025-03-04 21:52                 ` Eric Biggers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250304043223.68ed310f@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiggers@kernel.org \
    --cc=hpa@zytor.com \
    --cc=justinstitt@google.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mingo@redhat.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.