Re: [PATCH v2] x86/crc32: use builtins to improve code generation

public inbox for linux-crypto@vger.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Bill Wendling <morbo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, Eric Biggers <ebiggers@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Justin Stitt <justinstitt@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-crypto@vger.kernel.org,
	clang-built-linux <llvm@lists.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Date: Tue, 4 Mar 2025 20:52:52 +0000	[thread overview]
Message-ID: <20250304205252.59a04955@pumpkin> (raw)
In-Reply-To: <20250304043223.68ed310f@pumpkin>

On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:

....
> > For reference, GCC does much better with code gen, but only with the builtin:
> > 
> > .L39:
> >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> >         addq    $8, %rax        #, p
> >         cmpq    %rcx, %rax      # _37, p
> >         jne     .L39    #,  
> 
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.

Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.

...
> I'll find the code loop I use - machine isn't powered on at the moment.

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int pmc_id;
static void init_pmc(void)
{
        static struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .pinned = 1,
        };
        struct perf_event_mmap_page *pc;

        int perf_fd;
        perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
        if (perf_fd < 0) {
                fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
                exit(1);
        }
        pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
        if (pc == MAP_FAILED) {
                fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
                exit(1);
        }
        pmc_id = pc->index - 1;
}

static inline unsigned int rdpmc(id)
{
        unsigned int low, high;

// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
        asm volatile("mfence");
#endif
        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
        asm volatile("mfence");
#endif

        // return low bits, counter might to 32 or 40 bits wide.
        return low;
}

The test code is then something like:
#define PASSES 10
        unsigned int ticks[PASSES];
        unsigned int tick;
        unsigned int i;

        for (i = 0; i < PASSES; i++) {
                tick = rdpmc(pmc_id);
                test_fn(buf, len);
                ticks[i] = rdpmc(pmc_id) - tick;
        }

        for (i = 0; i < PASSES; i++)
                printf(" %5d", ticks[i]);

Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.

I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.

The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum 
                0     0   160   160   160   160   160   160   160   160   160   160  overhead
 3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
 3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
 3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
 3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
 3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
 3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
 3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
 3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
 3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
 3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
 3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
 3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
        bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
        bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
        6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high

I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.


	David

next prev parent reply	other threads:[~2025-03-04 20:52 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-27  6:12 [PATCH] x86/crc32: use builtins to improve code generation Bill Wendling
2025-02-27  6:28 ` Eric Biggers
2025-02-27  7:08   ` Bill Wendling
2025-02-28  2:08     ` Eric Biggers
2025-02-27 10:52   ` H. Peter Anvin
2025-02-27 12:17     ` Bill Wendling
2025-02-27 20:56       ` Bill Wendling
2025-02-27 16:26 ` Dave Hansen
2025-02-27 20:57   ` Bill Wendling
2025-02-27 21:03     ` Dave Hansen
2025-02-27 23:47 ` [PATCH v2] " Bill Wendling
2025-02-28 21:20   ` Eric Biggers
2025-02-28 21:29     ` Bill Wendling
2025-03-03 20:15   ` David Laight
2025-03-03 20:27     ` Bill Wendling
2025-03-03 22:42       ` David Laight
2025-03-03 23:57         ` H. Peter Anvin
2025-03-04  0:16           ` Bill Wendling
2025-03-04  0:43             ` H. Peter Anvin
2025-03-04  4:32             ` David Laight
2025-03-04 20:52               ` David Laight [this message]
2025-03-04 21:52                 ` Eric Biggers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250304205252.59a04955@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiggers@kernel.org \
    --cc=hpa@zytor.com \
    --cc=justinstitt@google.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mingo@redhat.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox