Re: [PATCH v2] x86/crc32: use builtins to improve code generation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Bill Wendling <morbo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
	<x86@kernel.org>, Eric Biggers <ebiggers@kernel.org>,
	Ard Biesheuvel <ardb@kernel.org>,
	Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
	Justin Stitt <justinstitt@google.com>,
	LKML <linux-kernel@vger.kernel.org>,
	linux-crypto@vger.kernel.org,
	clang-built-linux <llvm@lists.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Date: Tue, 4 Mar 2025 20:52:52 +0000	[thread overview]
Message-ID: <20250304205252.59a04955@pumpkin> (raw)
In-Reply-To: <20250304043223.68ed310f@pumpkin>

On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:

....
> > For reference, GCC does much better with code gen, but only with the builtin:
> > 
> > .L39:
> >         crc32q  (%rax), %rbx    # MEM[(long unsigned int *)p_40], tmp120
> >         addq    $8, %rax        #, p
> >         cmpq    %rcx, %rax      # _37, p
> >         jne     .L39    #,  
> 
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.

Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.

...
> I'll find the code loop I use - machine isn't powered on at the moment.

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

static int pmc_id;
static void init_pmc(void)
{
        static struct perf_event_attr perf_attr = {
                .type = PERF_TYPE_HARDWARE,
                .config = PERF_COUNT_HW_CPU_CYCLES,
                .pinned = 1,
        };
        struct perf_event_mmap_page *pc;

        int perf_fd;
        perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
        if (perf_fd < 0) {
                fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
                exit(1);
        }
        pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
        if (pc == MAP_FAILED) {
                fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
                exit(1);
        }
        pmc_id = pc->index - 1;
}

static inline unsigned int rdpmc(id)
{
        unsigned int low, high;

// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
        asm volatile("mfence");
#endif
        asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
        asm volatile("mfence");
#endif

        // return low bits, counter might to 32 or 40 bits wide.
        return low;
}

The test code is then something like:
#define PASSES 10
        unsigned int ticks[PASSES];
        unsigned int tick;
        unsigned int i;

        for (i = 0; i < PASSES; i++) {
                tick = rdpmc(pmc_id);
                test_fn(buf, len);
                ticks[i] = rdpmc(pmc_id) - tick;
        }

        for (i = 0; i < PASSES; i++)
                printf(" %5d", ticks[i]);

Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.

I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.

The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum 
                0     0   160   160   160   160   160   160   160   160   160   160  overhead
 3637b4f0b942c3c4  682f   316    25    26    26    26    26    26    26    26    26  csum_partial
 3637b4f0b942c3c4  682f   124    79    43    25    25    25    24    26    25    24  csum_partial_1
 3637b4f0b942c3c4  682f   166    43    25    25    24    24    24    24    24    24  csum_new adc pair
 3637b4f0b942c3c4  682f   115    21    21    21    21    21    21    21    21    21  adc_dec_2
 3637b4f0b942c3c4  682f    97    34    31    23    24    24    24    24    24    23  adc_dec_4
 3637b4f0b942c3c4  682f    39    33    34    21    21    21    21    21    21    21  adc_dec_8
 3637b4f0b942c3c4  682f    81    52    49    52    49    26    25    27    25    26  adc_jcxz_2
 3637b4f0b942c3c4  682f    62    46    24    24    24    24    24    24    24    24  adc_jcxz_4
 3637b4f0b942c3c4  682f   224    40    21    21    23    23    23    23    23    23  adc_2_pair
 3637b4f0b942c3c4  682f    42    36    37    22    22    22    22    22    22    22  adc_4_pair_old
 3637b4f0b942c3c4  682f    42    37    34    41    23    23    23    23    23    23  adc_4_pair
 3637b4f0b942c3c4  682f   122    19    20    19    18    19    18    19    18    19  adcx_adox
        bef7a78a9  682f   104    51    30    30    30    30    30    30    30    30  add_c_16
        bef7a78a9  682f   143    50    50    27    27    27    27    27    27    27  add_c_32
        6ef7a78ae  682f   103    91    45    34    34    34    35    34    34    34  add_c_high

I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.


	David

next prev parent reply	other threads:[~2025-03-04 20:52 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-27  6:12 [PATCH] x86/crc32: use builtins to improve code generation Bill Wendling
2025-02-27  6:28 ` Eric Biggers
2025-02-27  7:08   ` Bill Wendling
2025-02-28  2:08     ` Eric Biggers
2025-02-27 10:52   ` H. Peter Anvin
2025-02-27 12:17     ` Bill Wendling
2025-02-27 20:56       ` Bill Wendling
2025-02-27 16:26 ` Dave Hansen
2025-02-27 20:57   ` Bill Wendling
2025-02-27 21:03     ` Dave Hansen
2025-02-27 23:47 ` [PATCH v2] " Bill Wendling
2025-02-28 21:20   ` Eric Biggers
2025-02-28 21:29     ` Bill Wendling
2025-03-03 20:15   ` David Laight
2025-03-03 20:27     ` Bill Wendling
2025-03-03 22:42       ` David Laight
2025-03-03 23:57         ` H. Peter Anvin
2025-03-04  0:16           ` Bill Wendling
2025-03-04  0:43             ` H. Peter Anvin
2025-03-04  4:32             ` David Laight
2025-03-04 20:52               ` David Laight [this message]
2025-03-04 21:52                 ` Eric Biggers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250304205252.59a04955@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=ardb@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=ebiggers@kernel.org \
    --cc=hpa@zytor.com \
    --cc=justinstitt@google.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mingo@redhat.com \
    --cc=morbo@google.com \
    --cc=nathan@kernel.org \
    --cc=nick.desaulniers+lkml@gmail.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.