From: David Laight <david.laight.linux@gmail.com>
To: Bill Wendling <morbo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
<x86@kernel.org>, Eric Biggers <ebiggers@kernel.org>,
Ard Biesheuvel <ardb@kernel.org>,
Nathan Chancellor <nathan@kernel.org>,
Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
Justin Stitt <justinstitt@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-crypto@vger.kernel.org,
clang-built-linux <llvm@lists.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Date: Tue, 4 Mar 2025 20:52:52 +0000 [thread overview]
Message-ID: <20250304205252.59a04955@pumpkin> (raw)
In-Reply-To: <20250304043223.68ed310f@pumpkin>
On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:
....
> > For reference, GCC does much better with code gen, but only with the builtin:
> >
> > .L39:
> > crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120
> > addq $8, %rax #, p
> > cmpq %rcx, %rax # _37, p
> > jne .L39 #,
>
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.
Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.
...
> I'll find the code loop I use - machine isn't powered on at the moment.
#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>
static int pmc_id;
static void init_pmc(void)
{
static struct perf_event_attr perf_attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.pinned = 1,
};
struct perf_event_mmap_page *pc;
int perf_fd;
perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
if (perf_fd < 0) {
fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
exit(1);
}
pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
if (pc == MAP_FAILED) {
fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
exit(1);
}
pmc_id = pc->index - 1;
}
static inline unsigned int rdpmc(id)
{
unsigned int low, high;
// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
asm volatile("mfence");
#endif
asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
asm volatile("mfence");
#endif
// return low bits, counter might to 32 or 40 bits wide.
return low;
}
The test code is then something like:
#define PASSES 10
unsigned int ticks[PASSES];
unsigned int tick;
unsigned int i;
for (i = 0; i < PASSES; i++) {
tick = rdpmc(pmc_id);
test_fn(buf, len);
ticks[i] = rdpmc(pmc_id) - tick;
}
for (i = 0; i < PASSES; i++)
printf(" %5d", ticks[i]);
Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.
I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.
The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum
0 0 160 160 160 160 160 160 160 160 160 160 overhead
3637b4f0b942c3c4 682f 316 25 26 26 26 26 26 26 26 26 csum_partial
3637b4f0b942c3c4 682f 124 79 43 25 25 25 24 26 25 24 csum_partial_1
3637b4f0b942c3c4 682f 166 43 25 25 24 24 24 24 24 24 csum_new adc pair
3637b4f0b942c3c4 682f 115 21 21 21 21 21 21 21 21 21 adc_dec_2
3637b4f0b942c3c4 682f 97 34 31 23 24 24 24 24 24 23 adc_dec_4
3637b4f0b942c3c4 682f 39 33 34 21 21 21 21 21 21 21 adc_dec_8
3637b4f0b942c3c4 682f 81 52 49 52 49 26 25 27 25 26 adc_jcxz_2
3637b4f0b942c3c4 682f 62 46 24 24 24 24 24 24 24 24 adc_jcxz_4
3637b4f0b942c3c4 682f 224 40 21 21 23 23 23 23 23 23 adc_2_pair
3637b4f0b942c3c4 682f 42 36 37 22 22 22 22 22 22 22 adc_4_pair_old
3637b4f0b942c3c4 682f 42 37 34 41 23 23 23 23 23 23 adc_4_pair
3637b4f0b942c3c4 682f 122 19 20 19 18 19 18 19 18 19 adcx_adox
bef7a78a9 682f 104 51 30 30 30 30 30 30 30 30 add_c_16
bef7a78a9 682f 143 50 50 27 27 27 27 27 27 27 add_c_32
6ef7a78ae 682f 103 91 45 34 34 34 35 34 34 34 add_c_high
I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.
David
next prev parent reply other threads:[~2025-03-04 20:52 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-27 6:12 [PATCH] x86/crc32: use builtins to improve code generation Bill Wendling
2025-02-27 6:28 ` Eric Biggers
2025-02-27 7:08 ` Bill Wendling
2025-02-28 2:08 ` Eric Biggers
2025-02-27 10:52 ` H. Peter Anvin
2025-02-27 12:17 ` Bill Wendling
2025-02-27 20:56 ` Bill Wendling
2025-02-27 16:26 ` Dave Hansen
2025-02-27 20:57 ` Bill Wendling
2025-02-27 21:03 ` Dave Hansen
2025-02-27 23:47 ` [PATCH v2] " Bill Wendling
2025-02-28 21:20 ` Eric Biggers
2025-02-28 21:29 ` Bill Wendling
2025-03-03 20:15 ` David Laight
2025-03-03 20:27 ` Bill Wendling
2025-03-03 22:42 ` David Laight
2025-03-03 23:57 ` H. Peter Anvin
2025-03-04 0:16 ` Bill Wendling
2025-03-04 0:43 ` H. Peter Anvin
2025-03-04 4:32 ` David Laight
2025-03-04 20:52 ` David Laight [this message]
2025-03-04 21:52 ` Eric Biggers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250304205252.59a04955@pumpkin \
--to=david.laight.linux@gmail.com \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=ebiggers@kernel.org \
--cc=hpa@zytor.com \
--cc=justinstitt@google.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=llvm@lists.linux.dev \
--cc=mingo@redhat.com \
--cc=morbo@google.com \
--cc=nathan@kernel.org \
--cc=nick.desaulniers+lkml@gmail.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.