From: David Laight <david.laight.linux@gmail.com>
To: Bill Wendling <morbo@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)"
<x86@kernel.org>, Eric Biggers <ebiggers@kernel.org>,
Ard Biesheuvel <ardb@kernel.org>,
Nathan Chancellor <nathan@kernel.org>,
Nick Desaulniers <nick.desaulniers+lkml@gmail.com>,
Justin Stitt <justinstitt@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-crypto@vger.kernel.org,
clang-built-linux <llvm@lists.linux.dev>
Subject: Re: [PATCH v2] x86/crc32: use builtins to improve code generation
Date: Tue, 4 Mar 2025 20:52:52 +0000 [thread overview]
Message-ID: <20250304205252.59a04955@pumpkin> (raw)
In-Reply-To: <20250304043223.68ed310f@pumpkin>
On Tue, 4 Mar 2025 04:32:23 +0000
David Laight <david.laight.linux@gmail.com> wrote:
....
> > For reference, GCC does much better with code gen, but only with the builtin:
> >
> > .L39:
> > crc32q (%rax), %rbx # MEM[(long unsigned int *)p_40], tmp120
> > addq $8, %rax #, p
> > cmpq %rcx, %rax # _37, p
> > jne .L39 #,
>
> That looks reasonable, if Clang's 8 unrolled crc32q is faster per byte
> then you either need to unroll once (no point doing any more) or use
> the loop that does negative offsets from the end.
Thinking while properly awake the 1% difference isn't going to be a
difference between the above and Clang's unrolled loop.
Clang's loop will do 8 bytes every three clocks, if the above is slower
it'll be doing 8 bytes in 4 clocks (ok, you can get 3.5 - but unlikely)
which would be either 25% or 33% depending which way you measure it.
...
> I'll find the code loop I use - machine isn't powered on at the moment.
#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>
static int pmc_id;
static void init_pmc(void)
{
static struct perf_event_attr perf_attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.pinned = 1,
};
struct perf_event_mmap_page *pc;
int perf_fd;
perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
if (perf_fd < 0) {
fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
exit(1);
}
pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
if (pc == MAP_FAILED) {
fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
exit(1);
}
pmc_id = pc->index - 1;
}
static inline unsigned int rdpmc(id)
{
unsigned int low, high;
// You need something to force the instruction pipeline to finish.
// lfence might be enough.
#ifndef NOFENCE
asm volatile("mfence");
#endif
asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (id));
#ifndef NOFENCE
asm volatile("mfence");
#endif
// return low bits, counter might to 32 or 40 bits wide.
return low;
}
The test code is then something like:
#define PASSES 10
unsigned int ticks[PASSES];
unsigned int tick;
unsigned int i;
for (i = 0; i < PASSES; i++) {
tick = rdpmc(pmc_id);
test_fn(buf, len);
ticks[i] = rdpmc(pmc_id) - tick;
}
for (i = 0; i < PASSES; i++)
printf(" %5d", ticks[i]);
Make sure the data is in the l1-cache (or that dominates).
The values output for passes 2-10 are likely to be the same to within
a clock or two.
I probably tried to subtract an offset for an empty test_fn().
But you can easily work out the 'clocks per loop iteration'
(which is what you are trying to measure) by measuring two separate
loop lengths.
I did find that sometimes running the program gave slow results.
But it is usually very consistent.
Needs to be run as root.
Clearly a hardware interrupt will generate a very big number.
But they don't happen.
The copy I found was used for measuring ip checksum algorithms.
Seems to output:
$ sudo ./ipcsum
0 0 160 160 160 160 160 160 160 160 160 160 overhead
3637b4f0b942c3c4 682f 316 25 26 26 26 26 26 26 26 26 csum_partial
3637b4f0b942c3c4 682f 124 79 43 25 25 25 24 26 25 24 csum_partial_1
3637b4f0b942c3c4 682f 166 43 25 25 24 24 24 24 24 24 csum_new adc pair
3637b4f0b942c3c4 682f 115 21 21 21 21 21 21 21 21 21 adc_dec_2
3637b4f0b942c3c4 682f 97 34 31 23 24 24 24 24 24 23 adc_dec_4
3637b4f0b942c3c4 682f 39 33 34 21 21 21 21 21 21 21 adc_dec_8
3637b4f0b942c3c4 682f 81 52 49 52 49 26 25 27 25 26 adc_jcxz_2
3637b4f0b942c3c4 682f 62 46 24 24 24 24 24 24 24 24 adc_jcxz_4
3637b4f0b942c3c4 682f 224 40 21 21 23 23 23 23 23 23 adc_2_pair
3637b4f0b942c3c4 682f 42 36 37 22 22 22 22 22 22 22 adc_4_pair_old
3637b4f0b942c3c4 682f 42 37 34 41 23 23 23 23 23 23 adc_4_pair
3637b4f0b942c3c4 682f 122 19 20 19 18 19 18 19 18 19 adcx_adox
bef7a78a9 682f 104 51 30 30 30 30 30 30 30 30 add_c_16
bef7a78a9 682f 143 50 50 27 27 27 27 27 27 27 add_c_32
6ef7a78ae 682f 103 91 45 34 34 34 35 34 34 34 add_c_high
I don't think the current one is in there - IIRC it is as fast as the adcx_adox one
but more portable.
David
next prev parent reply other threads:[~2025-03-04 20:52 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-27 6:12 [PATCH] x86/crc32: use builtins to improve code generation Bill Wendling
2025-02-27 6:28 ` Eric Biggers
2025-02-27 7:08 ` Bill Wendling
2025-02-28 2:08 ` Eric Biggers
2025-02-27 10:52 ` H. Peter Anvin
2025-02-27 12:17 ` Bill Wendling
2025-02-27 20:56 ` Bill Wendling
2025-02-27 16:26 ` Dave Hansen
2025-02-27 20:57 ` Bill Wendling
2025-02-27 21:03 ` Dave Hansen
2025-02-27 23:47 ` [PATCH v2] " Bill Wendling
2025-02-28 21:20 ` Eric Biggers
2025-02-28 21:29 ` Bill Wendling
2025-03-03 20:15 ` David Laight
2025-03-03 20:27 ` Bill Wendling
2025-03-03 22:42 ` David Laight
2025-03-03 23:57 ` H. Peter Anvin
2025-03-04 0:16 ` Bill Wendling
2025-03-04 0:43 ` H. Peter Anvin
2025-03-04 4:32 ` David Laight
2025-03-04 20:52 ` David Laight [this message]
2025-03-04 21:52 ` Eric Biggers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250304205252.59a04955@pumpkin \
--to=david.laight.linux@gmail.com \
--cc=ardb@kernel.org \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=ebiggers@kernel.org \
--cc=hpa@zytor.com \
--cc=justinstitt@google.com \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=llvm@lists.linux.dev \
--cc=mingo@redhat.com \
--cc=morbo@google.com \
--cc=nathan@kernel.org \
--cc=nick.desaulniers+lkml@gmail.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox