From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Joe Mario <jmario@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>, qemu-devel@nongnu.org
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 09:47:43 +0100 [thread overview]
Message-ID: <YP52r9ck8jmV/RkR@work-vm> (raw)
In-Reply-To: <CA+c1WK7XCgdYTbG5FK3q=x9S1t2p2vxcreEwtiZf5L=rsg0i6g@mail.gmail.com>
* Joe Mario (jmario@redhat.com) wrote:
> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> wrote:
>
> > * Richard Henderson (richard.henderson@linaro.org) wrote:
> > > On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> > > > Hi Richard,
> > > > I think you were the last person to fiddle with the prefetching
> > > > in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> > > > prefetching still made sense on modern CPUs, and that their hardware
> > > > generally figures stuff out better on simple increments.
> > > >
> > > > What was your thinking on this, and did you actually measure
> > > > any improvement?
> > >
> > > Ah, well, that was 5 years ago so I have no particular memory of this.
> > It
> > > wouldn't surprise me if you can't measure any improvement on modern
> > > hardware.
> > >
> > > Do you now measure an improvement with the prefetches gone?
> >
> > Not tried, it just came from Joe's suggestion that it was generally a
> > bad idea these days; I do remember that the behaviour of those functions
> > is quite tricky because there performance is VERY data dependent - many
> > VMs actually have pages that are quite dirty so you never iterate the
> > loop, but then you hit others with big zero pages and you spend your
> > entire life in the loop.
> >
> >
> Dave, Richard:
> My curiosity got the best of me. So I created a small test program that
> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
Thanks for testing,
> When I run it on an Intel Cascade Lake processor, the cost of calling
> "__builtin_prefetch(p)" is in the noise range . It's always "just
> slightly" slower. I doubt it could ever be measured in qemu.
>
> Ironically, when I disabled the hardware prefetchers, the program slowed
> down over 33%. And the call to "__builtin_prefetch(p)" actually hurt
> performance by over 3%.
Yeh that's a bit odd.
> My results are below, (only with the hardware prefetchers enabled). The
> program is attached.
> Joe
>
> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> ./a.out; done
> TSC 356144 Kcycles.
> TSC 356714 Kcycles.
> TSC 356707 Kcycles.
> TSC 356565 Kcycles.
> TSC 356853 Kcycles.
> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> TSC 355520 Kcycles.
> TSC 355961 Kcycles.
> TSC 355872 Kcycles.
> TSC 355948 Kcycles.
> TSC 355918 Kcycles.
This basically agrees with the machines I've just tried your test on -
*except* AMD EPYC 7302P's - that really like the prefetch:
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
TSC 322162 Kcycles.
TSC 321861 Kcycles.
TSC 322212 Kcycles.
TSC 321957 Kcycles.
TSC 322085 Kcycles.
[root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
TSC 377988 Kcycles.
TSC 380125 Kcycles.
TSC 379440 Kcycles.
TSC 379689 Kcycles.
TSC 379571 Kcycles.
The 1st gen doesn't seem to see much difference with/without it.
Probably best to leave this code as is!
Dave
> Dave
> > >
> > > r~
> > >
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> >
> >
> /*
> * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> *
> * Compile with either:
> * gcc -mavx buffer_zero_avx.c -O
> * or
> * gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH
> */
>
> #include <immintrin.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <stddef.h>
> #include <sys/mman.h>
> #include <string.h>
>
> #define likely(x) __builtin_expect((x),1)
> #define unlikely(x) __builtin_expect((x),0)
>
> static __inline__ u_int64_t start_clock();
> static __inline__ u_int64_t stop_clock();
> static int buffer_zero_avx2(const void *buf, size_t len);
>
> /*
> * Allocate a large chuck of anon memory, touch/zero it,
> * and then time the call to buffer_zero_avx2().
> */
> int main()
> {
> long i;
> size_t mmap_len = 2UL*1024*1024*1024;
> char *ptr = mmap(NULL, mmap_len,
> PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0L);
>
> if (ptr == MAP_FAILED) {
> perror(" mmap");
> exit(1);
> }
>
> // Touch the pages (they're already cleared)
> memset(ptr,0x0,mmap_len);
>
> u_int64_t start_rdtsc = start_clock();
>
> buffer_zero_avx2(ptr, mmap_len);
>
> u_int64_t stop_rdtsc = stop_clock();
> u_int64_t diff = stop_rdtsc - start_rdtsc;
>
> printf("TSC %ld Kcycles. \n", diff/1000);
>
> }
>
> static int
> buffer_zero_avx2(const void *buf, size_t len)
> {
> /* Begin with an unaligned head of 32 bytes. */
> __m256i t = _mm256_loadu_si256(buf);
> __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32);
> __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32);
>
> if (likely(p <= e)) {
> /* Loop over 32-byte aligned blocks of 128. */
> do {
> #ifdef DO_PREFETCH
> __builtin_prefetch(p);
> #endif
> if (unlikely(!_mm256_testz_si256(t, t))) {
> printf("In unlikely buffer_zero, p:%lx \n",p);
> return 0;
> }
> t = p[-4] | p[-3] | p[-2] | p[-1];
> p += 4;
> } while (p <= e);
> } else {
> t |= _mm256_loadu_si256(buf + 32);
> if (len <= 128) {
> goto last2;
> }
> }
>
> /* Finish the last block of 128 unaligned. */
> t |= _mm256_loadu_si256(buf + len - 4 * 32);
> t |= _mm256_loadu_si256(buf + len - 3 * 32);
> last2:
> t |= _mm256_loadu_si256(buf + len - 2 * 32);
> t |= _mm256_loadu_si256(buf + len - 1 * 32);
>
> // printf("End of buffer_zero_avx2\n");
> return _mm256_testz_si256(t, t);
> }
>
> static __inline__ u_int64_t
> start_clock() {
> // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
> u_int32_t hi, lo;
> __asm__ __volatile__ (
> "CPUID\n\t"
> "RDTSC\n\t"
> "mov %%edx, %0\n\t"
> "mov %%eax, %1\n\t": "=r" (hi), "=r" (lo)::
> "%rax", "%rbx", "%rcx", "%rdx");
> return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
>
> static __inline__ u_int64_t
> stop_clock() {
> // See: Intel Doc #324264, "How to Benchmark Code Execution Times on Intel...",
> u_int32_t hi, lo;
> __asm__ __volatile__(
> "RDTSCP\n\t"
> "mov %%edx, %0\n\t"
> "mov %%eax, %1\n\t"
> "CPUID\n\t": "=r" (hi), "=r" (lo)::
> "%rax", "%rbx", "%rcx", "%rdx");
> return ( (u_int64_t)lo) | ( ((u_int64_t)hi) << 32);
> }
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
next prev parent reply other threads:[~2021-07-26 8:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
2021-07-22 18:53 ` Richard Henderson
2021-07-22 19:13 ` Dr. David Alan Gilbert
2021-07-24 1:43 ` Joe Mario
2021-07-26 8:47 ` Dr. David Alan Gilbert [this message]
2021-07-26 11:31 ` Philippe Mathieu-Daudé
2021-07-26 12:07 ` Dr. David Alan Gilbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YP52r9ck8jmV/RkR@work-vm \
--to=dgilbert@redhat.com \
--cc=jmario@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).