From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: "Lukáš Doktor" <ldoktor@redhat.com>,
"Richard Henderson" <richard.henderson@linaro.org>,
qemu-devel@nongnu.org, "Joe Mario" <jmario@redhat.com>
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 13:07:50 +0100 [thread overview]
Message-ID: <YP6llgVw1lN0U58h@work-vm> (raw)
In-Reply-To: <bcbca5e4-d804-2a2f-2dff-4b19f4ef07cf@redhat.com>
* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
>
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>> I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>> What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me. So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> >
> > Thanks for testing,
> >
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range . It's always "just
> >> slightly" slower. I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%. And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> >
> > Yeh that's a bit odd.
> >
> >> My results are below, (only with the hardware prefetchers enabled). The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> >
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> >
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles.
> > TSC 322212 Kcycles.
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles.
> >
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> > TSC 377988 Kcycles.
> > TSC 380125 Kcycles.
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles.
> > TSC 379571 Kcycles.
> >
> > The 1st gen doesn't seem to see much difference with/without it.
> >
> > Probably best to leave this code as is!
>
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.
It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).
Dave
> >> /*
> >> * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >> *
> >> * Compile with either:
> >> * gcc -mavx buffer_zero_avx.c -O
> >> * or
> >> * gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH
> >> */
> >>
> [...]
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
prev parent reply other threads:[~2021-07-26 12:08 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
2021-07-22 18:53 ` Richard Henderson
2021-07-22 19:13 ` Dr. David Alan Gilbert
2021-07-24 1:43 ` Joe Mario
2021-07-26 8:47 ` Dr. David Alan Gilbert
2021-07-26 11:31 ` Philippe Mathieu-Daudé
2021-07-26 12:07 ` Dr. David Alan Gilbert [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YP6llgVw1lN0U58h@work-vm \
--to=dgilbert@redhat.com \
--cc=jmario@redhat.com \
--cc=ldoktor@redhat.com \
--cc=philmd@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).