Re: Prefetches in buffer_zero_* - Dr. David Alan Gilbert

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: "Lukáš Doktor" <ldoktor@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, "Joe Mario" <jmario@redhat.com>
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 13:07:50 +0100	[thread overview]
Message-ID: <YP6llgVw1lN0U58h@work-vm> (raw)
In-Reply-To: <bcbca5e4-d804-2a2f-2dff-4b19f4ef07cf@redhat.com>

* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
> 
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>>    I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>>    What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me.  So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> > 
> > Thanks for testing,
> > 
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> >> slightly" slower.  I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> > 
> > Yeh that's a bit odd.
> > 
> >> My results are below, (only with the hardware prefetchers enabled).  The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> > 
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> > 
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles. 
> > TSC 322212 Kcycles. 
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles. 
> >  
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> > TSC 377988 Kcycles. 
> > TSC 380125 Kcycles. 
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles. 
> > TSC 379571 Kcycles. 
> >  
> > The 1st gen doesn't seem to see much difference with/without it.
> > 
> > Probably best to leave this code as is!
> 
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.

It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).


Dave

> >> /*
> >>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >>  *
> >>  * Compile with either:
> >>  *  gcc -mavx buffer_zero_avx.c -O 
> >>  * or
> >>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
> >>  */
> >>
> [...]
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

     prev parent reply	other threads:[~2021-07-26 12:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
2021-07-22 18:53 ` Richard Henderson
2021-07-22 19:13   ` Dr. David Alan Gilbert
2021-07-24  1:43     ` Joe Mario
2021-07-26  8:47       ` Dr. David Alan Gilbert
2021-07-26 11:31         ` Philippe Mathieu-Daudé
2021-07-26 12:07           ` Dr. David Alan Gilbert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YP6llgVw1lN0U58h@work-vm \
    --to=dgilbert@redhat.com \
    --cc=jmario@redhat.com \
    --cc=ldoktor@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).