Re: Prefetches in buffer_zero_* - Dr. David Alan Gilbert

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: "Philippe Mathieu-Daudé" <philmd@redhat.com>
Cc: "Lukáš Doktor" <ldoktor@redhat.com>,
	"Richard Henderson" <richard.henderson@linaro.org>,
	qemu-devel@nongnu.org, "Joe Mario" <jmario@redhat.com>
Subject: Re: Prefetches in buffer_zero_*
Date: Mon, 26 Jul 2021 13:07:50 +0100	[thread overview]
Message-ID: <YP6llgVw1lN0U58h@work-vm> (raw)
In-Reply-To: <bcbca5e4-d804-2a2f-2dff-4b19f4ef07cf@redhat.com>

* Philippe Mathieu-Daudé (philmd@redhat.com) wrote:
> +Lukáš
> 
> On 7/26/21 10:47 AM, Dr. David Alan Gilbert wrote:
> > * Joe Mario (jmario@redhat.com) wrote:
> >> On Thu, Jul 22, 2021 at 3:14 PM Dr. David Alan Gilbert <dgilbert@redhat.com>
> >> wrote:
> >>
> >>> * Richard Henderson (richard.henderson@linaro.org) wrote:
> >>>> On 7/22/21 12:02 AM, Dr. David Alan Gilbert wrote:
> >>>>> Hi Richard,
> >>>>>    I think you were the last person to fiddle with the prefetching
> >>>>> in buffer_zero_avx2 and friends; Joe (cc'd) wondered if explicit
> >>>>> prefetching still made sense on modern CPUs, and that their hardware
> >>>>> generally figures stuff out better on simple increments.
> >>>>>
> >>>>>    What was your thinking on this, and did you actually measure
> >>>>> any improvement?
> >>>>
> >>>> Ah, well, that was 5 years ago so I have no particular memory of this.
> >>> It
> >>>> wouldn't surprise me if you can't measure any improvement on modern
> >>>> hardware.
> >>>>
> >>>> Do you now measure an improvement with the prefetches gone?
> >>>
> >>> Not tried, it just came from Joe's suggestion that it was generally a
> >>> bad idea these days; I do remember that the behaviour of those functions
> >>> is quite tricky because there performance is VERY data dependent - many
> >>> VMs actually have pages that are quite dirty so you never iterate the
> >>> loop, but then you hit others with big zero pages and you spend your
> >>> entire life in the loop.
> >>>
> >>>
> >> Dave, Richard:
> >> My curiosity got the best of me.  So I created a small test program that
> >> used the buffer_zero_avx2() routine from qemu's bufferiszero.c.
> > 
> > Thanks for testing,
> > 
> >> When I run it on an Intel Cascade Lake processor, the cost of calling
> >> "__builtin_prefetch(p)" is in the noise range .  It's always "just
> >> slightly" slower.  I doubt it could ever be measured in qemu.
> >>
> >> Ironically, when I disabled the hardware prefetchers, the program slowed
> >> down over 33%.  And the call to "__builtin_prefetch(p)" actually hurt
> >> performance by over 3%.
> > 
> > Yeh that's a bit odd.
> > 
> >> My results are below, (only with the hardware prefetchers enabled).  The
> >> program is attached.
> >> Joe
> >>
> >> # gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do
> >> ./a.out; done
> >> TSC 356144 Kcycles.
> >> TSC 356714 Kcycles.
> >> TSC 356707 Kcycles.
> >> TSC 356565 Kcycles.
> >> TSC 356853 Kcycles.
> >> # gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> >> TSC 355520 Kcycles.
> >> TSC 355961 Kcycles.
> >> TSC 355872 Kcycles.
> >> TSC 355948 Kcycles.
> >> TSC 355918 Kcycles.
> > 
> > This basically agrees with the machines I've just tried your test on -
> > *except* AMD EPYC 7302P's - that really like the prefetch:
> > 
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH ; for i in {1..5}; do ./a.out; done
> > TSC 322162 Kcycles.
> > TSC 321861 Kcycles. 
> > TSC 322212 Kcycles. 
> > TSC 321957 Kcycles.
> > TSC 322085 Kcycles. 
> >  
> > [root@virtlab720 ~]# gcc -mavx buffer_zero_avx.c -O ; for i in {1..5}; do ./a.out; done
> > TSC 377988 Kcycles. 
> > TSC 380125 Kcycles. 
> > TSC 379440 Kcycles.
> > TSC 379689 Kcycles. 
> > TSC 379571 Kcycles. 
> >  
> > The 1st gen doesn't seem to see much difference with/without it.
> > 
> > Probably best to leave this code as is!
> 
> Regardless the decision of changing the code or not, it would be
> nice to have this test committed in the repository to run
> performance regression testing from time to time.

It could be, although this is a slightly odd microtest for that; it's a bit
specific (the avx2 variant, and only really testing the all zero case).


Dave

> >> /*
> >>  * Simple program to test if a prefetch helps or hurts buffer_zero_avx2.
> >>  *
> >>  * Compile with either:
> >>  *  gcc -mavx buffer_zero_avx.c -O 
> >>  * or
> >>  *  gcc -mavx buffer_zero_avx.c -O -DDO_PREFETCH 
> >>  */
> >>
> [...]
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

     prev parent reply	other threads:[~2021-07-26 12:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 10:02 Prefetches in buffer_zero_* Dr. David Alan Gilbert
2021-07-22 18:53 ` Richard Henderson
2021-07-22 19:13   ` Dr. David Alan Gilbert
2021-07-24  1:43     ` Joe Mario
2021-07-26  8:47       ` Dr. David Alan Gilbert
2021-07-26 11:31         ` Philippe Mathieu-Daudé
2021-07-26 12:07           ` Dr. David Alan Gilbert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YP6llgVw1lN0U58h@work-vm \
    --to=dgilbert@redhat.com \
    --cc=jmario@redhat.com \
    --cc=ldoktor@redhat.com \
    --cc=philmd@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.