Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: BALATON Zoltan <balaton@eik.bme.hu>
Cc: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>,
	qemu-ppc@nongnu.org, richard.henderson@linaro.org,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations
Date: Tue, 11 Dec 2018 12:20:27 +1100	[thread overview]
Message-ID: <20181211012027.GA4261@umbus.fritz.box> (raw)
In-Reply-To: <alpine.BSF.2.21.9999.1812102035360.24343@zero.eik.bme.hu>

[-- Attachment #1: Type: text/plain, Size: 6251 bytes --]

On Mon, Dec 10, 2018 at 09:54:51PM +0100, BALATON Zoltan wrote:
> On Mon, 10 Dec 2018, David Gibson wrote:
> > On Mon, Dec 10, 2018 at 01:33:53AM +0100, BALATON Zoltan wrote:
> > > On Fri, 7 Dec 2018, Mark Cave-Ayland wrote:
> > > > This patchset is an attempt at trying to improve the VMX (Altivec) instruction
> > > > performance by making use of the new TCG vector operations where possible.
> > > 
> > > This is very welcome, thanks for doing this.
> > > 
> > > > In order to use TCG vector operations, the registers must be accessible from cpu_env
> > > > whilst currently they are accessed via arrays of static TCG globals. Patches 1-3
> > > > are therefore mechanical patches which introduce access helpers for FPR, AVR and VSR
> > > > registers using the supplied TCGv_i64 parameter.
> > > 
> > > Have you tried some benchmarks or tests to measure the impact of these
> > > changes? I've tried the (very unscientific) benchmarks I've written about
> > > before here:
> > > 
> > > http://lists.nongnu.org/archive/html/qemu-ppc/2018-07/msg00261.html
> > > 
> > > (which seem to use AltiVec/VMX instructions but not sure which) on mac99
> > > with MorphOS and I could not see any performance increase. I haven't run
> > > enough tests but results with or without this series on master were mostly
> > > the same within a few percents, and sometimes even seen lower performance
> > > with these patches than without. I haven't tried to find out why (no time
> > > for that now) so can't really draw any conclusions from this. I'm also not
> > > sure if I've actually tested what you've changed or these use instructions
> > > that your patches don't optimise yet, or the changes I've seen were just
> > > normal changes between runs; but I wonder if the increased number of
> > > temporaries could result in lower performance in some cases?
> > 
> > What was your host machine.  IIUC this change will only improve
> > performance if the host tcg backend is able to implement TCG vector
> > ops in terms of vector ops on the host.
> 
> Tried it on i5 650 which has: sse sse2 ssse3 sse4_1 sse4_2. I assume x86_64
> should be supported but not sure what are the CPU requirements.
> 
> > In addition, this series only converts a subset of the integer and
> > logical vector instructions.  If your testcase is mostly floating
> > point (vectored or otherwise), it will still be softfloat and so not
> > see any speedup.
> 
> Yes, I don't really know what these tests use but I think "lame" test is
> mostly floating point but tried with "lame_vmx" which should at least use
> some vector ops and "mplayer -benchmark" test is more vmx dependent based on
> my previous profiling and testing with hardfloat but I'm not sure. (When
> testing these with hardfloat I've found that lame was benefiting from
> hardfloat but mplayer wasn't and more VMX related functions showed up with
> mplayer so I assumed it's more VMX bound.)

I should clarify here.  When I say "floating point" above, I'm not
meaning things using the regular FPU instead of the vector unit.  I'm
saying *anything* involving floating point calculations whether
they're done in the FPU or the vector unit.

The patches here don't convert all VMX instructions to use vector TCG
ops - they only convert a few, and those few are about using the
vector unit for integer (and logical) operations.  VMX instructions
involving floating point calculations are unaffected and will still
use soft-float.

> I've tried to do some profiling again to find out what's used but I can't
> get good results with the tools I have (oprofile stopped working since I've
> updated my machine and Linux perf provides results that are hard to
> interpret for me, haven't tried if gprof would work now it didn't before)
> but I've seen some vector related helpers in the profile so at least some
> vector ops are used. The "helper_vperm" came up top at about 11th (not sure
> where is it called from), other vector helpers were lower.
> 
> I don't remember details now but previously when testing hardfloat I've
> written this: "I've looked at vperm which came out top in one of the
> profiles I've taken and on little endian hosts it has the loop backwards and
> also accesses vector elements from end to front which I wonder may be enough
> for the compiler to not be able to optimise it? But I haven't checked
> assembly. The altivec dependent mplayer video decoding test did not change
> much with hardfloat, it took 98% compared to master so likely altivec is
> dominating here." (Although this was with the PPC specific vector helpers
> before VMX patch so not sure if this is still relevant.)
> 
> The top 10 in profile were still related to low level memory access and MMU
> management stuff as I've found before:
> 
> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03609.html
> http://lists.nongnu.org/archive/html/qemu-devel/2018-07/msg03704.html
> 
> I think implementing i2c for mac99 may help this and some other
> optimisations may also be possible but I don't know enough about these to
> try that.
> 
> It also looks like with --enable-debug something is always flusing tlb and
> blowing away tb caches so these will be top in profile and likely dominate
> runtime so can't really use profile to measure impact of VMX patch. Without
> --enable-debug I can't get call graphs so can't get useful profile. I think
> I've looked at this before as well but can't remember now which check
> enabled by --enable-debug is responsible for constant tb cache flush and if
> that could be avoided. I just don't use --enable-debug since unless need to
> debug somthing.
> 
> Maybe the PPC softmmu should be reviewed and optimised by someone who knows
> it...

I'm not sure there is anyone who knows it at this point.  I probably
know it as well as anybody, and the ppc32 code scares me.  It's a
crufty mess and it would be nice to clean up, but that requires
someone with enough time and interest.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

next prev parent reply	other threads:[~2018-12-11  1:32 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-07  8:56 [Qemu-devel] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG vector operations Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 1/6] target/ppc: introduce get_fpr() and set_fpr() helpers for FP register access Mark Cave-Ayland
2018-12-10  5:17   ` David Gibson
2018-12-10 18:25     ` Richard Henderson
2018-12-11  0:23       ` David Gibson
2018-12-11 19:06     ` Mark Cave-Ayland
2018-12-10 18:43   ` Richard Henderson
2018-12-11 19:15     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 2/6] target/ppc: introduce get_avr64() and set_avr64() helpers for VMX " Mark Cave-Ayland
2018-12-10 18:49   ` Richard Henderson
2018-12-11 19:16     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 3/6] target/ppc: introduce get_cpu_vsr{l, h}() and set_cpu_vsr{l, h}() helpers for VSR " Mark Cave-Ayland
2018-12-10 19:16   ` Richard Henderson
2018-12-11 19:24     ` Mark Cave-Ayland
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 4/6] target/ppc: switch FPR, VMX and VSX helpers to access data directly from cpu_env Mark Cave-Ayland
2018-12-10 19:05   ` Richard Henderson
2018-12-11 19:21     ` Mark Cave-Ayland
2018-12-11 21:24       ` Richard Henderson
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 5/6] target/ppc: convert VMX logical instructions to use vector operations Mark Cave-Ayland
2018-12-10 19:08   ` Richard Henderson
2018-12-07  8:56 ` [Qemu-devel] [RFC PATCH 6/6] target/ppc: convert vaddu[b, h, w, d] and vsubu[b, h, w, d] over " Mark Cave-Ayland
2018-12-10 19:09   ` Richard Henderson
2018-12-10  0:33 ` [Qemu-devel] [Qemu-ppc] [RFC PATCH 0/6] target/ppc: convert VMX instructions to use TCG " BALATON Zoltan
2018-12-10  2:59   ` David Gibson
2018-12-10 20:54     ` BALATON Zoltan
2018-12-10 21:09       ` Richard Henderson
2018-12-10 23:01         ` BALATON Zoltan
2018-12-11  1:20       ` David Gibson [this message]
2018-12-11  3:03         ` BALATON Zoltan
2018-12-11 19:35         ` Mark Cave-Ayland
2018-12-11 21:32           ` Richard Henderson
2018-12-10 13:04 ` [Qemu-devel] " Aleksandar Markovic
2018-12-11 19:11   ` Mark Cave-Ayland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181211012027.GA4261@umbus.fritz.box \
    --to=david@gibson.dropbear.id.au \
    --cc=balaton@eik.bme.hu \
    --cc=mark.cave-ayland@ilande.co.uk \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=richard.henderson@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).