qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Alex Bennée" <alex.bennee@linaro.org>
To: rth@twiddle.net, cota@braap.org, batuzovk@ispras.ru
Cc: qemu-devel@nongnu.org, qemu-arm@nongnu.org,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
Date: Thu, 17 Aug 2017 19:03:55 +0100	[thread overview]
Message-ID: <20170817180404.29334-1-alex.bennee@linaro.org> (raw)

Hi,

With upcoming work on SVE I've been looking at the way we implement
vector registers in QEMU's TCG. The current orthodoxy is to decompose
the vector into a series of TCG registers, often calling a helper
function the calculation of each element. The result of the helper is
then is then stored back in the vector representation afterwards.
There are occasional outliers like simd_tbl which access elements
directly from a passed CPUFooState env pointer but these are rare.

This series introduces the concept of TCGv_vec type. This is a pointer
to the start of the in memory representation of an arbitrarily long
vector register. This is passed to a helper function as a pointer
along with a normal TCG register containing information about the
actual vector length and any additional information the helper needs
to do the operation. The hope* is this saves on the churn of having
the TCG do things element by element and allows the compiler to use
native vector operations to streamline the helpers.

There are some downsides to this approach. The first is you have to be
careful about register aliasing. If you are doing a same reg to same
reg operation you need to make a copy of the vector so you don't
trample your input data as you go. The second is this involves
changing some of the assumptions the TCG makes about things. I've
managed to keep all the changes within the core TCG code for now but
so far it has only been tested for the tcg_call path which is the only
place where TCGv_vec's should turn up. It is possible to do the same
thing without touching the TCG code generation by using TCGv_ptrs and
manually emitting tcg_addi ops to pass the correct address. Richard
has been exploring this approach with his series. The downside of that
is you do miss the ability to have named global vector registers which
makes reading the TCG dumps a little easier.

I've only patched one helper in this series which implements the
indexed smull. This is because it appears in the profiles for my test
case which was using an arm64 ffmpeg to transcode:

  ./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \
    -threads 1 -qscale:v 3 -f null -

* hope. On an earlier revision (which included sqshrn conversions) I
  had measured a minor saving but this had disappeared once I measured
  the final code. However the profile is fairly dominated by
  softfloat.

master:
     8.05%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32
     7.28%  qemu-aarch64  qemu-aarch64             [.] float32_mul
     6.56%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.31%  qemu-aarch64  qemu-aarch64             [.] float32_muladd
     4.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16
     4.00%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs
     3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs
     2.26%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl
     2.00%  qemu-aarch64  qemu-aarch64             [.] float32_add
     1.81%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8
     1.64%  qemu-aarch64  qemu-aarch64             [.] float32_sub
     1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32
     0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8

tcg-native-vectors-rfc:
     7.93%  qemu-aarch64  qemu-aarch64             [.] roundAndPackFloat32             
     7.54%  qemu-aarch64  qemu-aarch64             [.] float32_mul                     
     6.29%  qemu-aarch64  qemu-aarch64             [.] helper_lookup_tb_ptr
     5.39%  qemu-aarch64  qemu-aarch64             [.] float32_muladd
     3.92%  qemu-aarch64  qemu-aarch64             [.] addFloat32Sigs
     3.86%  qemu-aarch64  qemu-aarch64             [.] subFloat32Sigs
     3.62%  qemu-aarch64  qemu-aarch64             [.] helper_advsimd_smull_idx_s32
     2.19%  qemu-aarch64  qemu-aarch64             [.] helper_simd_tbl
     2.09%  qemu-aarch64  qemu-aarch64             [.] helper_neon_mull_s16
     1.99%  qemu-aarch64  qemu-aarch64             [.] float32_add
     1.79%  qemu-aarch64  qemu-aarch64             [.] helper_neon_unarrow_sat8
     1.62%  qemu-aarch64  qemu-aarch64             [.] float32_sub
     1.43%  qemu-aarch64  qemu-aarch64             [.] helper_neon_subl_u32
     1.00%  qemu-aarch64  qemu-aarch64             [.] helper_neon_widen_u8
     0.98%  qemu-aarch64  qemu-aarch64             [.] helper_neon_addl_u32

At the moment the default compiler settings don't actually vectorise
the helper. I could get it to once I added some alignment guarantees
but the casting I did broke the instruction emulation so I haven't
included that patch in this series.

Given the results why continue investigating this? Well for one thing
vector sizes are growing, SVE vectors are up to 2048 bits long. Those
longer vectors should offer more scope for the host compiler to
generate efficient code in the helper. Also vector operations tend to
be quite complex operations, being able to handle this in C code
instead of TCGOps might be more preferable from a code maintainability
point of view. Finally this noddy little experiment has at least shown
it doesn't worsen performance. It would be nice if I could find a
benchmark that made heavy use if non-floating point SIMD instructions
to better measure the effect of marshalling elements vs vectorised
helpers. If anyone has any suggestions I'm all ears ;-)

Anyway questions, comments?

Alex Bennée (9):
  tcg/README: listify the TCG types.
  tcg: introduce the concepts of a TCGv_vec register type
  tcg: generate ptrs to vector registers
  helper-head: add support for vec type
  arm/cpu.h: align VFP registers
  target/arm/translate-a64: regnames -> x_regnames
  target/arm/translate-a64: register global vectors
  target/arm/helpers: introduce ADVSIMD flags
  target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]

 include/exec/helper-head.h        |  5 ++
 target/arm/advsimd_helper_flags.h | 50 ++++++++++++++++++++
 target/arm/cpu.h                  |  4 +-
 target/arm/helper-a64.c           | 18 ++++++++
 target/arm/helper-a64.h           |  2 +
 target/arm/translate-a64.c        | 97 +++++++++++++++++++++++++++++++++++++--
 tcg/README                        | 10 ++--
 tcg/tcg.c                         | 26 ++++++++++-
 tcg/tcg.h                         | 20 ++++++++
 9 files changed, 222 insertions(+), 10 deletions(-)
 create mode 100644 target/arm/advsimd_helper_flags.h

-- 
2.13.0

             reply	other threads:[~2017-08-17 18:04 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-17 18:03 Alex Bennée [this message]
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
2017-08-17 20:05   ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type Alex Bennée
2017-08-17 20:07   ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers Alex Bennée
2017-08-17 20:13   ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 4/9] helper-head: add support for vec type Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers Alex Bennée
2017-08-17 20:13   ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames Alex Bennée
2017-08-17 20:14   ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 7/9] target/arm/translate-a64: register global vectors Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 8/9] target/arm/helpers: introduce ADVSIMD flags Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] Alex Bennée
2017-08-17 20:23   ` Richard Henderson
2017-08-17 18:32 ` [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion no-reply
2017-08-18 11:33 ` Kirill Batuzov
2017-08-18 13:44   ` Richard Henderson
2017-08-22  9:04     ` Kirill Batuzov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170817180404.29334-1-alex.bennee@linaro.org \
    --to=alex.bennee@linaro.org \
    --cc=batuzovk@ispras.ru \
    --cc=cota@braap.org \
    --cc=qemu-arm@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).