qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kirill Batuzov <batuzovk@ispras.ru>
To: qemu-devel@nongnu.org
Cc: "Richard Henderson" <rth@twiddle.net>,
	"Alex Bennée" <alex.bennee@linaro.org>,
	"Kirill Batuzov" <batuzovk@ispras.ru>
Subject: [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations
Date: Thu, 16 Oct 2014 12:56:47 +0400	[thread overview]
Message-ID: <cover.1413286807.git.batuzovk@ispras.ru> (raw)
In-Reply-To: <87k3571pb5.fsf@linaro.org>

> (4) Consider supporting generic vector operations in the TCG?

I gave it a go and was quite happy with the result. I have implemented the add_i32x4
opcode which is addition of 128-bit vectors composed of four 32-bit integers
and used it to translate NEON vadd.i32 to SSE paddd instruction. I used ARM for
my guest because I'm familiar with this architecture and it is different from
my host.

I got a 3x speedup on my testcase:

    mov			r0, #0xb0000000
loop:
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    vadd.i32    q0, q0, q1
    subs        r0, r0, #1
    bne         loop

Evaluation results:

master: 25.398s
patched: 7.704s

Generated code:

IN: 
0x00008298:  f2200842      vadd.i32	q0, q0, q1
0x0000829c:  f2200842      vadd.i32	q0, q0, q1
0x000082a0:  f2200842      vadd.i32	q0, q0, q1
0x000082a4:  f2200842      vadd.i32	q0, q0, q1
<...>

OP after optimization and liveness analysis:
 ld_i32 tmp5,env,$0xfffffffffffffffc
 movi_i32 tmp6,$0x0
 brcond_i32 tmp5,tmp6,ne,$0x0
 ---- 0x8298
 add_i32x4 q0,q0,q1

 ---- 0x829c
 add_i32x4 q0,q0,q1

 ---- 0x82a0
 add_i32x4 q0,q0,q1

 ---- 0x82a4
 add_i32x4 q0,q0,q1
<...>

OUT: [size=196]
0x60442450:  mov    -0x4(%r14),%ebp
0x60442454:  test   %ebp,%ebp
0x60442456:  jne    0x60442505
0x6044245c:  movdqu 0x658(%r14),%xmm0
0x60442465:  movdqu 0x668(%r14),%xmm1
0x6044246e:  paddd  %xmm1,%xmm0
0x60442472:  paddd  %xmm1,%xmm0
0x60442476:  paddd  %xmm1,%xmm0
0x6044247a:  paddd  %xmm1,%xmm0
0x6044247e:  movdqu %xmm0,0x658(%r14)
<...>

> But for target-alpha, there's one vector comparison operation that appears in
> every guest string operation, and is used heavily enough that it's in the top
> 10 functions in the profile: cmpbge (compare bytes greater or equal).

cmpbge can be translated as follows:

cmpge_i8x8      tmp0, arg1, arg2
select_msb_i8x8 res, tmp0

where cmpge is "compare grater or equal" with following semantic:
res[i] = <111...11> if arg1[i] >= arg2[i]
res[i] = <000...00> if arg1[i] <  arg2[i]
There is such operation in NEON. In SSE we can emulate it with PCMPEQB, PCMPGTB
and POR.

select_msb is "select most significant bit". SSE instruction PMOVMSKB.

> While making helper functions faster is good I've wondered if they is
> enough genericsm across the various SIMD/vector operations we could add
> add TCG ops to translate them? The ops could fall back to generic helper
> functions using the GCC instrinsics if we know there is no decent
> back-end support for them?

>From Valgrind experience there are enough genericism. Valgrind can translate
SSE, AltiVec and NEON instructions to vector opcodes. Most of the opcodes are
reused between instruction sets.

But keep in mind - there are a lot of vector opcodes. Much much more than
scalar ones. You can see full list in Valgrind sources (VEX/pub/libvex_ir.h).

We can reduce the amount of opcodes by converting vector element size from part
of an opcode to a constant argument. But we will lose some flexibility offered
by the TARGET_HAS_opcode macro when target has support for some sizes but not for
others. For example SSE has vector minimum for sizes i8x16, i16x8, i32x4 but
does not have one for size i64x2. 

Some implementation details and concerns.

The most problematic issue was the fact that with vector registers we have one
entity that can be accessed as both global variable and memory location. I
solved it by introducing the sync_temp opcode that instructs register allocator to
save global variable to its memory location if it is on the register. If a
variable is not on a register or memory is already coherent - no store is issued,
so performance penalty for it is minimal. Still this approach has a serious
drawback: we need to generate sync_temp explicitly. But I do not know any better
way to achieve consistency.

Note that as of this RFC I have not finished conversion of ARM guest so mixing
NEON with VFP code can cause a miscompile.

The second problem is that a backend may or may not support vector operations. We
do not want each frontend to check it on every operation. I created a wrapper that
generates vector opcode if it is supported or generates emulation code.

For add_i32x4 emulation code is generated inline. I tried to make it a helper
but got a very significant performance loss (5x slowdown). I'm not sure about
the cause but I suspect that memory was a bottleneck and extra stores needed
by calling conventions mattered a lot.

The existing constraints are good enough to express that vector registers and
general purpose registers are different and can not be used instead of each
other.

One unsolved problem is global aliasing. With general purpose registers we have
no aliasing between globals. The only example I know where registers can alias
is the x86 ah/ax/eax/rax case. They are handled as one global. With vector
registers we have NEON where an 128-bit Q register consists of two 64-bit
D registers each consisting of two 32-bit S registers. I think I'll need
to add alias list to each global listing every other global it can clobber and
then iterate over it in the optimizer. Fortunately this list will be static and not
very long.

Why I think all this is worth doing:

(1) Performance. 200% speedup is a lot. My test was specifically crafted and real
    life applications may not have that much vector operations on average, but
    there is a specific class of applications where it will matter a lot - media
    processing applications like ffmpeg.

(2) Some unification of common operations. Right now every target reimplements
    common vector operations (like vector add/sub/mul/min/compare etc.). We can
    do it once in the common TCG code.

Still there are some cons I mentioned earlier. The need to support a lot of
opcodes is the most significant in the long run I think. So before I commit my
time to conversion of more operations I'd like to hear your opinions if this
approach is acceptable and worth spending efforts.

Kirill Batuzov (7):
  tcg: add support for 128bit vector type
  tcg: store ENV global in TCGContext
  tcg: add sync_temp opcode
  tcg: add add_i32x4 opcode
  target-arm: support access to 128-bit guest registers as globals
  target-arm: use add_i32x4 opcode to handle vadd.i32 instruction
  tcg/i386: add support for vector opcodes

 target-arm/translate.c |   30 ++++++++++-
 tcg/i386/tcg-target.c  |  103 ++++++++++++++++++++++++++++++++---
 tcg/i386/tcg-target.h  |   24 ++++++++-
 tcg/tcg-op.h           |  141 ++++++++++++++++++++++++++++++++++++++++++++++++
 tcg/tcg-opc.h          |   13 +++++
 tcg/tcg.c              |   36 +++++++++++++
 tcg/tcg.h              |   34 ++++++++++++
 7 files changed, 371 insertions(+), 10 deletions(-)

-- 
1.7.10.4

  reply	other threads:[~2014-10-16  8:57 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-28 15:40 [Qemu-devel] [RFC] Use of host vector operations in host helper functions Richard Henderson
2014-09-13 16:02 ` Alex Bennée
2014-10-16  8:56   ` Kirill Batuzov [this message]
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 1/7] tcg: add support for 128bit vector type Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 2/7] tcg: store ENV global in TCGContext Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 3/7] tcg: add sync_temp opcode Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 4/7] tcg: add add_i32x4 opcode Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 5/7] target-arm: support access to 128-bit guest registers as globals Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 6/7] target-arm: use add_i32x4 opcode to handle vadd.i32 instruction Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 7/7] tcg/i386: add support for vector opcodes Kirill Batuzov
2014-10-16 10:03     ` [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations Alex Bennée
2014-10-16 11:07       ` Kirill Batuzov
2014-11-11 11:58     ` Kirill Batuzov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1413286807.git.batuzovk@ispras.ru \
    --to=batuzovk@ispras.ru \
    --cc=alex.bennee@linaro.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rth@twiddle.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).