qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Richard Henderson <rth@twiddle.net>
To: qemu-devel <qemu-devel@nongnu.org>
Subject: [Qemu-devel] [RFC] Use of host vector operations in host helper functions
Date: Thu, 28 Aug 2014 08:40:27 -0700	[thread overview]
Message-ID: <53FF4D6B.2050202@twiddle.net> (raw)

Most of the time, guest vector operations are rare enough that it doesn't
really matter that we implement them with a loop around integer operations.

But for target-alpha, there's one vector comparison operation that appears in
every guest string operation, and is used heavily enough that it's in the top
10 functions in the profile: cmpbge (compare bytes greater or equal).

I did some experiments, where I rewrote the function using gcc's "generic"
vector types and builtin operations.  Irritatingly, gcc won't use a wider
vector insn to implement a narrower operation, so I needed to widen by hand in
order to get vectorization for SSE2, but:

----------------------------------------------------------------------
diff --git a/target-alpha/int_helper.c b/target-alpha/int_helper.c
index c023fa1..ec71c17 100644
--- a/target-alpha/int_helper.c
+++ b/target-alpha/int_helper.c
@@ -60,6 +60,42 @@ uint64_t helper_zap(uint64_t val, uint64_t mask)

 uint64_t helper_cmpbge(uint64_t op1, uint64_t op2)
 {
+#if 1
+    uint64_t r;
+
+    /* The cmpbge instruction is heavily used in the implementation of
+       every string function on Alpha.  We can do much better than either
+       the default loop below, or even an unrolled version by using the
+       native vector support.  */
+    {
+        typedef uint64_t Q __attribute__((vector_size(16)));
+        typedef uint8_t B __attribute__((vector_size(16)));
+
+        Q q1 = (Q){ op1, 0 };
+        Q q2 = (Q){ op2, 0 };
+
+        q1 = (Q)((B)q1 >= (B)q2);
+
+        r = q1[0];
+    }
+
+    /* Select only one bit from each byte.  */
+    r &= 0x0101010101010101;
+
+    /* Collect the bits into the bottom byte.  */
+    /* .......A.......B.......C.......D.......E.......F.......G.......H */
+    r |= r >> (8 - 1);
+
+    /* .......A......AB......BC......CD......DE......EF......FG......GH */
+    r |= r >> (16 - 2);
+
+    /* .......A......AB.....ABC....ABCD....BCDE....CDEF....DEFG....EFGH */
+    r |= r >> (32 - 4);
+
+    /* .......A......AB.....ABC....ABCD...ABCDE..ABCDEF.ABCDEFGABCDEFGH */
+    /* Return only the low 8 bits.  */
+    return r & 0xff;
+#else
     uint8_t opa, opb, res;
     int i;

@@ -72,6 +108,7 @@ uint64_t helper_cmpbge(uint64_t op1, uint64_t op2)
         }
     }
     return res;
+#endif
 }

 uint64_t helper_minub8(uint64_t op1, uint64_t op2)
----------------------------------------------------------------------

allows very good optimization on x86_64:

0000000000000120 <helper_cmpbge>:
 120:   48 89 7c 24 e8          mov    %rdi,-0x18(%rsp)
 125:   48 b8 01 01 01 01 01    movabs $0x101010101010101,%rax
 12c:   01 01 01
 12f:   f3 0f 7e 5c 24 e8       movq   -0x18(%rsp),%xmm3
 135:   48 89 74 24 e8          mov    %rsi,-0x18(%rsp)
 13a:   f3 0f 7e 64 24 e8       movq   -0x18(%rsp),%xmm4
 140:   f3 0f 7e c3             movq   %xmm3,%xmm0
 144:   f3 0f 7e cc             movq   %xmm4,%xmm1
 148:   66 0f 6f d1             movdqa %xmm1,%xmm2
 14c:   66 0f d8 d0             psubusb %xmm0,%xmm2
 150:   66 0f ef c0             pxor   %xmm0,%xmm0
 154:   66 0f 74 c2             pcmpeqb %xmm2,%xmm0
 158:   66 0f 7f 44 24 e8       movdqa %xmm0,-0x18(%rsp)
 15e:   48 8b 54 24 e8          mov    -0x18(%rsp),%rdx
 163:   48 21 c2                and    %rax,%rdx
 166:   48 89 d0                mov    %rdx,%rax
 169:   48 c1 e8 07             shr    $0x7,%rax
 16d:   48 09 d0                or     %rdx,%rax
 170:   48 89 c2                mov    %rax,%rdx
 173:   48 c1 ea 0e             shr    $0xe,%rdx
 177:   48 09 c2                or     %rax,%rdx
 17a:   48 89 d0                mov    %rdx,%rax
 17d:   48 c1 e8 1c             shr    $0x1c,%rax
 181:   48 09 d0                or     %rdx,%rax
 184:   0f b6 c0                movzbl %al,%eax
 187:   c3                      retq

which is just about as good as you could hope for (modulo two extra movq insns).

Profiling a (guest) compilation of glibc, helper_cmpbge is reduced from 3% to
0.8% of emulation time, and from 7th to 11th in the ranking.

GCC doesn't do a half-bad job on other hosts either:

aarch64:
  b4:   4f000400        movi    v0.4s, #0x0
  b8:   4ea01c01        mov     v1.16b, v0.16b
  bc:   4e081c00        mov     v0.d[0], x0
  c0:   4e081c21        mov     v1.d[0], x1
  c4:   6e213c00        cmhs    v0.16b, v0.16b, v1.16b
  c8:   4e083c00        mov     x0, v0.d[0]
  cc:   9200c000        and     x0, x0, #0x101010101010101
  d0:   aa401c00        orr     x0, x0, x0, lsr #7
  d4:   aa403800        orr     x0, x0, x0, lsr #14
  d8:   aa407000        orr     x0, x0, x0, lsr #28
  dc:   53001c00        uxtb    w0, w0
  e0:   d65f03c0        ret

Of course aarch64 *does* have an 8-byte vector size that gcc knows how to use.
 If I adjust the patch above to use it, only the first two insns are eliminated
-- surely not a measurable difference.

power7:
  ...
  vcmpgtub 13,0,1
  vcmpequb 0,0,1
  xxlor 32,45,32
  ...


But I guess the larger question here is: how much of this should we accept?

(0) Ignore this and do nothing?

(1) No general infrastructure.  Special case this one insn with #ifdef __SSE2__
and ignore anything else.

(2) Put in just enough infrastructure to know if compiler support for general
vectors is available, and then use it ad hoc when such functions are shown to
be high on the profile?

(3) Put in more infrastructure and allow it to be used to implement most guest
vector operations, possibly tidying their implementations?



r~

             reply	other threads:[~2014-08-28 15:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-28 15:40 Richard Henderson [this message]
2014-09-13 16:02 ` [Qemu-devel] [RFC] Use of host vector operations in host helper functions Alex Bennée
2014-10-16  8:56   ` [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 1/7] tcg: add support for 128bit vector type Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 2/7] tcg: store ENV global in TCGContext Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 3/7] tcg: add sync_temp opcode Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 4/7] tcg: add add_i32x4 opcode Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 5/7] target-arm: support access to 128-bit guest registers as globals Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 6/7] target-arm: use add_i32x4 opcode to handle vadd.i32 instruction Kirill Batuzov
2014-10-16  8:56     ` [Qemu-devel] [PATCH RFC 7/7] tcg/i386: add support for vector opcodes Kirill Batuzov
2014-10-16 10:03     ` [Qemu-devel] [PATCH RFC 0/7] Translate guest vector operations to host vector operations Alex Bennée
2014-10-16 11:07       ` Kirill Batuzov
2014-11-11 11:58     ` Kirill Batuzov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53FF4D6B.2050202@twiddle.net \
    --to=rth@twiddle.net \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).