Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
       [not found] <579ad8bd.8481620a.89e78.f1ce@mx.google.com>
@ 2016-07-29  4:44 ` Benjamin Herrenschmidt
  2016-07-29  6:42 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  4:44 UTC (permalink / raw)
  To: Richard Henderson; +Cc: David Gibson, qemu-devel, qemu-ppc

On Fri, 2016-07-29 at 09:46 +0530, Richard Henderson wrote:
> 
> But that doesn't yet make the leap to 128-bit types in tcg.
> I was going to raise that topic during the 2.8 cycle, since as a
> consequence I want to drop support for 32-bit hosts, at least for 64-
> bit guests, and maybe entirely.

I wonder (or is it too chancy ?) if it's worthwhile having a TCG
ops that in effect does a translate and returns a host pointer...

That or a slightly more convoluted way of asking the backend to
produce a series of load/stores. We can provide the element size
(so it can do byteswap) and the base reg. Maybe a stride too.

We can have a default impl that turns it into a loop of existing
load/stores and smart backends can check for page boundaries
and limit the translations.

That would allow a much more efficient implementation for example
of ppc store/load multiplem which are heavily used by 32-bit guest
code for example.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
       [not found] <579ad8bd.8481620a.89e78.f1ce@mx.google.com>
  2016-07-29  4:44 ` [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl Benjamin Herrenschmidt
@ 2016-07-29  6:42 ` Benjamin Herrenschmidt
  2016-07-29  6:56   ` Benjamin Herrenschmidt
  2016-07-29  6:58   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  6:42 UTC (permalink / raw)
  To: Richard Henderson; +Cc: David Gibson, qemu-devel, qemu-ppc

On Fri, 2016-07-29 at 09:46 +0530, Richard Henderson wrote:
> The form of declaration you're using takes care of that.  In order to
> not clobber, you have to use DEF_HELPER_FLAGS with *NO_RWG*.

BTW. Is that stuff (and the various flags here) somewhat documented
anywhere ? :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  6:42 ` Benjamin Herrenschmidt
@ 2016-07-29  6:56   ` Benjamin Herrenschmidt
  2016-07-29 12:30     ` Richard Henderson
  2016-07-29 12:37     ` Richard Henderson
  2016-07-29  6:58   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  6:56 UTC (permalink / raw)
  To: Richard Henderson; +Cc: David Gibson, qemu-devel, qemu-ppc

On Fri, 2016-07-29 at 16:42 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-29 at 09:46 +0530, Richard Henderson wrote:
> > 
> > The form of declaration you're using takes care of that.  In order
> > to
> > not clobber, you have to use DEF_HELPER_FLAGS with *NO_RWG*.
> 
> BTW. Is that stuff (and the various flags here) somewhat documented
> anywhere ? :-)

More specifically, looking at the flags defined:

#define TCG_CALL_NO_READ_GLOBALS    0x0010
/* Helper does not write globals */
#define TCG_CALL_NO_WRITE_GLOBALS   0x0020
/* Helper can be safely suppressed if the return value is not used. */
#define TCG_CALL_NO_SIDE_EFFECTS    0x0040

I assume by "globals" that means things defined
with tcg_global_mem_new_i32() correct ?

So I can't do something like set TCG_CALL_NO_RWG (nor TCG_CALL_NO_WG)
on something like dcbz because it can take an exception, correct ?

Now if a helper doens't access env->gpr/spr/... but instead take
everything as in/out arguments and cannot take an exception, we can use
these, right ?

I notice that sadly, all of the vector ops are helper with full
clobbers, because I assume, the "avr" is passed as pointer due to the
lack of an int128 type in TCG correct ?

Is the only option here to go down the int128 path or would there be a
reasonable way to add a mechanism for more fine grained clobbers ?

Another reason why I think that might be handy is that a whole lot of
helpers basically have full clobbers simply because they update a CR
bit (or fscr for FP ops). It would be nice to be able to mark just
that clobbered.

Maybe by declaring up to a handful of "globals" special, assigning them
an index and having a special clobber bits corresponding to them in the
flags ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  6:56   ` Benjamin Herrenschmidt
@ 2016-07-29 12:30     ` Richard Henderson
  2016-07-29 12:37     ` Richard Henderson
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2016-07-29 12:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David Gibson, qemu-devel, qemu-ppc

On 07/29/2016 12:26 PM, Benjamin Herrenschmidt wrote:
> So I can't do something like set TCG_CALL_NO_RWG (nor TCG_CALL_NO_WG)
> on something like dcbz because it can take an exception, correct ?

You can't set NO_RWG, but you can say NO_WG.  It "reads" the registers via the 
exception (and it certainly has side effects).  But it doesn't write to the 
registers (along the normal return path; a loophole I've used elsewhere).


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  6:56   ` Benjamin Herrenschmidt
  2016-07-29 12:30     ` Richard Henderson
@ 2016-07-29 12:37     ` Richard Henderson
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2016-07-29 12:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: David Gibson, qemu-devel, qemu-ppc

On 07/29/2016 12:26 PM, Benjamin Herrenschmidt wrote:
> I notice that sadly, all of the vector ops are helper with full
> clobbers, because I assume, the "avr" is passed as pointer due to the
> lack of an int128 type in TCG correct ?

Yes.  Although x86 doesn't declare the vector registers as tcg registers at 
all.  Which is both a blessing and a curse.

My opinion is that aarch64 is the model to aim for -- most operations can 
operate on 64-bit slices at a time, no registers clobbered.  There will always 
be exceptions, of course, but hopefully with rarer insns.

r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  6:42 ` Benjamin Herrenschmidt
  2016-07-29  6:56   ` Benjamin Herrenschmidt
@ 2016-07-29  6:58   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  6:58 UTC (permalink / raw)
  To: Richard Henderson; +Cc: David Gibson, qemu-devel, qemu-ppc

On Fri, 2016-07-29 at 16:42 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-29 at 09:46 +0530, Richard Henderson wrote:
> > 
> > The form of declaration you're using takes care of that.  In order
> > to
> > not clobber, you have to use DEF_HELPER_FLAGS with *NO_RWG*.
> 
> BTW. Is that stuff (and the various flags here) somewhat documented
> anywhere ? :-)

I found the README ! :-)

Sorry for the noise.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] [PATCH 01/32] ppc: Fix fault PC reporting for lve*/stve* VMX instructions
@ 2016-07-26 22:20 Benjamin Herrenschmidt
  2016-07-26 22:21 ` [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-26 22:20 UTC (permalink / raw)
  To: qemu-ppc; +Cc: qemu-devel, david, Benjamin Herrenschmidt

We forgot to do gen_update_nip() for these like we do with other
helpers. Fix this, but in a more efficient way by passing the RA
to the accessors instead so the overhead is only taken on faults.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 target-ppc/mem_helper.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/target-ppc/mem_helper.c b/target-ppc/mem_helper.c
index e4de86b..e4ed377 100644
--- a/target-ppc/mem_helper.c
+++ b/target-ppc/mem_helper.c
@@ -232,16 +232,16 @@ target_ulong helper_lscbx(CPUPPCState *env, target_ulong addr, uint32_t reg,
                                                                 \
         if (needs_byteswap(env)) {                              \
             r->element[LO_IDX ? index : (adjust - index)] =     \
-                swap(access(env, addr));                        \
+                swap(access(env, addr, GETPC()));               \
         } else {                                                \
             r->element[LO_IDX ? index : (adjust - index)] =     \
-                access(env, addr);                              \
+                access(env, addr, GETPC());                     \
         }                                                       \
     }
 #define I(x) (x)
-LVE(lvebx, cpu_ldub_data, I, u8)
-LVE(lvehx, cpu_lduw_data, bswap16, u16)
-LVE(lvewx, cpu_ldl_data, bswap32, u32)
+LVE(lvebx, cpu_ldub_data_ra, I, u8)
+LVE(lvehx, cpu_lduw_data_ra, bswap16, u16)
+LVE(lvewx, cpu_ldl_data_ra, bswap32, u32)
 #undef I
 #undef LVE
 
@@ -259,16 +259,17 @@ LVE(lvewx, cpu_ldl_data, bswap32, u32)
                                                                         \
         if (needs_byteswap(env)) {                                      \
             access(env, addr, swap(r->element[LO_IDX ? index :          \
-                                              (adjust - index)]));      \
+                                              (adjust - index)]),       \
+                        GETPC());                                       \
         } else {                                                        \
             access(env, addr, r->element[LO_IDX ? index :               \
-                                         (adjust - index)]);            \
+                                         (adjust - index)], GETPC());   \
         }                                                               \
     }
 #define I(x) (x)
-STVE(stvebx, cpu_stb_data, I, u8)
-STVE(stvehx, cpu_stw_data, bswap16, u16)
-STVE(stvewx, cpu_stl_data, bswap32, u32)
+STVE(stvebx, cpu_stb_data_ra, I, u8)
+STVE(stvehx, cpu_stw_data_ra, bswap16, u16)
+STVE(stvewx, cpu_stl_data_ra, bswap32, u32)
 #undef I
 #undef LVE
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-26 22:20 [Qemu-devel] [PATCH 01/32] ppc: Fix fault PC reporting for lve*/stve* VMX instructions Benjamin Herrenschmidt
@ 2016-07-26 22:21 ` Benjamin Herrenschmidt
  2016-07-29  0:49   ` Richard Henderson
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-26 22:21 UTC (permalink / raw)
  To: qemu-ppc; +Cc: qemu-devel, david, Benjamin Herrenschmidt

Those are always naturally aligned, so cannot cross a page boundary,
thus instead of generating two 8-byte loads with translation on each
(and double swap for LE on LE), we use a helper that will do a single
translation and memcpy the result over (or do appropriate swapping
if needed).

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 target-ppc/helper.h             |  2 ++
 target-ppc/mem_helper.c         | 60 +++++++++++++++++++++++++++++++++++++++++
 target-ppc/translate/vmx-impl.c | 38 ++++++++------------------
 target-ppc/translate/vmx-ops.c  |  4 +--
 4 files changed, 75 insertions(+), 29 deletions(-)

diff --git a/target-ppc/helper.h b/target-ppc/helper.h
index 1f5cfd0..64f7d2c 100644
--- a/target-ppc/helper.h
+++ b/target-ppc/helper.h
@@ -269,9 +269,11 @@ DEF_HELPER_5(vmsumshm, void, env, avr, avr, avr, avr)
 DEF_HELPER_5(vmsumshs, void, env, avr, avr, avr, avr)
 DEF_HELPER_4(vmladduhm, void, avr, avr, avr, avr)
 DEF_HELPER_2(mtvscr, void, env, avr)
+DEF_HELPER_3(lvx, void, env, i32, tl)
 DEF_HELPER_3(lvebx, void, env, avr, tl)
 DEF_HELPER_3(lvehx, void, env, avr, tl)
 DEF_HELPER_3(lvewx, void, env, avr, tl)
+DEF_HELPER_3(stvx, void, env, i32, tl)
 DEF_HELPER_3(stvebx, void, env, avr, tl)
 DEF_HELPER_3(stvehx, void, env, avr, tl)
 DEF_HELPER_3(stvewx, void, env, avr, tl)
diff --git a/target-ppc/mem_helper.c b/target-ppc/mem_helper.c
index 6548715..da3f973 100644
--- a/target-ppc/mem_helper.c
+++ b/target-ppc/mem_helper.c
@@ -225,6 +225,66 @@ target_ulong helper_lscbx(CPUPPCState *env, target_ulong addr, uint32_t reg,
 #define LO_IDX 0
 #endif
 
+void helper_lvx(CPUPPCState *env, uint32_t vr, target_ulong addr)
+{
+    uintptr_t raddr = GETPC();
+    ppc_avr_t *haddr;
+
+    /* Align address */
+    addr &= ~(target_ulong)0xf;
+
+    /* Try fast path translate */
+    haddr = tlb_vaddr_to_host(env, addr, MMU_DATA_LOAD, env->dmmu_idx);
+    if (haddr) {
+        if (msr_le && HI_IDX) {
+            memcpy(&env->avr[vr], haddr, 16);
+        } else {
+            env->avr[vr].u64[LO_IDX] = bswap64(haddr->u64[HI_IDX]);
+            env->avr[vr].u64[HI_IDX] = bswap64(haddr->u64[LO_IDX]);
+        }
+    } else {
+        if (needs_byteswap(env)) {
+            env->avr[vr].u64[LO_IDX] =
+                bswap64(cpu_ldq_data_ra(env, addr, raddr));
+            env->avr[vr].u64[HI_IDX] =
+                bswap64(cpu_ldq_data_ra(env, addr + 8, raddr));
+        } else {
+            env->avr[vr].u64[HI_IDX] = cpu_ldq_data_ra(env, addr, raddr);
+            env->avr[vr].u64[LO_IDX] = cpu_ldq_data_ra(env, addr + 8, raddr);
+        }
+    }
+}
+
+void helper_stvx(CPUPPCState *env, uint32_t vr, target_ulong addr)
+{
+    uintptr_t raddr = GETPC();
+    ppc_avr_t *haddr;
+
+    /* Align address */
+    addr &= ~(target_ulong)0xf;
+
+    /* Try fast path translate */
+    haddr = tlb_vaddr_to_host(env, addr, MMU_DATA_STORE, env->dmmu_idx);
+    if (haddr) {
+        if (msr_le && HI_IDX) {
+            memcpy(haddr, &env->avr[vr], 16);
+        } else {
+            haddr->u64[LO_IDX] = bswap64(env->avr[vr].u64[HI_IDX]);
+            haddr->u64[HI_IDX] = bswap64(env->avr[vr].u64[LO_IDX]);
+        }
+    } else {
+        if (needs_byteswap(env)) {
+            cpu_stq_data_ra(env, addr,
+                            bswap64(env->avr[vr].u64[LO_IDX]), raddr);
+            cpu_stq_data_ra(env, addr + 8,
+                            bswap64(env->avr[vr].u64[HI_IDX]), raddr);
+        } else {
+            cpu_stq_data_ra(env, addr, env->avr[vr].u64[HI_IDX], raddr);
+            cpu_stq_data_ra(env, addr + 8, env->avr[vr].u64[LO_IDX], raddr);
+        }
+    }
+}
+
 /* We use msr_le to determine index ordering in a vector.  However,
    byteswapping is not simply controlled by msr_le.  We also need to take
    into account endianness of the target.  This is done for the little-endian
diff --git a/target-ppc/translate/vmx-impl.c b/target-ppc/translate/vmx-impl.c
index 110e19c..a58aa0c 100644
--- a/target-ppc/translate/vmx-impl.c
+++ b/target-ppc/translate/vmx-impl.c
@@ -15,55 +15,39 @@ static inline TCGv_ptr gen_avr_ptr(int reg)
 }
 
 #define GEN_VR_LDX(name, opc2, opc3)                                          \
-static void glue(gen_, name)(DisasContext *ctx)                                       \
+static void glue(gen_, name)(DisasContext *ctx)                               \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i32 t0;                                                              \
     if (unlikely(!ctx->altivec_enabled)) {                                    \
         gen_exception(ctx, POWERPC_EXCP_VPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_INT);                                     \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_const_i32(rD(ctx->opcode));                                      \
     gen_addr_reg_index(ctx, EA);                                              \
-    tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
-    /* We only need to swap high and low halves. gen_qemu_ld64 does necessary \
-       64-bit byteswap already. */                                            \
-    if (ctx->le_mode) {                                                       \
-        gen_qemu_ld64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
-        tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_ld64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
-    } else {                                                                  \
-        gen_qemu_ld64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
-        tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_ld64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
-    }                                                                         \
+    gen_helper_lvx(cpu_env, t0, EA);                                          \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i32(t0);                                                    \
 }
 
 #define GEN_VR_STX(name, opc2, opc3)                                          \
 static void gen_st##name(DisasContext *ctx)                                   \
 {                                                                             \
     TCGv EA;                                                                  \
+    TCGv_i32 t0;                                                              \
     if (unlikely(!ctx->altivec_enabled)) {                                    \
         gen_exception(ctx, POWERPC_EXCP_VPU);                                 \
         return;                                                               \
     }                                                                         \
     gen_set_access_type(ctx, ACCESS_INT);                                     \
     EA = tcg_temp_new();                                                      \
+    t0 = tcg_const_i32(rD(ctx->opcode));                                      \
     gen_addr_reg_index(ctx, EA);                                              \
-    tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
-    /* We only need to swap high and low halves. gen_qemu_st64 does necessary \
-       64-bit byteswap already. */                                            \
-    if (ctx->le_mode) {                                                       \
-        gen_qemu_st64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
-        tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_st64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
-    } else {                                                                  \
-        gen_qemu_st64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
-        tcg_gen_addi_tl(EA, EA, 8);                                           \
-        gen_qemu_st64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
-    }                                                                         \
+    gen_helper_stvx(cpu_env, t0, EA);                                         \
     tcg_temp_free(EA);                                                        \
+    tcg_temp_free_i32(t0);                                                    \
 }
 
 #define GEN_VR_LVE(name, opc2, opc3, size)                              \
@@ -116,9 +100,9 @@ GEN_VR_LVE(bx, 0x07, 0x00, 1);
 GEN_VR_LVE(hx, 0x07, 0x01, 2);
 GEN_VR_LVE(wx, 0x07, 0x02, 4);
 
-GEN_VR_STX(svx, 0x07, 0x07);
+GEN_VR_STX(vx, 0x07, 0x07);
 /* As we don't emulate the cache, stvxl is stricly equivalent to stvx */
-GEN_VR_STX(svxl, 0x07, 0x0F);
+GEN_VR_STX(vxl, 0x07, 0x0F);
 
 GEN_VR_STVE(bx, 0x07, 0x04, 1);
 GEN_VR_STVE(hx, 0x07, 0x05, 2);
diff --git a/target-ppc/translate/vmx-ops.c b/target-ppc/translate/vmx-ops.c
index b9c982a..6c7d150 100644
--- a/target-ppc/translate/vmx-ops.c
+++ b/target-ppc/translate/vmx-ops.c
@@ -11,8 +11,8 @@ GEN_VR_LDX(lvxl, 0x07, 0x0B),
 GEN_VR_LVE(bx, 0x07, 0x00),
 GEN_VR_LVE(hx, 0x07, 0x01),
 GEN_VR_LVE(wx, 0x07, 0x02),
-GEN_VR_STX(svx, 0x07, 0x07),
-GEN_VR_STX(svxl, 0x07, 0x0F),
+GEN_VR_STX(vx, 0x07, 0x07),
+GEN_VR_STX(vxl, 0x07, 0x0F),
 GEN_VR_STVE(bx, 0x07, 0x04),
 GEN_VR_STVE(hx, 0x07, 0x05),
 GEN_VR_STVE(wx, 0x07, 0x06),
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-26 22:21 ` [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl Benjamin Herrenschmidt
@ 2016-07-29  0:49   ` Richard Henderson
  2016-07-29  2:13     ` Benjamin Herrenschmidt
  2016-07-29  9:00     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 15+ messages in thread
From: Richard Henderson @ 2016-07-29  0:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-ppc; +Cc: qemu-devel, david

On 07/27/2016 03:51 AM, Benjamin Herrenschmidt wrote:
> -    tcg_gen_andi_tl(EA, EA, ~0xf);                                            \
> -    /* We only need to swap high and low halves. gen_qemu_ld64 does necessary \
> -       64-bit byteswap already. */                                            \
> -    if (ctx->le_mode) {                                                       \
> -        gen_qemu_ld64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
> -        tcg_gen_addi_tl(EA, EA, 8);                                           \
> -        gen_qemu_ld64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
> -    } else {                                                                  \
> -        gen_qemu_ld64(ctx, cpu_avrh[rD(ctx->opcode)], EA);                    \
> -        tcg_gen_addi_tl(EA, EA, 8);                                           \
> -        gen_qemu_ld64(ctx, cpu_avrl[rD(ctx->opcode)], EA);                    \
> -    }                                                                         \
> +    gen_helper_lvx(cpu_env, t0, EA);                                          \

This, I'm not so keen on.

(1) The helper, since it writes to registers controlled by tcg, must be 
described to clobber all registers.  Which will noticeably increase memory 
traffic to ENV.  For instance, you won't be able to hold the guest register 
holding the address in a host register across the call.

(2) We're going to have to teach tcg about 16-byte data types soon anyway, for 
the proper emulation of 16-byte atomic operations.


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  0:49   ` Richard Henderson
@ 2016-07-29  2:13     ` Benjamin Herrenschmidt
  2016-07-29  3:34       ` David Gibson
  2016-07-29  9:00     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  2:13 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc; +Cc: qemu-devel, david

On Fri, 2016-07-29 at 06:19 +0530, Richard Henderson wrote:
> This, I'm not so keen on.
> 
> (1) The helper, since it writes to registers controlled by tcg, must be 
> described to clobber all registers.  Which will noticeably increase memory 
> traffic to ENV.  For instance, you won't be able to hold the guest register 
> holding the address in a host register across the call.

Ah I wasn't aware of this. How do you describe such a clobber ? Can I describe
specifically which one is clobbered ? I didn't think TCG kept track of the vector
halves but I must admit I'm still a bit new with TCG in general.

I noticed other constructs doing that (passing a register number to an opcode),
what do I do to ensure the right clobbers are there ?

> > (2) We're going to have to teach tcg about 16-byte data types soon anyway, for 
> the proper emulation of 16-byte atomic operations.

Is anybody working on this already ? I thought about that approach as
it would definitely make things easier for that and a couple of other
cases such as lq/stq.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  2:13     ` Benjamin Herrenschmidt
@ 2016-07-29  3:34       ` David Gibson
  2016-07-29  4:40         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: David Gibson @ 2016-07-29  3:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Richard Henderson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

On Fri, Jul 29, 2016 at 12:13:01PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-29 at 06:19 +0530, Richard Henderson wrote:
> > This, I'm not so keen on.
> > 
> > (1) The helper, since it writes to registers controlled by tcg, must be 
> > described to clobber all registers.  Which will noticeably increase memory 
> > traffic to ENV.  For instance, you won't be able to hold the guest register 
> > holding the address in a host register across the call.
> 
> Ah I wasn't aware of this. How do you describe such a clobber ? Can I describe
> specifically which one is clobbered ? I didn't think TCG kept track of the vector
> halves but I must admit I'm still a bit new with TCG in general.
> 
> I noticed other constructs doing that (passing a register number to an opcode),
> what do I do to ensure the right clobbers are there ?
> 
> > > (2) We're going to have to teach tcg about 16-byte data types soon anyway, for 
> > the proper emulation of 16-byte atomic operations.
> 
> Is anybody working on this already ? I thought about that approach as
> it would definitely make things easier for that and a couple of other
> cases such as lq/stq.

What should I do with this in the short term?  Leave it in
ppc-for-2.8, or remove it for now pending possible changes?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  3:34       ` David Gibson
@ 2016-07-29  4:40         ` Benjamin Herrenschmidt
  2016-07-29  4:58           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  4:40 UTC (permalink / raw)
  To: David Gibson; +Cc: Richard Henderson, qemu-ppc, qemu-devel

On Fri, 2016-07-29 at 13:34 +1000, David Gibson wrote:
> 
> What should I do with this in the short term?  Leave it in
> ppc-for-2.8, or remove it for now pending possible changes?

I think I'm still measuring a performance improvement with this, I'll
test a bit more and will get back to you.

It will definitely better to do it without a helper once Richard's 128-
bit stuff is in.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  4:40         ` Benjamin Herrenschmidt
@ 2016-07-29  4:58           ` Benjamin Herrenschmidt
  2016-07-29  5:42             ` David Gibson
  0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  4:58 UTC (permalink / raw)
  To: David Gibson; +Cc: Richard Henderson, qemu-ppc, qemu-devel

On Fri, 2016-07-29 at 14:40 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-29 at 13:34 +1000, David Gibson wrote:
> > 
> >  
> > What should I do with this in the short term?  Leave it in
> > ppc-for-2.8, or remove it for now pending possible changes?
> 
> I think I'm still measuring a performance improvement with this, I'll
> test a bit more and will get back to you.
> 
> It will definitely better to do it without a helper once Richard's 128-
> bit stuff is in.

Ok, after I nice'd myself on top of all the gcc's on that test machine
I confirm that this patch is a loss on qemu-user. It might still be a
win for qemu-system-ppc64 due to the translation but probably not
worth bothering.

So drop this one. I'll use the 128-bit support when it becomes
available.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  4:58           ` Benjamin Herrenschmidt
@ 2016-07-29  5:42             ` David Gibson
  0 siblings, 0 replies; 15+ messages in thread
From: David Gibson @ 2016-07-29  5:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Richard Henderson, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1113 bytes --]

On Fri, Jul 29, 2016 at 02:58:07PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2016-07-29 at 14:40 +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2016-07-29 at 13:34 +1000, David Gibson wrote:
> > > 
> > >  
> > > What should I do with this in the short term?  Leave it in
> > > ppc-for-2.8, or remove it for now pending possible changes?
> > 
> > I think I'm still measuring a performance improvement with this, I'll
> > test a bit more and will get back to you.
> > 
> > It will definitely better to do it without a helper once Richard's 128-
> > bit stuff is in.
> 
> Ok, after I nice'd myself on top of all the gcc's on that test machine
> I confirm that this patch is a loss on qemu-user. It might still be a
> win for qemu-system-ppc64 due to the translation but probably not
> worth bothering.
> 
> So drop this one. I'll use the 128-bit support when it becomes
> available.

Done.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  0:49   ` Richard Henderson
  2016-07-29  2:13     ` Benjamin Herrenschmidt
@ 2016-07-29  9:00     ` Benjamin Herrenschmidt
  2016-07-29 12:43       ` Richard Henderson
  1 sibling, 1 reply; 15+ messages in thread
From: Benjamin Herrenschmidt @ 2016-07-29  9:00 UTC (permalink / raw)
  To: Richard Henderson, qemu-ppc; +Cc: qemu-devel, david

On Fri, 2016-07-29 at 06:19 +0530, Richard Henderson wrote:
> (1) The helper, since it writes to registers controlled by tcg, must be 
> described to clobber all registers.  Which will noticeably increase memory 
> traffic to ENV.  For instance, you won't be able to hold the guest register 
> holding the address in a host register across the call.

So after fixing my test setup, I did observe indeed a small performance
loss using the helper in qemu-user. It might still win us something in
softmmu due to avoiding extra translations but I will leave that aside
as I mentioned separately.

Now out of curosity, I tried this:

--- a/target-ppc/helper.h
+++ b/target-ppc/helper.h
@@ -22,12 +22,12 @@ DEF_HELPER_1(check_tlb_flush, void, env)
 #endif
 
 DEF_HELPER_3(lmw, void, env, tl, i32)
-DEF_HELPER_3(stmw, void, env, tl, i32)
+DEF_HELPER_FLAGS_3(stmw, TCG_CALL_NO_WG, void, env, tl, i32)
 DEF_HELPER_4(lsw, void, env, tl, i32, i32)
 DEF_HELPER_5(lswx, void, env, tl, i32, i32, i32)
-DEF_HELPER_4(stsw, void, env, tl, i32, i32)
-DEF_HELPER_3(dcbz, void, env, tl, i32)
-DEF_HELPER_2(icbi, void, env, tl)
+DEF_HELPER_FLAGS_4(stsw, TCG_CALL_NO_WG, void, env, tl, i32, i32)
+DEF_HELPER_FLAGS_3(dcbz, TCG_CALL_NO_WG, void, env, tl, i32)
+DEF_HELPER_FLAGS_2(icbi, TCG_CALL_NO_WG, void, env, tl)
 DEF_HELPER_5(lscbx, tl, env, tl, i32, i32, i32)
 
 #if defined(TARGET_PPC64)

If my understanding is right, the above is correct, as none of these
instructions will write to the env, though they can read from it and/
or generate faults.

Sadly I haven't observed any performance improvement as a result in
a few micro-benchmarks I cooked up.

Cheers,
Ben
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl
  2016-07-29  9:00     ` Benjamin Herrenschmidt
@ 2016-07-29 12:43       ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2016-07-29 12:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-ppc; +Cc: qemu-devel, david

On 07/29/2016 02:30 PM, Benjamin Herrenschmidt wrote:
>  DEF_HELPER_3(lmw, void, env, tl, i32)
> -DEF_HELPER_3(stmw, void, env, tl, i32)
> +DEF_HELPER_FLAGS_3(stmw, TCG_CALL_NO_WG, void, env, tl, i32)
>  DEF_HELPER_4(lsw, void, env, tl, i32, i32)
>  DEF_HELPER_5(lswx, void, env, tl, i32, i32, i32)
> -DEF_HELPER_4(stsw, void, env, tl, i32, i32)
> -DEF_HELPER_3(dcbz, void, env, tl, i32)
> -DEF_HELPER_2(icbi, void, env, tl)
> +DEF_HELPER_FLAGS_4(stsw, TCG_CALL_NO_WG, void, env, tl, i32, i32)
> +DEF_HELPER_FLAGS_3(dcbz, TCG_CALL_NO_WG, void, env, tl, i32)
> +DEF_HELPER_FLAGS_2(icbi, TCG_CALL_NO_WG, void, env, tl)
>  DEF_HELPER_5(lscbx, tl, env, tl, i32, i32, i32)
>
>  #if defined(TARGET_PPC64)
>
> If my understanding is right, the above is correct, as none of these
> instructions will write to the env, though they can read from it and/
> or generate faults.
>
> Sadly I haven't observed any performance improvement as a result in
> a few micro-benchmarks I cooked up.

Too bad.

But the changes look correct, so you might as well queue them.  ;-)


r~

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-29 12:43 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <579ad8bd.8481620a.89e78.f1ce@mx.google.com>
2016-07-29  4:44 ` [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl Benjamin Herrenschmidt
2016-07-29  6:42 ` Benjamin Herrenschmidt
2016-07-29  6:56   ` Benjamin Herrenschmidt
2016-07-29 12:30     ` Richard Henderson
2016-07-29 12:37     ` Richard Henderson
2016-07-29  6:58   ` Benjamin Herrenschmidt
2016-07-26 22:20 [Qemu-devel] [PATCH 01/32] ppc: Fix fault PC reporting for lve*/stve* VMX instructions Benjamin Herrenschmidt
2016-07-26 22:21 ` [Qemu-devel] [PATCH 28/32] ppc: Avoid double translation for lvx/lvxl/stvx/stvxl Benjamin Herrenschmidt
2016-07-29  0:49   ` Richard Henderson
2016-07-29  2:13     ` Benjamin Herrenschmidt
2016-07-29  3:34       ` David Gibson
2016-07-29  4:40         ` Benjamin Herrenschmidt
2016-07-29  4:58           ` Benjamin Herrenschmidt
2016-07-29  5:42             ` David Gibson
2016-07-29  9:00     ` Benjamin Herrenschmidt
2016-07-29 12:43       ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).