* [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE
@ 2021-09-13  9:54 Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop Peter Maydell
                   ` (11 more replies)
  0 siblings, 12 replies; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
This patchset uses the TCG vector ops for some MVE
instructions. We can only do this when we know that none
of the MVE lanes are predicated, ie when neither tail
predication nor VPT predication nor ECI partial insn
execution are happening.
Changes v1->v2:
The major change is that instead of just updating the local
s->mve_no_pred flag when we translate an insn that changes the
predication state, we end the TB with DISAS_UPDATE_NONCHAIN.
The exceptions are the code called from vfp_access_check()
(gen_preserve_fp_state() and gen_update_fp_context()). We
can definitely determine the new flag value in one of these cases,
but in the other we can't always.
So patch 1 is new, and adds support to gen_jmp_tb() for
looking at the existing value of is_jmp so it can honour
a preceding request for an UPDATE_NOCHAIN or UPDATE_EXIT.
(We already were assuming this because gen_preserve_fp_state()
can set is_jmp to DISAS_UPDATE_EXIT if icount is in use.)
Patch 2 (new) enforces that FPDSCR.LTPSIZE is 4 on inbound
migration, because we now rely on this architectural invariant.
Patch 3 is the old patch 1, updated as noted above.
Patches 4-6 have been reviewed (they have been very slightly
tweaked to use a new mve_no_predication() function that checks
both s->eci and s->mve_no_pred, rather than v1's direct check
of mve_no_pred.)
Patches 7-12 are new, and add optimized variants of VDUP, VMVN,
various shifts, the shift-and-inserts, and the 1-operand-immediate
insns.
I think this should now be the complete set of optimizations
it's worth implementing at this point.
thanks
-- PMM
Peter Maydell (12):
  target/arm: Avoid goto_tb if we're trying to exit to the main loop
  target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration
  target/arm: Add TB flag for "MVE insns not predicated"
  target/arm: Optimize MVE logic ops
  target/arm: Optimize MVE arithmetic ops
  target/arm: Optimize MVE VNEG, VABS
  target/arm: Optimize MVE VDUP
  target/arm: Optimize MVE VMVN
  target/arm: Optimize MVE VSHL, VSHR immediate forms
  target/arm: Optimize MVE VSHLL and VMOVL
  target/arm: Optimize MVE VSLI and VSRI
  target/arm: Optimize MVE 1op-immediate insns
 target/arm/cpu.h              |   4 +-
 target/arm/translate.h        |   2 +
 target/arm/helper.c           |  33 ++++
 target/arm/machine.c          |  13 ++
 target/arm/translate-m-nocp.c |   8 +-
 target/arm/translate-mve.c    | 310 ++++++++++++++++++++++++++--------
 target/arm/translate-vfp.c    |  33 +++-
 target/arm/translate.c        |  42 ++++-
 8 files changed, 361 insertions(+), 84 deletions(-)
-- 
2.20.1
^ permalink raw reply	[flat|nested] 28+ messages in thread
* [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:36   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration Peter Maydell
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Currently gen_jmp_tb() assumes that if it is called then the jump it
is handling is the only reason that we might be trying to end the TB,
so it will use goto_tb if it can.  This is usually the case: mostly
"we did something that means we must end the TB" happens on a
non-branch instruction.  However, there are cases where we decide
early in handling an instruction that we need to end the TB and
return to the main loop, and then the insn is a complex one that
involves gen_jmp_tb().  For instance, for M-profile FP instructions,
in gen_preserve_fp_state() which is called from vfp_access_check() we
want to force an exit to the main loop if lazy state preservation is
active and we are in icount mode.
Make gen_jmp_tb() look at the current value of is_jmp, and only use
goto_tb if the previous is_jmp was DISAS_NEXT or DISAS_TOO_MANY.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 24b7f49d767..3d1ff8ba951 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -2610,8 +2610,40 @@ static inline void gen_jmp_tb(DisasContext *s, uint32_t dest, int tbno)
         /* An indirect jump so that we still trigger the debug exception.  */
         gen_set_pc_im(s, dest);
         s->base.is_jmp = DISAS_JUMP;
-    } else {
+        return;
+    }
+    switch (s->base.is_jmp) {
+    case DISAS_NEXT:
+    case DISAS_TOO_MANY:
+    case DISAS_NORETURN:
+        /*
+         * The normal case: just go to the destination TB.
+         * NB: NORETURN happens if we generate code like
+         *    gen_brcondi(l);
+         *    gen_jmp();
+         *    gen_set_label(l);
+         *    gen_jmp();
+         * on the second call to gen_jmp().
+         */
         gen_goto_tb(s, tbno, dest);
+        break;
+    case DISAS_UPDATE_NOCHAIN:
+    case DISAS_UPDATE_EXIT:
+        /*
+         * We already decided we're leaving the TB for some other reason.
+         * Avoid using goto_tb so we really do exit back to the main loop
+         * and don't chain to another TB.
+         */
+        gen_set_pc_im(s, dest);
+        gen_goto_ptr();
+        s->base.is_jmp = DISAS_NORETURN;
+        break;
+    default:
+        /*
+         * We shouldn't be emitting code for a jump and also have
+         * is_jmp set to one of the special cases like DISAS_SWI.
+         */
+        g_assert_not_reached();
     }
 }
 
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:39   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Architecturally, for an M-profile CPU with the LOB feature the
LTPSIZE field in FPDSCR is always constant 4.  QEMU's implementation
enforces this everywhere, except that we don't check that it is true
in incoming migration data.
We're going to add come in gen_update_fp_context() which relies on
the "always 4" property.  Since this is TCG-only, we don't actually
need to be robust to bogus incoming migration data, and the effect of
it being wrong would be wrong code generation rather than a QEMU
crash; but if it did ever happen somehow it would be very difficult
to track down the cause.  Add a check so that we fail the inbound
migration if the FPDSCR.LTPSIZE value is incorrect.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/machine.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
diff --git a/target/arm/machine.c b/target/arm/machine.c
index 81e30de8243..c74d8c3f4b3 100644
--- a/target/arm/machine.c
+++ b/target/arm/machine.c
@@ -781,6 +781,19 @@ static int cpu_post_load(void *opaque, int version_id)
     hw_breakpoint_update_all(cpu);
     hw_watchpoint_update_all(cpu);
 
+    /*
+     * TCG gen_update_fp_context() relies on the invariant that
+     * FPDSCR.LTPSIZE is constant 4 for M-profile with the LOB extension;
+     * forbid bogus incoming data with some other value.
+     */
+    if (arm_feature(env, ARM_FEATURE_M) && cpu_isar_feature(aa32_lob, cpu)) {
+        if (extract32(env->v7m.fpdscr[M_REG_NS],
+                      FPCR_LTPSIZE_SHIFT, FPCR_LTPSIZE_LENGTH) != 4 ||
+            extract32(env->v7m.fpdscr[M_REG_S],
+                      FPCR_LTPSIZE_SHIFT, FPCR_LTPSIZE_LENGTH) != 4) {
+            return -1;
+        }
+    }
     if (!kvm_enabled()) {
         pmu_op_finish(&cpu->env);
     }
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:44   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 04/12] target/arm: Optimize MVE logic ops Peter Maydell
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Our current codegen for MVE always calls out to helper functions,
because some byte lanes might be predicated.  The common case is that
in fact there is no predication active and all lanes should be
updated together, so we can produce better code by detecting that and
using the TCG generic vector infrastructure.
Add a TB flag that is set when we can guarantee that there is no
active MVE predication, and a bool in the DisasContext.  Subsequent
patches will use this flag to generate improved code for some
instructions.
In most cases when the predication state changes we simply end the TB
after that instruction.  For the code called from vfp_access_check()
that handles lazy state preservation and creating a new FP context,
we can usually avoid having to try to end the TB because luckily the
new value of the flag following the register changes in those
sequences doesn't depend on any runtime decisions.  We do have to end
the TB if the guest has enabled lazy FP state preservation but not
automatic state preservation, but this is an odd corner case that is
not going to be common in real-world code.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
I renamed the mve_no_predication() function to mve_no_pred() because
I want to use the former name in patch 2 for the translate-time "no
predication of any kind including ECI", and wanted to distinguish it
from this function that is just determining the value of the TB flag
bit.  Better naming suggestions welcome.
---
 target/arm/cpu.h              |  4 +++-
 target/arm/translate.h        |  2 ++
 target/arm/helper.c           | 33 +++++++++++++++++++++++++++++++++
 target/arm/translate-m-nocp.c |  8 +++++++-
 target/arm/translate-mve.c    | 13 ++++++++++++-
 target/arm/translate-vfp.c    | 33 +++++++++++++++++++++++++++------
 target/arm/translate.c        |  8 ++++++++
 7 files changed, 92 insertions(+), 9 deletions(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 6a987f65e41..a235d21c233 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -3440,7 +3440,7 @@ typedef ARMCPU ArchCPU;
  * | TBFLAG_AM32 |          +-----+----------+
  * |             |                |TBFLAG_M32|
  * +-------------+----------------+----------+
- *  31         23                5 4        0
+ *  31         23                6 5        0
  *
  * Unless otherwise noted, these bits are cached in env->hflags.
  */
@@ -3497,6 +3497,8 @@ FIELD(TBFLAG_M32, LSPACT, 2, 1)                 /* Not cached. */
 FIELD(TBFLAG_M32, NEW_FP_CTXT_NEEDED, 3, 1)     /* Not cached. */
 /* Set if FPCCR.S does not match current security state */
 FIELD(TBFLAG_M32, FPCCR_S_WRONG, 4, 1)          /* Not cached. */
+/* Set if MVE insns are definitely not predicated by VPR or LTPSIZE */
+FIELD(TBFLAG_M32, MVE_NO_PRED, 5, 1)            /* Not cached. */
 
 /*
  * Bit usage when in AArch64 state
diff --git a/target/arm/translate.h b/target/arm/translate.h
index 8636c20c3b4..f3cc820d071 100644
--- a/target/arm/translate.h
+++ b/target/arm/translate.h
@@ -98,6 +98,8 @@ typedef struct DisasContext {
     bool hstr_active;
     /* True if memory operations require alignment */
     bool align_mem;
+    /* True if MVE insns are definitely not predicated by VPR or LTPSIZE */
+    bool mve_no_pred;
     /*
      * >= 0, a copy of PSTATE.BTYPE, which will be 0 without v8.5-BTI.
      *  < 0, set by the current instruction.
diff --git a/target/arm/helper.c b/target/arm/helper.c
index a7ae78146d4..6f8cf67d895 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -13673,6 +13673,35 @@ static inline void assert_hflags_rebuild_correctly(CPUARMState *env)
 #endif
 }
 
+static bool mve_no_pred(CPUARMState *env)
+{
+    /*
+     * Return true if there is definitely no predication of MVE
+     * instructions by VPR or LTPSIZE. (Returning false even if there
+     * isn't any predication is OK; generated code will just be
+     * a little worse.)
+     * If the CPU does not implement MVE then this TB flag is always 0.
+     *
+     * NOTE: if you change this logic, the "recalculate s->mve_no_pred"
+     * logic in gen_update_fp_context() needs to be updated to match.
+     *
+     * We do not include the effect of the ECI bits here -- they are
+     * tracked in other TB flags. This simplifies the logic for
+     * "when did we emit code that changes the MVE_NO_PRED TB flag
+     * and thus need to end the TB?".
+     */
+    if (cpu_isar_feature(aa32_mve, env_archcpu(env))) {
+        return false;
+    }
+    if (env->v7m.vpr) {
+        return false;
+    }
+    if (env->v7m.ltpsize < 4) {
+        return false;
+    }
+    return true;
+}
+
 void cpu_get_tb_cpu_state(CPUARMState *env, target_ulong *pc,
                           target_ulong *cs_base, uint32_t *pflags)
 {
@@ -13712,6 +13741,10 @@ void cpu_get_tb_cpu_state(CPUARMState *env, target_ulong *pc,
             if (env->v7m.fpccr[is_secure] & R_V7M_FPCCR_LSPACT_MASK) {
                 DP_TBFLAG_M32(flags, LSPACT, 1);
             }
+
+            if (mve_no_pred(env)) {
+                DP_TBFLAG_M32(flags, MVE_NO_PRED, 1);
+            }
         } else {
             /*
              * Note that XSCALE_CPAR shares bits with VECSTRIDE.
diff --git a/target/arm/translate-m-nocp.c b/target/arm/translate-m-nocp.c
index 5eab04832cd..d9e144e8eb3 100644
--- a/target/arm/translate-m-nocp.c
+++ b/target/arm/translate-m-nocp.c
@@ -95,7 +95,10 @@ static bool trans_VLLDM_VLSTM(DisasContext *s, arg_VLLDM_VLSTM *a)
 
     clear_eci_state(s);
 
-    /* End the TB, because we have updated FP control bits */
+    /*
+     * End the TB, because we have updated FP control bits,
+     * and possibly VPR or LTPSIZE.
+     */
     s->base.is_jmp = DISAS_UPDATE_EXIT;
     return true;
 }
@@ -397,6 +400,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         store_cpu_field(control, v7m.control[M_REG_S]);
         tcg_gen_andi_i32(tmp, tmp, ~FPCR_NZCV_MASK);
         gen_helper_vfp_set_fpscr(cpu_env, tmp);
+        s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
         tcg_temp_free_i32(tmp);
         tcg_temp_free_i32(sfpa);
         break;
@@ -409,6 +413,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         }
         tmp = loadfn(s, opaque, true);
         store_cpu_field(tmp, v7m.vpr);
+        s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
         break;
     case ARM_VFP_P0:
     {
@@ -418,6 +423,7 @@ static bool gen_M_fp_sysreg_write(DisasContext *s, int regno,
         tcg_gen_deposit_i32(vpr, vpr, tmp,
                             R_V7M_VPR_P0_SHIFT, R_V7M_VPR_P0_LENGTH);
         store_cpu_field(vpr, v7m.vpr);
+        s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
         tcg_temp_free_i32(tmp);
         break;
     }
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 2ed91577ec8..0eca96e29cf 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -810,7 +810,12 @@ DO_LOGIC(VORR, gen_helper_mve_vorr)
 DO_LOGIC(VORN, gen_helper_mve_vorn)
 DO_LOGIC(VEOR, gen_helper_mve_veor)
 
-DO_LOGIC(VPSEL, gen_helper_mve_vpsel)
+static bool trans_VPSEL(DisasContext *s, arg_2op *a)
+{
+    /* This insn updates predication bits */
+    s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
+    return do_2op(s, a, gen_helper_mve_vpsel);
+}
 
 #define DO_2OP(INSN, FN) \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
@@ -1366,6 +1371,8 @@ static bool trans_VPNOT(DisasContext *s, arg_VPNOT *a)
     }
 
     gen_helper_mve_vpnot(cpu_env);
+    /* This insn updates predication bits */
+    s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
     mve_update_eci(s);
     return true;
 }
@@ -1852,6 +1859,8 @@ static bool do_vcmp(DisasContext *s, arg_vcmp *a, MVEGenCmpFn *fn)
         /* VPT */
         gen_vpst(s, a->mask);
     }
+    /* This insn updates predication bits */
+    s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
     mve_update_eci(s);
     return true;
 }
@@ -1883,6 +1892,8 @@ static bool do_vcmp_scalar(DisasContext *s, arg_vcmp_scalar *a,
         /* VPT */
         gen_vpst(s, a->mask);
     }
+    /* This insn updates predication bits */
+    s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
     mve_update_eci(s);
     return true;
 }
diff --git a/target/arm/translate-vfp.c b/target/arm/translate-vfp.c
index e2eb797c829..59bcaec5beb 100644
--- a/target/arm/translate-vfp.c
+++ b/target/arm/translate-vfp.c
@@ -109,7 +109,7 @@ static inline long vfp_f16_offset(unsigned reg, bool top)
  * Generate code for M-profile lazy FP state preservation if needed;
  * this corresponds to the pseudocode PreserveFPState() function.
  */
-static void gen_preserve_fp_state(DisasContext *s)
+static void gen_preserve_fp_state(DisasContext *s, bool skip_context_update)
 {
     if (s->v7m_lspact) {
         /*
@@ -128,6 +128,20 @@ static void gen_preserve_fp_state(DisasContext *s)
          * any further FP insns in this TB.
          */
         s->v7m_lspact = false;
+        /*
+         * The helper might have zeroed VPR, so we do not know the
+         * correct value for the MVE_NO_PRED TB flag any more.
+         * If we're about to create a new fp context then that
+         * will precisely determine the MVE_NO_PRED value (see
+         * gen_update_fp_context()). Otherwise, we must:
+         *  - set s->mve_no_pred to false, so this instruction
+         *    is generated to use helper functions
+         *  - end the TB now, without chaining to the next TB
+         */
+        if (skip_context_update || !s->v7m_new_fp_ctxt_needed) {
+            s->mve_no_pred = false;
+            s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
+        }
     }
 }
 
@@ -169,12 +183,19 @@ static void gen_update_fp_context(DisasContext *s)
             TCGv_i32 z32 = tcg_const_i32(0);
             store_cpu_field(z32, v7m.vpr);
         }
-
         /*
-         * We don't need to arrange to end the TB, because the only
-         * parts of FPSCR which we cache in the TB flags are the VECLEN
-         * and VECSTRIDE, and those don't exist for M-profile.
+         * We just updated the FPSCR and VPR. Some of this state is cached
+         * in the MVE_NO_PRED TB flag. We want to avoid having to end the
+         * TB here, which means we need the new value of the MVE_NO_PRED
+         * flag to be exactly known here and the same for all executions.
+         * Luckily FPDSCR.LTPSIZE is always constant 4 and the VPR is
+         * always set to 0, so the new MVE_NO_PRED flag is always 1
+         * if and only if we have MVE.
+         *
+         * (The other FPSCR state cached in TB flags is VECLEN and VECSTRIDE,
+         * but those do not exist for M-profile, so are not relevant here.)
          */
+        s->mve_no_pred = dc_isar_feature(aa32_mve, s);
 
         if (s->v8m_secure) {
             bits |= R_V7M_CONTROL_SFPA_MASK;
@@ -238,7 +259,7 @@ bool vfp_access_check_m(DisasContext *s, bool skip_context_update)
     /* Handle M-profile lazy FP state mechanics */
 
     /* Trigger lazy-state preservation if necessary */
-    gen_preserve_fp_state(s);
+    gen_preserve_fp_state(s, skip_context_update);
 
     if (!skip_context_update) {
         /* Update ownership of FP context and create new FP context if needed */
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 3d1ff8ba951..3a59de208d5 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -8496,6 +8496,7 @@ static bool trans_DLS(DisasContext *s, arg_DLS *a)
         /* DLSTP: set FPSCR.LTPSIZE */
         tmp = tcg_const_i32(a->size);
         store_cpu_field(tmp, v7m.ltpsize);
+        s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
     }
     return true;
 }
@@ -8561,6 +8562,10 @@ static bool trans_WLS(DisasContext *s, arg_WLS *a)
         assert(ok);
         tmp = tcg_const_i32(a->size);
         store_cpu_field(tmp, v7m.ltpsize);
+        /*
+         * LTPSIZE updated, but MVE_NO_PRED will always be the same thing (0)
+         * when we take this upcoming exit from this TB, so gen_jmp_tb() is OK.
+         */
     }
     gen_jmp_tb(s, s->base.pc_next, 1);
 
@@ -8743,6 +8748,8 @@ static bool trans_VCTP(DisasContext *s, arg_VCTP *a)
     gen_helper_mve_vctp(cpu_env, masklen);
     tcg_temp_free_i32(masklen);
     tcg_temp_free_i32(rn_shifted);
+    /* This insn updates predication bits */
+    s->base.is_jmp = DISAS_UPDATE_NOCHAIN;
     mve_update_eci(s);
     return true;
 }
@@ -9402,6 +9409,7 @@ static void arm_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
         dc->v7m_new_fp_ctxt_needed =
             EX_TBFLAG_M32(tb_flags, NEW_FP_CTXT_NEEDED);
         dc->v7m_lspact = EX_TBFLAG_M32(tb_flags, LSPACT);
+        dc->mve_no_pred = EX_TBFLAG_M32(tb_flags, MVE_NO_PRED);
     } else {
         dc->debug_target_el = EX_TBFLAG_ANY(tb_flags, DEBUG_TARGET_EL);
         dc->sctlr_b = EX_TBFLAG_A32(tb_flags, SCTLR__B);
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 04/12] target/arm: Optimize MVE logic ops
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (2 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 05/12] target/arm: Optimize MVE arithmetic ops Peter Maydell
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
When not predicating, implement the MVE bitwise logical insns
directly using TCG vector operations.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
v1->v2: new mve_no_predication() function
---
 target/arm/translate-mve.c | 51 +++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 15 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 0eca96e29cf..77b9f0db334 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -64,6 +64,16 @@ static TCGv_ptr mve_qreg_ptr(unsigned reg)
     return ret;
 }
 
+static bool mve_no_predication(DisasContext *s)
+{
+    /*
+     * Return true if we are executing the entire MVE instruction
+     * with no predication or partial-execution, and so we can safely
+     * use an inline TCG vector implementation.
+     */
+    return s->eci == 0 && s->mve_no_pred;
+}
+
 static bool mve_check_qreg_bank(DisasContext *s, int qmask)
 {
     /*
@@ -774,7 +784,8 @@ static bool trans_VNEG_fp(DisasContext *s, arg_1op *a)
     return do_1op(s, a, fns[a->size]);
 }
 
-static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn)
+static bool do_2op_vec(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn,
+                       GVecGen3Fn *vecfn)
 {
     TCGv_ptr qd, qn, qm;
 
@@ -787,28 +798,38 @@ static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn fn)
         return true;
     }
 
-    qd = mve_qreg_ptr(a->qd);
-    qn = mve_qreg_ptr(a->qn);
-    qm = mve_qreg_ptr(a->qm);
-    fn(cpu_env, qd, qn, qm);
-    tcg_temp_free_ptr(qd);
-    tcg_temp_free_ptr(qn);
-    tcg_temp_free_ptr(qm);
+    if (vecfn && mve_no_predication(s)) {
+        vecfn(a->size, mve_qreg_offset(a->qd), mve_qreg_offset(a->qn),
+              mve_qreg_offset(a->qm), 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        qn = mve_qreg_ptr(a->qn);
+        qm = mve_qreg_ptr(a->qm);
+        fn(cpu_env, qd, qn, qm);
+        tcg_temp_free_ptr(qd);
+        tcg_temp_free_ptr(qn);
+        tcg_temp_free_ptr(qm);
+    }
     mve_update_eci(s);
     return true;
 }
 
-#define DO_LOGIC(INSN, HELPER)                                  \
+static bool do_2op(DisasContext *s, arg_2op *a, MVEGenTwoOpFn *fn)
+{
+    return do_2op_vec(s, a, fn, NULL);
+}
+
+#define DO_LOGIC(INSN, HELPER, VECFN)                           \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
     {                                                           \
-        return do_2op(s, a, HELPER);                            \
+        return do_2op_vec(s, a, HELPER, VECFN);                 \
     }
 
-DO_LOGIC(VAND, gen_helper_mve_vand)
-DO_LOGIC(VBIC, gen_helper_mve_vbic)
-DO_LOGIC(VORR, gen_helper_mve_vorr)
-DO_LOGIC(VORN, gen_helper_mve_vorn)
-DO_LOGIC(VEOR, gen_helper_mve_veor)
+DO_LOGIC(VAND, gen_helper_mve_vand, tcg_gen_gvec_and)
+DO_LOGIC(VBIC, gen_helper_mve_vbic, tcg_gen_gvec_andc)
+DO_LOGIC(VORR, gen_helper_mve_vorr, tcg_gen_gvec_or)
+DO_LOGIC(VORN, gen_helper_mve_vorn, tcg_gen_gvec_orc)
+DO_LOGIC(VEOR, gen_helper_mve_veor, tcg_gen_gvec_xor)
 
 static bool trans_VPSEL(DisasContext *s, arg_2op *a)
 {
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 05/12] target/arm: Optimize MVE arithmetic ops
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (3 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 04/12] target/arm: Optimize MVE logic ops Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 06/12] target/arm: Optimize MVE VNEG, VABS Peter Maydell
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize MVE arithmetic ops when we have a TCG
vector operation we can use.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/translate-mve.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 77b9f0db334..255cb860fec 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -838,7 +838,7 @@ static bool trans_VPSEL(DisasContext *s, arg_2op *a)
     return do_2op(s, a, gen_helper_mve_vpsel);
 }
 
-#define DO_2OP(INSN, FN) \
+#define DO_2OP_VEC(INSN, FN, VECFN)                             \
     static bool trans_##INSN(DisasContext *s, arg_2op *a)       \
     {                                                           \
         static MVEGenTwoOpFn * const fns[] = {                  \
@@ -847,20 +847,22 @@ static bool trans_VPSEL(DisasContext *s, arg_2op *a)
             gen_helper_mve_##FN##w,                             \
             NULL,                                               \
         };                                                      \
-        return do_2op(s, a, fns[a->size]);                      \
+        return do_2op_vec(s, a, fns[a->size], VECFN);           \
     }
 
-DO_2OP(VADD, vadd)
-DO_2OP(VSUB, vsub)
-DO_2OP(VMUL, vmul)
+#define DO_2OP(INSN, FN) DO_2OP_VEC(INSN, FN, NULL)
+
+DO_2OP_VEC(VADD, vadd, tcg_gen_gvec_add)
+DO_2OP_VEC(VSUB, vsub, tcg_gen_gvec_sub)
+DO_2OP_VEC(VMUL, vmul, tcg_gen_gvec_mul)
 DO_2OP(VMULH_S, vmulhs)
 DO_2OP(VMULH_U, vmulhu)
 DO_2OP(VRMULH_S, vrmulhs)
 DO_2OP(VRMULH_U, vrmulhu)
-DO_2OP(VMAX_S, vmaxs)
-DO_2OP(VMAX_U, vmaxu)
-DO_2OP(VMIN_S, vmins)
-DO_2OP(VMIN_U, vminu)
+DO_2OP_VEC(VMAX_S, vmaxs, tcg_gen_gvec_smax)
+DO_2OP_VEC(VMAX_U, vmaxu, tcg_gen_gvec_umax)
+DO_2OP_VEC(VMIN_S, vmins, tcg_gen_gvec_smin)
+DO_2OP_VEC(VMIN_U, vminu, tcg_gen_gvec_umin)
 DO_2OP(VABD_S, vabds)
 DO_2OP(VABD_U, vabdu)
 DO_2OP(VHADD_S, vhadds)
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 06/12] target/arm: Optimize MVE VNEG, VABS
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (4 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 05/12] target/arm: Optimize MVE arithmetic ops Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 07/12] target/arm: Optimize MVE VDUP Peter Maydell
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE VNEG and VABS insns by using TCG
vector ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/translate-mve.c | 32 ++++++++++++++++++++++----------
 1 file changed, 22 insertions(+), 10 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 255cb860fec..d30c7e57ea3 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -510,7 +510,8 @@ static bool trans_VDUP(DisasContext *s, arg_VDUP *a)
     return true;
 }
 
-static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
+static bool do_1op_vec(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn,
+                       GVecGen2Fn vecfn)
 {
     TCGv_ptr qd, qm;
 
@@ -524,16 +525,25 @@ static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
         return true;
     }
 
-    qd = mve_qreg_ptr(a->qd);
-    qm = mve_qreg_ptr(a->qm);
-    fn(cpu_env, qd, qm);
-    tcg_temp_free_ptr(qd);
-    tcg_temp_free_ptr(qm);
+    if (vecfn && mve_no_predication(s)) {
+        vecfn(a->size, mve_qreg_offset(a->qd), mve_qreg_offset(a->qm), 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        qm = mve_qreg_ptr(a->qm);
+        fn(cpu_env, qd, qm);
+        tcg_temp_free_ptr(qd);
+        tcg_temp_free_ptr(qm);
+    }
     mve_update_eci(s);
     return true;
 }
 
-#define DO_1OP(INSN, FN)                                        \
+static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
+{
+    return do_1op_vec(s, a, fn, NULL);
+}
+
+#define DO_1OP_VEC(INSN, FN, VECFN)                             \
     static bool trans_##INSN(DisasContext *s, arg_1op *a)       \
     {                                                           \
         static MVEGenOneOpFn * const fns[] = {                  \
@@ -542,13 +552,15 @@ static bool do_1op(DisasContext *s, arg_1op *a, MVEGenOneOpFn fn)
             gen_helper_mve_##FN##w,                             \
             NULL,                                               \
         };                                                      \
-        return do_1op(s, a, fns[a->size]);                      \
+        return do_1op_vec(s, a, fns[a->size], VECFN);           \
     }
 
+#define DO_1OP(INSN, FN) DO_1OP_VEC(INSN, FN, NULL)
+
 DO_1OP(VCLZ, vclz)
 DO_1OP(VCLS, vcls)
-DO_1OP(VABS, vabs)
-DO_1OP(VNEG, vneg)
+DO_1OP_VEC(VABS, vabs, tcg_gen_gvec_abs)
+DO_1OP_VEC(VNEG, vneg, tcg_gen_gvec_neg)
 DO_1OP(VQABS, vqabs)
 DO_1OP(VQNEG, vqneg)
 DO_1OP(VMAXA, vmaxa)
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 07/12] target/arm: Optimize MVE VDUP
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (5 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 06/12] target/arm: Optimize MVE VNEG, VABS Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:46   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 08/12] target/arm: Optimize MVE VMVN Peter Maydell
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE VDUP insns by using TCG vector ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index d30c7e57ea3..13de55242e2 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -500,11 +500,15 @@ static bool trans_VDUP(DisasContext *s, arg_VDUP *a)
         return true;
     }
 
-    qd = mve_qreg_ptr(a->qd);
     rt = load_reg(s, a->rt);
-    tcg_gen_dup_i32(a->size, rt, rt);
-    gen_helper_mve_vdup(cpu_env, qd, rt);
-    tcg_temp_free_ptr(qd);
+    if (mve_no_predication(s)) {
+        tcg_gen_gvec_dup_i32(a->size, mve_qreg_offset(a->qd), 16, 16, rt);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        tcg_gen_dup_i32(a->size, rt, rt);
+        gen_helper_mve_vdup(cpu_env, qd, rt);
+        tcg_temp_free_ptr(qd);
+    }
     tcg_temp_free_i32(rt);
     mve_update_eci(s);
     return true;
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 08/12] target/arm: Optimize MVE VMVN
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (6 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 07/12] target/arm: Optimize MVE VDUP Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:47   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms Peter Maydell
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE VMVN insn by using TCG vector ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 13de55242e2..4583e22f21c 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -769,7 +769,7 @@ static bool trans_VREV64(DisasContext *s, arg_1op *a)
 
 static bool trans_VMVN(DisasContext *s, arg_1op *a)
 {
-    return do_1op(s, a, gen_helper_mve_vmvn);
+    return do_1op_vec(s, a, gen_helper_mve_vmvn, tcg_gen_gvec_not);
 }
 
 static bool trans_VABS_fp(DisasContext *s, arg_1op *a)
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (7 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 08/12] target/arm: Optimize MVE VMVN Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 13:56   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL Peter Maydell
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE VSHL and VSHR immediate forms by using TCG vector
ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 83 +++++++++++++++++++++++++++++---------
 1 file changed, 63 insertions(+), 20 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 4583e22f21c..00fa4379a74 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -1570,8 +1570,8 @@ static bool trans_Vimm_1r(DisasContext *s, arg_1imm *a)
     return do_1imm(s, a, fn);
 }
 
-static bool do_2shift(DisasContext *s, arg_2shift *a, MVEGenTwoOpShiftFn fn,
-                      bool negateshift)
+static bool do_2shift_vec(DisasContext *s, arg_2shift *a, MVEGenTwoOpShiftFn fn,
+                          bool negateshift, GVecGen2iFn vecfn)
 {
     TCGv_ptr qd, qm;
     int shift = a->shift;
@@ -1594,34 +1594,77 @@ static bool do_2shift(DisasContext *s, arg_2shift *a, MVEGenTwoOpShiftFn fn,
         shift = -shift;
     }
 
-    qd = mve_qreg_ptr(a->qd);
-    qm = mve_qreg_ptr(a->qm);
-    fn(cpu_env, qd, qm, tcg_constant_i32(shift));
-    tcg_temp_free_ptr(qd);
-    tcg_temp_free_ptr(qm);
+    if (vecfn && mve_no_predication(s)) {
+        vecfn(a->size, mve_qreg_offset(a->qd), mve_qreg_offset(a->qm),
+              shift, 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        qm = mve_qreg_ptr(a->qm);
+        fn(cpu_env, qd, qm, tcg_constant_i32(shift));
+        tcg_temp_free_ptr(qd);
+        tcg_temp_free_ptr(qm);
+    }
     mve_update_eci(s);
     return true;
 }
 
-#define DO_2SHIFT(INSN, FN, NEGATESHIFT)                         \
-    static bool trans_##INSN(DisasContext *s, arg_2shift *a)    \
-    {                                                           \
-        static MVEGenTwoOpShiftFn * const fns[] = {             \
-            gen_helper_mve_##FN##b,                             \
-            gen_helper_mve_##FN##h,                             \
-            gen_helper_mve_##FN##w,                             \
-            NULL,                                               \
-        };                                                      \
-        return do_2shift(s, a, fns[a->size], NEGATESHIFT);      \
+static bool do_2shift(DisasContext *s, arg_2shift *a, MVEGenTwoOpShiftFn fn,
+                      bool negateshift)
+{
+    return do_2shift_vec(s, a, fn, negateshift, NULL);
+}
+
+#define DO_2SHIFT_VEC(INSN, FN, NEGATESHIFT, VECFN)                     \
+    static bool trans_##INSN(DisasContext *s, arg_2shift *a)            \
+    {                                                                   \
+        static MVEGenTwoOpShiftFn * const fns[] = {                     \
+            gen_helper_mve_##FN##b,                                     \
+            gen_helper_mve_##FN##h,                                     \
+            gen_helper_mve_##FN##w,                                     \
+            NULL,                                                       \
+        };                                                              \
+        return do_2shift_vec(s, a, fns[a->size], NEGATESHIFT, VECFN);   \
     }
 
-DO_2SHIFT(VSHLI, vshli_u, false)
+#define DO_2SHIFT(INSN, FN, NEGATESHIFT)        \
+    DO_2SHIFT_VEC(INSN, FN, NEGATESHIFT, NULL)
+
+static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
+                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    /*
+     * We get here with a negated shift count, and we must handle
+     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
+     */
+    shift = -shift;
+    if (shift == (8 << vece)) {
+        shift--;
+    }
+    tcg_gen_gvec_sari(vece, dofs, aofs, shift, oprsz, maxsz);
+}
+
+static void do_gvec_shri_u(unsigned vece, uint32_t dofs, uint32_t aofs,
+                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    /*
+     * We get here with a negated shift count, and we must handle
+     * shifts by the element size, which tcg_gen_gvec_shri() does not do.
+     */
+    shift = -shift;
+    if (shift == (8 << vece)) {
+        tcg_gen_gvec_dup_imm(vece, dofs, oprsz, maxsz, 0);
+    } else {
+        tcg_gen_gvec_shri(vece, dofs, aofs, shift, oprsz, maxsz);
+    }
+}
+
+DO_2SHIFT_VEC(VSHLI, vshli_u, false, tcg_gen_gvec_shli)
 DO_2SHIFT(VQSHLI_S, vqshli_s, false)
 DO_2SHIFT(VQSHLI_U, vqshli_u, false)
 DO_2SHIFT(VQSHLUI, vqshlui_s, false)
 /* These right shifts use a left-shift helper with negated shift count */
-DO_2SHIFT(VSHRI_S, vshli_s, true)
-DO_2SHIFT(VSHRI_U, vshli_u, true)
+DO_2SHIFT_VEC(VSHRI_S, vshli_s, true, do_gvec_shri_s)
+DO_2SHIFT_VEC(VSHRI_U, vshli_u, true, do_gvec_shri_u)
 DO_2SHIFT(VRSHRI_S, vrshli_s, true)
 DO_2SHIFT(VRSHRI_U, vrshli_u, true)
 
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (8 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 14:04   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI Peter Maydell
  2021-09-13  9:54 ` [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns Peter Maydell
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE VSHLL insns by using TCG vector ops when possible.
This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
with zero shift count".
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
The cases here that I've implemented with ANDI then shift
could also be implemented as shift-then-shift. Is one better
than another?
---
 target/arm/translate-mve.c | 67 +++++++++++++++++++++++++++++++++-----
 1 file changed, 59 insertions(+), 8 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 00fa4379a74..5d66f70657e 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -1735,16 +1735,67 @@ DO_2SHIFT_SCALAR(VQSHL_U_scalar, vqshli_u)
 DO_2SHIFT_SCALAR(VQRSHL_S_scalar, vqrshli_s)
 DO_2SHIFT_SCALAR(VQRSHL_U_scalar, vqrshli_u)
 
-#define DO_VSHLL(INSN, FN)                                      \
-    static bool trans_##INSN(DisasContext *s, arg_2shift *a)    \
-    {                                                           \
-        static MVEGenTwoOpShiftFn * const fns[] = {             \
-            gen_helper_mve_##FN##b,                             \
-            gen_helper_mve_##FN##h,                             \
-        };                                                      \
-        return do_2shift(s, a, fns[a->size], false);            \
+#define DO_VSHLL(INSN, FN)                                              \
+    static bool trans_##INSN(DisasContext *s, arg_2shift *a)            \
+    {                                                                   \
+        static MVEGenTwoOpShiftFn * const fns[] = {                     \
+            gen_helper_mve_##FN##b,                                     \
+            gen_helper_mve_##FN##h,                                     \
+        };                                                              \
+        return do_2shift_vec(s, a, fns[a->size], false, do_gvec_##FN);  \
     }
 
+/*
+ * For the VSHLL vector helpers, the vece is the size of the input
+ * (ie MO_8 or MO_16); the helpers want to work in the output size.
+ * The shift count can be 0..<input size>, inclusive. (0 is VMOVL.)
+ */
+static void do_gvec_vshllbs(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    tcg_gen_gvec_shli(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    tcg_gen_gvec_sari(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+}
+
+static void do_gvec_vshllbu(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    tcg_gen_gvec_andi(ovece, dofs, aofs,
+                      ovece == MO_16 ? 0xff : 0xffff, oprsz, maxsz);
+    tcg_gen_gvec_shli(ovece, dofs, dofs, shift, oprsz, maxsz);
+}
+
+static void do_gvec_vshllts(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    if (shift == 0) {
+        tcg_gen_gvec_sari(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    } else {
+        tcg_gen_gvec_andi(ovece, dofs, aofs,
+                          ovece == MO_16 ? 0xff00 : 0xffff0000, oprsz, maxsz);
+        tcg_gen_gvec_sari(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+    }
+}
+
+static void do_gvec_vshlltu(unsigned vece, uint32_t dofs, uint32_t aofs,
+                            int64_t shift, uint32_t oprsz, uint32_t maxsz)
+{
+    unsigned ovece = vece + 1;
+    unsigned ibits = vece == MO_8 ? 8 : 16;
+    if (shift == 0) {
+        tcg_gen_gvec_shri(ovece, dofs, aofs, ibits, oprsz, maxsz);
+    } else {
+        tcg_gen_gvec_andi(ovece, dofs, aofs,
+                          ovece == MO_16 ? 0xff00 : 0xffff0000, oprsz, maxsz);
+        tcg_gen_gvec_shri(ovece, dofs, dofs, ibits - shift, oprsz, maxsz);
+    }
+}
+
 DO_VSHLL(VSHLL_BS, vshllbs)
 DO_VSHLL(VSHLL_BU, vshllbu)
 DO_VSHLL(VSHLL_TS, vshllts)
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (9 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 14:04   ` Richard Henderson
  2021-09-13  9:54 ` [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns Peter Maydell
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE shift-and-insert insns by using TCG
vector ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 5d66f70657e..1fd71c9a1ee 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -1668,8 +1668,8 @@ DO_2SHIFT_VEC(VSHRI_U, vshli_u, true, do_gvec_shri_u)
 DO_2SHIFT(VRSHRI_S, vrshli_s, true)
 DO_2SHIFT(VRSHRI_U, vrshli_u, true)
 
-DO_2SHIFT(VSRI, vsri, false)
-DO_2SHIFT(VSLI, vsli, false)
+DO_2SHIFT_VEC(VSRI, vsri, false, gen_gvec_sri)
+DO_2SHIFT_VEC(VSLI, vsli, false, gen_gvec_sli)
 
 #define DO_2SHIFT_FP(INSN, FN)                                  \
     static bool trans_##INSN(DisasContext *s, arg_2shift *a)    \
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns
  2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
                   ` (10 preceding siblings ...)
  2021-09-13  9:54 ` [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI Peter Maydell
@ 2021-09-13  9:54 ` Peter Maydell
  2021-09-13 14:09   ` Richard Henderson
  11 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13  9:54 UTC (permalink / raw)
  To: qemu-arm, qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
Optimize the MVE 1op-immediate insns (VORR, VBIC, VMOV) to
use TCG vector ops when possible.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/translate-mve.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/target/arm/translate-mve.c b/target/arm/translate-mve.c
index 1fd71c9a1ee..4267d43cc7c 100644
--- a/target/arm/translate-mve.c
+++ b/target/arm/translate-mve.c
@@ -1521,7 +1521,8 @@ static bool trans_VADDLV(DisasContext *s, arg_VADDLV *a)
     return true;
 }
 
-static bool do_1imm(DisasContext *s, arg_1imm *a, MVEGenOneOpImmFn *fn)
+static bool do_1imm(DisasContext *s, arg_1imm *a, MVEGenOneOpImmFn *fn,
+                    GVecGen2iFn *vecfn)
 {
     TCGv_ptr qd;
     uint64_t imm;
@@ -1537,17 +1538,29 @@ static bool do_1imm(DisasContext *s, arg_1imm *a, MVEGenOneOpImmFn *fn)
 
     imm = asimd_imm_const(a->imm, a->cmode, a->op);
 
-    qd = mve_qreg_ptr(a->qd);
-    fn(cpu_env, qd, tcg_constant_i64(imm));
-    tcg_temp_free_ptr(qd);
+    if (vecfn && mve_no_predication(s)) {
+        vecfn(MO_64, mve_qreg_offset(a->qd), mve_qreg_offset(a->qd),
+              imm, 16, 16);
+    } else {
+        qd = mve_qreg_ptr(a->qd);
+        fn(cpu_env, qd, tcg_constant_i64(imm));
+        tcg_temp_free_ptr(qd);
+    }
     mve_update_eci(s);
     return true;
 }
 
+static void gen_gvec_vmovi(unsigned vece, uint32_t dofs, uint32_t aofs,
+                           int64_t c, uint32_t oprsz, uint32_t maxsz)
+{
+    tcg_gen_gvec_dup_imm(vece, dofs, oprsz, maxsz, c);
+}
+
 static bool trans_Vimm_1r(DisasContext *s, arg_1imm *a)
 {
     /* Handle decode of cmode/op here between VORR/VBIC/VMOV */
     MVEGenOneOpImmFn *fn;
+    GVecGen2iFn *vecfn;
 
     if ((a->cmode & 1) && a->cmode < 12) {
         if (a->op) {
@@ -1556,8 +1569,10 @@ static bool trans_Vimm_1r(DisasContext *s, arg_1imm *a)
              * so the VBIC becomes a logical AND operation.
              */
             fn = gen_helper_mve_vandi;
+            vecfn = tcg_gen_gvec_andi;
         } else {
             fn = gen_helper_mve_vorri;
+            vecfn = tcg_gen_gvec_ori;
         }
     } else {
         /* There is one unallocated cmode/op combination in this space */
@@ -1566,8 +1581,9 @@ static bool trans_Vimm_1r(DisasContext *s, arg_1imm *a)
         }
         /* asimd_imm_const() sorts out VMVNI vs VMOVI for us */
         fn = gen_helper_mve_vmovi;
+        vecfn = gen_gvec_vmovi;
     }
-    return do_1imm(s, a, fn);
+    return do_1imm(s, a, fn, vecfn);
 }
 
 static bool do_2shift_vec(DisasContext *s, arg_2shift *a, MVEGenTwoOpShiftFn fn,
-- 
2.20.1
^ permalink raw reply related	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop
  2021-09-13  9:54 ` [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop Peter Maydell
@ 2021-09-13 13:36   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:36 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Currently gen_jmp_tb() assumes that if it is called then the jump it
> is handling is the only reason that we might be trying to end the TB,
> so it will use goto_tb if it can.  This is usually the case: mostly
> "we did something that means we must end the TB" happens on a
> non-branch instruction.  However, there are cases where we decide
> early in handling an instruction that we need to end the TB and
> return to the main loop, and then the insn is a complex one that
> involves gen_jmp_tb().  For instance, for M-profile FP instructions,
> in gen_preserve_fp_state() which is called from vfp_access_check() we
> want to force an exit to the main loop if lazy state preservation is
> active and we are in icount mode.
> 
> Make gen_jmp_tb() look at the current value of is_jmp, and only use
> goto_tb if the previous is_jmp was DISAS_NEXT or DISAS_TOO_MANY.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate.c | 34 +++++++++++++++++++++++++++++++++-
>   1 file changed, 33 insertions(+), 1 deletion(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration
  2021-09-13  9:54 ` [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration Peter Maydell
@ 2021-09-13 13:39   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:39 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Architecturally, for an M-profile CPU with the LOB feature the
> LTPSIZE field in FPDSCR is always constant 4.  QEMU's implementation
> enforces this everywhere, except that we don't check that it is true
> in incoming migration data.
> 
> We're going to add come in gen_update_fp_context() which relies on
"code"
> the "always 4" property.  Since this is TCG-only, we don't actually
> need to be robust to bogus incoming migration data, and the effect of
> it being wrong would be wrong code generation rather than a QEMU
> crash; but if it did ever happen somehow it would be very difficult
> to track down the cause.  Add a check so that we fail the inbound
> migration if the FPDSCR.LTPSIZE value is incorrect.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated"
  2021-09-13  9:54 ` [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
@ 2021-09-13 13:44   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:44 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Our current codegen for MVE always calls out to helper functions,
> because some byte lanes might be predicated.  The common case is that
> in fact there is no predication active and all lanes should be
> updated together, so we can produce better code by detecting that and
> using the TCG generic vector infrastructure.
> 
> Add a TB flag that is set when we can guarantee that there is no
> active MVE predication, and a bool in the DisasContext.  Subsequent
> patches will use this flag to generate improved code for some
> instructions.
> 
> In most cases when the predication state changes we simply end the TB
> after that instruction.  For the code called from vfp_access_check()
> that handles lazy state preservation and creating a new FP context,
> we can usually avoid having to try to end the TB because luckily the
> new value of the flag following the register changes in those
> sequences doesn't depend on any runtime decisions.  We do have to end
> the TB if the guest has enabled lazy FP state preservation but not
> automatic state preservation, but this is an odd corner case that is
> not going to be common in real-world code.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
> I renamed the mve_no_predication() function to mve_no_pred() because
> I want to use the former name in patch 2 for the translate-time "no
> predication of any kind including ECI", and wanted to distinguish it
> from this function that is just determining the value of the TB flag
> bit.  Better naming suggestions welcome.
> ---
>   target/arm/cpu.h              |  4 +++-
>   target/arm/translate.h        |  2 ++
>   target/arm/helper.c           | 33 +++++++++++++++++++++++++++++++++
>   target/arm/translate-m-nocp.c |  8 +++++++-
>   target/arm/translate-mve.c    | 13 ++++++++++++-
>   target/arm/translate-vfp.c    | 33 +++++++++++++++++++++++++++------
>   target/arm/translate.c        |  8 ++++++++
>   7 files changed, 92 insertions(+), 9 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 07/12] target/arm: Optimize MVE VDUP
  2021-09-13  9:54 ` [PATCH v2 07/12] target/arm: Optimize MVE VDUP Peter Maydell
@ 2021-09-13 13:46   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:46 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE VDUP insns by using TCG vector ops when possible.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 08/12] target/arm: Optimize MVE VMVN
  2021-09-13  9:54 ` [PATCH v2 08/12] target/arm: Optimize MVE VMVN Peter Maydell
@ 2021-09-13 13:47   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:47 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE VMVN insn by using TCG vector ops when possible.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-13  9:54 ` [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms Peter Maydell
@ 2021-09-13 13:56   ` Richard Henderson
  2021-09-13 14:21     ` Peter Maydell
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 13:56 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> +static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
> +                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
> +{
> +    /*
> +     * We get here with a negated shift count, and we must handle
> +     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
> +     */
> +    shift = -shift;
You've already performed the negation in do_2shift_vec.
> +    if (shift == (8 << vece)) {
> +        shift--;
> +    }
> +    tcg_gen_gvec_sari(vece, dofs, aofs, shift, oprsz, maxsz);
...
> +    if (shift == (8 << vece)) {
> +        tcg_gen_gvec_dup_imm(vece, dofs, oprsz, maxsz, 0);
> +    } else {
> +        tcg_gen_gvec_shri(vece, dofs, aofs, shift, oprsz, maxsz);
> +    }
Perhaps worth placing these functions somewhere we can share code with NEON?  Tactical 
error, perhaps, open-coding these tests in trans_VSHR_S_2sh and trans_VSHR_U_2sh.
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL
  2021-09-13  9:54 ` [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL Peter Maydell
@ 2021-09-13 14:04   ` Richard Henderson
  2021-09-13 14:22     ` Peter Maydell
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 14:04 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE VSHLL insns by using TCG vector ops when possible.
> This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
> with zero shift count".
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> The cases here that I've implemented with ANDI then shift
> could also be implemented as shift-then-shift. Is one better
> than another?
I would expect and + shift to be preferred over shift + shift.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI
  2021-09-13  9:54 ` [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI Peter Maydell
@ 2021-09-13 14:04   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 14:04 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE shift-and-insert insns by using TCG
> vector ops when possible.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns
  2021-09-13  9:54 ` [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns Peter Maydell
@ 2021-09-13 14:09   ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 14:09 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 9/13/21 2:54 AM, Peter Maydell wrote:
> Optimize the MVE 1op-immediate insns (VORR, VBIC, VMOV) to
> use TCG vector ops when possible.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/translate-mve.c | 26 +++++++++++++++++++++-----
>   1 file changed, 21 insertions(+), 5 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-13 13:56   ` Richard Henderson
@ 2021-09-13 14:21     ` Peter Maydell
  2021-09-13 15:53       ` Richard Henderson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13 14:21 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On Mon, 13 Sept 2021 at 14:56, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/13/21 2:54 AM, Peter Maydell wrote:
> > +static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
> > +                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
> > +{
> > +    /*
> > +     * We get here with a negated shift count, and we must handle
> > +     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
> > +     */
> > +    shift = -shift;
>
> You've already performed the negation in do_2shift_vec.
Here we are undoing the negation we did there, so as to get a
"positive means shift right" shift count back again, which is what
the instruction encoding has and what tcg_gen_gvic_shri() wants.
> > +    if (shift == (8 << vece)) {
> > +        shift--;
> > +    }
> > +    tcg_gen_gvec_sari(vece, dofs, aofs, shift, oprsz, maxsz);
> ...
> > +    if (shift == (8 << vece)) {
> > +        tcg_gen_gvec_dup_imm(vece, dofs, oprsz, maxsz, 0);
> > +    } else {
> > +        tcg_gen_gvec_shri(vece, dofs, aofs, shift, oprsz, maxsz);
> > +    }
>
>
> Perhaps worth placing these functions somewhere we can share code with NEON?  Tactical
> error, perhaps, open-coding these tests in trans_VSHR_S_2sh and trans_VSHR_U_2sh.
I'm not convinced the resemblance is close enough to be worth the
effort...
-- PMM
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL
  2021-09-13 14:04   ` Richard Henderson
@ 2021-09-13 14:22     ` Peter Maydell
  2021-09-13 15:56       ` Richard Henderson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-13 14:22 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On Mon, 13 Sept 2021 at 15:04, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/13/21 2:54 AM, Peter Maydell wrote:
> > Optimize the MVE VSHLL insns by using TCG vector ops when possible.
> > This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
> > with zero shift count".
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> > The cases here that I've implemented with ANDI then shift
> > could also be implemented as shift-then-shift. Is one better
> > than another?
>
> I would expect and + shift to be preferred over shift + shift.
OK. (I wasn't sure, because and + shift requires another insn
to assemble the immediate constant, I think.)
-- PMM
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-13 14:21     ` Peter Maydell
@ 2021-09-13 15:53       ` Richard Henderson
  2021-09-16 10:01         ` Peter Maydell
  0 siblings, 1 reply; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 15:53 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On 9/13/21 7:21 AM, Peter Maydell wrote:
> On Mon, 13 Sept 2021 at 14:56, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 9/13/21 2:54 AM, Peter Maydell wrote:
>>> +static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
>>> +                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
>>> +{
>>> +    /*
>>> +     * We get here with a negated shift count, and we must handle
>>> +     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
>>> +     */
>>> +    shift = -shift;
>>
>> You've already performed the negation in do_2shift_vec.
> 
> Here we are undoing the negation we did there, so as to get a
> "positive means shift right" shift count back again, which is what
> the instruction encoding has and what tcg_gen_gvic_shri() wants.
Ah, I misinterpreted.
>> Perhaps worth placing these functions somewhere we can share code with NEON?  Tactical
>> error, perhaps, open-coding these tests in trans_VSHR_S_2sh and trans_VSHR_U_2sh.
> 
> I'm not convinced the resemblance is close enough to be worth the
> effort...
Yeah, not with the negation bit above.
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL
  2021-09-13 14:22     ` Peter Maydell
@ 2021-09-13 15:56       ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-13 15:56 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On 9/13/21 7:22 AM, Peter Maydell wrote:
> On Mon, 13 Sept 2021 at 15:04, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 9/13/21 2:54 AM, Peter Maydell wrote:
>>> Optimize the MVE VSHLL insns by using TCG vector ops when possible.
>>> This includes the VMOVL insn, which we handle in mve.decode as "VSHLL
>>> with zero shift count".
>>>
>>> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
>>> ---
>>> The cases here that I've implemented with ANDI then shift
>>> could also be implemented as shift-then-shift. Is one better
>>> than another?
>>
>> I would expect and + shift to be preferred over shift + shift.
> 
> OK. (I wasn't sure, because and + shift requires another insn
> to assemble the immediate constant, I think.)
Yea, though Arm itself is good about not requiring one.  But there's generally only one 
shifter across multiple pipelines.  Not that we're doing any sort of compute resource 
allocation and scheduling...
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-13 15:53       ` Richard Henderson
@ 2021-09-16 10:01         ` Peter Maydell
  2021-09-16 14:39           ` Richard Henderson
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Maydell @ 2021-09-16 10:01 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On Mon, 13 Sept 2021 at 16:53, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 9/13/21 7:21 AM, Peter Maydell wrote:
> > On Mon, 13 Sept 2021 at 14:56, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> On 9/13/21 2:54 AM, Peter Maydell wrote:
> >>> +static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
> >>> +                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
> >>> +{
> >>> +    /*
> >>> +     * We get here with a negated shift count, and we must handle
> >>> +     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
> >>> +     */
> >>> +    shift = -shift;
> >>
> >> You've already performed the negation in do_2shift_vec.
> >
> > Here we are undoing the negation we did there, so as to get a
> > "positive means shift right" shift count back again, which is what
> > the instruction encoding has and what tcg_gen_gvic_shri() wants.
>
> Ah, I misinterpreted.
>
> >> Perhaps worth placing these functions somewhere we can share code with NEON?  Tactical
> >> error, perhaps, open-coding these tests in trans_VSHR_S_2sh and trans_VSHR_U_2sh.
> >
> > I'm not convinced the resemblance is close enough to be worth the
> > effort...
>
> Yeah, not with the negation bit above.
Could I get a reviewed-by for this patch, then, please ?
thanks
-- PMM
^ permalink raw reply	[flat|nested] 28+ messages in thread
* Re: [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms
  2021-09-16 10:01         ` Peter Maydell
@ 2021-09-16 14:39           ` Richard Henderson
  0 siblings, 0 replies; 28+ messages in thread
From: Richard Henderson @ 2021-09-16 14:39 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-arm, QEMU Developers, Philippe Mathieu-Daudé
On 9/16/21 3:01 AM, Peter Maydell wrote:
> On Mon, 13 Sept 2021 at 16:53, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 9/13/21 7:21 AM, Peter Maydell wrote:
>>> On Mon, 13 Sept 2021 at 14:56, Richard Henderson
>>> <richard.henderson@linaro.org> wrote:
>>>>
>>>> On 9/13/21 2:54 AM, Peter Maydell wrote:
>>>>> +static void do_gvec_shri_s(unsigned vece, uint32_t dofs, uint32_t aofs,
>>>>> +                           int64_t shift, uint32_t oprsz, uint32_t maxsz)
>>>>> +{
>>>>> +    /*
>>>>> +     * We get here with a negated shift count, and we must handle
>>>>> +     * shifts by the element size, which tcg_gen_gvec_sari() does not do.
>>>>> +     */
>>>>> +    shift = -shift;
>>>>
>>>> You've already performed the negation in do_2shift_vec.
>>>
>>> Here we are undoing the negation we did there, so as to get a
>>> "positive means shift right" shift count back again, which is what
>>> the instruction encoding has and what tcg_gen_gvic_shri() wants.
>>
>> Ah, I misinterpreted.
>>
>>>> Perhaps worth placing these functions somewhere we can share code with NEON?  Tactical
>>>> error, perhaps, open-coding these tests in trans_VSHR_S_2sh and trans_VSHR_U_2sh.
>>>
>>> I'm not convinced the resemblance is close enough to be worth the
>>> effort...
>>
>> Yeah, not with the negation bit above.
> 
> Could I get a reviewed-by for this patch, then, please ?
Oops, yes.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply	[flat|nested] 28+ messages in thread
end of thread, other threads:[~2021-09-16 14:42 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-09-13  9:54 [PATCH v2 00/12] target/arm: Use TCG vector ops for MVE Peter Maydell
2021-09-13  9:54 ` [PATCH v2 01/12] target/arm: Avoid goto_tb if we're trying to exit to the main loop Peter Maydell
2021-09-13 13:36   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 02/12] target/arm: Enforce that FPDSCR.LTPSIZE is 4 on inbound migration Peter Maydell
2021-09-13 13:39   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 03/12] target/arm: Add TB flag for "MVE insns not predicated" Peter Maydell
2021-09-13 13:44   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 04/12] target/arm: Optimize MVE logic ops Peter Maydell
2021-09-13  9:54 ` [PATCH v2 05/12] target/arm: Optimize MVE arithmetic ops Peter Maydell
2021-09-13  9:54 ` [PATCH v2 06/12] target/arm: Optimize MVE VNEG, VABS Peter Maydell
2021-09-13  9:54 ` [PATCH v2 07/12] target/arm: Optimize MVE VDUP Peter Maydell
2021-09-13 13:46   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 08/12] target/arm: Optimize MVE VMVN Peter Maydell
2021-09-13 13:47   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 09/12] target/arm: Optimize MVE VSHL, VSHR immediate forms Peter Maydell
2021-09-13 13:56   ` Richard Henderson
2021-09-13 14:21     ` Peter Maydell
2021-09-13 15:53       ` Richard Henderson
2021-09-16 10:01         ` Peter Maydell
2021-09-16 14:39           ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 10/12] target/arm: Optimize MVE VSHLL and VMOVL Peter Maydell
2021-09-13 14:04   ` Richard Henderson
2021-09-13 14:22     ` Peter Maydell
2021-09-13 15:56       ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 11/12] target/arm: Optimize MVE VSLI and VSRI Peter Maydell
2021-09-13 14:04   ` Richard Henderson
2021-09-13  9:54 ` [PATCH v2 12/12] target/arm: Optimize MVE 1op-immediate insns Peter Maydell
2021-09-13 14:09   ` Richard Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).