[Qemu-devel] [PULL 24/48] target/arm: Convert VFP VMLA to decodetree

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Peter Maydell <peter.maydell@linaro.org>
To: qemu-devel@nongnu.org
Subject: [Qemu-devel] [PULL 24/48] target/arm: Convert VFP VMLA to decodetree
Date: Thu, 13 Jun 2019 13:14:09 +0100	[thread overview]
Message-ID: <20190613121433.5246-25-peter.maydell@linaro.org> (raw)
In-Reply-To: <20190613121433.5246-1-peter.maydell@linaro.org>

Convert the VFP VMLA instruction to decodetree.

This is the first of the VFP 3-operand data processing instructions,
so we include in this patch the code which loops over the elements
for an old-style VFP vector operation. The existing code to do this
looping uses the deprecated cpu_F0s/F0d/F1s/F1d TCG globals; since
we are going to be converting instructions one at a time anyway
we can take the opportunity to make the new loop use TCG temporaries,
which means we can do that conversion one operation at a time
rather than needing to do it all in one go.

We include an UNDEF check which was missing in the old code:
short-vector operations (with stride or length non-zero) were
deprecated in v7A and must UNDEF in v8A, so if the MVFR0 FPShVec
field does not indicate that support for short vectors is present
we UNDEF the operations that would use them. (This is a change
of behaviour for Cortex-A7, Cortex-A15 and the v8 CPUs, which
previously were all incorrectly allowing short-vector operations.)

Note that the conversion fixes a bug in the old code for the
case of VFP short-vector "mixed scalar/vector operations". These
happen where the destination register is in a vector bank but
but the second operand is in a scalar bank. For example
  vmla.f64 d10, d1, d16   with length 2 stride 2
is equivalent to the pair of scalar operations
  vmla.f64 d10, d1, d16
  vmla.f64 d8, d3, d16
where the destination and first input register cycle through
their vector but the second input is scalar (d16). In the
old decoder the gen_vfp_F1_mul() operation uses cpu_F1{s,d}
as a temporary output for the multiply, which trashes the
second input operand. For the fully-scalar case (where we
never do a second iteration) and the fully-vector case
(where the loop loads the new second input operand) this
doesn't matter, but for the mixed scalar/vector case we
will end up using the wrong value for later loop iterations.
In the new code we use TCG temporaries and so avoid the bug.
This bug is present for all the multiply-accumulate insns
that operate on short vectors: VMLA, VMLS, VNMLA, VNMLS.

Note 2: the expression used to calculate the next register
number in the vector bank is not in fact correct; we leave
this behaviour unchanged from the old decoder and will
fix this bug later in the series.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/arm/cpu.h               |   5 +
 target/arm/translate-vfp.inc.c | 205 +++++++++++++++++++++++++++++++++
 target/arm/translate.c         |  14 ++-
 target/arm/vfp.decode          |   6 +
 4 files changed, 224 insertions(+), 6 deletions(-)

diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index dc50c86e318..92298624215 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -3377,6 +3377,11 @@ static inline bool isar_feature_aa32_fp_d32(const ARMISARegisters *id)
     return FIELD_EX64(id->mvfr0, MVFR0, SIMDREG) >= 2;
 }
 
+static inline bool isar_feature_aa32_fpshvec(const ARMISARegisters *id)
+{
+    return FIELD_EX64(id->mvfr0, MVFR0, FPSHVEC) > 0;
+}
+
 /*
  * We always set the FP and SIMD FP16 fields to indicate identical
  * levels of support (assuming SIMD is implemented at all), so
diff --git a/target/arm/translate-vfp.inc.c b/target/arm/translate-vfp.inc.c
index 9729946d734..4f922dc8405 100644
--- a/target/arm/translate-vfp.inc.c
+++ b/target/arm/translate-vfp.inc.c
@@ -1098,3 +1098,208 @@ static bool trans_VLDM_VSTM_dp(DisasContext *s, arg_VLDM_VSTM_dp *a)
 
     return true;
 }
+
+/*
+ * Types for callbacks for do_vfp_3op_sp() and do_vfp_3op_dp().
+ * The callback should emit code to write a value to vd. If
+ * do_vfp_3op_{sp,dp}() was passed reads_vd then the TCGv vd
+ * will contain the old value of the relevant VFP register;
+ * otherwise it must be written to only.
+ */
+typedef void VFPGen3OpSPFn(TCGv_i32 vd,
+                           TCGv_i32 vn, TCGv_i32 vm, TCGv_ptr fpst);
+typedef void VFPGen3OpDPFn(TCGv_i64 vd,
+                           TCGv_i64 vn, TCGv_i64 vm, TCGv_ptr fpst);
+
+/*
+ * Perform a 3-operand VFP data processing instruction. fn is the
+ * callback to do the actual operation; this function deals with the
+ * code to handle looping around for VFP vector processing.
+ */
+static bool do_vfp_3op_sp(DisasContext *s, VFPGen3OpSPFn *fn,
+                          int vd, int vn, int vm, bool reads_vd)
+{
+    uint32_t delta_m = 0;
+    uint32_t delta_d = 0;
+    uint32_t bank_mask = 0;
+    int veclen = s->vec_len;
+    TCGv_i32 f0, f1, fd;
+    TCGv_ptr fpst;
+
+    if (!dc_isar_feature(aa32_fpshvec, s) &&
+        (veclen != 0 || s->vec_stride != 0)) {
+        return false;
+    }
+
+    if (!vfp_access_check(s)) {
+        return true;
+    }
+
+    if (veclen > 0) {
+        bank_mask = 0x18;
+
+        /* Figure out what type of vector operation this is.  */
+        if ((vd & bank_mask) == 0) {
+            /* scalar */
+            veclen = 0;
+        } else {
+            delta_d = s->vec_stride + 1;
+
+            if ((vm & bank_mask) == 0) {
+                /* mixed scalar/vector */
+                delta_m = 0;
+            } else {
+                /* vector */
+                delta_m = delta_d;
+            }
+        }
+    }
+
+    f0 = tcg_temp_new_i32();
+    f1 = tcg_temp_new_i32();
+    fd = tcg_temp_new_i32();
+    fpst = get_fpstatus_ptr(0);
+
+    neon_load_reg32(f0, vn);
+    neon_load_reg32(f1, vm);
+
+    for (;;) {
+        if (reads_vd) {
+            neon_load_reg32(fd, vd);
+        }
+        fn(fd, f0, f1, fpst);
+        neon_store_reg32(fd, vd);
+
+        if (veclen == 0) {
+            break;
+        }
+
+        /* Set up the operands for the next iteration */
+        veclen--;
+        vd = ((vd + delta_d) & (bank_mask - 1)) | (vd & bank_mask);
+        vn = ((vn + delta_d) & (bank_mask - 1)) | (vn & bank_mask);
+        neon_load_reg32(f0, vn);
+        if (delta_m) {
+            vm = ((vm + delta_m) & (bank_mask - 1)) | (vm & bank_mask);
+            neon_load_reg32(f1, vm);
+        }
+    }
+
+    tcg_temp_free_i32(f0);
+    tcg_temp_free_i32(f1);
+    tcg_temp_free_i32(fd);
+    tcg_temp_free_ptr(fpst);
+
+    return true;
+}
+
+static bool do_vfp_3op_dp(DisasContext *s, VFPGen3OpDPFn *fn,
+                          int vd, int vn, int vm, bool reads_vd)
+{
+    uint32_t delta_m = 0;
+    uint32_t delta_d = 0;
+    uint32_t bank_mask = 0;
+    int veclen = s->vec_len;
+    TCGv_i64 f0, f1, fd;
+    TCGv_ptr fpst;
+
+    /* UNDEF accesses to D16-D31 if they don't exist */
+    if (!dc_isar_feature(aa32_fp_d32, s) && ((vd | vn | vm) & 0x10)) {
+        return false;
+    }
+
+    if (!dc_isar_feature(aa32_fpshvec, s) &&
+        (veclen != 0 || s->vec_stride != 0)) {
+        return false;
+    }
+
+    if (!vfp_access_check(s)) {
+        return true;
+    }
+
+    if (veclen > 0) {
+        bank_mask = 0xc;
+
+        /* Figure out what type of vector operation this is.  */
+        if ((vd & bank_mask) == 0) {
+            /* scalar */
+            veclen = 0;
+        } else {
+            delta_d = (s->vec_stride >> 1) + 1;
+
+            if ((vm & bank_mask) == 0) {
+                /* mixed scalar/vector */
+                delta_m = 0;
+            } else {
+                /* vector */
+                delta_m = delta_d;
+            }
+        }
+    }
+
+    f0 = tcg_temp_new_i64();
+    f1 = tcg_temp_new_i64();
+    fd = tcg_temp_new_i64();
+    fpst = get_fpstatus_ptr(0);
+
+    neon_load_reg64(f0, vn);
+    neon_load_reg64(f1, vm);
+
+    for (;;) {
+        if (reads_vd) {
+            neon_load_reg64(fd, vd);
+        }
+        fn(fd, f0, f1, fpst);
+        neon_store_reg64(fd, vd);
+
+        if (veclen == 0) {
+            break;
+        }
+        /* Set up the operands for the next iteration */
+        veclen--;
+        vd = ((vd + delta_d) & (bank_mask - 1)) | (vd & bank_mask);
+        vn = ((vn + delta_d) & (bank_mask - 1)) | (vn & bank_mask);
+        neon_load_reg64(f0, vn);
+        if (delta_m) {
+            vm = ((vm + delta_m) & (bank_mask - 1)) | (vm & bank_mask);
+            neon_load_reg64(f1, vm);
+        }
+    }
+
+    tcg_temp_free_i64(f0);
+    tcg_temp_free_i64(f1);
+    tcg_temp_free_i64(fd);
+    tcg_temp_free_ptr(fpst);
+
+    return true;
+}
+
+static void gen_VMLA_sp(TCGv_i32 vd, TCGv_i32 vn, TCGv_i32 vm, TCGv_ptr fpst)
+{
+    /* Note that order of inputs to the add matters for NaNs */
+    TCGv_i32 tmp = tcg_temp_new_i32();
+
+    gen_helper_vfp_muls(tmp, vn, vm, fpst);
+    gen_helper_vfp_adds(vd, vd, tmp, fpst);
+    tcg_temp_free_i32(tmp);
+}
+
+static bool trans_VMLA_sp(DisasContext *s, arg_VMLA_sp *a)
+{
+    return do_vfp_3op_sp(s, gen_VMLA_sp, a->vd, a->vn, a->vm, true);
+}
+
+static void gen_VMLA_dp(TCGv_i64 vd, TCGv_i64 vn, TCGv_i64 vm, TCGv_ptr fpst)
+{
+    /* Note that order of inputs to the add matters for NaNs */
+    TCGv_i64 tmp = tcg_temp_new_i64();
+
+    gen_helper_vfp_muld(tmp, vn, vm, fpst);
+    gen_helper_vfp_addd(vd, vd, tmp, fpst);
+    tcg_temp_free_i64(tmp);
+}
+
+static bool trans_VMLA_dp(DisasContext *s, arg_VMLA_sp *a)
+{
+    return do_vfp_3op_dp(s, gen_VMLA_dp, a->vd, a->vn, a->vm, true);
+}
diff --git a/target/arm/translate.c b/target/arm/translate.c
index 1c5c87d7d01..ad8f947a13b 100644
--- a/target/arm/translate.c
+++ b/target/arm/translate.c
@@ -3133,6 +3133,14 @@ static int disas_vfp_insn(DisasContext *s, uint32_t insn)
             op = ((insn >> 20) & 8) | ((insn >> 19) & 6) | ((insn >> 6) & 1);
             rn = VFP_SREG_N(insn);
 
+            switch (op) {
+            case 0:
+                /* Already handled by decodetree */
+                return 1;
+            default:
+                break;
+            }
+
             if (op == 15) {
                 /* rn is opcode, encoded as per VFP_SREG_N. */
                 switch (rn) {
@@ -3312,12 +3320,6 @@ static int disas_vfp_insn(DisasContext *s, uint32_t insn)
             for (;;) {
                 /* Perform the calculation.  */
                 switch (op) {
-                case 0: /* VMLA: fd + (fn * fm) */
-                    /* Note that order of inputs to the add matters for NaNs */
-                    gen_vfp_F1_mul(dp);
-                    gen_mov_F0_vreg(dp, rd);
-                    gen_vfp_add(dp);
-                    break;
                 case 1: /* VMLS: fd + -(fn * fm) */
                     gen_vfp_mul(dp);
                     gen_vfp_F1_neg(dp);
diff --git a/target/arm/vfp.decode b/target/arm/vfp.decode
index 68c9ffcfd3c..9530e17ae02 100644
--- a/target/arm/vfp.decode
+++ b/target/arm/vfp.decode
@@ -96,3 +96,9 @@ VLDM_VSTM_sp ---- 1101 0.1 l:1 rn:4 .... 1010 imm:8 \
              vd=%vd_sp p=1 u=0 w=1
 VLDM_VSTM_dp ---- 1101 0.1 l:1 rn:4 .... 1011 imm:8 \
              vd=%vd_dp p=1 u=0 w=1
+
+# 3-register VFP data-processing; bits [23,21:20,6] identify the operation.
+VMLA_sp      ---- 1110 0.00 .... .... 1010 .0.0 .... \
+             vm=%vm_sp vn=%vn_sp vd=%vd_sp
+VMLA_dp      ---- 1110 0.00 .... .... 1011 .0.0 .... \
+             vm=%vm_dp vn=%vn_dp vd=%vd_dp
-- 
2.20.1

next prev parent reply	other threads:[~2019-06-13 13:13 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-13 12:13 [Qemu-devel] [PULL 00/48] target-arm queue Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 01/48] target/arm: Vectorize USHL and SSHL Peter Maydell
2019-06-13 14:11   ` Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 02/48] target/arm: Use tcg_gen_gvec_bitsel Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 03/48] target/arm: Implement NSACR gating of floating point Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 04/48] hw/arm/smmuv3: Fix decoding of ID register range Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 05/48] hw/core/bus.c: Only the main system bus can have no parent Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 06/48] target/arm: Fix output of PAuth Auth Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 07/48] decodetree: Fix comparison of Field Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 08/48] target/arm: Add stubs for AArch32 VFP decodetree Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 09/48] target/arm: Factor out VFP access checking code Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 10/48] target/arm: Fix Cortex-R5F MVFR values Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 11/48] target/arm: Explicitly enable VFP short-vectors for aarch32 -cpu max Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 12/48] target/arm: Convert the VSEL instructions to decodetree Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 13/48] target/arm: Convert VMINNM, VMAXNM " Peter Maydell
2019-06-13 12:13 ` [Qemu-devel] [PULL 14/48] target/arm: Convert VRINTA/VRINTN/VRINTP/VRINTM " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 15/48] target/arm: Convert VCVTA/VCVTN/VCVTP/VCVTM " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 16/48] target/arm: Move the VFP trans_* functions to translate-vfp.inc.c Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 17/48] target/arm: Add helpers for VFP register loads and stores Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 18/48] target/arm: Convert "double-precision" register moves to decodetree Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 19/48] target/arm: Convert "single-precision" " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 20/48] target/arm: Convert VFP two-register transfer insns " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 21/48] target/arm: Convert VFP VLDR and VSTR " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 22/48] target/arm: Convert the VFP load/store multiple insns " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 23/48] target/arm: Remove VLDR/VSTR/VLDM/VSTM use of cpu_F0s and cpu_F0d Peter Maydell
2019-06-13 12:14 ` Peter Maydell [this message]
2019-06-13 12:14 ` [Qemu-devel] [PULL 25/48] target/arm: Convert VFP VMLS to decodetree Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 26/48] target/arm: Convert VFP VNMLS " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 27/48] target/arm: Convert VFP VNMLA " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 28/48] target/arm: Convert VMUL " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 29/48] target/arm: Convert VNMUL " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 30/48] target/arm: Convert VADD " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 31/48] target/arm: Convert VSUB " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 32/48] target/arm: Convert VDIV " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 33/48] target/arm: Convert VFP fused multiply-add insns " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 34/48] target/arm: Convert VMOV (imm) " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 35/48] target/arm: Convert VABS " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 36/48] target/arm: Convert VNEG " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 37/48] target/arm: Convert VSQRT " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 38/48] target/arm: Convert VMOV (register) " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 39/48] target/arm: Convert VFP comparison insns " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 40/48] target/arm: Convert the VCVT-from-f16 " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 41/48] target/arm: Convert the VCVT-to-f16 " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 42/48] target/arm: Convert VFP round " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 43/48] target/arm: Convert double-single precision conversion " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 44/48] target/arm: Convert integer-to-float " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 45/48] target/arm: Convert VJCVT " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 46/48] target/arm: Convert VCVT fp/fixed-point conversion insns " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 47/48] target/arm: Convert float-to-integer VCVT " Peter Maydell
2019-06-13 12:14 ` [Qemu-devel] [PULL 48/48] target/arm: Fix short-vector increment behaviour Peter Maydell
2019-06-13 14:18 ` [Qemu-devel] [PULL 00/48] target-arm queue no-reply
2019-06-13 16:51 ` no-reply

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:dc50c86e31 dfblob:9229862421 dfblob:9729946d73
dfblob:4f922dc840 dfblob:1c5c87d7d0 dfblob:ad8f947a13 dfblob:68c9ffcfd3
dfblob:9530e17ae0 )
 OR (
bs:"[Qemu-devel] [PULL 24/48] target/arm: Convert VFP VMLA to decodetree" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190613121433.5246-25-peter.maydell@linaro.org \
    --to=peter.maydell@linaro.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).