[PATCH 0/8] target/arm: Implement FEAT

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] target/arm: Implement FEAT_EBF16
@ 2024-07-30 16:02 Peter Maydell
  2024-07-30 16:02 ` [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16 Peter Maydell
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:02 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

This patchset implements the optional FEAT_EBF16 architectural feature.
This feature only does one thing: it adds a new bit FPCR.EBF to the
floating point control register, so that the guest can enable a
slightly different set of semantics for the bfloat16 dot-product
instructions (BFDOT, BFMMLA, BFMOPA, BFMOPS; also BFVDOT when we
eventually implement SME2). When the bit is set:
 * they honour FPCR.RMode to set the rounding mode
 * they honour the FPCR bits controlling flushing of denormals
 * they can generate default NaN and infinity as intermediate
   sum-of-products
 * the intermediate rounding handling changes

In the Arm ARM these changes only affect the pseudocode BFDotAdd
function, which in QEMU we implement in bfdotadd().

A lot of this series is plumbing -- we need the CPU env pointer
now in the helper functions which call bfdotadd(), so we need
to pass it through from the generated code. Once we have it,
we can refactor the callsites in a manner suggested by RTH,
so that we have bfdotadd() specialized for EBF=0 and bfdotadd_ebf()
specialized for EBF=1. This lets us hoist the setup out of the
inner loop:
   float_status fpst, fpst_odd;
   if (is_ebf(env, &fpst, &fpst_odd)) {
       for (...) {
           x = bfdotadd_ebf(..., &fpst, &fpst_odd);
       }
   } else {
       for (...) {
           x = bfdotadd(..., &fpst);
       }
   }

The implementation itself requires a fused paired-multiply-and-add;
we use the same trick we already have in f16_dotadd() to implement
this.

Not intended for 9.1, obviously, but I figured since I'd written
and tested it I might as well send it out to the list.

Based-on: <20240730155819.2958924-1-peter.maydell@linaro.org>
("target/arm: Handle denormals correctly for FMOPA (widening)")
both for textual reasons and because that patch introduces the
do_outprod_env() utility function we use here.

thanks
-- PMM

Peter Maydell (8):
  target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16
  target/arm: Pass env pointer through to sme_bfmopa helper
  target/arm: Pass env pointer through to gvec_bfdot helper
  target/arm: Pass env pointer through to gvec_bfdot_idx helper
  target/arm: Pass env pointer through to gvec_bfmmla helper
  target/arm: Prepare bfdotadd() callers for FEAT_EBF support
  target/arm: Implement FPCR.EBF=1 semantics for bfdotadd()
  target/arm: Enable FEAT_EBF16 in the "max" CPU

 docs/system/arm/emulation.rst   |   1 +
 target/arm/cpu-features.h       |   5 +
 target/arm/cpu.h                |   1 +
 target/arm/helper.h             |  12 +-
 target/arm/tcg/helper-sme.h     |   4 +-
 target/arm/tcg/vec_internal.h   |  37 +++++-
 target/arm/tcg/cpu64.c          |   4 +-
 target/arm/tcg/sme_helper.c     |  78 ++++++++----
 target/arm/tcg/translate-a64.c  |  40 ++++++-
 target/arm/tcg/translate-neon.c |  43 ++++++-
 target/arm/tcg/translate-sme.c  |   3 +-
 target/arm/tcg/translate-sve.c  |  25 +++-
 target/arm/tcg/vec_helper.c     | 202 +++++++++++++++++++++++++-------
 target/arm/vfp_helper.c         |   8 +-
 14 files changed, 371 insertions(+), 92 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
@ 2024-07-30 16:02 ` Peter Maydell
  2024-07-31  1:30   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper Peter Maydell
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:02 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

FEAT_EBF16 adds one new bit to the FPCR floating point control
register.  Allow this bit to be read and written when the ID
registers indicate the presence of the feature.

Note that because this new bit is not in FPSCR_FPCR_MASK the bit is
not visible in the AArch32 FPSCR, and FPSCR writes do not affect it.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/cpu-features.h | 5 +++++
 target/arm/cpu.h          | 1 +
 target/arm/vfp_helper.c   | 8 ++++++--
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/target/arm/cpu-features.h b/target/arm/cpu-features.h
index c59ca104fe1..cfb82c23cad 100644
--- a/target/arm/cpu-features.h
+++ b/target/arm/cpu-features.h
@@ -556,6 +556,11 @@ static inline bool isar_feature_aa64_bf16(const ARMISARegisters *id)
     return FIELD_EX64(id->id_aa64isar1, ID_AA64ISAR1, BF16) != 0;
 }
 
+static inline bool isar_feature_aa64_ebf16(const ARMISARegisters *id)
+{
+    return FIELD_EX64(id->id_aa64isar1, ID_AA64ISAR1, BF16) > 1;
+}
+
 static inline bool isar_feature_aa64_rcpc_8_3(const ARMISARegisters *id)
 {
     return FIELD_EX64(id->id_aa64isar1, ID_AA64ISAR1, LRCPC) != 0;
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index a12859fc533..34df9d7e39b 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1707,6 +1707,7 @@ void vfp_set_fpscr(CPUARMState *env, uint32_t val);
 #define FPCR_OFE    (1 << 10)   /* Overflow exception trap enable */
 #define FPCR_UFE    (1 << 11)   /* Underflow exception trap enable */
 #define FPCR_IXE    (1 << 12)   /* Inexact exception trap enable */
+#define FPCR_EBF    (1 << 13)   /* Extended BFloat16 behaviors */
 #define FPCR_IDE    (1 << 15)   /* Input Denormal exception trap enable */
 #define FPCR_LEN_MASK (7 << 16) /* LEN, A-profile only */
 #define FPCR_FZ16   (1 << 19)   /* ARMv8.2+, FP16 flush-to-zero */
diff --git a/target/arm/vfp_helper.c b/target/arm/vfp_helper.c
index b3698da8ca7..203d37303bd 100644
--- a/target/arm/vfp_helper.c
+++ b/target/arm/vfp_helper.c
@@ -254,6 +254,10 @@ static void vfp_set_fpcr_masked(CPUARMState *env, uint32_t val, uint32_t mask)
         val &= ~FPCR_FZ16;
     }
 
+    if (!cpu_isar_feature(aa64_ebf16, cpu)) {
+        val &= ~FPCR_EBF;
+    }
+
     vfp_set_fpcr_to_host(env, val, mask);
 
     if (mask & (FPCR_LEN_MASK | FPCR_STRIDE_MASK)) {
@@ -278,12 +282,12 @@ static void vfp_set_fpcr_masked(CPUARMState *env, uint32_t val, uint32_t mask)
      * We don't implement trapped exception handling, so the
      * trap enable bits, IDE|IXE|UFE|OFE|DZE|IOE are all RAZ/WI (not RES0!)
      *
-     * The FPCR bits we keep in vfp.fpcr are AHP, DN, FZ, RMode
+     * The FPCR bits we keep in vfp.fpcr are AHP, DN, FZ, RMode, EBF
      * and FZ16. Len, Stride and LTPSIZE we just handled. Store those bits
      * there, and zero any of the other FPCR bits and the RES0 and RAZ/WI
      * bits.
      */
-    val &= FPCR_AHP | FPCR_DN | FPCR_FZ | FPCR_RMODE_MASK | FPCR_FZ16;
+    val &= FPCR_AHP | FPCR_DN | FPCR_FZ | FPCR_RMODE_MASK | FPCR_FZ16 | FPCR_EBF;
     env->vfp.fpcr &= ~mask;
     env->vfp.fpcr |= val;
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16
  2024-07-30 16:02 ` [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16 Peter Maydell
@ 2024-07-31  1:30   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:30 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:02, Peter Maydell wrote:
> FEAT_EBF16 adds one new bit to the FPCR floating point control
> register.  Allow this bit to be read and written when the ID
> registers indicate the presence of the feature.
> 
> Note that because this new bit is not in FPSCR_FPCR_MASK the bit is
> not visible in the AArch32 FPSCR, and FPSCR writes do not affect it.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/cpu-features.h | 5 +++++
>   target/arm/cpu.h          | 1 +
>   target/arm/vfp_helper.c   | 8 ++++++--
>   3 files changed, 12 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
  2024-07-30 16:02 ` [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16 Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:32   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper Peter Maydell
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

To implement the FEAT_EBF16 semantics, we are going to need
the CPUARMState env pointer in every helper function which calls
bfdotadd().

Pass the env pointer through from generated code to the sme_bfmopa
helper. (We'll add the code that uses it when we've adjusted
all the helpers to have access to the env pointer.)

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/tcg/helper-sme.h    | 4 ++--
 target/arm/tcg/sme_helper.c    | 4 ++--
 target/arm/tcg/translate-sme.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 659867a1faf..f12d903aa44 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -126,8 +126,8 @@ DEF_HELPER_FLAGS_7(sme_fmopa_s, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_7(sme_fmopa_d, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_6(sme_bfmopa, TCG_CALL_NO_RWG,
-                   void, ptr, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_7(sme_bfmopa, TCG_CALL_NO_RWG,
+                   void, env, ptr, ptr, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_6(sme_smopa_s, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_6(sme_umopa_s, TCG_CALL_NO_RWG,
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 2af2b957cb6..f172225b2f2 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1080,8 +1080,8 @@ void HELPER(sme_fmopa_h)(CPUARMState *env,
     }
 }
 
-void HELPER(sme_bfmopa)(void *vza, void *vzn, void *vzm, void *vpn,
-                        void *vpm, uint32_t desc)
+void HELPER(sme_bfmopa)(CPUARMState *env, void *vza, void *vzn, void *vzm,
+                        void *vpn, void *vpm, uint32_t desc)
 {
     intptr_t row, col, oprsz = simd_maxsz(desc);
     uint32_t neg = simd_data(desc) * 0x80008000u;
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 8e9332f1898..bcb502feb05 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -355,7 +355,7 @@ TRANS_FEAT(FMOPA_d, aa64_sme_f64f64, do_outprod_fpst, a,
            MO_64, FPST_FPCR, gen_helper_sme_fmopa_d)
 
 /* TODO: FEAT_EBF16 */
-TRANS_FEAT(BFMOPA, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_bfmopa)
+TRANS_FEAT(BFMOPA, aa64_sme, do_outprod_env, a, MO_32, gen_helper_sme_bfmopa)
 
 TRANS_FEAT(SMOPA_s, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_smopa_s)
 TRANS_FEAT(UMOPA_s, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_umopa_s)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper
  2024-07-30 16:03 ` [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper Peter Maydell
@ 2024-07-31  1:32   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:32 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> To implement the FEAT_EBF16 semantics, we are going to need
> the CPUARMState env pointer in every helper function which calls
> bfdotadd().
> 
> Pass the env pointer through from generated code to the sme_bfmopa
> helper. (We'll add the code that uses it when we've adjusted
> all the helpers to have access to the env pointer.)
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>   target/arm/tcg/helper-sme.h    | 4 ++--
>   target/arm/tcg/sme_helper.c    | 4 ++--
>   target/arm/tcg/translate-sme.c | 2 +-
>   3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
> index 659867a1faf..f12d903aa44 100644
> --- a/target/arm/tcg/helper-sme.h
> +++ b/target/arm/tcg/helper-sme.h
> @@ -126,8 +126,8 @@ DEF_HELPER_FLAGS_7(sme_fmopa_s, TCG_CALL_NO_RWG,
>                      void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
>   DEF_HELPER_FLAGS_7(sme_fmopa_d, TCG_CALL_NO_RWG,
>                      void, ptr, ptr, ptr, ptr, ptr, ptr, i32)
> -DEF_HELPER_FLAGS_6(sme_bfmopa, TCG_CALL_NO_RWG,
> -                   void, ptr, ptr, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_7(sme_bfmopa, TCG_CALL_NO_RWG,
> +                   void, env, ptr, ptr, ptr, ptr, ptr, i32)
>   DEF_HELPER_FLAGS_6(sme_smopa_s, TCG_CALL_NO_RWG,
>                      void, ptr, ptr, ptr, ptr, ptr, i32)
>   DEF_HELPER_FLAGS_6(sme_umopa_s, TCG_CALL_NO_RWG,
> diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
> index 2af2b957cb6..f172225b2f2 100644
> --- a/target/arm/tcg/sme_helper.c
> +++ b/target/arm/tcg/sme_helper.c
> @@ -1080,8 +1080,8 @@ void HELPER(sme_fmopa_h)(CPUARMState *env,
>       }
>   }
>   
> -void HELPER(sme_bfmopa)(void *vza, void *vzn, void *vzm, void *vpn,
> -                        void *vpm, uint32_t desc)
> +void HELPER(sme_bfmopa)(CPUARMState *env, void *vza, void *vzn, void *vzm,
> +                        void *vpn, void *vpm, uint32_t desc)

Per this morning's review of do_outprod_env, I think env should be penultimate, as for 
other gen_helper_gvec_5_ptr functions.

Otherwise,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

>   {
>       intptr_t row, col, oprsz = simd_maxsz(desc);
>       uint32_t neg = simd_data(desc) * 0x80008000u;
> diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
> index 8e9332f1898..bcb502feb05 100644
> --- a/target/arm/tcg/translate-sme.c
> +++ b/target/arm/tcg/translate-sme.c
> @@ -355,7 +355,7 @@ TRANS_FEAT(FMOPA_d, aa64_sme_f64f64, do_outprod_fpst, a,
>              MO_64, FPST_FPCR, gen_helper_sme_fmopa_d)
>   
>   /* TODO: FEAT_EBF16 */
> -TRANS_FEAT(BFMOPA, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_bfmopa)
> +TRANS_FEAT(BFMOPA, aa64_sme, do_outprod_env, a, MO_32, gen_helper_sme_bfmopa)
>   
>   TRANS_FEAT(SMOPA_s, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_smopa_s)
>   TRANS_FEAT(UMOPA_s, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_umopa_s)



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
  2024-07-30 16:02 ` [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16 Peter Maydell
  2024-07-30 16:03 ` [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:36   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper Peter Maydell
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Pass the env pointer through to the gvec_bfdot helper,
so we can use it to add support for FEAT_EBF16.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/helper.h             |  4 ++--
 target/arm/tcg/translate-a64.c  | 27 ++++++++++++++++++++++++-
 target/arm/tcg/translate-neon.c | 35 +++++++++++++++++++++++++++++++--
 target/arm/tcg/translate-sve.c  | 15 +++++++++++++-
 target/arm/tcg/vec_helper.c     |  3 ++-
 5 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index 970d059dec5..aece9fd4aa7 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1027,8 +1027,8 @@ DEF_HELPER_FLAGS_5(gvec_ummla_b, TCG_CALL_NO_RWG,
 DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, i32)
 
-DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
-                   void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
+                   void, ptr, ptr, ptr, ptr, ptr, i32)
 DEF_HELPER_FLAGS_5(gvec_bfdot_idx, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, i32)
 
diff --git a/target/arm/tcg/translate-a64.c b/target/arm/tcg/translate-a64.c
index 148be2826ec..4aef8b9211a 100644
--- a/target/arm/tcg/translate-a64.c
+++ b/target/arm/tcg/translate-a64.c
@@ -735,6 +735,22 @@ static void gen_gvec_op4_ool(DisasContext *s, bool is_q, int rd, int rn,
                        is_q ? 16 : 8, vec_full_reg_size(s), data, fn);
 }
 
+/*
+ * Expand a 4-operand operation using an out-of-line helper that takes
+ * a pointer to the CPU env.
+ */
+static void gen_gvec_op4_env(DisasContext *s, bool is_q, int rd, int rn,
+                             int rm, int ra, int data,
+                             gen_helper_gvec_4_ptr *fn)
+{
+    tcg_gen_gvec_4_ptr(vec_full_reg_offset(s, rd),
+                       vec_full_reg_offset(s, rn),
+                       vec_full_reg_offset(s, rm),
+                       vec_full_reg_offset(s, ra),
+                       tcg_env,
+                       is_q ? 16 : 8, vec_full_reg_size(s), data, fn);
+}
+
 /*
  * Expand a 4-operand + fpstatus pointer + simd data value operation using
  * an out-of-line helper.
@@ -5601,10 +5617,19 @@ static bool do_dot_vector(DisasContext *s, arg_qrrr_e *a,
     return true;
 }
 
+static bool do_dot_vector_env(DisasContext *s, arg_qrrr_e *a,
+                              gen_helper_gvec_4_ptr *fn)
+{
+    if (fp_access_check(s)) {
+        gen_gvec_op4_env(s, a->q, a->rd, a->rn, a->rm, a->rd, 0, fn);
+    }
+    return true;
+}
+
 TRANS_FEAT(SDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_sdot_b)
 TRANS_FEAT(UDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_udot_b)
 TRANS_FEAT(USDOT_v, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_usdot_b)
-TRANS_FEAT(BFDOT_v, aa64_bf16, do_dot_vector, a, gen_helper_gvec_bfdot)
+TRANS_FEAT(BFDOT_v, aa64_bf16, do_dot_vector_env, a, gen_helper_gvec_bfdot)
 TRANS_FEAT(BFMMLA, aa64_bf16, do_dot_vector, a, gen_helper_gvec_bfmmla)
 TRANS_FEAT(SMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_smmla_b)
 TRANS_FEAT(UMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_ummla_b)
diff --git a/target/arm/tcg/translate-neon.c b/target/arm/tcg/translate-neon.c
index 915c9e56db5..454380f01d7 100644
--- a/target/arm/tcg/translate-neon.c
+++ b/target/arm/tcg/translate-neon.c
@@ -148,6 +148,37 @@ static bool do_neon_ddda(DisasContext *s, int q, int vd, int vn, int vm,
     return true;
 }
 
+static bool do_neon_ddda_env(DisasContext *s, int q, int vd, int vn, int vm,
+                             int data, gen_helper_gvec_4_ptr *fn_gvec)
+{
+    /* UNDEF accesses to D16-D31 if they don't exist. */
+    if (((vd | vn | vm) & 0x10) && !dc_isar_feature(aa32_simd_r32, s)) {
+        return false;
+    }
+
+    /*
+     * UNDEF accesses to odd registers for each bit of Q.
+     * Q will be 0b111 for all Q-reg instructions, otherwise
+     * when we have mixed Q- and D-reg inputs.
+     */
+    if (((vd & 1) * 4 | (vn & 1) * 2 | (vm & 1)) & q) {
+        return false;
+    }
+
+    if (!vfp_access_check(s)) {
+        return true;
+    }
+
+    int opr_sz = q ? 16 : 8;
+    tcg_gen_gvec_4_ptr(vfp_reg_offset(1, vd),
+                       vfp_reg_offset(1, vn),
+                       vfp_reg_offset(1, vm),
+                       vfp_reg_offset(1, vd),
+                       tcg_env,
+                       opr_sz, opr_sz, data, fn_gvec);
+    return true;
+}
+
 static bool do_neon_ddda_fpst(DisasContext *s, int q, int vd, int vn, int vm,
                               int data, ARMFPStatusFlavour fp_flavour,
                               gen_helper_gvec_4_ptr *fn_gvec_ptr)
@@ -266,8 +297,8 @@ static bool trans_VDOT_b16(DisasContext *s, arg_VDOT_b16 *a)
     if (!dc_isar_feature(aa32_bf16, s)) {
         return false;
     }
-    return do_neon_ddda(s, a->q * 7, a->vd, a->vn, a->vm, 0,
-                        gen_helper_gvec_bfdot);
+    return do_neon_ddda_env(s, a->q * 7, a->vd, a->vn, a->vm, 0,
+                            gen_helper_gvec_bfdot);
 }
 
 static bool trans_VFML(DisasContext *s, arg_VFML *a)
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 798ab2bfb13..4fb0bd077b4 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -238,6 +238,19 @@ static bool gen_gvec_fpst_zzzz(DisasContext *s, gen_helper_gvec_4_ptr *fn,
     return ret;
 }
 
+static bool gen_gvec_env_zzzz(DisasContext *s, gen_helper_gvec_4_ptr *fn,
+                              int rd, int rn, int rm, int ra,
+                              int data)
+{
+    return gen_gvec_ptr_zzzz(s, fn, rd, rn, rm, ra, data, tcg_env);
+}
+
+static bool gen_gvec_env_arg_zzzz(DisasContext *s, gen_helper_gvec_4_ptr *fn,
+                                  arg_rrrr_esz *a, int data)
+{
+    return gen_gvec_env_zzzz(s, fn, a->rd, a->rn, a->rm, a->ra, data);
+}
+
 /* Invoke an out-of-line helper on 4 Zregs, 1 Preg, plus fpst. */
 static bool gen_gvec_fpst_zzzzp(DisasContext *s, gen_helper_gvec_5_ptr *fn,
                                 int rd, int rn, int rm, int ra, int pg,
@@ -7099,7 +7112,7 @@ TRANS_FEAT_NONSTREAMING(USMMLA, aa64_sve_i8mm, gen_gvec_ool_arg_zzzz,
 TRANS_FEAT_NONSTREAMING(UMMLA, aa64_sve_i8mm, gen_gvec_ool_arg_zzzz,
                         gen_helper_gvec_ummla_b, a, 0)
 
-TRANS_FEAT(BFDOT_zzzz, aa64_sve_bf16, gen_gvec_ool_arg_zzzz,
+TRANS_FEAT(BFDOT_zzzz, aa64_sve_bf16, gen_gvec_env_arg_zzzz,
            gen_helper_gvec_bfdot, a, 0)
 TRANS_FEAT(BFDOT_zzxz, aa64_sve_bf16, gen_gvec_ool_arg_zzxz,
            gen_helper_gvec_bfdot_idx, a)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 98604d170fd..37aad4be4b0 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2814,7 +2814,8 @@ float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
     return t1;
 }
 
-void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
+void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
+                        void *envp, uint32_t desc)
 {
     intptr_t i, opr_sz = simd_oprsz(desc);
     float32 *d = vd, *a = va;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper
  2024-07-30 16:03 ` [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper Peter Maydell
@ 2024-07-31  1:36   ` Richard Henderson
  2024-07-31 12:31     ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:36 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> Pass the env pointer through to the gvec_bfdot helper,
> so we can use it to add support for FEAT_EBF16.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>   target/arm/helper.h             |  4 ++--
>   target/arm/tcg/translate-a64.c  | 27 ++++++++++++++++++++++++-
>   target/arm/tcg/translate-neon.c | 35 +++++++++++++++++++++++++++++++--
>   target/arm/tcg/translate-sve.c  | 15 +++++++++++++-
>   target/arm/tcg/vec_helper.c     |  3 ++-
>   5 files changed, 77 insertions(+), 7 deletions(-)
> 
> diff --git a/target/arm/helper.h b/target/arm/helper.h
> index 970d059dec5..aece9fd4aa7 100644
> --- a/target/arm/helper.h
> +++ b/target/arm/helper.h
> @@ -1027,8 +1027,8 @@ DEF_HELPER_FLAGS_5(gvec_ummla_b, TCG_CALL_NO_RWG,
>   DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
>                      void, ptr, ptr, ptr, ptr, i32)
>   
> -DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
> -                   void, ptr, ptr, ptr, ptr, i32)
> +DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
> +                   void, ptr, ptr, ptr, ptr, ptr, i32)

Because env expands to TCGv_ptr in the translation context, I suspect that you can use 
that here.  Worth a try, anyway, so that

> -void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
> +void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
> +                        void *envp, uint32_t desc)

this doesn't have to use void *.

Either way,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper
  2024-07-31  1:36   ` Richard Henderson
@ 2024-07-31 12:31     ` Peter Maydell
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Maydell @ 2024-07-31 12:31 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, qemu-devel

On Wed, 31 Jul 2024 at 02:36, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 7/31/24 02:03, Peter Maydell wrote:
> > Pass the env pointer through to the gvec_bfdot helper,
> > so we can use it to add support for FEAT_EBF16.
> >
> > Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> > ---
> >   target/arm/helper.h             |  4 ++--
> >   target/arm/tcg/translate-a64.c  | 27 ++++++++++++++++++++++++-
> >   target/arm/tcg/translate-neon.c | 35 +++++++++++++++++++++++++++++++--
> >   target/arm/tcg/translate-sve.c  | 15 +++++++++++++-
> >   target/arm/tcg/vec_helper.c     |  3 ++-
> >   5 files changed, 77 insertions(+), 7 deletions(-)
> >
> > diff --git a/target/arm/helper.h b/target/arm/helper.h
> > index 970d059dec5..aece9fd4aa7 100644
> > --- a/target/arm/helper.h
> > +++ b/target/arm/helper.h
> > @@ -1027,8 +1027,8 @@ DEF_HELPER_FLAGS_5(gvec_ummla_b, TCG_CALL_NO_RWG,
> >   DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
> >                      void, ptr, ptr, ptr, ptr, i32)
> >
> > -DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
> > -                   void, ptr, ptr, ptr, ptr, i32)
> > +DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
> > +                   void, ptr, ptr, ptr, ptr, ptr, i32)
>
> Because env expands to TCGv_ptr in the translation context, I suspect that you can use
> that here.  Worth a try, anyway, so that
>
> > -void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
> > +void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
> > +                        void *envp, uint32_t desc)
>
> this doesn't have to use void *.

I thought I'd tried that, but obviously I didn't hit on the
right combination of types in the prototype/definition.
This does work, so I've changed the patchset to use it.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
                   ` (2 preceding siblings ...)
  2024-07-30 16:03 ` [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:37   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper Peter Maydell
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Pass the env pointer through to the gvec_bfdot_idx helper,
so we can use it to add support for FEAT_EBF16.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/helper.h             |  4 ++--
 target/arm/tcg/translate-a64.c  | 11 ++++++++++-
 target/arm/tcg/translate-neon.c |  4 ++--
 target/arm/tcg/translate-sve.c  |  8 +++++++-
 target/arm/tcg/vec_helper.c     |  2 +-
 5 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index aece9fd4aa7..386cf8686ea 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1029,8 +1029,8 @@ DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
 
 DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_bfdot_idx, TCG_CALL_NO_RWG,
-                   void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(gvec_bfdot_idx, TCG_CALL_NO_RWG,
+                   void, ptr, ptr, ptr, ptr, ptr, i32)
 
 DEF_HELPER_FLAGS_5(gvec_bfmmla, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/translate-a64.c b/target/arm/tcg/translate-a64.c
index 4aef8b9211a..a4e9740c921 100644
--- a/target/arm/tcg/translate-a64.c
+++ b/target/arm/tcg/translate-a64.c
@@ -6403,13 +6403,22 @@ static bool do_dot_vector_idx(DisasContext *s, arg_qrrx_e *a,
     return true;
 }
 
+static bool do_dot_vector_idx_env(DisasContext *s, arg_qrrx_e *a,
+                                  gen_helper_gvec_4_ptr *fn)
+{
+    if (fp_access_check(s)) {
+        gen_gvec_op4_env(s, a->q, a->rd, a->rn, a->rm, a->rd, a->idx, fn);
+    }
+    return true;
+}
+
 TRANS_FEAT(SDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_sdot_idx_b)
 TRANS_FEAT(UDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_udot_idx_b)
 TRANS_FEAT(SUDOT_vi, aa64_i8mm, do_dot_vector_idx, a,
            gen_helper_gvec_sudot_idx_b)
 TRANS_FEAT(USDOT_vi, aa64_i8mm, do_dot_vector_idx, a,
            gen_helper_gvec_usdot_idx_b)
-TRANS_FEAT(BFDOT_vi, aa64_bf16, do_dot_vector_idx, a,
+TRANS_FEAT(BFDOT_vi, aa64_bf16, do_dot_vector_idx_env, a,
            gen_helper_gvec_bfdot_idx)
 
 static bool trans_BFMLAL_vi(DisasContext *s, arg_qrrx_e *a)
diff --git a/target/arm/tcg/translate-neon.c b/target/arm/tcg/translate-neon.c
index 454380f01d7..7de157c539c 100644
--- a/target/arm/tcg/translate-neon.c
+++ b/target/arm/tcg/translate-neon.c
@@ -391,8 +391,8 @@ static bool trans_VDOT_b16_scal(DisasContext *s, arg_VDOT_b16_scal *a)
     if (!dc_isar_feature(aa32_bf16, s)) {
         return false;
     }
-    return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
-                        gen_helper_gvec_bfdot_idx);
+    return do_neon_ddda_env(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
+                            gen_helper_gvec_bfdot_idx);
 }
 
 static bool trans_VFML_scalar(DisasContext *s, arg_VFML_scalar *a)
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 4fb0bd077b4..8876d1f91a9 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -251,6 +251,12 @@ static bool gen_gvec_env_arg_zzzz(DisasContext *s, gen_helper_gvec_4_ptr *fn,
     return gen_gvec_env_zzzz(s, fn, a->rd, a->rn, a->rm, a->ra, data);
 }
 
+static bool gen_gvec_env_arg_zzxz(DisasContext *s, gen_helper_gvec_4_ptr *fn,
+                                  arg_rrxr_esz *a)
+{
+    return gen_gvec_env_zzzz(s, fn, a->rd, a->rn, a->rm, a->ra, a->index);
+}
+
 /* Invoke an out-of-line helper on 4 Zregs, 1 Preg, plus fpst. */
 static bool gen_gvec_fpst_zzzzp(DisasContext *s, gen_helper_gvec_5_ptr *fn,
                                 int rd, int rn, int rm, int ra, int pg,
@@ -7114,7 +7120,7 @@ TRANS_FEAT_NONSTREAMING(UMMLA, aa64_sve_i8mm, gen_gvec_ool_arg_zzzz,
 
 TRANS_FEAT(BFDOT_zzzz, aa64_sve_bf16, gen_gvec_env_arg_zzzz,
            gen_helper_gvec_bfdot, a, 0)
-TRANS_FEAT(BFDOT_zzxz, aa64_sve_bf16, gen_gvec_ool_arg_zzxz,
+TRANS_FEAT(BFDOT_zzxz, aa64_sve_bf16, gen_gvec_env_arg_zzxz,
            gen_helper_gvec_bfdot_idx, a)
 
 TRANS_FEAT_NONSTREAMING(BFMMLA, aa64_sve_bf16, gen_gvec_ool_arg_zzzz,
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 37aad4be4b0..1edde9792f0 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2828,7 +2828,7 @@ void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
 }
 
 void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
-                            void *va, uint32_t desc)
+                            void *va, void *envp, uint32_t desc)
 {
     intptr_t i, j, opr_sz = simd_oprsz(desc);
     intptr_t index = simd_data(desc);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper
  2024-07-30 16:03 ` [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper Peter Maydell
@ 2024-07-31  1:37   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:37 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> Pass the env pointer through to the gvec_bfdot_idx helper,
> so we can use it to add support for FEAT_EBF16.
> 
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
>   target/arm/helper.h             |  4 ++--
>   target/arm/tcg/translate-a64.c  | 11 ++++++++++-
>   target/arm/tcg/translate-neon.c |  4 ++--
>   target/arm/tcg/translate-sve.c  |  8 +++++++-
>   target/arm/tcg/vec_helper.c     |  2 +-
>   5 files changed, 22 insertions(+), 7 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
                   ` (3 preceding siblings ...)
  2024-07-30 16:03 ` [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:38   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support Peter Maydell
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Pass the env pointer through to the gvec_bfmmla helper,
so we can use it to add support for FEAT_EBF16.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/helper.h             | 4 ++--
 target/arm/tcg/translate-a64.c  | 2 +-
 target/arm/tcg/translate-neon.c | 4 ++--
 target/arm/tcg/translate-sve.c  | 2 +-
 target/arm/tcg/vec_helper.c     | 3 ++-
 5 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/target/arm/helper.h b/target/arm/helper.h
index 386cf8686ea..93b830d2cce 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1032,8 +1032,8 @@ DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
 DEF_HELPER_FLAGS_6(gvec_bfdot_idx, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, i32)
 
-DEF_HELPER_FLAGS_5(gvec_bfmmla, TCG_CALL_NO_RWG,
-                   void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(gvec_bfmmla, TCG_CALL_NO_RWG,
+                   void, ptr, ptr, ptr, ptr, ptr, i32)
 
 DEF_HELPER_FLAGS_6(gvec_bfmlal, TCG_CALL_NO_RWG,
                    void, ptr, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/translate-a64.c b/target/arm/tcg/translate-a64.c
index a4e9740c921..33d49f524f4 100644
--- a/target/arm/tcg/translate-a64.c
+++ b/target/arm/tcg/translate-a64.c
@@ -5630,7 +5630,7 @@ TRANS_FEAT(SDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_sdot_b)
 TRANS_FEAT(UDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_udot_b)
 TRANS_FEAT(USDOT_v, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_usdot_b)
 TRANS_FEAT(BFDOT_v, aa64_bf16, do_dot_vector_env, a, gen_helper_gvec_bfdot)
-TRANS_FEAT(BFMMLA, aa64_bf16, do_dot_vector, a, gen_helper_gvec_bfmmla)
+TRANS_FEAT(BFMMLA, aa64_bf16, do_dot_vector_env, a, gen_helper_gvec_bfmmla)
 TRANS_FEAT(SMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_smmla_b)
 TRANS_FEAT(UMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_ummla_b)
 TRANS_FEAT(USMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_usmmla_b)
diff --git a/target/arm/tcg/translate-neon.c b/target/arm/tcg/translate-neon.c
index 7de157c539c..13cd31aad42 100644
--- a/target/arm/tcg/translate-neon.c
+++ b/target/arm/tcg/translate-neon.c
@@ -3730,8 +3730,8 @@ static bool trans_VMMLA_b16(DisasContext *s, arg_VMMLA_b16 *a)
     if (!dc_isar_feature(aa32_bf16, s)) {
         return false;
     }
-    return do_neon_ddda(s, 7, a->vd, a->vn, a->vm, 0,
-                        gen_helper_gvec_bfmmla);
+    return do_neon_ddda_env(s, 7, a->vd, a->vn, a->vm, 0,
+                            gen_helper_gvec_bfmmla);
 }
 
 static bool trans_VFMA_b16(DisasContext *s, arg_VFMA_b16 *a)
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 8876d1f91a9..95e938662ed 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -7123,7 +7123,7 @@ TRANS_FEAT(BFDOT_zzzz, aa64_sve_bf16, gen_gvec_env_arg_zzzz,
 TRANS_FEAT(BFDOT_zzxz, aa64_sve_bf16, gen_gvec_env_arg_zzxz,
            gen_helper_gvec_bfdot_idx, a)
 
-TRANS_FEAT_NONSTREAMING(BFMMLA, aa64_sve_bf16, gen_gvec_ool_arg_zzzz,
+TRANS_FEAT_NONSTREAMING(BFMMLA, aa64_sve_bf16, gen_gvec_env_arg_zzzz,
                         gen_helper_gvec_bfmmla, a, 0)
 
 static bool do_BFMLAL_zzzw(DisasContext *s, arg_rrrr_esz *a, bool sel)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 1edde9792f0..77efb5f47d8 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2847,7 +2847,8 @@ void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
 
-void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
+void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va,
+                         void *envp, uint32_t desc)
 {
     intptr_t s, opr_sz = simd_oprsz(desc);
     float32 *d = vd, *a = va;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper
  2024-07-30 16:03 ` [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper Peter Maydell
@ 2024-07-31  1:38   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:38 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> Pass the env pointer through to the gvec_bfmmla helper,
> so we can use it to add support for FEAT_EBF16.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/helper.h             | 4 ++--
>   target/arm/tcg/translate-a64.c  | 2 +-
>   target/arm/tcg/translate-neon.c | 4 ++--
>   target/arm/tcg/translate-sve.c  | 2 +-
>   target/arm/tcg/vec_helper.c     | 3 ++-
>   5 files changed, 8 insertions(+), 7 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
                   ` (4 preceding siblings ...)
  2024-07-30 16:03 ` [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:43   ` Richard Henderson
  2024-07-31  1:48   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd() Peter Maydell
  2024-07-30 16:03 ` [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU Peter Maydell
  7 siblings, 2 replies; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

We use bfdotadd() in four callsites for various helper functions. Currently
this all assumes that we have the FPCR.EBF=0 semantics. For FPCR.EBF=1
we will need to:
 * call a different routine to bfdotadd() because we need to do a
   fused multiply-add rather than separate multiply and add steps
 * use a different float_status that honours the FPCR rounding mode
   and denormal-flushing fields
 * pass in an extra float_status that has been set up to perform
   round-to-odd rounding

To prepare for this, refactor all the callsites so that instead of
   for (...) {
       x = bfdotadd(...);
   }

they are:
   float_status fpst, fpst_odd;
   if (is_ebf(env, &fpst, &fpst_odd)) {
       for (...) {
           x = bfdotadd_ebf(..., &fpst, &fpst_odd);
       }
   } else {
       for (...) {
           x = bfdotadd(..., &fpst);
       }
   }

For the moment the is_ebf() function always returns false, sets up
fpst for EBF=0 semantics and never sets up fpst_odd; bfdotadd_ebf()
will assert if called. We'll fill in the handling for EBF=1 in the
next commit.

This change should be a zero-behaviour-change refactor.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/tcg/vec_internal.h |  37 ++++++++-
 target/arm/tcg/sme_helper.c   |  74 ++++++++++++------
 target/arm/tcg/vec_helper.c   | 141 +++++++++++++++++++++++++---------
 3 files changed, 192 insertions(+), 60 deletions(-)

diff --git a/target/arm/tcg/vec_internal.h b/target/arm/tcg/vec_internal.h
index 3ca1b94ccf9..094f5c169ca 100644
--- a/target/arm/tcg/vec_internal.h
+++ b/target/arm/tcg/vec_internal.h
@@ -223,13 +223,46 @@ int64_t do_sqrdmlah_d(int64_t, int64_t, int64_t, bool, bool);
  * bfdotadd:
  * @sum: addend
  * @e1, @e2: multiplicand vectors
+ * @fpst: floating-point status to use
  *
  * BFloat16 2-way dot product of @e1 & @e2, accumulating with @sum.
  * The @e1 and @e2 operands correspond to the 32-bit source vector
  * slots and contain two Bfloat16 values each.
  *
- * Corresponds to the ARM pseudocode function BFDotAdd.
+ * Corresponds to the ARM pseudocode function BFDotAdd, specialized
+ * for the FPCR.EBF == 0 case.
  */
-float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2);
+float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2, float_status *fpst);
+/**
+ * bfdotadd_ebf:
+ * @sum: addend
+ * @e1, @e2: multiplicand vectors
+ * @fpst: floating-point status to use
+ * @fpst_odd: floating-point status to use for round-to-odd operations
+ *
+ * BFloat16 2-way dot product of @e1 & @e2, accumulating with @sum.
+ * The @e1 and @e2 operands correspond to the 32-bit source vector
+ * slots and contain two Bfloat16 values each.
+ *
+ * Corresponds to the ARM pseudocode function BFDotAdd, specialized
+ * for the FPCR.EBF == 1 case.
+ */
+float32 bfdotadd_ebf(float32 sum, uint32_t e1, uint32_t e2,
+                     float_status *fpst, float_status *fpst_odd);
+
+/**
+ * is_ebf:
+ * @env: CPU state
+ * @statusp: pointer to floating point status to fill in
+ * @oddstatusp: pointer to floating point status to fill in for round-to-odd
+ *
+ * Determine whether a BFDotAdd operation should use FPCR.EBF = 0
+ * or FPCR.EBF = 1 semantics. On return, has initialized *statusp
+ * and *oddstatusp to suitable float_status arguments to use with either
+ * bfdotadd() or bfdotadd_ebf().
+ * Returns true for EBF = 1, false for EBF = 0. (The caller should use this
+ * to decide whether to call bfdotadd() or bfdotadd_ebf().)
+ */
+bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp);
 
 #endif /* TARGET_ARM_VEC_INTERNAL_H */
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index f172225b2f2..e3fbfa98fa5 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1086,32 +1086,62 @@ void HELPER(sme_bfmopa)(CPUARMState *env, void *vza, void *vzn, void *vzm,
     intptr_t row, col, oprsz = simd_maxsz(desc);
     uint32_t neg = simd_data(desc) * 0x80008000u;
     uint16_t *pn = vpn, *pm = vpm;
+    float_status fpst, fpst_odd;
 
-    for (row = 0; row < oprsz; ) {
-        uint16_t prow = pn[H2(row >> 4)];
-        do {
-            void *vza_row = vza + tile_vslice_offset(row);
-            uint32_t n = *(uint32_t *)(vzn + H1_4(row));
+    if (is_ebf(env, &fpst, &fpst_odd)) {
+        for (row = 0; row < oprsz; ) {
+            uint16_t prow = pn[H2(row >> 4)];
+            do {
+                void *vza_row = vza + tile_vslice_offset(row);
+                uint32_t n = *(uint32_t *)(vzn + H1_4(row));
 
-            n = f16mop_adj_pair(n, prow, neg);
+                n = f16mop_adj_pair(n, prow, neg);
 
-            for (col = 0; col < oprsz; ) {
-                uint16_t pcol = pm[H2(col >> 4)];
-                do {
-                    if (prow & pcol & 0b0101) {
-                        uint32_t *a = vza_row + H1_4(col);
-                        uint32_t m = *(uint32_t *)(vzm + H1_4(col));
+                for (col = 0; col < oprsz; ) {
+                    uint16_t pcol = pm[H2(col >> 4)];
+                    do {
+                        if (prow & pcol & 0b0101) {
+                            uint32_t *a = vza_row + H1_4(col);
+                            uint32_t m = *(uint32_t *)(vzm + H1_4(col));
 
-                        m = f16mop_adj_pair(m, pcol, 0);
-                        *a = bfdotadd(*a, n, m);
-                    }
-                    col += 4;
-                    pcol >>= 4;
-                } while (col & 15);
-            }
-            row += 4;
-            prow >>= 4;
-        } while (row & 15);
+                            m = f16mop_adj_pair(m, pcol, 0);
+                            *a = bfdotadd_ebf(*a, n, m, &fpst, &fpst_odd);
+                        }
+                        col += 4;
+                        pcol >>= 4;
+                    } while (col & 15);
+                }
+                row += 4;
+                prow >>= 4;
+            } while (row & 15);
+        }
+    } else {
+        for (row = 0; row < oprsz; ) {
+            uint16_t prow = pn[H2(row >> 4)];
+            do {
+                void *vza_row = vza + tile_vslice_offset(row);
+                uint32_t n = *(uint32_t *)(vzn + H1_4(row));
+
+                n = f16mop_adj_pair(n, prow, neg);
+
+                for (col = 0; col < oprsz; ) {
+                    uint16_t pcol = pm[H2(col >> 4)];
+                    do {
+                        if (prow & pcol & 0b0101) {
+                            uint32_t *a = vza_row + H1_4(col);
+                            uint32_t m = *(uint32_t *)(vzm + H1_4(col));
+
+                            m = f16mop_adj_pair(m, pcol, 0);
+                            *a = bfdotadd(*a, n, m, &fpst);
+                        }
+                        col += 4;
+                        pcol >>= 4;
+                    } while (col & 15);
+                }
+                row += 4;
+                prow >>= 4;
+            } while (row & 15);
+        }
     }
 }
 
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 77efb5f47d8..baf04a0561b 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2790,7 +2790,7 @@ DO_MMLA_B(gvec_usmmla_b, do_usmmla_b)
  * BFloat16 Dot Product
  */
 
-float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
+bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp)
 {
     /* FPCR is ignored for BFDOT and BFMMLA. */
     float_status bf_status = {
@@ -2800,29 +2800,50 @@ float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
         .flush_inputs_to_zero = true,
         .default_nan_mode = true,
     };
+
+    *statusp = bf_status;
+    return false;
+}
+
+float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2, float_status *fpst)
+{
     float32 t1, t2;
 
     /*
      * Extract each BFloat16 from the element pair, and shift
      * them such that they become float32.
      */
-    t1 = float32_mul(e1 << 16, e2 << 16, &bf_status);
-    t2 = float32_mul(e1 & 0xffff0000u, e2 & 0xffff0000u, &bf_status);
-    t1 = float32_add(t1, t2, &bf_status);
-    t1 = float32_add(sum, t1, &bf_status);
+    t1 = float32_mul(e1 << 16, e2 << 16, fpst);
+    t2 = float32_mul(e1 & 0xffff0000u, e2 & 0xffff0000u, fpst);
+    t1 = float32_add(t1, t2, fpst);
+    t1 = float32_add(sum, t1, fpst);
 
     return t1;
 }
 
+float32 bfdotadd_ebf(float32 sum, uint32_t e1, uint32_t e2,
+                     float_status *fpst, float_status *fpst_odd)
+{
+    g_assert_not_reached();
+}
+
 void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
                         void *envp, uint32_t desc)
 {
+    CPUARMState *env = envp;
     intptr_t i, opr_sz = simd_oprsz(desc);
     float32 *d = vd, *a = va;
     uint32_t *n = vn, *m = vm;
+    float_status fpst, fpst_odd;
 
-    for (i = 0; i < opr_sz / 4; ++i) {
-        d[i] = bfdotadd(a[i], n[i], m[i]);
+    if (is_ebf(env, &fpst, &fpst_odd)) {
+        for (i = 0; i < opr_sz / 4; ++i) {
+            d[i] = bfdotadd_ebf(a[i], n[i], m[i], &fpst, &fpst_odd);
+        }
+    } else {
+        for (i = 0; i < opr_sz / 4; ++i) {
+            d[i] = bfdotadd(a[i], n[i], m[i], &fpst);
+        }
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
@@ -2830,18 +2851,30 @@ void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
 void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
                             void *va, void *envp, uint32_t desc)
 {
+    CPUARMState *env = envp;
     intptr_t i, j, opr_sz = simd_oprsz(desc);
     intptr_t index = simd_data(desc);
     intptr_t elements = opr_sz / 4;
     intptr_t eltspersegment = MIN(16 / 4, elements);
     float32 *d = vd, *a = va;
     uint32_t *n = vn, *m = vm;
+    float_status fpst, fpst_odd;
 
-    for (i = 0; i < elements; i += eltspersegment) {
-        uint32_t m_idx = m[i + H4(index)];
+    if (is_ebf(env, &fpst, &fpst_odd)) {
+        for (i = 0; i < elements; i += eltspersegment) {
+            uint32_t m_idx = m[i + H4(index)];
 
-        for (j = i; j < i + eltspersegment; j++) {
-            d[j] = bfdotadd(a[j], n[j], m_idx);
+            for (j = i; j < i + eltspersegment; j++) {
+                d[j] = bfdotadd_ebf(a[j], n[j], m_idx, &fpst, &fpst_odd);
+            }
+        }
+    } else {
+        for (i = 0; i < elements; i += eltspersegment) {
+            uint32_t m_idx = m[i + H4(index)];
+
+            for (j = i; j < i + eltspersegment; j++) {
+                d[j] = bfdotadd(a[j], n[j], m_idx, &fpst);
+            }
         }
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
@@ -2850,40 +2883,76 @@ void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
 void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va,
                          void *envp, uint32_t desc)
 {
+    CPUARMState *env = envp;
     intptr_t s, opr_sz = simd_oprsz(desc);
     float32 *d = vd, *a = va;
     uint32_t *n = vn, *m = vm;
+    float_status fpst, fpst_odd;
 
-    for (s = 0; s < opr_sz / 4; s += 4) {
-        float32 sum00, sum01, sum10, sum11;
+    if (is_ebf(env, &fpst, &fpst_odd)) {
+        for (s = 0; s < opr_sz / 4; s += 4) {
+            float32 sum00, sum01, sum10, sum11;
 
-        /*
-         * Process the entire segment at once, writing back the
-         * results only after we've consumed all of the inputs.
-         *
-         * Key to indices by column:
-         *               i   j           i   k             j   k
-         */
-        sum00 = a[s + H4(0 + 0)];
-        sum00 = bfdotadd(sum00, n[s + H4(0 + 0)], m[s + H4(0 + 0)]);
-        sum00 = bfdotadd(sum00, n[s + H4(0 + 1)], m[s + H4(0 + 1)]);
+            /*
+             * Process the entire segment at once, writing back the
+             * results only after we've consumed all of the inputs.
+             *
+             * Key to indices by column:
+             *               i   j               i   k             j   k
+             */
+            sum00 = a[s + H4(0 + 0)];
+            sum00 = bfdotadd_ebf(sum00, n[s + H4(0 + 0)], m[s + H4(0 + 0)], &fpst, &fpst_odd);
+            sum00 = bfdotadd_ebf(sum00, n[s + H4(0 + 1)], m[s + H4(0 + 1)], &fpst, &fpst_odd);
 
-        sum01 = a[s + H4(0 + 1)];
-        sum01 = bfdotadd(sum01, n[s + H4(0 + 0)], m[s + H4(2 + 0)]);
-        sum01 = bfdotadd(sum01, n[s + H4(0 + 1)], m[s + H4(2 + 1)]);
+            sum01 = a[s + H4(0 + 1)];
+            sum01 = bfdotadd_ebf(sum01, n[s + H4(0 + 0)], m[s + H4(2 + 0)], &fpst, &fpst_odd);
+            sum01 = bfdotadd_ebf(sum01, n[s + H4(0 + 1)], m[s + H4(2 + 1)], &fpst, &fpst_odd);
 
-        sum10 = a[s + H4(2 + 0)];
-        sum10 = bfdotadd(sum10, n[s + H4(2 + 0)], m[s + H4(0 + 0)]);
-        sum10 = bfdotadd(sum10, n[s + H4(2 + 1)], m[s + H4(0 + 1)]);
+            sum10 = a[s + H4(2 + 0)];
+            sum10 = bfdotadd_ebf(sum10, n[s + H4(2 + 0)], m[s + H4(0 + 0)], &fpst, &fpst_odd);
+            sum10 = bfdotadd_ebf(sum10, n[s + H4(2 + 1)], m[s + H4(0 + 1)], &fpst, &fpst_odd);
 
-        sum11 = a[s + H4(2 + 1)];
-        sum11 = bfdotadd(sum11, n[s + H4(2 + 0)], m[s + H4(2 + 0)]);
-        sum11 = bfdotadd(sum11, n[s + H4(2 + 1)], m[s + H4(2 + 1)]);
+            sum11 = a[s + H4(2 + 1)];
+            sum11 = bfdotadd_ebf(sum11, n[s + H4(2 + 0)], m[s + H4(2 + 0)], &fpst, &fpst_odd);
+            sum11 = bfdotadd_ebf(sum11, n[s + H4(2 + 1)], m[s + H4(2 + 1)], &fpst, &fpst_odd);
 
-        d[s + H4(0 + 0)] = sum00;
-        d[s + H4(0 + 1)] = sum01;
-        d[s + H4(2 + 0)] = sum10;
-        d[s + H4(2 + 1)] = sum11;
+            d[s + H4(0 + 0)] = sum00;
+            d[s + H4(0 + 1)] = sum01;
+            d[s + H4(2 + 0)] = sum10;
+            d[s + H4(2 + 1)] = sum11;
+        }
+    } else {
+        for (s = 0; s < opr_sz / 4; s += 4) {
+            float32 sum00, sum01, sum10, sum11;
+
+            /*
+             * Process the entire segment at once, writing back the
+             * results only after we've consumed all of the inputs.
+             *
+             * Key to indices by column:
+             *               i   j           i   k             j   k
+             */
+            sum00 = a[s + H4(0 + 0)];
+            sum00 = bfdotadd(sum00, n[s + H4(0 + 0)], m[s + H4(0 + 0)], &fpst);
+            sum00 = bfdotadd(sum00, n[s + H4(0 + 1)], m[s + H4(0 + 1)], &fpst);
+
+            sum01 = a[s + H4(0 + 1)];
+            sum01 = bfdotadd(sum01, n[s + H4(0 + 0)], m[s + H4(2 + 0)], &fpst);
+            sum01 = bfdotadd(sum01, n[s + H4(0 + 1)], m[s + H4(2 + 1)], &fpst);
+
+            sum10 = a[s + H4(2 + 0)];
+            sum10 = bfdotadd(sum10, n[s + H4(2 + 0)], m[s + H4(0 + 0)], &fpst);
+            sum10 = bfdotadd(sum10, n[s + H4(2 + 1)], m[s + H4(0 + 1)], &fpst);
+
+            sum11 = a[s + H4(2 + 1)];
+            sum11 = bfdotadd(sum11, n[s + H4(2 + 0)], m[s + H4(2 + 0)], &fpst);
+            sum11 = bfdotadd(sum11, n[s + H4(2 + 1)], m[s + H4(2 + 1)], &fpst);
+
+            d[s + H4(0 + 0)] = sum00;
+            d[s + H4(0 + 1)] = sum01;
+            d[s + H4(2 + 0)] = sum10;
+            d[s + H4(2 + 1)] = sum11;
+        }
     }
     clear_tail(d, opr_sz, simd_maxsz(desc));
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support
  2024-07-30 16:03 ` [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support Peter Maydell
@ 2024-07-31  1:43   ` Richard Henderson
  2024-07-31  1:48   ` Richard Henderson
  1 sibling, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:43 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> We use bfdotadd() in four callsites for various helper functions. Currently
> this all assumes that we have the FPCR.EBF=0 semantics. For FPCR.EBF=1
> we will need to:
>   * call a different routine to bfdotadd() because we need to do a
>     fused multiply-add rather than separate multiply and add steps
>   * use a different float_status that honours the FPCR rounding mode
>     and denormal-flushing fields
>   * pass in an extra float_status that has been set up to perform
>     round-to-odd rounding
> 
> To prepare for this, refactor all the callsites so that instead of
>     for (...) {
>         x = bfdotadd(...);
>     }
> 
> they are:
>     float_status fpst, fpst_odd;
>     if (is_ebf(env, &fpst, &fpst_odd)) {
>         for (...) {
>             x = bfdotadd_ebf(..., &fpst, &fpst_odd);
>         }
>     } else {
>         for (...) {
>             x = bfdotadd(..., &fpst);
>         }
>     }
> 
> For the moment the is_ebf() function always returns false, sets up
> fpst for EBF=0 semantics and never sets up fpst_odd; bfdotadd_ebf()
> will assert if called. We'll fill in the handling for EBF=1 in the
> next commit.
> 
> This change should be a zero-behaviour-change refactor.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/tcg/vec_internal.h |  37 ++++++++-
>   target/arm/tcg/sme_helper.c   |  74 ++++++++++++------
>   target/arm/tcg/vec_helper.c   | 141 +++++++++++++++++++++++++---------
>   3 files changed, 192 insertions(+), 60 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support
  2024-07-30 16:03 ` [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support Peter Maydell
  2024-07-31  1:43   ` Richard Henderson
@ 2024-07-31  1:48   ` Richard Henderson
  2024-07-31 12:32     ` Peter Maydell
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:48 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> @@ -2790,7 +2790,7 @@ DO_MMLA_B(gvec_usmmla_b, do_usmmla_b)
>    * BFloat16 Dot Product
>    */
>   
> -float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
> +bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp)
>   {
>       /* FPCR is ignored for BFDOT and BFMMLA. */
>       float_status bf_status = {
> @@ -2800,29 +2800,50 @@ float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
>           .flush_inputs_to_zero = true,
>           .default_nan_mode = true,
>       };
> +
> +    *statusp = bf_status;
> +    return false;
> +}

Looking at the next patch, I think dropping the local variable is better.

   *statusp = (float_status){
       ...
   };


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support
  2024-07-31  1:48   ` Richard Henderson
@ 2024-07-31 12:32     ` Peter Maydell
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Maydell @ 2024-07-31 12:32 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-arm, qemu-devel

On Wed, 31 Jul 2024 at 02:48, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 7/31/24 02:03, Peter Maydell wrote:
> > @@ -2790,7 +2790,7 @@ DO_MMLA_B(gvec_usmmla_b, do_usmmla_b)
> >    * BFloat16 Dot Product
> >    */
> >
> > -float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
> > +bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp)
> >   {
> >       /* FPCR is ignored for BFDOT and BFMMLA. */
> >       float_status bf_status = {
> > @@ -2800,29 +2800,50 @@ float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
> >           .flush_inputs_to_zero = true,
> >           .default_nan_mode = true,
> >       };
> > +
> > +    *statusp = bf_status;
> > +    return false;
> > +}
>
> Looking at the next patch, I think dropping the local variable is better.
>
>    *statusp = (float_status){
>        ...
>    };

Yes, I agree; I've updated this patch and the next accordingly.

-- PMM


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd()
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
                   ` (5 preceding siblings ...)
  2024-07-30 16:03 ` [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:50   ` Richard Henderson
  2024-07-30 16:03 ` [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU Peter Maydell
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Implement the FPCR.EBF=1 semantics for bfdotadd() operations:
 * is_ebf() sets up fpst and fpst_odd
 * bfdotadd_ebf() implements the fused paired-multiply-and-add
   operation that we need

The paired-multiply-and-add is similar to f16_dotadd() and
we use the same trick here as in that function, but the inputs
here are bfloat16 rather than float16.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 target/arm/tcg/vec_helper.c | 57 +++++++++++++++++++++++++++++++++++--
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index baf04a0561b..64076c1c595 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2792,7 +2792,20 @@ DO_MMLA_B(gvec_usmmla_b, do_usmmla_b)
 
 bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp)
 {
-    /* FPCR is ignored for BFDOT and BFMMLA. */
+    /*
+     * For BFDOT, BFMMLA, etc, the behaviour depends on FPCR.EBF.
+     * For EBF = 0, we ignore the FPCR bits which determine rounding
+     * mode and denormal-flushing, and we do unfused multiplies and
+     * additions with intermediate rounding of all products and sums.
+     * For EBF = 1, we honour FPCR rounding mode and denormal-flushing bits,
+     * and we perform a fused two-way sum-of-products without intermediate
+     * rounding of the products.
+     * In either case, we don't set fp exception flags.
+     *
+     * EBF is AArch64 only, so even if it's set in the FPCR it has
+     * no effect on AArch32 instructions.
+     */
+    bool ebf = is_a64(env) && env->vfp.fpcr & FPCR_EBF;
     float_status bf_status = {
         .tininess_before_rounding = float_tininess_before_rounding,
         .float_rounding_mode = float_round_to_odd_inf,
@@ -2801,8 +2814,19 @@ bool is_ebf(CPUARMState *env, float_status *statusp, float_status *oddstatusp)
         .default_nan_mode = true,
     };
 
+    if (ebf) {
+        float_status *fpst = &env->vfp.fp_status;
+        set_flush_to_zero(get_flush_to_zero(fpst), &bf_status);
+        set_flush_inputs_to_zero(get_flush_inputs_to_zero(fpst), &bf_status);
+        set_float_rounding_mode(get_float_rounding_mode(fpst), &bf_status);
+
+        /* EBF=1 needs to do a step with round-to-odd semantics */
+        *oddstatusp = bf_status;
+        set_float_rounding_mode(float_round_to_odd, oddstatusp);
+    }
+
     *statusp = bf_status;
-    return false;
+    return ebf;
 }
 
 float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2, float_status *fpst)
@@ -2824,7 +2848,34 @@ float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2, float_status *fpst)
 float32 bfdotadd_ebf(float32 sum, uint32_t e1, uint32_t e2,
                      float_status *fpst, float_status *fpst_odd)
 {
-    g_assert_not_reached();
+    /*
+     * Compare f16_dotadd() in sme_helper.c, but here we have
+     * bfloat16 inputs. In particular that means that we do not
+     * want the FPCR.FZ16 flush semantics, so we use the normal
+     * float_status for the input handling here.
+     */
+    float64 e1r = float32_to_float64(e1 << 16, fpst);
+    float64 e1c = float32_to_float64(e1 & 0xffff0000u, fpst);
+    float64 e2r = float32_to_float64(e2 << 16, fpst);
+    float64 e2c = float32_to_float64(e2 & 0xffff0000u, fpst);
+    float64 t64;
+    float32 t32;
+
+    /*
+     * The ARM pseudocode function FPDot performs both multiplies
+     * and the add with a single rounding operation.  Emulate this
+     * by performing the first multiply in round-to-odd, then doing
+     * the second multiply as fused multiply-add, and rounding to
+     * float32 all in one step.
+     */
+    t64 = float64_mul(e1r, e2r, fpst_odd);
+    t64 = float64r32_muladd(e1c, e2c, t64, 0, fpst);
+
+    /* This conversion is exact, because we've already rounded. */
+    t32 = float64_to_float32(t64, fpst);
+
+    /* The final accumulation step is not fused. */
+    return float32_add(sum, t32, fpst);
 }
 
 void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd()
  2024-07-30 16:03 ` [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd() Peter Maydell
@ 2024-07-31  1:50   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:50 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> Implement the FPCR.EBF=1 semantics for bfdotadd() operations:
>   * is_ebf() sets up fpst and fpst_odd
>   * bfdotadd_ebf() implements the fused paired-multiply-and-add
>     operation that we need
> 
> The paired-multiply-and-add is similar to f16_dotadd() and
> we use the same trick here as in that function, but the inputs
> here are bfloat16 rather than float16.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   target/arm/tcg/vec_helper.c | 57 +++++++++++++++++++++++++++++++++++--
>   1 file changed, 54 insertions(+), 3 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU
  2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
                   ` (6 preceding siblings ...)
  2024-07-30 16:03 ` [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd() Peter Maydell
@ 2024-07-30 16:03 ` Peter Maydell
  2024-07-31  1:51   ` Richard Henderson
  7 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2024-07-30 16:03 UTC (permalink / raw)
  To: qemu-arm, qemu-devel

Now that we've implemented the required behaviour for FEAT_EBF16, we
can enable it for the "max" CPU type, list it in our documentation,
and delete a TODO comment about it being missing.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
 docs/system/arm/emulation.rst  | 1 +
 target/arm/tcg/cpu64.c         | 4 ++--
 target/arm/tcg/translate-sme.c | 1 -
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/system/arm/emulation.rst b/docs/system/arm/emulation.rst
index 3ab6e726679..35f52a54b1c 100644
--- a/docs/system/arm/emulation.rst
+++ b/docs/system/arm/emulation.rst
@@ -45,6 +45,7 @@ the following architecture extensions:
 - FEAT_DotProd (Advanced SIMD dot product instructions)
 - FEAT_DoubleFault (Double Fault Extension)
 - FEAT_E0PD (Preventing EL0 access to halves of address maps)
+- FEAT_EBF16 (AArch64 Extended BFloat16 instructions)
 - FEAT_ECV (Enhanced Counter Virtualization)
 - FEAT_EL0 (Support for execution at EL0)
 - FEAT_EL1 (Support for execution at EL1)
diff --git a/target/arm/tcg/cpu64.c b/target/arm/tcg/cpu64.c
index fe232eb3069..79258a7c928 100644
--- a/target/arm/tcg/cpu64.c
+++ b/target/arm/tcg/cpu64.c
@@ -1160,7 +1160,7 @@ void aarch64_max_tcg_initfn(Object *obj)
     t = FIELD_DP64(t, ID_AA64ISAR1, FRINTTS, 1);  /* FEAT_FRINTTS */
     t = FIELD_DP64(t, ID_AA64ISAR1, SB, 1);       /* FEAT_SB */
     t = FIELD_DP64(t, ID_AA64ISAR1, SPECRES, 1);  /* FEAT_SPECRES */
-    t = FIELD_DP64(t, ID_AA64ISAR1, BF16, 1);     /* FEAT_BF16 */
+    t = FIELD_DP64(t, ID_AA64ISAR1, BF16, 2);     /* FEAT_BF16, FEAT_EBF16 */
     t = FIELD_DP64(t, ID_AA64ISAR1, DGH, 1);      /* FEAT_DGH */
     t = FIELD_DP64(t, ID_AA64ISAR1, I8MM, 1);     /* FEAT_I8MM */
     cpu->isar.id_aa64isar1 = t;
@@ -1244,7 +1244,7 @@ void aarch64_max_tcg_initfn(Object *obj)
     t = FIELD_DP64(t, ID_AA64ZFR0, SVEVER, 1);
     t = FIELD_DP64(t, ID_AA64ZFR0, AES, 2);       /* FEAT_SVE_PMULL128 */
     t = FIELD_DP64(t, ID_AA64ZFR0, BITPERM, 1);   /* FEAT_SVE_BitPerm */
-    t = FIELD_DP64(t, ID_AA64ZFR0, BFLOAT16, 1);  /* FEAT_BF16 */
+    t = FIELD_DP64(t, ID_AA64ZFR0, BFLOAT16, 2);  /* FEAT_BF16, FEAT_EBF16 */
     t = FIELD_DP64(t, ID_AA64ZFR0, SHA3, 1);      /* FEAT_SVE_SHA3 */
     t = FIELD_DP64(t, ID_AA64ZFR0, SM4, 1);       /* FEAT_SVE_SM4 */
     t = FIELD_DP64(t, ID_AA64ZFR0, I8MM, 1);      /* FEAT_I8MM */
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index bcb502feb05..760c200e622 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -354,7 +354,6 @@ TRANS_FEAT(FMOPA_s, aa64_sme, do_outprod_fpst, a,
 TRANS_FEAT(FMOPA_d, aa64_sme_f64f64, do_outprod_fpst, a,
            MO_64, FPST_FPCR, gen_helper_sme_fmopa_d)
 
-/* TODO: FEAT_EBF16 */
 TRANS_FEAT(BFMOPA, aa64_sme, do_outprod_env, a, MO_32, gen_helper_sme_bfmopa)
 
 TRANS_FEAT(SMOPA_s, aa64_sme, do_outprod, a, MO_32, gen_helper_sme_smopa_s)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU
  2024-07-30 16:03 ` [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU Peter Maydell
@ 2024-07-31  1:51   ` Richard Henderson
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2024-07-31  1:51 UTC (permalink / raw)
  To: Peter Maydell, qemu-arm, qemu-devel

On 7/31/24 02:03, Peter Maydell wrote:
> Now that we've implemented the required behaviour for FEAT_EBF16, we
> can enable it for the "max" CPU type, list it in our documentation,
> and delete a TODO comment about it being missing.
> 
> Signed-off-by: Peter Maydell<peter.maydell@linaro.org>
> ---
>   docs/system/arm/emulation.rst  | 1 +
>   target/arm/tcg/cpu64.c         | 4 ++--
>   target/arm/tcg/translate-sme.c | 1 -
>   3 files changed, 3 insertions(+), 3 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-07-31 12:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-30 16:02 [PATCH 0/8] target/arm: Implement FEAT_EBF16 Peter Maydell
2024-07-30 16:02 ` [PATCH 1/8] target/arm: Allow setting the FPCR.EBF bit for FEAT_EBF16 Peter Maydell
2024-07-31  1:30   ` Richard Henderson
2024-07-30 16:03 ` [PATCH 2/8] target/arm: Pass env pointer through to sme_bfmopa helper Peter Maydell
2024-07-31  1:32   ` Richard Henderson
2024-07-30 16:03 ` [PATCH 3/8] target/arm: Pass env pointer through to gvec_bfdot helper Peter Maydell
2024-07-31  1:36   ` Richard Henderson
2024-07-31 12:31     ` Peter Maydell
2024-07-30 16:03 ` [PATCH 4/8] target/arm: Pass env pointer through to gvec_bfdot_idx helper Peter Maydell
2024-07-31  1:37   ` Richard Henderson
2024-07-30 16:03 ` [PATCH 5/8] target/arm: Pass env pointer through to gvec_bfmmla helper Peter Maydell
2024-07-31  1:38   ` Richard Henderson
2024-07-30 16:03 ` [PATCH 6/8] target/arm: Prepare bfdotadd() callers for FEAT_EBF support Peter Maydell
2024-07-31  1:43   ` Richard Henderson
2024-07-31  1:48   ` Richard Henderson
2024-07-31 12:32     ` Peter Maydell
2024-07-30 16:03 ` [PATCH 7/8] target/arm: Implement FPCR.EBF=1 semantics for bfdotadd() Peter Maydell
2024-07-31  1:50   ` Richard Henderson
2024-07-30 16:03 ` [PATCH 8/8] target/arm: Enable FEAT_EBF16 in the "max" CPU Peter Maydell
2024-07-31  1:51   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).