* [PATCH 00/61] target/arm: Implement FEAT_SME2
@ 2025-02-06 19:56 Richard Henderson
2025-02-06 19:56 ` [PATCH 01/61] tcg: Add dbase argument to do_dup_store Richard Henderson
` (61 more replies)
0 siblings, 62 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Based-on: 20250201164012.1660228-1-peter.maydell@linaro.org
("[PATCH v2 00/69] target/arm: FEAT_AFP and FEAT_RPRES")
This implements the Scalar Matrix Extensions, version 2, plus two
trivial extensions for float16 and bfloat16.
This hasn't been tested much at all; I need to either get FVP up and
running for RISU comparison, or write some stand-alone test cases.
But in the meantime this could use some eyes.
SME2 is the first vector-like extension we've had that has dynamic
indexing of registers: ZArray[(rv + offset) % svl], where RV is a
general register. So the first thing I do is extend TCG's gvec
support to handle TCGv_ptr base + offset addressing. I only changed
enough to handle what I needed within SME2; changing it all would be
a big job, and it would (at least for the moment) remain unused.
Still to-do are few more extensions for SME2p1.
r~
Richard Henderson (61):
tcg: Add dbase argument to do_dup_store
tcg: Add dbase argument to do_dup
tcg: Add dbase argument to expand_clr
tcg: Add base arguments to check_overlap_[234]
tcg: Split out tcg_gen_gvec_2_var
tcg: Split out tcg_gen_gvec_3_var
tcg: Split out tcg_gen_gvec_mov_var
tcg: Split out tcg_gen_gvec_{add,sub}_var
target/arm: Introduce FPST_ZA, FPST_ZA_F16
target/arm: Use FPST_ZA for sme_fmopa_[hsd]
target/arm: Rename zarray to za_state.za
target/arm: Add isar_feature_aa64_sme2*
target/arm: Add ZT0
target/arm: Add zt0_excp_el to DisasContext
target/arm: Implement SME2 ZERO ZT0
target/arm: Implement SME2 LDR/STR ZT0
target/arm: Implement SME2 MOVT
target/arm: Split get_tile_rowcol argument tile_index
target/arm: Rename MOVA for translate
target/arm: Implement SME2 MOVA to/from tile, multiple registers
target/arm: Split out get_zarray
target/arm: Implement SME2 MOVA to/from array, multiple registers
target/arm: Implement SME2 BMOPA
target/arm: Implement SME2 SMOPS, UMOPS (2-way)
target/arm: Introduce gen_gvec_sve2_sqdmulh
target/arm: Implement SME2 Multiple and Single SVE Destructive
target/arm: Implement SME2 Multiple Vectors SVE Destructive
target/arm: Implement SME2 ADD/SUB (array results, multiple and single
vector)
target/arm: Implement SME2 ADD/SUB (array results, multiple vectors)
target/arm: Pass ZA to helper_sve2_fmlal_zz[zx]w_s
target/arm: Implement SME2 FMLAL, BFMLAL
target/arm: Implement SME2 FDOT
target/arm: Implement SME2 BFDOT
target/arm: Implement SME2 FVDOT, BFVDOT
target/arm: Rename helper_gvec_*dot_[bh] to *_4[bh]
target/arm: Remove helper_gvec_sudot_idx_4b
target/arm: Implemement SME2 SDOT, UDOT, USDOT, SUDOT
target/arm: Implement SME2 SVDOT, UVDOT, SUVDOT, USVDOT
target/arm: Implement SME2 SMLAL, SMLSL, UMLAL, UMLSL
target/arm: Implement SME2 SMLALL, SMLSLL, UMLALL, UMLSLL
target/arm: Rename gvec_fml[as]_[hs] with _nf_ infix
target/arm: Implement SME2 FMLA, FMLS
target/arm: Implement SME2 BFMLA, BFMLS
target/arm: Implement SME2 FADD, FSUB, BFADD, BFSUB
target/arm: Remove CPUARMState.vfp.scratch
target/arm: Implement SME2 BFCVT, BFCVTN, FCVT, FCVTN
target/arm: Implement SME2 FCVT (widening), FCVTL
target/arm: Implement SME2 FCVTZS, FCVTZU
target/arm: Implement SME2 SCVTF, UCVTF
target/arm: Implement SME2 FRINTN, FRINTP, FRINTM, FRINTA
target/arm: Introduce do_[us]sat_[bhs] macros
target/arm: Use do_[us]sat_[bhs] in sve_helper.c
target/arm: Implement SME2 SQCVT, UQCVT, SQCVTU
target/arm: Implement SME2 SUNPK, UUNPK
target/arm: Implement SME2 ZIP, UZP (four registers)
target/arm: Move do_urshr, do_srshr to vec_internal.h
target/arm: Implement SME2 SQRSHR, UQRSHR, SQRSHRN
target/arm: Implement SME2 ZIP, UZP (two registers)
target/arm: Implement SME2 FCLAMP, SCLAMP, UCLAMP
target/arm: Implement SME2 SEL
target/arm: Enable FEAT_SME2, FEAT_SME_F16F16, FEAT_SVE_B16B16 on -cpu
max
include/tcg/tcg-op-gvec-common.h | 20 +
target/arm/cpu-features.h | 35 +
target/arm/cpu.h | 68 +-
target/arm/helper.h | 47 +-
target/arm/syndrome.h | 1 +
target/arm/tcg/helper-sme.h | 164 ++++
target/arm/tcg/translate-a64.h | 4 +
target/arm/tcg/translate.h | 11 +
target/arm/tcg/vec_internal.h | 32 +
linux-user/aarch64/signal.c | 4 +-
target/arm/cpu.c | 11 +-
target/arm/helper.c | 2 +-
target/arm/machine.c | 23 +-
target/arm/tcg/cpu64.c | 7 +-
target/arm/tcg/gengvec.c | 6 +
target/arm/tcg/gengvec64.c | 11 +
target/arm/tcg/helper-a64.c | 2 +
target/arm/tcg/hflags.c | 34 +-
target/arm/tcg/mve_helper.c | 21 -
target/arm/tcg/sme_helper.c | 867 +++++++++++++++++++--
target/arm/tcg/sve_helper.c | 137 ++--
target/arm/tcg/translate-a64.c | 15 +-
target/arm/tcg/translate-neon.c | 18 +-
target/arm/tcg/translate-sme.c | 1238 +++++++++++++++++++++++++++++-
target/arm/tcg/translate-sve.c | 28 +-
target/arm/tcg/vec_helper.c | 162 +++-
target/arm/vfp_helper.c | 10 +
tcg/tcg-op-gvec.c | 363 +++++----
docs/system/arm/emulation.rst | 3 +
target/arm/tcg/sme.decode | 823 +++++++++++++++++++-
30 files changed, 3719 insertions(+), 448 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 64+ messages in thread
* [PATCH 01/61] tcg: Add dbase argument to do_dup_store
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 02/61] tcg: Add dbase argument to do_dup Richard Henderson
` (60 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op-gvec.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index d32a4f146d..1aad7b0864 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -483,8 +483,8 @@ static TCGType choose_vector_type(const TCGOpcode *list, unsigned vece,
return 0;
}
-static void do_dup_store(TCGType type, uint32_t dofs, uint32_t oprsz,
- uint32_t maxsz, TCGv_vec t_vec)
+static void do_dup_store(TCGType type, TCGv_ptr dbase, uint32_t dofs,
+ uint32_t oprsz, uint32_t maxsz, TCGv_vec t_vec)
{
uint32_t i = 0;
@@ -496,7 +496,7 @@ static void do_dup_store(TCGType type, uint32_t dofs, uint32_t oprsz,
* are misaligned wrt the maximum vector size, so do that first.
*/
if (dofs & 8) {
- tcg_gen_stl_vec(t_vec, tcg_env, dofs + i, TCG_TYPE_V64);
+ tcg_gen_stl_vec(t_vec, dbase, dofs + i, TCG_TYPE_V64);
i += 8;
}
@@ -508,17 +508,17 @@ static void do_dup_store(TCGType type, uint32_t dofs, uint32_t oprsz,
* that e.g. size == 80 would be expanded with 2x32 + 1x16.
*/
for (; i + 32 <= oprsz; i += 32) {
- tcg_gen_stl_vec(t_vec, tcg_env, dofs + i, TCG_TYPE_V256);
+ tcg_gen_stl_vec(t_vec, dbase, dofs + i, TCG_TYPE_V256);
}
/* fallthru */
case TCG_TYPE_V128:
for (; i + 16 <= oprsz; i += 16) {
- tcg_gen_stl_vec(t_vec, tcg_env, dofs + i, TCG_TYPE_V128);
+ tcg_gen_stl_vec(t_vec, dbase, dofs + i, TCG_TYPE_V128);
}
break;
case TCG_TYPE_V64:
for (; i < oprsz; i += 8) {
- tcg_gen_stl_vec(t_vec, tcg_env, dofs + i, TCG_TYPE_V64);
+ tcg_gen_stl_vec(t_vec, dbase, dofs + i, TCG_TYPE_V64);
}
break;
default:
@@ -574,7 +574,7 @@ static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
} else {
tcg_gen_dupi_vec(vece, t_vec, in_c);
}
- do_dup_store(type, dofs, oprsz, maxsz, t_vec);
+ do_dup_store(type, tcg_env, dofs, oprsz, maxsz, t_vec);
return;
}
@@ -1731,7 +1731,7 @@ void tcg_gen_gvec_dup_mem(unsigned vece, uint32_t dofs, uint32_t aofs,
if (type != 0) {
TCGv_vec t_vec = tcg_temp_new_vec(type);
tcg_gen_dup_mem_vec(vece, t_vec, tcg_env, aofs);
- do_dup_store(type, dofs, oprsz, maxsz, t_vec);
+ do_dup_store(type, tcg_env, dofs, oprsz, maxsz, t_vec);
} else if (vece <= MO_32) {
TCGv_i32 in = tcg_temp_ebb_new_i32();
switch (vece) {
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 02/61] tcg: Add dbase argument to do_dup
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
2025-02-06 19:56 ` [PATCH 01/61] tcg: Add dbase argument to do_dup_store Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 03/61] tcg: Add dbase argument to expand_clr Richard Henderson
` (59 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op-gvec.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 1aad7b0864..451091753d 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -534,9 +534,9 @@ static void do_dup_store(TCGType type, TCGv_ptr dbase, uint32_t dofs,
* Only one of IN_32 or IN_64 may be set;
* IN_C is used if IN_32 and IN_64 are unset.
*/
-static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
- uint32_t maxsz, TCGv_i32 in_32, TCGv_i64 in_64,
- uint64_t in_c)
+static void do_dup(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ uint32_t oprsz, uint32_t maxsz,
+ TCGv_i32 in_32, TCGv_i64 in_64, uint64_t in_c)
{
TCGType type;
TCGv_i64 t_64;
@@ -574,7 +574,7 @@ static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
} else {
tcg_gen_dupi_vec(vece, t_vec, in_c);
}
- do_dup_store(type, tcg_env, dofs, oprsz, maxsz, t_vec);
+ do_dup_store(type, dbase, dofs, oprsz, maxsz, t_vec);
return;
}
@@ -618,14 +618,14 @@ static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
/* Implement inline if we picked an implementation size above. */
if (t_32) {
for (i = 0; i < oprsz; i += 4) {
- tcg_gen_st_i32(t_32, tcg_env, dofs + i);
+ tcg_gen_st_i32(t_32, dbase, dofs + i);
}
tcg_temp_free_i32(t_32);
goto done;
}
if (t_64) {
for (i = 0; i < oprsz; i += 8) {
- tcg_gen_st_i64(t_64, tcg_env, dofs + i);
+ tcg_gen_st_i64(t_64, dbase, dofs + i);
}
tcg_temp_free_i64(t_64);
goto done;
@@ -634,7 +634,7 @@ static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
/* Otherwise implement out of line. */
t_ptr = tcg_temp_ebb_new_ptr();
- tcg_gen_addi_ptr(t_ptr, tcg_env, dofs);
+ tcg_gen_addi_ptr(t_ptr, dbase, dofs);
/*
* This may be expand_clr for the tail of an operation, e.g.
@@ -710,7 +710,7 @@ static void do_dup(unsigned vece, uint32_t dofs, uint32_t oprsz,
/* Likewise, but with zero. */
static void expand_clr(uint32_t dofs, uint32_t maxsz)
{
- do_dup(MO_8, dofs, maxsz, maxsz, NULL, NULL, 0);
+ do_dup(MO_8, tcg_env, dofs, maxsz, maxsz, NULL, NULL, 0);
}
/* Expand OPSZ bytes worth of two-operand operations using i32 elements. */
@@ -1711,7 +1711,7 @@ void tcg_gen_gvec_dup_i32(unsigned vece, uint32_t dofs, uint32_t oprsz,
{
check_size_align(oprsz, maxsz, dofs);
tcg_debug_assert(vece <= MO_32);
- do_dup(vece, dofs, oprsz, maxsz, in, NULL, 0);
+ do_dup(vece, tcg_env, dofs, oprsz, maxsz, in, NULL, 0);
}
void tcg_gen_gvec_dup_i64(unsigned vece, uint32_t dofs, uint32_t oprsz,
@@ -1719,7 +1719,7 @@ void tcg_gen_gvec_dup_i64(unsigned vece, uint32_t dofs, uint32_t oprsz,
{
check_size_align(oprsz, maxsz, dofs);
tcg_debug_assert(vece <= MO_64);
- do_dup(vece, dofs, oprsz, maxsz, NULL, in, 0);
+ do_dup(vece, tcg_env, dofs, oprsz, maxsz, NULL, in, 0);
}
void tcg_gen_gvec_dup_mem(unsigned vece, uint32_t dofs, uint32_t aofs,
@@ -1745,12 +1745,12 @@ void tcg_gen_gvec_dup_mem(unsigned vece, uint32_t dofs, uint32_t aofs,
tcg_gen_ld_i32(in, tcg_env, aofs);
break;
}
- do_dup(vece, dofs, oprsz, maxsz, in, NULL, 0);
+ do_dup(vece, tcg_env, dofs, oprsz, maxsz, in, NULL, 0);
tcg_temp_free_i32(in);
} else {
TCGv_i64 in = tcg_temp_ebb_new_i64();
tcg_gen_ld_i64(in, tcg_env, aofs);
- do_dup(vece, dofs, oprsz, maxsz, NULL, in, 0);
+ do_dup(vece, tcg_env, dofs, oprsz, maxsz, NULL, in, 0);
tcg_temp_free_i64(in);
}
} else if (vece == 4) {
@@ -1833,7 +1833,7 @@ void tcg_gen_gvec_dup_imm(unsigned vece, uint32_t dofs, uint32_t oprsz,
uint32_t maxsz, uint64_t x)
{
check_size_align(oprsz, maxsz, dofs);
- do_dup(vece, dofs, oprsz, maxsz, NULL, NULL, x);
+ do_dup(vece, tcg_env, dofs, oprsz, maxsz, NULL, NULL, x);
}
void tcg_gen_gvec_not(unsigned vece, uint32_t dofs, uint32_t aofs,
@@ -3772,7 +3772,7 @@ void tcg_gen_gvec_cmp(TCGCond cond, unsigned vece, uint32_t dofs,
check_overlap_3(dofs, aofs, bofs, maxsz);
if (cond == TCG_COND_NEVER || cond == TCG_COND_ALWAYS) {
- do_dup(MO_8, dofs, oprsz, maxsz,
+ do_dup(MO_8, tcg_env, dofs, oprsz, maxsz,
NULL, NULL, -(cond == TCG_COND_ALWAYS));
return;
}
@@ -3892,7 +3892,7 @@ void tcg_gen_gvec_cmps(TCGCond cond, unsigned vece, uint32_t dofs,
check_overlap_2(dofs, aofs, maxsz);
if (cond == TCG_COND_NEVER || cond == TCG_COND_ALWAYS) {
- do_dup(MO_8, dofs, oprsz, maxsz,
+ do_dup(MO_8, tcg_env, dofs, oprsz, maxsz,
NULL, NULL, -(cond == TCG_COND_ALWAYS));
return;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 03/61] tcg: Add dbase argument to expand_clr
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
2025-02-06 19:56 ` [PATCH 01/61] tcg: Add dbase argument to do_dup_store Richard Henderson
2025-02-06 19:56 ` [PATCH 02/61] tcg: Add dbase argument to do_dup Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 04/61] tcg: Add base arguments to check_overlap_[234] Richard Henderson
` (58 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op-gvec.c | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 451091753d..c26cfb24cc 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -380,7 +380,7 @@ static inline bool check_size_impl(uint32_t oprsz, uint32_t lnsz)
return q <= MAX_UNROLL;
}
-static void expand_clr(uint32_t dofs, uint32_t maxsz);
+static void expand_clr(TCGv_ptr dbase, uint32_t dofs, uint32_t maxsz);
/* Duplicate C as per VECE. */
uint64_t (dup_const)(unsigned vece, uint64_t c)
@@ -526,7 +526,7 @@ static void do_dup_store(TCGType type, TCGv_ptr dbase, uint32_t dofs,
}
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(dbase, dofs + oprsz, maxsz - oprsz);
}
}
@@ -703,14 +703,14 @@ static void do_dup(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
done:
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(dbase, dofs + oprsz, maxsz - oprsz);
}
}
/* Likewise, but with zero. */
-static void expand_clr(uint32_t dofs, uint32_t maxsz)
+static void expand_clr(TCGv_ptr dbase, uint32_t dofs, uint32_t maxsz)
{
- do_dup(MO_8, tcg_env, dofs, maxsz, maxsz, NULL, NULL, 0);
+ do_dup(MO_8, dbase, dofs, maxsz, maxsz, NULL, NULL, 0);
}
/* Expand OPSZ bytes worth of two-operand operations using i32 elements. */
@@ -1255,7 +1255,7 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1324,7 +1324,7 @@ void tcg_gen_gvec_2i(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1401,7 +1401,7 @@ void tcg_gen_gvec_2s(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
}
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1467,7 +1467,7 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1536,7 +1536,7 @@ void tcg_gen_gvec_3i(uint32_t dofs, uint32_t aofs, uint32_t bofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1605,7 +1605,7 @@ void tcg_gen_gvec_4(uint32_t dofs, uint32_t aofs, uint32_t bofs, uint32_t cofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1674,7 +1674,7 @@ void tcg_gen_gvec_4i(uint32_t dofs, uint32_t aofs, uint32_t bofs, uint32_t cofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -1701,7 +1701,7 @@ void tcg_gen_gvec_mov(unsigned vece, uint32_t dofs, uint32_t aofs,
} else {
check_size_align(oprsz, maxsz, dofs);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
}
@@ -1779,7 +1779,7 @@ void tcg_gen_gvec_dup_mem(unsigned vece, uint32_t dofs, uint32_t aofs,
tcg_temp_free_i64(in1);
}
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
} else if (vece == 5) {
/* 256-bit duplicate. */
@@ -1822,7 +1822,7 @@ void tcg_gen_gvec_dup_mem(unsigned vece, uint32_t dofs, uint32_t aofs,
}
}
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
} else {
g_assert_not_reached();
@@ -3255,7 +3255,7 @@ do_gvec_shifts(unsigned vece, uint32_t dofs, uint32_t aofs, TCGv_i32 shift,
clear_tail:
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -3834,7 +3834,7 @@ void tcg_gen_gvec_cmp(TCGCond cond, unsigned vece, uint32_t dofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
@@ -3975,7 +3975,7 @@ void tcg_gen_gvec_cmps(TCGCond cond, unsigned vece, uint32_t dofs,
}
if (oprsz < maxsz) {
- expand_clr(dofs + oprsz, maxsz - oprsz);
+ expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 04/61] tcg: Add base arguments to check_overlap_[234]
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (2 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 03/61] tcg: Add dbase argument to expand_clr Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 05/61] tcg: Split out tcg_gen_gvec_2_var Richard Henderson
` (57 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op-gvec.c | 55 ++++++++++++++++++++++++++---------------------
1 file changed, 31 insertions(+), 24 deletions(-)
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index c26cfb24cc..54304d08cc 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -58,29 +58,34 @@ static void check_size_align(uint32_t oprsz, uint32_t maxsz, uint32_t ofs)
}
/* Verify vector overlap rules for two operands. */
-static void check_overlap_2(uint32_t d, uint32_t a, uint32_t s)
+static void check_overlap_2(TCGv_ptr dbase, uint32_t d,
+ TCGv_ptr abase, uint32_t a, uint32_t s)
{
- tcg_debug_assert(d == a || d + s <= a || a + s <= d);
+ tcg_debug_assert(dbase != abase || d == a || d + s <= a || a + s <= d);
}
/* Verify vector overlap rules for three operands. */
-static void check_overlap_3(uint32_t d, uint32_t a, uint32_t b, uint32_t s)
+static void check_overlap_3(TCGv_ptr dbase, uint32_t d,
+ TCGv_ptr abase, uint32_t a,
+ TCGv_ptr bbase, uint32_t b, uint32_t s)
{
- check_overlap_2(d, a, s);
- check_overlap_2(d, b, s);
- check_overlap_2(a, b, s);
+ check_overlap_2(dbase, d, abase, a, s);
+ check_overlap_2(dbase, d, bbase, b, s);
+ check_overlap_2(abase, a, bbase, b, s);
}
/* Verify vector overlap rules for four operands. */
-static void check_overlap_4(uint32_t d, uint32_t a, uint32_t b,
- uint32_t c, uint32_t s)
+static void check_overlap_4(TCGv_ptr dbase, uint32_t d,
+ TCGv_ptr abase, uint32_t a,
+ TCGv_ptr bbase, uint32_t b,
+ TCGv_ptr cbase, uint32_t c, uint32_t s)
{
- check_overlap_2(d, a, s);
- check_overlap_2(d, b, s);
- check_overlap_2(d, c, s);
- check_overlap_2(a, b, s);
- check_overlap_2(a, c, s);
- check_overlap_2(b, c, s);
+ check_overlap_2(dbase, d, abase, a, s);
+ check_overlap_2(dbase, d, bbase, b, s);
+ check_overlap_2(dbase, d, cbase, c, s);
+ check_overlap_2(abase, a, bbase, b, s);
+ check_overlap_2(abase, a, cbase, c, s);
+ check_overlap_2(bbase, b, cbase, c, s);
}
/* Create a descriptor from components. */
@@ -1205,7 +1210,7 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(dofs, aofs, maxsz);
+ check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1269,7 +1274,7 @@ void tcg_gen_gvec_2i(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(dofs, aofs, maxsz);
+ check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1335,7 +1340,7 @@ void tcg_gen_gvec_2s(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
TCGType type;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(dofs, aofs, maxsz);
+ check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1415,7 +1420,7 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs);
- check_overlap_3(dofs, aofs, bofs, maxsz);
+ check_overlap_3(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1482,7 +1487,7 @@ void tcg_gen_gvec_3i(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs);
- check_overlap_3(dofs, aofs, bofs, maxsz);
+ check_overlap_3(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1550,7 +1555,8 @@ void tcg_gen_gvec_4(uint32_t dofs, uint32_t aofs, uint32_t bofs, uint32_t cofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs | cofs);
- check_overlap_4(dofs, aofs, bofs, cofs, maxsz);
+ check_overlap_4(tcg_env, dofs, tcg_env, aofs,
+ tcg_env, bofs, tcg_env, cofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1620,7 +1626,8 @@ void tcg_gen_gvec_4i(uint32_t dofs, uint32_t aofs, uint32_t bofs, uint32_t cofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs | cofs);
- check_overlap_4(dofs, aofs, bofs, cofs, maxsz);
+ check_overlap_4(tcg_env, dofs, tcg_env, aofs,
+ tcg_env, bofs, tcg_env, cofs, maxsz);
type = 0;
if (g->fniv) {
@@ -3149,7 +3156,7 @@ do_gvec_shifts(unsigned vece, uint32_t dofs, uint32_t aofs, TCGv_i32 shift,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(dofs, aofs, maxsz);
+ check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
/* If the backend has a scalar expansion, great. */
type = choose_vector_type(g->s_list, vece, oprsz, vece == MO_64);
@@ -3769,7 +3776,7 @@ void tcg_gen_gvec_cmp(TCGCond cond, unsigned vece, uint32_t dofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs);
- check_overlap_3(dofs, aofs, bofs, maxsz);
+ check_overlap_3(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs, maxsz);
if (cond == TCG_COND_NEVER || cond == TCG_COND_ALWAYS) {
do_dup(MO_8, tcg_env, dofs, oprsz, maxsz,
@@ -3889,7 +3896,7 @@ void tcg_gen_gvec_cmps(TCGCond cond, unsigned vece, uint32_t dofs,
TCGType type;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(dofs, aofs, maxsz);
+ check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
if (cond == TCG_COND_NEVER || cond == TCG_COND_ALWAYS) {
do_dup(MO_8, tcg_env, dofs, oprsz, maxsz,
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 05/61] tcg: Split out tcg_gen_gvec_2_var
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (3 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 04/61] tcg: Add base arguments to check_overlap_[234] Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 06/61] tcg: Split out tcg_gen_gvec_3_var Richard Henderson
` (56 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
include/tcg/tcg-op-gvec-common.h | 3 ++
tcg/tcg-op-gvec.c | 85 ++++++++++++++++++++------------
2 files changed, 56 insertions(+), 32 deletions(-)
diff --git a/include/tcg/tcg-op-gvec-common.h b/include/tcg/tcg-op-gvec-common.h
index 65553f5f97..877871c101 100644
--- a/include/tcg/tcg-op-gvec-common.h
+++ b/include/tcg/tcg-op-gvec-common.h
@@ -227,6 +227,9 @@ typedef struct {
bool prefer_i64;
} GVecGen4i;
+void tcg_gen_gvec_2_var(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen2 *);
void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
uint32_t oprsz, uint32_t maxsz, const GVecGen2 *);
void tcg_gen_gvec_2i(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 54304d08cc..08019421f2 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -129,9 +129,10 @@ uint32_t simd_desc(uint32_t oprsz, uint32_t maxsz, int32_t data)
}
/* Generate a call to a gvec-style helper with two vector operands. */
-void tcg_gen_gvec_2_ool(uint32_t dofs, uint32_t aofs,
- uint32_t oprsz, uint32_t maxsz, int32_t data,
- gen_helper_gvec_2 *fn)
+static void expand_2_ool(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz,
+ int32_t data, gen_helper_gvec_2 *fn)
{
TCGv_ptr a0, a1;
TCGv_i32 desc = tcg_constant_i32(simd_desc(oprsz, maxsz, data));
@@ -139,8 +140,8 @@ void tcg_gen_gvec_2_ool(uint32_t dofs, uint32_t aofs,
a0 = tcg_temp_ebb_new_ptr();
a1 = tcg_temp_ebb_new_ptr();
- tcg_gen_addi_ptr(a0, tcg_env, dofs);
- tcg_gen_addi_ptr(a1, tcg_env, aofs);
+ tcg_gen_addi_ptr(a0, dbase, dofs);
+ tcg_gen_addi_ptr(a1, abase, aofs);
fn(a0, a1, desc);
@@ -148,6 +149,13 @@ void tcg_gen_gvec_2_ool(uint32_t dofs, uint32_t aofs,
tcg_temp_free_ptr(a1);
}
+void tcg_gen_gvec_2_ool(uint32_t dofs, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz, int32_t data,
+ gen_helper_gvec_2 *fn)
+{
+ expand_2_ool(tcg_env, dofs, tcg_env, aofs, oprsz, maxsz, data, fn);
+}
+
/* Generate a call to a gvec-style helper with two vector operands
and one scalar operand. */
void tcg_gen_gvec_2i_ool(uint32_t dofs, uint32_t aofs, TCGv_i64 c,
@@ -719,20 +727,21 @@ static void expand_clr(TCGv_ptr dbase, uint32_t dofs, uint32_t maxsz)
}
/* Expand OPSZ bytes worth of two-operand operations using i32 elements. */
-static void expand_2_i32(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
- bool load_dest, void (*fni)(TCGv_i32, TCGv_i32))
+static void expand_2_i32(TCGv_ptr dbase, uint32_t dofs, TCGv_ptr abase,
+ uint32_t aofs, uint32_t oprsz, bool load_dest,
+ void (*fni)(TCGv_i32, TCGv_i32))
{
TCGv_i32 t0 = tcg_temp_new_i32();
TCGv_i32 t1 = tcg_temp_new_i32();
uint32_t i;
for (i = 0; i < oprsz; i += 4) {
- tcg_gen_ld_i32(t0, tcg_env, aofs + i);
+ tcg_gen_ld_i32(t0, abase, aofs + i);
if (load_dest) {
- tcg_gen_ld_i32(t1, tcg_env, dofs + i);
+ tcg_gen_ld_i32(t1, dbase, dofs + i);
}
fni(t1, t0);
- tcg_gen_st_i32(t1, tcg_env, dofs + i);
+ tcg_gen_st_i32(t1, dbase, dofs + i);
}
tcg_temp_free_i32(t0);
tcg_temp_free_i32(t1);
@@ -882,20 +891,21 @@ static void expand_4i_i32(uint32_t dofs, uint32_t aofs, uint32_t bofs,
}
/* Expand OPSZ bytes worth of two-operand operations using i64 elements. */
-static void expand_2_i64(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
- bool load_dest, void (*fni)(TCGv_i64, TCGv_i64))
+static void expand_2_i64(TCGv_ptr dbase, uint32_t dofs, TCGv_ptr abase,
+ uint32_t aofs, uint32_t oprsz, bool load_dest,
+ void (*fni)(TCGv_i64, TCGv_i64))
{
TCGv_i64 t0 = tcg_temp_new_i64();
TCGv_i64 t1 = tcg_temp_new_i64();
uint32_t i;
for (i = 0; i < oprsz; i += 8) {
- tcg_gen_ld_i64(t0, tcg_env, aofs + i);
+ tcg_gen_ld_i64(t0, abase, aofs + i);
if (load_dest) {
- tcg_gen_ld_i64(t1, tcg_env, dofs + i);
+ tcg_gen_ld_i64(t1, dbase, dofs + i);
}
fni(t1, t0);
- tcg_gen_st_i64(t1, tcg_env, dofs + i);
+ tcg_gen_st_i64(t1, dbase, dofs + i);
}
tcg_temp_free_i64(t0);
tcg_temp_free_i64(t1);
@@ -1045,7 +1055,8 @@ static void expand_4i_i64(uint32_t dofs, uint32_t aofs, uint32_t bofs,
}
/* Expand OPSZ bytes worth of two-operand operations using host vectors. */
-static void expand_2_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
+static void expand_2_vec(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
uint32_t oprsz, uint32_t tysz, TCGType type,
bool load_dest,
void (*fni)(unsigned, TCGv_vec, TCGv_vec))
@@ -1054,12 +1065,12 @@ static void expand_2_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
TCGv_vec t0 = tcg_temp_new_vec(type);
TCGv_vec t1 = tcg_temp_new_vec(type);
- tcg_gen_ld_vec(t0, tcg_env, aofs + i);
+ tcg_gen_ld_vec(t0, abase, aofs + i);
if (load_dest) {
- tcg_gen_ld_vec(t1, tcg_env, dofs + i);
+ tcg_gen_ld_vec(t1, dbase, dofs + i);
}
fni(vece, t1, t0);
- tcg_gen_st_vec(t1, tcg_env, dofs + i);
+ tcg_gen_st_vec(t1, dbase, dofs + i);
}
}
@@ -1201,8 +1212,9 @@ static void expand_4i_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
}
/* Expand a vector two-operand operation. */
-void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
- uint32_t oprsz, uint32_t maxsz, const GVecGen2 *g)
+void tcg_gen_gvec_2_var(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen2 *g)
{
const TCGOpcode *this_list = g->opt_opc ? : vecop_list_empty;
const TCGOpcode *hold_list = tcg_swap_vecop_list(this_list);
@@ -1210,7 +1222,7 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs);
- check_overlap_2(tcg_env, dofs, tcg_env, aofs, maxsz);
+ check_overlap_2(dbase, dofs, abase, aofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1223,8 +1235,8 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
* that e.g. size == 80 would be expanded with 2x32 + 1x16.
*/
some = QEMU_ALIGN_DOWN(oprsz, 32);
- expand_2_vec(g->vece, dofs, aofs, some, 32, TCG_TYPE_V256,
- g->load_dest, g->fniv);
+ expand_2_vec(g->vece, dbase, dofs, abase, aofs, some, 32,
+ TCG_TYPE_V256, g->load_dest, g->fniv);
if (some == oprsz) {
break;
}
@@ -1234,22 +1246,25 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
maxsz -= some;
/* fallthru */
case TCG_TYPE_V128:
- expand_2_vec(g->vece, dofs, aofs, oprsz, 16, TCG_TYPE_V128,
- g->load_dest, g->fniv);
+ expand_2_vec(g->vece, dbase, dofs, abase, aofs, oprsz, 16,
+ TCG_TYPE_V128, g->load_dest, g->fniv);
break;
case TCG_TYPE_V64:
- expand_2_vec(g->vece, dofs, aofs, oprsz, 8, TCG_TYPE_V64,
- g->load_dest, g->fniv);
+ expand_2_vec(g->vece, dbase, dofs, abase, aofs, oprsz, 8,
+ TCG_TYPE_V64, g->load_dest, g->fniv);
break;
case 0:
if (g->fni8 && check_size_impl(oprsz, 8)) {
- expand_2_i64(dofs, aofs, oprsz, g->load_dest, g->fni8);
+ expand_2_i64(dbase, dofs, abase, aofs,
+ oprsz, g->load_dest, g->fni8);
} else if (g->fni4 && check_size_impl(oprsz, 4)) {
- expand_2_i32(dofs, aofs, oprsz, g->load_dest, g->fni4);
+ expand_2_i32(dbase, dofs, abase, aofs,
+ oprsz, g->load_dest, g->fni4);
} else {
assert(g->fno != NULL);
- tcg_gen_gvec_2_ool(dofs, aofs, oprsz, maxsz, g->data, g->fno);
+ expand_2_ool(dbase, dofs, abase, aofs,
+ oprsz, maxsz, g->data, g->fno);
oprsz = maxsz;
}
break;
@@ -1260,10 +1275,16 @@ void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
+ expand_clr(dbase, dofs + oprsz, maxsz - oprsz);
}
}
+void tcg_gen_gvec_2(uint32_t dofs, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen2 *g)
+{
+ tcg_gen_gvec_2_var(tcg_env, dofs, tcg_env, aofs, oprsz, maxsz, g);
+}
+
/* Expand a vector operation with two vectors and an immediate. */
void tcg_gen_gvec_2i(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
uint32_t maxsz, int64_t c, const GVecGen2i *g)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 06/61] tcg: Split out tcg_gen_gvec_3_var
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (4 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 05/61] tcg: Split out tcg_gen_gvec_2_var Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 07/61] tcg: Split out tcg_gen_gvec_mov_var Richard Henderson
` (55 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
include/tcg/tcg-op-gvec-common.h | 4 ++
tcg/tcg-op-gvec.c | 102 +++++++++++++++++++------------
2 files changed, 68 insertions(+), 38 deletions(-)
diff --git a/include/tcg/tcg-op-gvec-common.h b/include/tcg/tcg-op-gvec-common.h
index 877871c101..6e8fccad01 100644
--- a/include/tcg/tcg-op-gvec-common.h
+++ b/include/tcg/tcg-op-gvec-common.h
@@ -236,6 +236,10 @@ void tcg_gen_gvec_2i(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
uint32_t maxsz, int64_t c, const GVecGen2i *);
void tcg_gen_gvec_2s(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
uint32_t maxsz, TCGv_i64 c, const GVecGen2s *);
+void tcg_gen_gvec_3_var(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen3 *);
void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t oprsz, uint32_t maxsz, const GVecGen3 *);
void tcg_gen_gvec_3i(uint32_t dofs, uint32_t aofs, uint32_t bofs,
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 08019421f2..3e53e43354 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -178,9 +178,11 @@ void tcg_gen_gvec_2i_ool(uint32_t dofs, uint32_t aofs, TCGv_i64 c,
}
/* Generate a call to a gvec-style helper with three vector operands. */
-void tcg_gen_gvec_3_ool(uint32_t dofs, uint32_t aofs, uint32_t bofs,
- uint32_t oprsz, uint32_t maxsz, int32_t data,
- gen_helper_gvec_3 *fn)
+static void expand_3_ool(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz,
+ int32_t data, gen_helper_gvec_3 *fn)
{
TCGv_ptr a0, a1, a2;
TCGv_i32 desc = tcg_constant_i32(simd_desc(oprsz, maxsz, data));
@@ -189,9 +191,9 @@ void tcg_gen_gvec_3_ool(uint32_t dofs, uint32_t aofs, uint32_t bofs,
a1 = tcg_temp_ebb_new_ptr();
a2 = tcg_temp_ebb_new_ptr();
- tcg_gen_addi_ptr(a0, tcg_env, dofs);
- tcg_gen_addi_ptr(a1, tcg_env, aofs);
- tcg_gen_addi_ptr(a2, tcg_env, bofs);
+ tcg_gen_addi_ptr(a0, dbase, dofs);
+ tcg_gen_addi_ptr(a1, abase, aofs);
+ tcg_gen_addi_ptr(a2, bbase, bofs);
fn(a0, a1, a2, desc);
@@ -200,6 +202,14 @@ void tcg_gen_gvec_3_ool(uint32_t dofs, uint32_t aofs, uint32_t bofs,
tcg_temp_free_ptr(a2);
}
+void tcg_gen_gvec_3_ool(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz, int32_t data,
+ gen_helper_gvec_3 *fn)
+{
+ expand_3_ool(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs,
+ oprsz, maxsz, data, fn);
+}
+
/* Generate a call to a gvec-style helper with four vector operands. */
void tcg_gen_gvec_4_ool(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t cofs, uint32_t oprsz, uint32_t maxsz,
@@ -789,8 +799,10 @@ static void expand_2s_i32(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
}
/* Expand OPSZ bytes worth of three-operand operations using i32 elements. */
-static void expand_3_i32(uint32_t dofs, uint32_t aofs,
- uint32_t bofs, uint32_t oprsz, bool load_dest,
+static void expand_3_i32(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, bool load_dest,
void (*fni)(TCGv_i32, TCGv_i32, TCGv_i32))
{
TCGv_i32 t0 = tcg_temp_new_i32();
@@ -799,13 +811,13 @@ static void expand_3_i32(uint32_t dofs, uint32_t aofs,
uint32_t i;
for (i = 0; i < oprsz; i += 4) {
- tcg_gen_ld_i32(t0, tcg_env, aofs + i);
- tcg_gen_ld_i32(t1, tcg_env, bofs + i);
+ tcg_gen_ld_i32(t0, abase, aofs + i);
+ tcg_gen_ld_i32(t1, bbase, bofs + i);
if (load_dest) {
- tcg_gen_ld_i32(t2, tcg_env, dofs + i);
+ tcg_gen_ld_i32(t2, dbase, dofs + i);
}
fni(t2, t0, t1);
- tcg_gen_st_i32(t2, tcg_env, dofs + i);
+ tcg_gen_st_i32(t2, dbase, dofs + i);
}
tcg_temp_free_i32(t2);
tcg_temp_free_i32(t1);
@@ -953,8 +965,10 @@ static void expand_2s_i64(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
}
/* Expand OPSZ bytes worth of three-operand operations using i64 elements. */
-static void expand_3_i64(uint32_t dofs, uint32_t aofs,
- uint32_t bofs, uint32_t oprsz, bool load_dest,
+static void expand_3_i64(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, bool load_dest,
void (*fni)(TCGv_i64, TCGv_i64, TCGv_i64))
{
TCGv_i64 t0 = tcg_temp_new_i64();
@@ -963,13 +977,13 @@ static void expand_3_i64(uint32_t dofs, uint32_t aofs,
uint32_t i;
for (i = 0; i < oprsz; i += 8) {
- tcg_gen_ld_i64(t0, tcg_env, aofs + i);
- tcg_gen_ld_i64(t1, tcg_env, bofs + i);
+ tcg_gen_ld_i64(t0, abase, aofs + i);
+ tcg_gen_ld_i64(t1, bbase, bofs + i);
if (load_dest) {
- tcg_gen_ld_i64(t2, tcg_env, dofs + i);
+ tcg_gen_ld_i64(t2, dbase, dofs + i);
}
fni(t2, t0, t1);
- tcg_gen_st_i64(t2, tcg_env, dofs + i);
+ tcg_gen_st_i64(t2, dbase, dofs + i);
}
tcg_temp_free_i64(t2);
tcg_temp_free_i64(t1);
@@ -1114,8 +1128,9 @@ static void expand_2s_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
}
/* Expand OPSZ bytes worth of three-operand operations using host vectors. */
-static void expand_3_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
- uint32_t bofs, uint32_t oprsz,
+static void expand_3_vec(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs, uint32_t oprsz,
uint32_t tysz, TCGType type, bool load_dest,
void (*fni)(unsigned, TCGv_vec, TCGv_vec, TCGv_vec))
{
@@ -1124,13 +1139,13 @@ static void expand_3_vec(unsigned vece, uint32_t dofs, uint32_t aofs,
TCGv_vec t1 = tcg_temp_new_vec(type);
TCGv_vec t2 = tcg_temp_new_vec(type);
- tcg_gen_ld_vec(t0, tcg_env, aofs + i);
- tcg_gen_ld_vec(t1, tcg_env, bofs + i);
+ tcg_gen_ld_vec(t0, abase, aofs + i);
+ tcg_gen_ld_vec(t1, bbase, bofs + i);
if (load_dest) {
- tcg_gen_ld_vec(t2, tcg_env, dofs + i);
+ tcg_gen_ld_vec(t2, dbase, dofs + i);
}
fni(vece, t2, t0, t1);
- tcg_gen_st_vec(t2, tcg_env, dofs + i);
+ tcg_gen_st_vec(t2, dbase, dofs + i);
}
}
@@ -1432,8 +1447,10 @@ void tcg_gen_gvec_2s(uint32_t dofs, uint32_t aofs, uint32_t oprsz,
}
/* Expand a vector three-operand operation. */
-void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
- uint32_t oprsz, uint32_t maxsz, const GVecGen3 *g)
+void tcg_gen_gvec_3_var(TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen3 *g)
{
const TCGOpcode *this_list = g->opt_opc ? : vecop_list_empty;
const TCGOpcode *hold_list = tcg_swap_vecop_list(this_list);
@@ -1441,7 +1458,7 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t some;
check_size_align(oprsz, maxsz, dofs | aofs | bofs);
- check_overlap_3(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs, maxsz);
+ check_overlap_3(dbase, dofs, abase, aofs, bbase, bofs, maxsz);
type = 0;
if (g->fniv) {
@@ -1454,8 +1471,8 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
* that e.g. size == 80 would be expanded with 2x32 + 1x16.
*/
some = QEMU_ALIGN_DOWN(oprsz, 32);
- expand_3_vec(g->vece, dofs, aofs, bofs, some, 32, TCG_TYPE_V256,
- g->load_dest, g->fniv);
+ expand_3_vec(g->vece, dbase, dofs, abase, aofs, bbase, bofs,
+ some, 32, TCG_TYPE_V256, g->load_dest, g->fniv);
if (some == oprsz) {
break;
}
@@ -1466,23 +1483,25 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
maxsz -= some;
/* fallthru */
case TCG_TYPE_V128:
- expand_3_vec(g->vece, dofs, aofs, bofs, oprsz, 16, TCG_TYPE_V128,
- g->load_dest, g->fniv);
+ expand_3_vec(g->vece, dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, 16, TCG_TYPE_V128, g->load_dest, g->fniv);
break;
case TCG_TYPE_V64:
- expand_3_vec(g->vece, dofs, aofs, bofs, oprsz, 8, TCG_TYPE_V64,
- g->load_dest, g->fniv);
+ expand_3_vec(g->vece, dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, 8, TCG_TYPE_V64, g->load_dest, g->fniv);
break;
case 0:
if (g->fni8 && check_size_impl(oprsz, 8)) {
- expand_3_i64(dofs, aofs, bofs, oprsz, g->load_dest, g->fni8);
+ expand_3_i64(dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, g->load_dest, g->fni8);
} else if (g->fni4 && check_size_impl(oprsz, 4)) {
- expand_3_i32(dofs, aofs, bofs, oprsz, g->load_dest, g->fni4);
+ expand_3_i32(dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, g->load_dest, g->fni4);
} else {
assert(g->fno != NULL);
- tcg_gen_gvec_3_ool(dofs, aofs, bofs, oprsz,
- maxsz, g->data, g->fno);
+ expand_3_ool(dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, maxsz, g->data, g->fno);
oprsz = maxsz;
}
break;
@@ -1493,10 +1512,17 @@ void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
tcg_swap_vecop_list(hold_list);
if (oprsz < maxsz) {
- expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
+ expand_clr(dbase, dofs + oprsz, maxsz - oprsz);
}
}
+void tcg_gen_gvec_3(uint32_t dofs, uint32_t aofs, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz, const GVecGen3 *g)
+{
+ tcg_gen_gvec_3_var(tcg_env, dofs, tcg_env, aofs, tcg_env, bofs,
+ oprsz, maxsz, g);
+}
+
/* Expand a vector operation with three vectors and an immediate. */
void tcg_gen_gvec_3i(uint32_t dofs, uint32_t aofs, uint32_t bofs,
uint32_t oprsz, uint32_t maxsz, int64_t c,
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 07/61] tcg: Split out tcg_gen_gvec_mov_var
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (5 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 06/61] tcg: Split out tcg_gen_gvec_3_var Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 08/61] tcg: Split out tcg_gen_gvec_{add,sub}_var Richard Henderson
` (54 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
include/tcg/tcg-op-gvec-common.h | 4 ++++
tcg/tcg-op-gvec.c | 21 +++++++++++++++------
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/include/tcg/tcg-op-gvec-common.h b/include/tcg/tcg-op-gvec-common.h
index 6e8fccad01..cabbe957c8 100644
--- a/include/tcg/tcg-op-gvec-common.h
+++ b/include/tcg/tcg-op-gvec-common.h
@@ -253,6 +253,10 @@ void tcg_gen_gvec_4i(uint32_t dofs, uint32_t aofs, uint32_t bofs, uint32_t cofs,
/* Expand a specific vector operation. */
+void tcg_gen_gvec_mov_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz);
+
void tcg_gen_gvec_mov(unsigned vece, uint32_t dofs, uint32_t aofs,
uint32_t oprsz, uint32_t maxsz);
void tcg_gen_gvec_not(unsigned vece, uint32_t dofs, uint32_t aofs,
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 3e53e43354..5e58b1cc75 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -1741,8 +1741,9 @@ static void vec_mov2(unsigned vece, TCGv_vec a, TCGv_vec b)
tcg_gen_mov_vec(a, b);
}
-void tcg_gen_gvec_mov(unsigned vece, uint32_t dofs, uint32_t aofs,
- uint32_t oprsz, uint32_t maxsz)
+void tcg_gen_gvec_mov_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz)
{
static const GVecGen2 g = {
.fni8 = tcg_gen_mov_i64,
@@ -1750,14 +1751,22 @@ void tcg_gen_gvec_mov(unsigned vece, uint32_t dofs, uint32_t aofs,
.fno = gen_helper_gvec_mov,
.prefer_i64 = TCG_TARGET_REG_BITS == 64,
};
- if (dofs != aofs) {
- tcg_gen_gvec_2(dofs, aofs, oprsz, maxsz, &g);
- } else {
+
+ if (dofs == aofs && dbase == abase) {
check_size_align(oprsz, maxsz, dofs);
if (oprsz < maxsz) {
- expand_clr(tcg_env, dofs + oprsz, maxsz - oprsz);
+ expand_clr(dbase, dofs + oprsz, maxsz - oprsz);
}
+ return;
}
+
+ tcg_gen_gvec_2_var(dbase, dofs, abase, aofs, oprsz, maxsz, &g);
+}
+
+void tcg_gen_gvec_mov(unsigned vece, uint32_t dofs, uint32_t aofs,
+ uint32_t oprsz, uint32_t maxsz)
+{
+ tcg_gen_gvec_mov_var(vece, tcg_env, dofs, tcg_env, aofs, oprsz, maxsz);
}
void tcg_gen_gvec_dup_i32(unsigned vece, uint32_t dofs, uint32_t oprsz,
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 08/61] tcg: Split out tcg_gen_gvec_{add,sub}_var
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (6 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 07/61] tcg: Split out tcg_gen_gvec_mov_var Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 09/61] target/arm: Introduce FPST_ZA, FPST_ZA_F16 Richard Henderson
` (53 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
include/tcg/tcg-op-gvec-common.h | 9 +++++++++
tcg/tcg-op-gvec.c | 32 ++++++++++++++++++++++++++------
2 files changed, 35 insertions(+), 6 deletions(-)
diff --git a/include/tcg/tcg-op-gvec-common.h b/include/tcg/tcg-op-gvec-common.h
index cabbe957c8..fbe5a68a7e 100644
--- a/include/tcg/tcg-op-gvec-common.h
+++ b/include/tcg/tcg-op-gvec-common.h
@@ -266,6 +266,15 @@ void tcg_gen_gvec_neg(unsigned vece, uint32_t dofs, uint32_t aofs,
void tcg_gen_gvec_abs(unsigned vece, uint32_t dofs, uint32_t aofs,
uint32_t oprsz, uint32_t maxsz);
+void tcg_gen_gvec_add_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz);
+void tcg_gen_gvec_sub_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz);
+
void tcg_gen_gvec_add(unsigned vece, uint32_t dofs, uint32_t aofs,
uint32_t bofs, uint32_t oprsz, uint32_t maxsz);
void tcg_gen_gvec_sub(unsigned vece, uint32_t dofs, uint32_t aofs,
diff --git a/tcg/tcg-op-gvec.c b/tcg/tcg-op-gvec.c
index 5e58b1cc75..d5fbd4e885 100644
--- a/tcg/tcg-op-gvec.c
+++ b/tcg/tcg-op-gvec.c
@@ -1994,8 +1994,10 @@ void tcg_gen_vec_add32_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
static const TCGOpcode vecop_list_add[] = { INDEX_op_add_vec, 0 };
-void tcg_gen_gvec_add(unsigned vece, uint32_t dofs, uint32_t aofs,
- uint32_t bofs, uint32_t oprsz, uint32_t maxsz)
+void tcg_gen_gvec_add_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz)
{
static const GVecGen3 g[4] = {
{ .fni8 = tcg_gen_vec_add8_i64,
@@ -2022,7 +2024,15 @@ void tcg_gen_gvec_add(unsigned vece, uint32_t dofs, uint32_t aofs,
};
tcg_debug_assert(vece <= MO_64);
- tcg_gen_gvec_3(dofs, aofs, bofs, oprsz, maxsz, &g[vece]);
+ tcg_gen_gvec_3_var(dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, maxsz, &g[vece]);
+}
+
+void tcg_gen_gvec_add(unsigned vece, uint32_t dofs, uint32_t aofs,
+ uint32_t bofs, uint32_t oprsz, uint32_t maxsz)
+{
+ tcg_gen_gvec_add_var(vece, tcg_env, dofs, tcg_env, aofs, tcg_env, bofs,
+ oprsz, maxsz);
}
void tcg_gen_gvec_adds(unsigned vece, uint32_t dofs, uint32_t aofs,
@@ -2175,8 +2185,10 @@ void tcg_gen_vec_sub32_i64(TCGv_i64 d, TCGv_i64 a, TCGv_i64 b)
tcg_temp_free_i64(t2);
}
-void tcg_gen_gvec_sub(unsigned vece, uint32_t dofs, uint32_t aofs,
- uint32_t bofs, uint32_t oprsz, uint32_t maxsz)
+void tcg_gen_gvec_sub_var(unsigned vece, TCGv_ptr dbase, uint32_t dofs,
+ TCGv_ptr abase, uint32_t aofs,
+ TCGv_ptr bbase, uint32_t bofs,
+ uint32_t oprsz, uint32_t maxsz)
{
static const GVecGen3 g[4] = {
{ .fni8 = tcg_gen_vec_sub8_i64,
@@ -2203,7 +2215,15 @@ void tcg_gen_gvec_sub(unsigned vece, uint32_t dofs, uint32_t aofs,
};
tcg_debug_assert(vece <= MO_64);
- tcg_gen_gvec_3(dofs, aofs, bofs, oprsz, maxsz, &g[vece]);
+ tcg_gen_gvec_3_var(dbase, dofs, abase, aofs, bbase, bofs,
+ oprsz, maxsz, &g[vece]);
+}
+
+void tcg_gen_gvec_sub(unsigned vece, uint32_t dofs, uint32_t aofs,
+ uint32_t bofs, uint32_t oprsz, uint32_t maxsz)
+{
+ tcg_gen_gvec_sub_var(vece, tcg_env, dofs, tcg_env, aofs, tcg_env, bofs,
+ oprsz, maxsz);
}
static const TCGOpcode vecop_list_mul[] = { INDEX_op_mul_vec, 0 };
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 09/61] target/arm: Introduce FPST_ZA, FPST_ZA_F16
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (7 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 08/61] tcg: Split out tcg_gen_gvec_{add,sub}_var Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 10/61] target/arm: Use FPST_ZA for sme_fmopa_[hsd] Richard Henderson
` (52 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Rather than repeatedly copying FPST_FPCR to locals
and setting default nan mode, create dedicated float_status.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 12 +++++++++++-
target/arm/cpu.c | 4 ++++
target/arm/vfp_helper.c | 10 ++++++++++
3 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index e6513ef1e5..42c39ac6bd 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -219,6 +219,8 @@ typedef struct NVICState NVICState;
* when FPCR.AH == 1 (bfloat16 conversions and multiplies,
* and the reciprocal and square root estimate/step insns);
* for half-precision
+ * ZA: the "streaming sve" fp status.
+ * ZA_F16: likewise for half-precision.
*
* Half-precision operations are governed by a separate
* flush-to-zero control bit in FPSCR:FZ16. We pass a separate
@@ -239,6 +241,12 @@ typedef struct NVICState NVICState;
* they ignore FPCR.RMode. But they don't ignore FPCR.FZ16,
* which means we need an FPST_AH_F16 as well.
*
+ * The "ZA" float_status are for Streaming SVE operations which use
+ * default-NaN and do not generate fp exceptions, which means that they
+ * do not accumulate exception bits back into FPCR.
+ * See e.g. FPAdd vs FPAdd_ZA pseudocode functions, and the setting
+ * of fpcr.DN and fpexec parameters.
+ *
* To avoid having to transfer exception bits around, we simply
* say that the FPSCR cumulative exception flags are the logical
* OR of the flags in the four fp statuses. This relies on the
@@ -252,10 +260,12 @@ typedef enum ARMFPStatusFlavour {
FPST_A64_F16,
FPST_AH,
FPST_AH_F16,
+ FPST_ZA,
+ FPST_ZA_F16,
FPST_STD,
FPST_STD_F16,
} ARMFPStatusFlavour;
-#define FPST_COUNT 8
+#define FPST_COUNT 10
typedef struct CPUArchState {
/* Regs for current mode. */
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 180e11c5d7..ba08c05ec6 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -550,11 +550,15 @@ static void arm_cpu_reset_hold(Object *obj, ResetType type)
set_flush_inputs_to_zero(1, &env->vfp.fp_status[FPST_STD]);
set_default_nan_mode(1, &env->vfp.fp_status[FPST_STD]);
set_default_nan_mode(1, &env->vfp.fp_status[FPST_STD_F16]);
+ set_default_nan_mode(1, &env->vfp.fp_status[FPST_ZA]);
+ set_default_nan_mode(1, &env->vfp.fp_status[FPST_ZA_F16]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A32]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A64]);
+ arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_ZA]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_STD]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A32_F16]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A64_F16]);
+ arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_ZA_F16]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_STD_F16]);
arm_set_ah_fp_behaviours(&env->vfp.fp_status[FPST_AH]);
set_flush_to_zero(1, &env->vfp.fp_status[FPST_AH]);
diff --git a/target/arm/vfp_helper.c b/target/arm/vfp_helper.c
index ff4b25bf28..f26c0d211e 100644
--- a/target/arm/vfp_helper.c
+++ b/target/arm/vfp_helper.c
@@ -202,6 +202,8 @@ static void vfp_set_fpcr_to_host(CPUARMState *env, uint32_t val, uint32_t mask)
set_float_rounding_mode(i, &env->vfp.fp_status[FPST_A64]);
set_float_rounding_mode(i, &env->vfp.fp_status[FPST_A32_F16]);
set_float_rounding_mode(i, &env->vfp.fp_status[FPST_A64_F16]);
+ set_float_rounding_mode(i, &env->vfp.fp_status[FPST_ZA]);
+ set_float_rounding_mode(i, &env->vfp.fp_status[FPST_ZA_F16]);
}
if (changed & FPCR_FZ16) {
bool ftz_enabled = val & FPCR_FZ16;
@@ -209,15 +211,18 @@ static void vfp_set_fpcr_to_host(CPUARMState *env, uint32_t val, uint32_t mask)
set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A64_F16]);
set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_STD_F16]);
set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_AH_F16]);
+ set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_ZA_F16]);
set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A32_F16]);
set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A64_F16]);
set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_STD_F16]);
set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_AH_F16]);
+ set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_ZA_F16]);
}
if (changed & FPCR_FZ) {
bool ftz_enabled = val & FPCR_FZ;
set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A32]);
set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A64]);
+ set_flush_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_ZA]);
/* FIZ is A64 only so FZ always makes A32 code flush inputs to zero */
set_flush_inputs_to_zero(ftz_enabled, &env->vfp.fp_status[FPST_A32]);
}
@@ -229,6 +234,7 @@ static void vfp_set_fpcr_to_host(CPUARMState *env, uint32_t val, uint32_t mask)
bool fitz_enabled = (val & FPCR_FIZ) ||
(val & (FPCR_FZ | FPCR_AH)) == FPCR_FZ;
set_flush_inputs_to_zero(fitz_enabled, &env->vfp.fp_status[FPST_A64]);
+ set_flush_inputs_to_zero(fitz_enabled, &env->vfp.fp_status[FPST_ZA]);
}
if (changed & FPCR_DN) {
bool dnan_enabled = val & FPCR_DN;
@@ -246,9 +252,13 @@ static void vfp_set_fpcr_to_host(CPUARMState *env, uint32_t val, uint32_t mask)
/* Change behaviours for A64 FP operations */
arm_set_ah_fp_behaviours(&env->vfp.fp_status[FPST_A64]);
arm_set_ah_fp_behaviours(&env->vfp.fp_status[FPST_A64_F16]);
+ arm_set_ah_fp_behaviours(&env->vfp.fp_status[FPST_ZA]);
+ arm_set_ah_fp_behaviours(&env->vfp.fp_status[FPST_ZA_F16]);
} else {
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A64]);
arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_A64_F16]);
+ arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_ZA]);
+ arm_set_default_fp_behaviours(&env->vfp.fp_status[FPST_ZA_F16]);
}
}
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 10/61] target/arm: Use FPST_ZA for sme_fmopa_[hsd]
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (8 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 09/61] target/arm: Introduce FPST_ZA, FPST_ZA_F16 Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 11/61] target/arm: Rename zarray to za_state.za Richard Henderson
` (51 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/sme_helper.c | 37 ++++++++--------------------------
target/arm/tcg/translate-sme.c | 4 ++--
2 files changed, 10 insertions(+), 31 deletions(-)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index dcc48e43db..d4562502dd 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -904,20 +904,11 @@ void HELPER(sme_addva_d)(void *vzda, void *vzn, void *vpn,
}
void HELPER(sme_fmopa_s)(void *vza, void *vzn, void *vzm, void *vpn,
- void *vpm, float_status *fpst_in, uint32_t desc)
+ void *vpm, float_status *fpst, uint32_t desc)
{
intptr_t row, col, oprsz = simd_maxsz(desc);
uint32_t neg = simd_data(desc) << 31;
uint16_t *pn = vpn, *pm = vpm;
- float_status fpst;
-
- /*
- * Make a copy of float_status because this operation does not
- * update the cumulative fp exception status. It also produces
- * default nans.
- */
- fpst = *fpst_in;
- set_default_nan_mode(true, &fpst);
for (row = 0; row < oprsz; ) {
uint16_t pa = pn[H2(row >> 4)];
@@ -932,7 +923,7 @@ void HELPER(sme_fmopa_s)(void *vza, void *vzn, void *vzm, void *vpn,
if (pb & 1) {
uint32_t *a = vza_row + H1_4(col);
uint32_t *m = vzm + H1_4(col);
- *a = float32_muladd(n, *m, *a, 0, &fpst);
+ *a = float32_muladd(n, *m, *a, 0, fpst);
}
col += 4;
pb >>= 4;
@@ -946,15 +937,12 @@ void HELPER(sme_fmopa_s)(void *vza, void *vzn, void *vzm, void *vpn,
}
void HELPER(sme_fmopa_d)(void *vza, void *vzn, void *vzm, void *vpn,
- void *vpm, float_status *fpst_in, uint32_t desc)
+ void *vpm, float_status *fpst, uint32_t desc)
{
intptr_t row, col, oprsz = simd_oprsz(desc) / 8;
uint64_t neg = (uint64_t)simd_data(desc) << 63;
uint64_t *za = vza, *zn = vzn, *zm = vzm;
uint8_t *pn = vpn, *pm = vpm;
- float_status fpst = *fpst_in;
-
- set_default_nan_mode(true, &fpst);
for (row = 0; row < oprsz; ++row) {
if (pn[H1(row)] & 1) {
@@ -964,7 +952,7 @@ void HELPER(sme_fmopa_d)(void *vza, void *vzn, void *vzm, void *vpn,
for (col = 0; col < oprsz; ++col) {
if (pm[H1(col)] & 1) {
uint64_t *a = &za_row[col];
- *a = float64_muladd(n, zm[col], *a, 0, &fpst);
+ *a = float64_muladd(n, zm[col], *a, 0, fpst);
}
}
}
@@ -1035,19 +1023,8 @@ void HELPER(sme_fmopa_h)(void *vza, void *vzn, void *vzm, void *vpn,
intptr_t row, col, oprsz = simd_maxsz(desc);
uint32_t neg = simd_data(desc) * 0x80008000u;
uint16_t *pn = vpn, *pm = vpm;
- float_status fpst_odd, fpst_std, fpst_f16;
+ float_status fpst_odd = env->vfp.fp_status[FPST_ZA];
- /*
- * Make copies of the fp status fields we use, because this operation
- * does not update the cumulative fp exception status. It also
- * produces default NaNs. We also need a second copy of fp_status with
- * round-to-odd -- see above.
- */
- fpst_f16 = env->vfp.fp_status[FPST_A64_F16];
- fpst_std = env->vfp.fp_status[FPST_A64];
- set_default_nan_mode(true, &fpst_std);
- set_default_nan_mode(true, &fpst_f16);
- fpst_odd = fpst_std;
set_float_rounding_mode(float_round_to_odd, &fpst_odd);
for (row = 0; row < oprsz; ) {
@@ -1067,7 +1044,9 @@ void HELPER(sme_fmopa_h)(void *vza, void *vzn, void *vzm, void *vpn,
m = f16mop_adj_pair(m, pcol, 0);
*a = f16_dotadd(*a, n, m,
- &fpst_f16, &fpst_std, &fpst_odd);
+ &env->vfp.fp_status[FPST_ZA_F16],
+ &env->vfp.fp_status[FPST_ZA],
+ &fpst_odd);
}
col += 4;
pcol >>= 4;
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index fcbb350016..51175c923e 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -358,9 +358,9 @@ static bool do_outprod_env(DisasContext *s, arg_op *a, MemOp esz,
TRANS_FEAT(FMOPA_h, aa64_sme, do_outprod_env, a,
MO_32, gen_helper_sme_fmopa_h)
TRANS_FEAT(FMOPA_s, aa64_sme, do_outprod_fpst, a,
- MO_32, FPST_A64, gen_helper_sme_fmopa_s)
+ MO_32, FPST_ZA, gen_helper_sme_fmopa_s)
TRANS_FEAT(FMOPA_d, aa64_sme_f64f64, do_outprod_fpst, a,
- MO_64, FPST_A64, gen_helper_sme_fmopa_d)
+ MO_64, FPST_ZA, gen_helper_sme_fmopa_d)
TRANS_FEAT(BFMOPA, aa64_sme, do_outprod_env, a, MO_32, gen_helper_sme_bfmopa)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 11/61] target/arm: Rename zarray to za_state.za
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (9 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 10/61] target/arm: Use FPST_ZA for sme_fmopa_[hsd] Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 12/61] target/arm: Add isar_feature_aa64_sme2* Richard Henderson
` (50 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
The whole ZA state will also contain ZT0.
Make things easier in aarch64_set_svcr to zero both
by wrapping them in a common structure.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 48 +++++++++++++++++++---------------
linux-user/aarch64/signal.c | 4 +--
target/arm/cpu.c | 4 +--
target/arm/helper.c | 2 +-
target/arm/machine.c | 2 +-
target/arm/tcg/sme_helper.c | 6 ++---
target/arm/tcg/translate-sme.c | 4 +--
7 files changed, 38 insertions(+), 32 deletions(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 42c39ac6bd..938c990854 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -733,27 +733,33 @@ typedef struct CPUArchState {
uint64_t scxtnum_el[4];
- /*
- * SME ZA storage -- 256 x 256 byte array, with bytes in host word order,
- * as we do with vfp.zregs[]. This corresponds to the architectural ZA
- * array, where ZA[N] is in the least-significant bytes of env->zarray[N].
- * When SVL is less than the architectural maximum, the accessible
- * storage is restricted, such that if the SVL is X bytes the guest can
- * see only the bottom X elements of zarray[], and only the least
- * significant X bytes of each element of the array. (In other words,
- * the observable part is always square.)
- *
- * The ZA storage can also be considered as a set of square tiles of
- * elements of different sizes. The mapping from tiles to the ZA array
- * is architecturally defined, such that for tiles of elements of esz
- * bytes, the Nth row (or "horizontal slice") of tile T is in
- * ZA[T + N * esz]. Note that this means that each tile is not contiguous
- * in the ZA storage, because its rows are striped through the ZA array.
- *
- * Because this is so large, keep this toward the end of the reset area,
- * to keep the offsets into the rest of the structure smaller.
- */
- ARMVectorReg zarray[ARM_MAX_VQ * 16];
+ struct {
+ /*
+ * SME ZA storage -- 256 x 256 byte array, with bytes in host
+ * word order, as we do with vfp.zregs[]. This corresponds to
+ * the architectural ZA array, where ZA[N] is in the least
+ * significant bytes of env->za_state.za[N].
+ *
+ * When SVL is less than the architectural maximum, the accessible
+ * storage is restricted, such that if the SVL is X bytes the guest
+ * can see only the bottom X elements of zarray[], and only the least
+ * significant X bytes of each element of the array. (In other words,
+ * the observable part is always square.)
+ *
+ * The ZA storage can also be considered as a set of square tiles of
+ * elements of different sizes. The mapping from tiles to the ZA array
+ * is architecturally defined, such that for tiles of elements of esz
+ * bytes, the Nth row (or "horizontal slice") of tile T is in
+ * ZA[T + N * esz]. Note that this means that each tile is not
+ * contiguous in the ZA storage, because its rows are striped through
+ * the ZA array.
+ *
+ * Because this is so large, keep this toward the end of the
+ * reset area, to keep the offsets into the rest of the structure
+ * smaller.
+ */
+ ARMVectorReg za[ARM_MAX_VQ * 16];
+ } za_state;
#endif
struct CPUBreakpoint *cpu_breakpoint[16];
diff --git a/linux-user/aarch64/signal.c b/linux-user/aarch64/signal.c
index bc7a13800d..d50cab78d8 100644
--- a/linux-user/aarch64/signal.c
+++ b/linux-user/aarch64/signal.c
@@ -248,7 +248,7 @@ static void target_setup_za_record(struct target_za_context *za,
for (i = 0; i < vl; ++i) {
uint64_t *z = (void *)za + TARGET_ZA_SIG_ZAV_OFFSET(vq, i);
for (j = 0; j < vq * 2; ++j) {
- __put_user_e(env->zarray[i].d[j], z + j, le);
+ __put_user_e(env->za_state.za[i].d[j], z + j, le);
}
}
}
@@ -397,7 +397,7 @@ static bool target_restore_za_record(CPUARMState *env,
for (i = 0; i < vl; ++i) {
uint64_t *z = (void *)za + TARGET_ZA_SIG_ZAV_OFFSET(vq, i);
for (j = 0; j < vq * 2; ++j) {
- __get_user_e(env->zarray[i].d[j], z + j, le);
+ __get_user_e(env->za_state.za[i].d[j], z + j, le);
}
}
return true;
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index ba08c05ec6..813cb45276 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1369,8 +1369,8 @@ static void aarch64_cpu_dump_state(CPUState *cs, FILE *f, int flags)
qemu_fprintf(f, "ZA[%0*d]=", svl_lg10, i);
for (j = zcr_len; j >= 0; --j) {
qemu_fprintf(f, "%016" PRIx64 ":%016" PRIx64 "%c",
- env->zarray[i].d[2 * j + 1],
- env->zarray[i].d[2 * j],
+ env->za_state.za[i].d[2 * j + 1],
+ env->za_state.za[i].d[2 * j],
j ? ':' : '\n');
}
}
diff --git a/target/arm/helper.c b/target/arm/helper.c
index 7d95eae997..e5f06bc288 100644
--- a/target/arm/helper.c
+++ b/target/arm/helper.c
@@ -6438,7 +6438,7 @@ void aarch64_set_svcr(CPUARMState *env, uint64_t new, uint64_t mask)
* when disabled either.
*/
if (change & new & R_SVCR_ZA_MASK) {
- memset(env->zarray, 0, sizeof(env->zarray));
+ memset(&env->za_state, 0, sizeof(env->za_state));
}
if (tcg_enabled()) {
diff --git a/target/arm/machine.c b/target/arm/machine.c
index 978249fb71..d41da414b3 100644
--- a/target/arm/machine.c
+++ b/target/arm/machine.c
@@ -315,7 +315,7 @@ static const VMStateDescription vmstate_za = {
.minimum_version_id = 1,
.needed = za_needed,
.fields = (const VMStateField[]) {
- VMSTATE_STRUCT_ARRAY(env.zarray, ARMCPU, ARM_MAX_VQ * 16, 0,
+ VMSTATE_STRUCT_ARRAY(env.za_state.za, ARMCPU, ARM_MAX_VQ * 16, 0,
vmstate_vreg, ARMVectorReg),
VMSTATE_END_OF_LIST()
}
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index d4562502dd..45f6cdfcb4 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -39,12 +39,12 @@ void helper_sme_zero(CPUARMState *env, uint32_t imm, uint32_t svl)
uint32_t i;
/*
- * Special case clearing the entire ZA space.
+ * Special case clearing the entire ZArray.
* This falls into the CONSTRAINED UNPREDICTABLE zeroing of any
* parts of the ZA storage outside of SVL.
*/
if (imm == 0xff) {
- memset(env->zarray, 0, sizeof(env->zarray));
+ memset(env->za_state.za, 0, sizeof(env->za_state.za));
return;
}
@@ -54,7 +54,7 @@ void helper_sme_zero(CPUARMState *env, uint32_t imm, uint32_t svl)
*/
for (i = 0; i < svl; i++) {
if (imm & (1 << (i % 8))) {
- memset(&env->zarray[i], 0, svl);
+ memset(&env->za_state.za[i], 0, svl);
}
}
}
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 51175c923e..e8b3578174 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -92,7 +92,7 @@ static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
offset = tile * sizeof(ARMVectorReg);
/* Include the byte offset of zarray to make this relative to env. */
- offset += offsetof(CPUARMState, zarray);
+ offset += offsetof(CPUARMState, za_state.za);
tcg_gen_addi_i32(tmp, tmp, offset);
/* Add the byte offset to env to produce the final pointer. */
@@ -112,7 +112,7 @@ static TCGv_ptr get_tile(DisasContext *s, int esz, int tile)
TCGv_ptr addr = tcg_temp_new_ptr();
int offset;
- offset = tile * sizeof(ARMVectorReg) + offsetof(CPUARMState, zarray);
+ offset = tile * sizeof(ARMVectorReg) + offsetof(CPUARMState, za_state.za);
tcg_gen_addi_ptr(addr, tcg_env, offset);
return addr;
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 12/61] target/arm: Add isar_feature_aa64_sme2*
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (10 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 11/61] target/arm: Rename zarray to za_state.za Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 13/61] target/arm: Add ZT0 Richard Henderson
` (49 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu-features.h | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/target/arm/cpu-features.h b/target/arm/cpu-features.h
index 525e4cee12..5383d6b2d7 100644
--- a/target/arm/cpu-features.h
+++ b/target/arm/cpu-features.h
@@ -930,6 +930,11 @@ static inline bool isar_feature_aa64_sve2(const ARMISARegisters *id)
return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, SVEVER) != 0;
}
+static inline bool isar_feature_aa64_sve2p1(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, SVEVER) >= 2;
+}
+
static inline bool isar_feature_aa64_sve2_aes(const ARMISARegisters *id)
{
return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, AES) != 0;
@@ -990,6 +995,36 @@ static inline bool isar_feature_aa64_sme_fa64(const ARMISARegisters *id)
return FIELD_EX64(id->id_aa64smfr0, ID_AA64SMFR0, FA64);
}
+static inline bool isar_feature_aa64_sme2(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64smfr0, ID_AA64SMFR0, SMEVER) != 0;
+}
+
+static inline bool isar_feature_aa64_sme2p1(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64smfr0, ID_AA64SMFR0, SMEVER) >= 2;
+}
+
+static inline bool isar_feature_aa64_sme2_i16i64(const ARMISARegisters *id)
+{
+ return isar_feature_aa64_sme2(id) && isar_feature_aa64_sme_i16i64(id);
+}
+
+static inline bool isar_feature_aa64_sme2_f64f64(const ARMISARegisters *id)
+{
+ return isar_feature_aa64_sme2(id) && isar_feature_aa64_sme_f64f64(id);
+}
+
+static inline bool isar_feature_aa64_sme2_b16b16(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64smfr0, ID_AA64SMFR0, B16B16) != 0;
+}
+
+static inline bool isar_feature_aa64_sme2_f16f16(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64smfr0, ID_AA64SMFR0, F16F16) != 0;
+}
+
/*
* Feature tests for "does this exist in either 32-bit or 64-bit?"
*/
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 13/61] target/arm: Add ZT0
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (11 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 12/61] target/arm: Add isar_feature_aa64_sme2* Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 14/61] target/arm: Add zt0_excp_el to DisasContext Richard Henderson
` (48 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
This is a 512-bit array introduced with SME2.
Save it only when ZA is in use.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 3 +++
target/arm/machine.c | 21 +++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 938c990854..091a517a93 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -734,6 +734,9 @@ typedef struct CPUArchState {
uint64_t scxtnum_el[4];
struct {
+ /* SME2 ZT0 -- 512 bit array, with data ordered like ARMVectorReg. */
+ uint64_t zt0[512 / 64] QEMU_ALIGNED(16);
+
/*
* SME ZA storage -- 256 x 256 byte array, with bytes in host
* word order, as we do with vfp.zregs[]. This corresponds to
diff --git a/target/arm/machine.c b/target/arm/machine.c
index d41da414b3..416fe1b7be 100644
--- a/target/arm/machine.c
+++ b/target/arm/machine.c
@@ -320,6 +320,26 @@ static const VMStateDescription vmstate_za = {
VMSTATE_END_OF_LIST()
}
};
+
+static bool zt0_needed(void *opaque)
+{
+ ARMCPU *cpu = opaque;
+
+ return za_needed(cpu) && cpu_isar_feature(aa64_sme2, cpu);
+}
+
+static const VMStateDescription vmstate_zt0 = {
+ .name = "cpu/zt0",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .needed = zt0_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT64_ARRAY(env.za_state.zt0, ARMCPU,
+ ARRAY_SIZE(((CPUARMState *)0)->za_state.zt0)),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
#endif /* AARCH64 */
static bool serror_needed(void *opaque)
@@ -1104,6 +1124,7 @@ const VMStateDescription vmstate_arm_cpu = {
#ifdef TARGET_AARCH64
&vmstate_sve,
&vmstate_za,
+ &vmstate_zt0,
#endif
&vmstate_serror,
&vmstate_irq_line_state,
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 14/61] target/arm: Add zt0_excp_el to DisasContext
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (12 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 13/61] target/arm: Add ZT0 Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 15/61] target/arm: Implement SME2 ZERO ZT0 Richard Henderson
` (47 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Pipe the value through from SMCR_ELx through hflags
and into the disassembly context.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 2 ++
target/arm/tcg/translate.h | 1 +
target/arm/cpu.c | 3 +++
target/arm/tcg/hflags.c | 34 +++++++++++++++++++++++++++++++++-
target/arm/tcg/translate-a64.c | 1 +
5 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 091a517a93..61f959af8b 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -1531,6 +1531,7 @@ FIELD(SVCR, ZA, 1, 1)
/* Fields for SMCR_ELx. */
FIELD(SMCR, LEN, 0, 4)
+FIELD(SMCR, EZT0, 30, 1)
FIELD(SMCR, FA64, 31, 1)
/* Write a new value to v7m.exception, thus transitioning into or out
@@ -3240,6 +3241,7 @@ FIELD(TBFLAG_A64, NV2_MEM_E20, 35, 1)
FIELD(TBFLAG_A64, NV2_MEM_BE, 36, 1)
FIELD(TBFLAG_A64, AH, 37, 1) /* FPCR.AH */
FIELD(TBFLAG_A64, NEP, 38, 1) /* FPCR.NEP */
+FIELD(TBFLAG_A64, ZT0EXC_EL, 39, 2)
/*
* Helpers for using the above. Note that only the A64 accessors use
diff --git a/target/arm/tcg/translate.h b/target/arm/tcg/translate.h
index f8dc2f0d4b..3021902ce1 100644
--- a/target/arm/tcg/translate.h
+++ b/target/arm/tcg/translate.h
@@ -71,6 +71,7 @@ typedef struct DisasContext {
int fp_excp_el; /* FP exception EL or 0 if enabled */
int sve_excp_el; /* SVE exception EL or 0 if enabled */
int sme_excp_el; /* SME exception EL or 0 if enabled */
+ int zt0_excp_el; /* ZT0 exception EL or 0 if enabled */
int vl; /* current vector length in bytes */
int svl; /* current streaming vector length in bytes */
bool vfp_enabled; /* FP enabled via FPSCR.EN */
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 813cb45276..0cbc3b10c2 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -631,6 +631,9 @@ void arm_emulate_firmware_reset(CPUState *cpustate, int target_el)
env->cp15.cptr_el[3] |= R_CPTR_EL3_ESM_MASK;
env->cp15.scr_el3 |= SCR_ENTP2;
env->vfp.smcr_el[3] = 0xf;
+ if (cpu_isar_feature(aa64_sme2, cpu)) {
+ env->vfp.smcr_el[3] |= R_SMCR_EZT0_MASK;
+ }
}
if (cpu_isar_feature(aa64_hcx, cpu)) {
env->cp15.scr_el3 |= SCR_HXEN;
diff --git a/target/arm/tcg/hflags.c b/target/arm/tcg/hflags.c
index 9e6a1869f9..e8823c380f 100644
--- a/target/arm/tcg/hflags.c
+++ b/target/arm/tcg/hflags.c
@@ -201,6 +201,31 @@ static CPUARMTBFlags rebuild_hflags_a32(CPUARMState *env, int fp_el,
return rebuild_hflags_common_32(env, fp_el, mmu_idx, flags);
}
+/*
+ * Return the exception level to which exceptions should be taken for ZT0.
+ * C.f. the ARM pseudocode function CheckSMEZT0Enabled, after the ZA check.
+ */
+static int zt0_exception_el(CPUARMState *env, int el)
+{
+#ifndef CONFIG_USER_ONLY
+ if (el <= 1
+ && !el_is_in_host(env, el)
+ && !FIELD_EX64(env->vfp.smcr_el[1], SMCR, EZT0)) {
+ return 1;
+ }
+ if (el <= 2
+ && arm_is_el2_enabled(env)
+ && !FIELD_EX64(env->vfp.smcr_el[2], SMCR, EZT0)) {
+ return 2;
+ }
+ if (arm_feature(env, ARM_FEATURE_EL3)
+ && !FIELD_EX64(env->vfp.smcr_el[3], SMCR, EZT0)) {
+ return 3;
+ }
+#endif
+ return 0;
+}
+
static CPUARMTBFlags rebuild_hflags_a64(CPUARMState *env, int el, int fp_el,
ARMMMUIdx mmu_idx)
{
@@ -256,7 +281,14 @@ static CPUARMTBFlags rebuild_hflags_a64(CPUARMState *env, int el, int fp_el,
DP_TBFLAG_A64(flags, PSTATE_SM, 1);
DP_TBFLAG_A64(flags, SME_TRAP_NONSTREAMING, !sme_fa64(env, el));
}
- DP_TBFLAG_A64(flags, PSTATE_ZA, FIELD_EX64(env->svcr, SVCR, ZA));
+
+ if (FIELD_EX64(env->svcr, SVCR, ZA)) {
+ DP_TBFLAG_A64(flags, PSTATE_ZA, 1);
+ if (cpu_isar_feature(aa64_sme2, env_archcpu(env))) {
+ int zt0_el = zt0_exception_el(env, el);
+ DP_TBFLAG_A64(flags, ZT0EXC_EL, zt0_el);
+ }
+ }
}
sctlr = regime_sctlr(env, stage1);
diff --git a/target/arm/tcg/translate-a64.c b/target/arm/tcg/translate-a64.c
index 1ee57ebf66..bc96cee273 100644
--- a/target/arm/tcg/translate-a64.c
+++ b/target/arm/tcg/translate-a64.c
@@ -10143,6 +10143,7 @@ static void aarch64_tr_init_disas_context(DisasContextBase *dcbase,
dc->trap_eret = EX_TBFLAG_A64(tb_flags, TRAP_ERET);
dc->sve_excp_el = EX_TBFLAG_A64(tb_flags, SVEEXC_EL);
dc->sme_excp_el = EX_TBFLAG_A64(tb_flags, SMEEXC_EL);
+ dc->zt0_excp_el = EX_TBFLAG_A64(tb_flags, ZT0EXC_EL);
dc->vl = (EX_TBFLAG_A64(tb_flags, VL) + 1) * 16;
dc->svl = (EX_TBFLAG_A64(tb_flags, SVL) + 1) * 16;
dc->pauth_active = EX_TBFLAG_A64(tb_flags, PAUTH_ACTIVE);
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 15/61] target/arm: Implement SME2 ZERO ZT0
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (13 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 14/61] target/arm: Add zt0_excp_el to DisasContext Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 16/61] target/arm: Implement SME2 LDR/STR ZT0 Richard Henderson
` (46 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/syndrome.h | 1 +
target/arm/tcg/translate-sme.c | 26 ++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 1 +
3 files changed, 28 insertions(+)
diff --git a/target/arm/syndrome.h b/target/arm/syndrome.h
index 3244e0740d..c48d3b8587 100644
--- a/target/arm/syndrome.h
+++ b/target/arm/syndrome.h
@@ -80,6 +80,7 @@ typedef enum {
SME_ET_Streaming,
SME_ET_NotStreaming,
SME_ET_InactiveZA,
+ SME_ET_InaccessibleZT0,
} SMEExceptionType;
#define ARM_EL_EC_LENGTH 6
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index e8b3578174..37f4d341f0 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -27,6 +27,19 @@
#include "decode-sme.c.inc"
+static bool sme2_zt0_enabled_check(DisasContext *s)
+{
+ if (!sme_za_enabled_check(s)) {
+ return false;
+ }
+ if (s->zt0_excp_el) {
+ gen_exception_insn_el(s, 0, EXCP_UDEF,
+ syn_smetrap(SME_ET_InaccessibleZT0, false),
+ s->zt0_excp_el);
+ return false;
+ }
+ return true;
+}
/*
* Resolve tile.size[index] to a host pointer, where tile and index
@@ -130,6 +143,19 @@ static bool trans_ZERO(DisasContext *s, arg_ZERO *a)
return true;
}
+static bool trans_ZERO_zt0(DisasContext *s, arg_ZERO_zt0 *a)
+{
+ if (!dc_isar_feature(aa64_sme2, s)) {
+ return false;
+ }
+ if (sme2_zt0_enabled_check(s)) {
+ tcg_gen_gvec_dup_imm(MO_64, offsetof(CPUARMState, za_state.zt0),
+ sizeof_field(CPUARMState, za_state.zt0),
+ sizeof_field(CPUARMState, za_state.zt0), 0);
+ }
+ return true;
+}
+
static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
{
static gen_helper_gvec_4 * const h_fns[5] = {
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 628804e37a..dd1f983941 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -22,6 +22,7 @@
### SME Misc
ZERO 11000000 00 001 00000000000 imm:8
+ZERO_zt0 11000000 01 001 00000000000 00000001
### SME Move into/from Array
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 16/61] target/arm: Implement SME2 LDR/STR ZT0
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (14 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 15/61] target/arm: Implement SME2 ZERO ZT0 Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 17/61] target/arm: Implement SME2 MOVT Richard Henderson
` (45 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 12 ++++++++++++
target/arm/tcg/sme.decode | 6 ++++++
2 files changed, 18 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 37f4d341f0..8b0a33e2ae 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -291,6 +291,18 @@ static bool do_ldst_r(DisasContext *s, arg_ldstr *a, GenLdStR *fn)
TRANS_FEAT(LDR, aa64_sme, do_ldst_r, a, gen_sve_ldr)
TRANS_FEAT(STR, aa64_sme, do_ldst_r, a, gen_sve_str)
+static bool do_ldst_zt0(DisasContext *s, arg_ldstzt0 *a, GenLdStR *fn)
+{
+ if (sme2_zt0_enabled_check(s)) {
+ fn(s, tcg_env, offsetof(CPUARMState, za_state.zt0),
+ sizeof_field(CPUARMState, za_state.zt0), a->rn, 0);
+ }
+ return true;
+}
+
+TRANS_FEAT(LDR_zt0, aa64_sme2, do_ldst_zt0, a, gen_sve_ldr)
+TRANS_FEAT(STR_zt0, aa64_sme2, do_ldst_zt0, a, gen_sve_str)
+
static bool do_adda(DisasContext *s, arg_adda *a, MemOp esz,
gen_helper_gvec_4 *fn)
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index dd1f983941..cef49c3b29 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -55,6 +55,12 @@ LDST1 1110000 111 st:1 rm:5 v:1 .. pg:3 rn:5 0 za_imm:4 \
LDR 1110000 100 0 000000 .. 000 ..... 0 .... @ldstr
STR 1110000 100 1 000000 .. 000 ..... 0 .... @ldstr
+&ldstzt0 rn
+@ldstzt0 ....... ... . ...... .. ... rn:5 ..... &ldstzt0
+
+LDR_zt0 1110000 100 0 111111 00 000 ..... 00000 @ldstzt0
+STR_zt0 1110000 100 1 111111 00 000 ..... 00000 @ldstzt0
+
### SME Add Vector to Array
&adda zad zn pm pn
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 17/61] target/arm: Implement SME2 MOVT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (15 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 16/61] target/arm: Implement SME2 LDR/STR ZT0 Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 18/61] target/arm: Split get_tile_rowcol argument tile_index Richard Henderson
` (44 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 13 +++++++++++++
target/arm/tcg/sme.decode | 5 +++++
2 files changed, 18 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 8b0a33e2ae..13314c5cd7 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -210,6 +210,19 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
return true;
}
+static bool do_movt(DisasContext *s, arg_MOVT_rzt *a,
+ void (*func)(TCGv_i64, TCGv_ptr, tcg_target_long))
+{
+ if (sme2_zt0_enabled_check(s)) {
+ func(cpu_reg(s, a->rt), tcg_env,
+ offsetof(CPUARMState, za_state.zt0) + a->off * 8);
+ }
+ return true;
+}
+
+TRANS_FEAT(MOVT_rzt, aa64_sme2, do_movt, a, tcg_gen_ld_i64)
+TRANS_FEAT(MOVT_ztr, aa64_sme2, do_movt, a, tcg_gen_st_i64)
+
static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
{
typedef void GenLdSt1(TCGv_env, TCGv_ptr, TCGv_ptr, TCGv, TCGv_i32);
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index cef49c3b29..83ca6a9104 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -39,6 +39,11 @@ MOVA 11000000 esz:2 00001 0 v:1 .. pg:3 0 za_imm:4 zr:5 \
MOVA 11000000 11 00001 1 v:1 .. pg:3 0 za_imm:4 zr:5 \
&mova to_vec=1 rs=%mova_rs esz=4
+### SME Move into/from ZT0
+
+MOVT_rzt 1100 0000 0100 1100 0 off:3 00 11111 rt:5
+MOVT_ztr 1100 0000 0100 1110 0 off:3 00 11111 rt:5
+
### SME Memory
&ldst esz rs pg rn rm za_imm v:bool st:bool
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 18/61] target/arm: Split get_tile_rowcol argument tile_index
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (16 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 17/61] target/arm: Implement SME2 MOVT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 19/61] target/arm: Rename MOVA for translate Richard Henderson
` (43 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Decode tile number and index offset beforehand and separately.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 17 +++++--------
target/arm/tcg/sme.decode | 46 +++++++++++++++++++++++-----------
2 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 13314c5cd7..bd6095ffb6 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -41,15 +41,10 @@ static bool sme2_zt0_enabled_check(DisasContext *s)
return true;
}
-/*
- * Resolve tile.size[index] to a host pointer, where tile and index
- * are always decoded together, dependent on the element size.
- */
+/* Resolve tile.size[rs+imm] to a host pointer. */
static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
- int tile_index, bool vertical)
+ int tile, int imm, bool vertical)
{
- int tile = tile_index >> (4 - esz);
- int index = esz == MO_128 ? 0 : extract32(tile_index, 0, 4 - esz);
int pos, len, offset;
TCGv_i32 tmp;
TCGv_ptr addr;
@@ -57,7 +52,7 @@ static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
/* Compute the final index, which is Rs+imm. */
tmp = tcg_temp_new_i32();
tcg_gen_trunc_tl_i32(tmp, cpu_reg(s, rs));
- tcg_gen_addi_i32(tmp, tmp, index);
+ tcg_gen_addi_i32(tmp, tmp, imm);
/* Prepare a power-of-two modulo via extraction of @len bits. */
len = ctz32(streaming_vec_reg_size(s)) - esz;
@@ -185,7 +180,7 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
return true;
}
- t_za = get_tile_rowcol(s, a->esz, a->rs, a->za_imm, a->v);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, a->v);
t_zr = vec_full_reg_ptr(s, a->zr);
t_pg = pred_full_reg_ptr(s, a->pg);
@@ -264,7 +259,7 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
return true;
}
- t_za = get_tile_rowcol(s, a->esz, a->rs, a->za_imm, a->v);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, a->v);
t_pg = pred_full_reg_ptr(s, a->pg);
addr = tcg_temp_new_i64();
@@ -295,7 +290,7 @@ static bool do_ldst_r(DisasContext *s, arg_ldstr *a, GenLdStR *fn)
}
/* ZA[n] equates to ZA0H.B[n]. */
- base = get_tile_rowcol(s, MO_8, a->rv, imm, false);
+ base = get_tile_rowcol(s, MO_8, a->rv, 0, imm, false);
fn(s, base, 0, svl, a->rn, imm * svl);
return true;
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 83ca6a9104..efe369e079 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -27,17 +27,29 @@ ZERO_zt0 11000000 01 001 00000000000 00000001
### SME Move into/from Array
%mova_rs 13:2 !function=plus_12
-&mova esz rs pg zr za_imm v:bool to_vec:bool
+&mova esz rs pg zr za off v:bool to_vec:bool
-MOVA 11000000 esz:2 00000 0 v:1 .. pg:3 zr:5 0 za_imm:4 \
- &mova to_vec=0 rs=%mova_rs
-MOVA 11000000 11 00000 1 v:1 .. pg:3 zr:5 0 za_imm:4 \
- &mova to_vec=0 rs=%mova_rs esz=4
+MOVA 11000000 00 00000 0 v:1 .. pg:3 zr:5 0 off:4 \
+ &mova to_vec=0 rs=%mova_rs esz=0 za=0
+MOVA 11000000 01 00000 0 v:1 .. pg:3 zr:5 0 za:1 off:3 \
+ &mova to_vec=0 rs=%mova_rs esz=1
+MOVA 11000000 10 00000 0 v:1 .. pg:3 zr:5 0 za:2 off:2 \
+ &mova to_vec=0 rs=%mova_rs esz=2
+MOVA 11000000 11 00000 0 v:1 .. pg:3 zr:5 0 za:3 off:1 \
+ &mova to_vec=0 rs=%mova_rs esz=3
+MOVA 11000000 11 00000 1 v:1 .. pg:3 zr:5 0 za:4 \
+ &mova to_vec=0 rs=%mova_rs esz=4 off=0
-MOVA 11000000 esz:2 00001 0 v:1 .. pg:3 0 za_imm:4 zr:5 \
- &mova to_vec=1 rs=%mova_rs
-MOVA 11000000 11 00001 1 v:1 .. pg:3 0 za_imm:4 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=4
+MOVA 11000000 00 00001 0 v:1 .. pg:3 0 off:4 zr:5 \
+ &mova to_vec=1 rs=%mova_rs esz=0 za=0
+MOVA 11000000 01 00001 0 v:1 .. pg:3 0 za:1 off:3 zr:5 \
+ &mova to_vec=1 rs=%mova_rs esz=1
+MOVA 11000000 10 00001 0 v:1 .. pg:3 0 za:2 off:2 zr:5 \
+ &mova to_vec=1 rs=%mova_rs esz=2
+MOVA 11000000 11 00001 0 v:1 .. pg:3 0 za:3 off:1 zr:5 \
+ &mova to_vec=1 rs=%mova_rs esz=3
+MOVA 11000000 11 00001 1 v:1 .. pg:3 0 za:4 zr:5 \
+ &mova to_vec=1 rs=%mova_rs esz=4 off=0
### SME Move into/from ZT0
@@ -46,12 +58,18 @@ MOVT_ztr 1100 0000 0100 1110 0 off:3 00 11111 rt:5
### SME Memory
-&ldst esz rs pg rn rm za_imm v:bool st:bool
+&ldst esz rs pg rn rm za off v:bool st:bool
-LDST1 1110000 0 esz:2 st:1 rm:5 v:1 .. pg:3 rn:5 0 za_imm:4 \
- &ldst rs=%mova_rs
-LDST1 1110000 111 st:1 rm:5 v:1 .. pg:3 rn:5 0 za_imm:4 \
- &ldst esz=4 rs=%mova_rs
+LDST1 1110000 0 00 st:1 rm:5 v:1 .. pg:3 rn:5 0 off:4 \
+ &ldst rs=%mova_rs esz=0 za=0
+LDST1 1110000 0 01 st:1 rm:5 v:1 .. pg:3 rn:5 0 za:1 off:3 \
+ &ldst rs=%mova_rs esz=1
+LDST1 1110000 0 10 st:1 rm:5 v:1 .. pg:3 rn:5 0 za:2 off:2 \
+ &ldst rs=%mova_rs esz=2
+LDST1 1110000 0 11 st:1 rm:5 v:1 .. pg:3 rn:5 0 za:3 off:1 \
+ &ldst rs=%mova_rs esz=3
+LDST1 1110000 1 11 st:1 rm:5 v:1 .. pg:3 rn:5 0 za:4 \
+ &ldst rs=%mova_rs esz=4 off=0
&ldstr rv rn imm
@ldstr ....... ... . ...... .. ... rn:5 . imm:4 \
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 19/61] target/arm: Rename MOVA for translate
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (17 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 18/61] target/arm: Split get_tile_rowcol argument tile_index Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 20/61] target/arm: Implement SME2 MOVA to/from tile, multiple registers Richard Henderson
` (42 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Prepare for more kinds of MOVA from SME2 by renaming the
existing SME1 MOVA to indicate tile to/from vector.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 12 +++++-----
target/arm/tcg/sme.decode | 42 +++++++++++++++++-----------------
2 files changed, 27 insertions(+), 27 deletions(-)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index bd6095ffb6..908c3e8dd6 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -151,7 +151,7 @@ static bool trans_ZERO_zt0(DisasContext *s, arg_ZERO_zt0 *a)
return true;
}
-static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
+static bool do_mova_tile(DisasContext *s, arg_mova_p *a, bool to_vec)
{
static gen_helper_gvec_4 * const h_fns[5] = {
gen_helper_sve_sel_zpzz_b, gen_helper_sve_sel_zpzz_h,
@@ -173,9 +173,6 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
TCGv_i32 t_desc;
int svl;
- if (!dc_isar_feature(aa64_sme, s)) {
- return false;
- }
if (!sme_smza_enabled_check(s)) {
return true;
}
@@ -189,14 +186,14 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
if (a->v) {
/* Vertical slice -- use sme mova helpers. */
- if (a->to_vec) {
+ if (to_vec) {
zc_fns[a->esz](t_zr, t_za, t_pg, t_desc);
} else {
cz_fns[a->esz](t_za, t_zr, t_pg, t_desc);
}
} else {
/* Horizontal slice -- reuse sve sel helpers. */
- if (a->to_vec) {
+ if (to_vec) {
h_fns[a->esz](t_zr, t_za, t_zr, t_pg, t_desc);
} else {
h_fns[a->esz](t_za, t_zr, t_za, t_pg, t_desc);
@@ -205,6 +202,9 @@ static bool trans_MOVA(DisasContext *s, arg_MOVA *a)
return true;
}
+TRANS_FEAT(MOVA_tz, aa64_sme, do_mova_tile, a, false)
+TRANS_FEAT(MOVA_zt, aa64_sme, do_mova_tile, a, true)
+
static bool do_movt(DisasContext *s, arg_MOVT_rzt *a,
void (*func)(TCGv_i64, TCGv_ptr, tcg_target_long))
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index efe369e079..459b96805f 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -27,29 +27,29 @@ ZERO_zt0 11000000 01 001 00000000000 00000001
### SME Move into/from Array
%mova_rs 13:2 !function=plus_12
-&mova esz rs pg zr za off v:bool to_vec:bool
+&mova_p esz rs pg zr za off v:bool
-MOVA 11000000 00 00000 0 v:1 .. pg:3 zr:5 0 off:4 \
- &mova to_vec=0 rs=%mova_rs esz=0 za=0
-MOVA 11000000 01 00000 0 v:1 .. pg:3 zr:5 0 za:1 off:3 \
- &mova to_vec=0 rs=%mova_rs esz=1
-MOVA 11000000 10 00000 0 v:1 .. pg:3 zr:5 0 za:2 off:2 \
- &mova to_vec=0 rs=%mova_rs esz=2
-MOVA 11000000 11 00000 0 v:1 .. pg:3 zr:5 0 za:3 off:1 \
- &mova to_vec=0 rs=%mova_rs esz=3
-MOVA 11000000 11 00000 1 v:1 .. pg:3 zr:5 0 za:4 \
- &mova to_vec=0 rs=%mova_rs esz=4 off=0
+MOVA_tz 11000000 00 00000 0 v:1 .. pg:3 zr:5 0 off:4 \
+ &mova_p rs=%mova_rs esz=0 za=0
+MOVA_tz 11000000 01 00000 0 v:1 .. pg:3 zr:5 0 za:1 off:3 \
+ &mova_p rs=%mova_rs esz=1
+MOVA_tz 11000000 10 00000 0 v:1 .. pg:3 zr:5 0 za:2 off:2 \
+ &mova_p rs=%mova_rs esz=2
+MOVA_tz 11000000 11 00000 0 v:1 .. pg:3 zr:5 0 za:3 off:1 \
+ &mova_p rs=%mova_rs esz=3
+MOVA_tz 11000000 11 00000 1 v:1 .. pg:3 zr:5 0 za:4 \
+ &mova_p rs=%mova_rs esz=4 off=0
-MOVA 11000000 00 00001 0 v:1 .. pg:3 0 off:4 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=0 za=0
-MOVA 11000000 01 00001 0 v:1 .. pg:3 0 za:1 off:3 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=1
-MOVA 11000000 10 00001 0 v:1 .. pg:3 0 za:2 off:2 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=2
-MOVA 11000000 11 00001 0 v:1 .. pg:3 0 za:3 off:1 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=3
-MOVA 11000000 11 00001 1 v:1 .. pg:3 0 za:4 zr:5 \
- &mova to_vec=1 rs=%mova_rs esz=4 off=0
+MOVA_zt 11000000 00 00001 0 v:1 .. pg:3 0 off:4 zr:5 \
+ &mova_p rs=%mova_rs esz=0 za=0
+MOVA_zt 11000000 01 00001 0 v:1 .. pg:3 0 za:1 off:3 zr:5 \
+ &mova_p rs=%mova_rs esz=1
+MOVA_zt 11000000 10 00001 0 v:1 .. pg:3 0 za:2 off:2 zr:5 \
+ &mova_p rs=%mova_rs esz=2
+MOVA_zt 11000000 11 00001 0 v:1 .. pg:3 0 za:3 off:1 zr:5 \
+ &mova_p rs=%mova_rs esz=3
+MOVA_zt 11000000 11 00001 1 v:1 .. pg:3 0 za:4 zr:5 \
+ &mova_p rs=%mova_rs esz=4 off=0
### SME Move into/from ZT0
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 20/61] target/arm: Implement SME2 MOVA to/from tile, multiple registers
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (18 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 19/61] target/arm: Rename MOVA for translate Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 21/61] target/arm: Split out get_zarray Richard Henderson
` (41 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 9 +++++
target/arm/tcg/sme_helper.c | 64 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 56 +++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 37 ++++++++++++++++++++
4 files changed, 166 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 858d69188f..8246ce774c 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -33,6 +33,15 @@ DEF_HELPER_FLAGS_4(sme_mova_zc_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_4(sme_mova_cz_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_4(sme_mova_zc_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_cz_b, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_zc_b, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_cz_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_zc_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_cz_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_zc_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_cz_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_mova_zc_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
DEF_HELPER_FLAGS_5(sme_ld1b_h, TCG_CALL_NO_WG, void, env, ptr, ptr, tl, i32)
DEF_HELPER_FLAGS_5(sme_ld1b_v, TCG_CALL_NO_WG, void, env, ptr, ptr, tl, i32)
DEF_HELPER_FLAGS_5(sme_ld1b_h_mte, TCG_CALL_NO_WG, void, env, ptr, ptr, tl, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 45f6cdfcb4..ee81bece12 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -206,6 +206,50 @@ void HELPER(sme_mova_zc_q)(void *vd, void *za, void *vg, uint32_t desc)
#undef DO_MOVA_Z
+void HELPER(sme2_mova_zc_b)(void *vdst, void *vsrc, uint32_t desc)
+{
+ const uint8_t *src = vsrc;
+ uint8_t *dst = vdst;
+ size_t i, n = simd_oprsz(desc);
+
+ for (i = 0; i < n; ++i) {
+ dst[i] = src[tile_vslice_index(i)];
+ }
+}
+
+void HELPER(sme2_mova_zc_h)(void *vdst, void *vsrc, uint32_t desc)
+{
+ const uint16_t *src = vsrc;
+ uint16_t *dst = vdst;
+ size_t i, n = simd_oprsz(desc) / 2;
+
+ for (i = 0; i < n; ++i) {
+ dst[i] = src[tile_vslice_index(i)];
+ }
+}
+
+void HELPER(sme2_mova_zc_s)(void *vdst, void *vsrc, uint32_t desc)
+{
+ const uint32_t *src = vsrc;
+ uint32_t *dst = vdst;
+ size_t i, n = simd_oprsz(desc) / 4;
+
+ for (i = 0; i < n; ++i) {
+ dst[i] = src[tile_vslice_index(i)];
+ }
+}
+
+void HELPER(sme2_mova_zc_d)(void *vdst, void *vsrc, uint32_t desc)
+{
+ const uint64_t *src = vsrc;
+ uint64_t *dst = vdst;
+ size_t i, n = simd_oprsz(desc) / 8;
+
+ for (i = 0; i < n; ++i) {
+ dst[i] = src[tile_vslice_index(i)];
+ }
+}
+
/*
* Clear elements in a tile slice comprising len bytes.
*/
@@ -314,6 +358,26 @@ static void copy_vertical_q(void *vdst, const void *vsrc, size_t len)
}
}
+void HELPER(sme2_mova_cz_b)(void *vdst, void *vsrc, uint32_t desc)
+{
+ copy_vertical_b(vdst, vsrc, simd_oprsz(desc));
+}
+
+void HELPER(sme2_mova_cz_h)(void *vdst, void *vsrc, uint32_t desc)
+{
+ copy_vertical_h(vdst, vsrc, simd_oprsz(desc));
+}
+
+void HELPER(sme2_mova_cz_s)(void *vdst, void *vsrc, uint32_t desc)
+{
+ copy_vertical_s(vdst, vsrc, simd_oprsz(desc));
+}
+
+void HELPER(sme2_mova_cz_d)(void *vdst, void *vsrc, uint32_t desc)
+{
+ copy_vertical_d(vdst, vsrc, simd_oprsz(desc));
+}
+
/*
* Host and TLB primitives for vertical tile slice addressing.
*/
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 908c3e8dd6..eed9345651 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -205,6 +205,62 @@ static bool do_mova_tile(DisasContext *s, arg_mova_p *a, bool to_vec)
TRANS_FEAT(MOVA_tz, aa64_sme, do_mova_tile, a, false)
TRANS_FEAT(MOVA_zt, aa64_sme, do_mova_tile, a, true)
+static bool do_mova_tile_n(DisasContext *s, arg_mova_t *a, int n, bool to_vec)
+{
+ static gen_helper_gvec_2 * const cz_fns[] = {
+ gen_helper_sme2_mova_cz_b, gen_helper_sme2_mova_cz_h,
+ gen_helper_sme2_mova_cz_s, gen_helper_sme2_mova_cz_d,
+ };
+ static gen_helper_gvec_2 * const zc_fns[] = {
+ gen_helper_sme2_mova_zc_b, gen_helper_sme2_mova_zc_h,
+ gen_helper_sme2_mova_zc_s, gen_helper_sme2_mova_zc_d,
+ };
+ TCGv_ptr t_za;
+ int svl;
+
+ if (!sme_smza_enabled_check(s)) {
+ return true;
+ }
+
+ svl = streaming_vec_reg_size(s);
+ if (svl == 16 && n == 4 && a->esz == MO_64) {
+ unallocated_encoding(s);
+ return true;
+ }
+
+ if (a->v) {
+ TCGv_i32 t_desc = tcg_constant_i32(simd_desc(svl, svl, 0));
+
+ for (int i = 0; i < n; ++i) {
+ TCGv_ptr t_zr = vec_full_reg_ptr(s, a->zr * n + i);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za,
+ a->off * n + i, a->v);
+ if (to_vec) {
+ zc_fns[a->esz](t_zr, t_za, t_desc);
+ } else {
+ cz_fns[a->esz](t_za, t_zr, t_desc);
+ }
+ }
+ } else {
+ for (int i = 0; i < n; ++i) {
+ int zr_ofs = vec_full_reg_offset(s, a->zr * n + i);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za,
+ a->off * n + i, a->v);
+ if (to_vec) {
+ tcg_gen_gvec_mov_var(MO_8, tcg_env, zr_ofs, t_za, 0, svl, svl);
+ } else {
+ tcg_gen_gvec_mov_var(MO_8, t_za, 0, tcg_env, zr_ofs, svl, svl);
+ }
+ }
+ }
+ return true;
+}
+
+TRANS_FEAT(MOVA_tz2, aa64_sme2, do_mova_tile_n, a, 2, false)
+TRANS_FEAT(MOVA_tz4, aa64_sme2, do_mova_tile_n, a, 4, false)
+TRANS_FEAT(MOVA_zt2, aa64_sme2, do_mova_tile_n, a, 2, true)
+TRANS_FEAT(MOVA_zt4, aa64_sme2, do_mova_tile_n, a, 4, true)
+
static bool do_movt(DisasContext *s, arg_MOVT_rzt *a,
void (*func)(TCGv_i64, TCGv_ptr, tcg_target_long))
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 459b96805f..5eca5f4acf 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -28,6 +28,7 @@ ZERO_zt0 11000000 01 001 00000000000 00000001
%mova_rs 13:2 !function=plus_12
&mova_p esz rs pg zr za off v:bool
+&mova_t esz rs zr za off v:bool
MOVA_tz 11000000 00 00000 0 v:1 .. pg:3 zr:5 0 off:4 \
&mova_p rs=%mova_rs esz=0 za=0
@@ -51,6 +52,42 @@ MOVA_zt 11000000 11 00001 0 v:1 .. pg:3 0 za:3 off:1 zr:5 \
MOVA_zt 11000000 11 00001 1 v:1 .. pg:3 0 za:4 zr:5 \
&mova_p rs=%mova_rs esz=4 off=0
+MOVA_tz2 11000000 00 00010 0 v:1 .. 000 zr:4 0 00 off:3 \
+ &mova_t rs=%mova_rs esz=0 za=0
+MOVA_tz2 11000000 01 00010 0 v:1 .. 000 zr:4 0 00 za:1 off:2 \
+ &mova_t rs=%mova_rs esz=1
+MOVA_tz2 11000000 10 00010 0 v:1 .. 000 zr:4 0 00 za:2 off:1 \
+ &mova_t rs=%mova_rs esz=2
+MOVA_tz2 11000000 11 00010 0 v:1 .. 000 zr:4 0 00 za:3 \
+ &mova_t rs=%mova_rs esz=3 off=0
+
+MOVA_zt2 11000000 00 00011 0 v:1 .. 000 00 off:3 zr:4 0 \
+ &mova_t rs=%mova_rs esz=0 za=0
+MOVA_zt2 11000000 01 00011 0 v:1 .. 000 00 za:1 off:2 zr:4 0 \
+ &mova_t rs=%mova_rs esz=1
+MOVA_zt2 11000000 10 00011 0 v:1 .. 000 00 za:2 off:1 zr:4 0 \
+ &mova_t rs=%mova_rs esz=2
+MOVA_zt2 11000000 11 00011 0 v:1 .. 000 00 za:3 zr:4 0 \
+ &mova_t rs=%mova_rs esz=3 off=0
+
+MOVA_tz4 11000000 00 00010 0 v:1 .. 001 zr:3 00 000 off:2 \
+ &mova_t rs=%mova_rs esz=0 za=0
+MOVA_tz4 11000000 01 00010 0 v:1 .. 001 zr:3 00 000 za:1 off:1 \
+ &mova_t rs=%mova_rs esz=1
+MOVA_tz4 11000000 10 00010 0 v:1 .. 001 zr:3 00 000 za:2 \
+ &mova_t rs=%mova_rs esz=2 off=0
+MOVA_tz4 11000000 11 00010 0 v:1 .. 001 zr:3 00 00 za:3 \
+ &mova_t rs=%mova_rs esz=3 off=0
+
+MOVA_zt4 11000000 00 00011 0 v:1 .. 001 000 off:2 zr:3 00 \
+ &mova_t rs=%mova_rs esz=0 za=0
+MOVA_zt4 11000000 01 00011 0 v:1 .. 001 000 za:1 off:1 zr:3 00 \
+ &mova_t rs=%mova_rs esz=1
+MOVA_zt4 11000000 10 00011 0 v:1 .. 001 000 za:2 zr:3 00 \
+ &mova_t rs=%mova_rs esz=2 off=0
+MOVA_zt4 11000000 11 00011 0 v:1 .. 001 00 za:3 zr:3 00 \
+ &mova_t rs=%mova_rs esz=3 off=0
+
### SME Move into/from ZT0
MOVT_rzt 1100 0000 0100 1100 0 off:3 00 11111 rt:5
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 21/61] target/arm: Split out get_zarray
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (19 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 20/61] target/arm: Implement SME2 MOVA to/from tile, multiple registers Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 22/61] target/arm: Implement SME2 MOVA to/from array, multiple registers Richard Henderson
` (40 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Prepare for MOVA array to/from vector with multiple registers
by adding a div_len parameter, herein always 1.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 40 ++++++++++++++++++----------------
1 file changed, 21 insertions(+), 19 deletions(-)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index eed9345651..a818a549cb 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -43,7 +43,7 @@ static bool sme2_zt0_enabled_check(DisasContext *s)
/* Resolve tile.size[rs+imm] to a host pointer. */
static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
- int tile, int imm, bool vertical)
+ int tile, int imm, int div_len, bool vertical)
{
int pos, len, offset;
TCGv_i32 tmp;
@@ -55,7 +55,7 @@ static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
tcg_gen_addi_i32(tmp, tmp, imm);
/* Prepare a power-of-two modulo via extraction of @len bits. */
- len = ctz32(streaming_vec_reg_size(s)) - esz;
+ len = ctz32(streaming_vec_reg_size(s) / div_len) - esz;
if (!len) {
/*
@@ -111,6 +111,13 @@ static TCGv_ptr get_tile_rowcol(DisasContext *s, int esz, int rs,
return addr;
}
+/* Resolve ZArray[rs+imm] to a host pointer. */
+static TCGv_ptr get_zarray(DisasContext *s, int rs, int imm, int div_len)
+{
+ /* ZA[n] equates to ZA0H.B[n]. */
+ return get_tile_rowcol(s, MO_8, rs, 0, imm, div_len, false);
+}
+
/*
* Resolve tile.size[0] to a host pointer.
* Used by e.g. outer product insns where we require the entire tile.
@@ -177,7 +184,7 @@ static bool do_mova_tile(DisasContext *s, arg_mova_p *a, bool to_vec)
return true;
}
- t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, a->v);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, 1, a->v);
t_zr = vec_full_reg_ptr(s, a->zr);
t_pg = pred_full_reg_ptr(s, a->pg);
@@ -234,7 +241,7 @@ static bool do_mova_tile_n(DisasContext *s, arg_mova_t *a, int n, bool to_vec)
for (int i = 0; i < n; ++i) {
TCGv_ptr t_zr = vec_full_reg_ptr(s, a->zr * n + i);
t_za = get_tile_rowcol(s, a->esz, a->rs, a->za,
- a->off * n + i, a->v);
+ a->off * n + i, 1, a->v);
if (to_vec) {
zc_fns[a->esz](t_zr, t_za, t_desc);
} else {
@@ -243,13 +250,13 @@ static bool do_mova_tile_n(DisasContext *s, arg_mova_t *a, int n, bool to_vec)
}
} else {
for (int i = 0; i < n; ++i) {
- int zr_ofs = vec_full_reg_offset(s, a->zr * n + i);
+ int o_zr = vec_full_reg_offset(s, a->zr * n + i);
t_za = get_tile_rowcol(s, a->esz, a->rs, a->za,
- a->off * n + i, a->v);
+ a->off * n + i, 1, a->v);
if (to_vec) {
- tcg_gen_gvec_mov_var(MO_8, tcg_env, zr_ofs, t_za, 0, svl, svl);
+ tcg_gen_gvec_mov_var(MO_8, tcg_env, o_zr, t_za, 0, svl, svl);
} else {
- tcg_gen_gvec_mov_var(MO_8, t_za, 0, tcg_env, zr_ofs, svl, svl);
+ tcg_gen_gvec_mov_var(MO_8, t_za, 0, tcg_env, o_zr, svl, svl);
}
}
}
@@ -315,7 +322,7 @@ static bool trans_LDST1(DisasContext *s, arg_LDST1 *a)
return true;
}
- t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, a->v);
+ t_za = get_tile_rowcol(s, a->esz, a->rs, a->za, a->off, 1, a->v);
t_pg = pred_full_reg_ptr(s, a->pg);
addr = tcg_temp_new_i64();
@@ -337,18 +344,13 @@ typedef void GenLdStR(DisasContext *, TCGv_ptr, int, int, int, int);
static bool do_ldst_r(DisasContext *s, arg_ldstr *a, GenLdStR *fn)
{
- int svl = streaming_vec_reg_size(s);
- int imm = a->imm;
- TCGv_ptr base;
+ if (sme_za_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ int imm = a->imm;
+ TCGv_ptr base = get_zarray(s, a->rv, imm, 1);
- if (!sme_za_enabled_check(s)) {
- return true;
+ fn(s, base, 0, svl, a->rn, imm * svl);
}
-
- /* ZA[n] equates to ZA0H.B[n]. */
- base = get_tile_rowcol(s, MO_8, a->rv, 0, imm, false);
-
- fn(s, base, 0, svl, a->rn, imm * svl);
return true;
}
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 22/61] target/arm: Implement SME2 MOVA to/from array, multiple registers
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (20 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 21/61] target/arm: Split out get_zarray Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 23/61] target/arm: Implement SME2 BMOPA Richard Henderson
` (39 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate.h | 5 +++++
target/arm/tcg/translate-sme.c | 30 ++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 12 ++++++++++++
3 files changed, 47 insertions(+)
diff --git a/target/arm/tcg/translate.h b/target/arm/tcg/translate.h
index 3021902ce1..c364d977f3 100644
--- a/target/arm/tcg/translate.h
+++ b/target/arm/tcg/translate.h
@@ -206,6 +206,11 @@ static inline int plus_2(DisasContext *s, int x)
return x + 2;
}
+static inline int plus_8(DisasContext *s, int x)
+{
+ return x + 8;
+}
+
static inline int plus_12(DisasContext *s, int x)
{
return x + 12;
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index a818a549cb..8a901f158c 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -268,6 +268,36 @@ TRANS_FEAT(MOVA_tz4, aa64_sme2, do_mova_tile_n, a, 4, false)
TRANS_FEAT(MOVA_zt2, aa64_sme2, do_mova_tile_n, a, 2, true)
TRANS_FEAT(MOVA_zt4, aa64_sme2, do_mova_tile_n, a, 4, true)
+static bool do_mova_array_n(DisasContext *s, arg_mova_a *a, int n, bool to_vec)
+{
+ TCGv_ptr t_za;
+ int svl;
+
+ if (!sme_smza_enabled_check(s)) {
+ return true;
+ }
+
+ svl = streaming_vec_reg_size(s);
+ t_za = get_zarray(s, a->rv, a->off, n);
+
+ for (int i = 0; i < n; ++i) {
+ int o_za = (svl / n * sizeof(ARMVectorReg)) * i;
+ int o_zr = vec_full_reg_offset(s, a->zr * n + i);
+
+ if (to_vec) {
+ tcg_gen_gvec_mov_var(MO_8, tcg_env, o_zr, t_za, o_za, svl, svl);
+ } else {
+ tcg_gen_gvec_mov_var(MO_8, t_za, o_za, tcg_env, o_zr, svl, svl);
+ }
+ }
+ return true;
+}
+
+TRANS_FEAT(MOVA_az2, aa64_sme2, do_mova_array_n, a, 2, false)
+TRANS_FEAT(MOVA_az4, aa64_sme2, do_mova_array_n, a, 4, false)
+TRANS_FEAT(MOVA_za2, aa64_sme2, do_mova_array_n, a, 2, true)
+TRANS_FEAT(MOVA_za4, aa64_sme2, do_mova_array_n, a, 4, true)
+
static bool do_movt(DisasContext *s, arg_MOVT_rzt *a,
void (*func)(TCGv_i64, TCGv_ptr, tcg_target_long))
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 5eca5f4acf..37bd0c6198 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -27,6 +27,8 @@ ZERO_zt0 11000000 01 001 00000000000 00000001
### SME Move into/from Array
%mova_rs 13:2 !function=plus_12
+%mova_rv 13:2 !function=plus_8
+&mova_a rv zr off
&mova_p esz rs pg zr za off v:bool
&mova_t esz rs zr za off v:bool
@@ -88,6 +90,16 @@ MOVA_zt4 11000000 10 00011 0 v:1 .. 001 000 za:2 zr:3 00 \
MOVA_zt4 11000000 11 00011 0 v:1 .. 001 00 za:3 zr:3 00 \
&mova_t rs=%mova_rs esz=3 off=0
+MOVA_az2 11000000 00 00010 00 .. 010 zr:4 000 off:3 \
+ &mova_a rv=%mova_rv
+MOVA_az4 11000000 00 00010 00 .. 011 zr:3 0000 off:3 \
+ &mova_a rv=%mova_rv
+
+MOVA_za2 11000000 00 00011 00 .. 010 00 off:3 zr:4 0 \
+ &mova_a rv=%mova_rv
+MOVA_za4 11000000 00 00011 00 .. 011 00 off:3 zr:3 00 \
+ &mova_a rv=%mova_rv
+
### SME Move into/from ZT0
MOVT_rzt 1100 0000 0100 1100 0 off:3 00 11111 rt:5
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 23/61] target/arm: Implement SME2 BMOPA
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (21 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 22/61] target/arm: Implement SME2 MOVA to/from array, multiple registers Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 24/61] target/arm: Implement SME2 SMOPS, UMOPS (2-way) Richard Henderson
` (38 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 3 +++
target/arm/tcg/sme_helper.c | 34 ++++++++++++++++++++++++----------
target/arm/tcg/translate-sme.c | 2 ++
target/arm/tcg/sme.decode | 2 ++
4 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 8246ce774c..17d1a7c102 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -153,3 +153,6 @@ DEF_HELPER_FLAGS_6(sme_sumopa_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_6(sme_usmopa_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_6(sme2_bmopa_s, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index ee81bece12..6d8e2cedc9 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1267,17 +1267,31 @@ DEF_IMOP_64(umopa_d, uint16_t, uint16_t)
DEF_IMOP_64(sumopa_d, int16_t, uint16_t)
DEF_IMOP_64(usmopa_d, uint16_t, int16_t)
-#define DEF_IMOPH(NAME, S) \
- void HELPER(sme_##NAME##_##S)(void *vza, void *vzn, void *vzm, \
+#define DEF_IMOPH(P, NAME, S) \
+ void HELPER(P##_##NAME##_##S)(void *vza, void *vzn, void *vzm, \
void *vpn, void *vpm, uint32_t desc) \
{ do_imopa_##S(vza, vzn, vzm, vpn, vpm, desc, NAME##_##S); }
-DEF_IMOPH(smopa, s)
-DEF_IMOPH(umopa, s)
-DEF_IMOPH(sumopa, s)
-DEF_IMOPH(usmopa, s)
+DEF_IMOPH(sme, smopa, s)
+DEF_IMOPH(sme, umopa, s)
+DEF_IMOPH(sme, sumopa, s)
+DEF_IMOPH(sme, usmopa, s)
-DEF_IMOPH(smopa, d)
-DEF_IMOPH(umopa, d)
-DEF_IMOPH(sumopa, d)
-DEF_IMOPH(usmopa, d)
+DEF_IMOPH(sme, smopa, d)
+DEF_IMOPH(sme, umopa, d)
+DEF_IMOPH(sme, sumopa, d)
+DEF_IMOPH(sme, usmopa, d)
+
+static uint32_t bmopa_s(uint32_t n, uint32_t m, uint32_t a, uint8_t p, bool neg)
+{
+ uint32_t sum = ctpop32(~(n ^ m));
+ if (neg) {
+ sum = -sum;
+ }
+ if (!(p & 1)) {
+ sum = 0;
+ }
+ return a + sum;
+}
+
+DEF_IMOPH(sme2, bmopa, s)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 8a901f158c..25139cb7aa 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -507,3 +507,5 @@ TRANS_FEAT(SMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_smopa_
TRANS_FEAT(UMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_umopa_d)
TRANS_FEAT(SUMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_sumopa_d)
TRANS_FEAT(USMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_usmopa_d)
+
+TRANS_FEAT(BMOPA, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_bmopa_s)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 37bd0c6198..de8d04cb87 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -165,3 +165,5 @@ SMOPA_d 1010000 0 11 0 ..... ... ... ..... . 0 ... @op_64
SUMOPA_d 1010000 0 11 1 ..... ... ... ..... . 0 ... @op_64
USMOPA_d 1010000 1 11 0 ..... ... ... ..... . 0 ... @op_64
UMOPA_d 1010000 1 11 1 ..... ... ... ..... . 0 ... @op_64
+
+BMOPA 1000000 0 10 0 ..... ... ... ..... . 10 .. @op_32
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 24/61] target/arm: Implement SME2 SMOPS, UMOPS (2-way)
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (22 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 23/61] target/arm: Implement SME2 BMOPA Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 25/61] target/arm: Introduce gen_gvec_sve2_sqdmulh Richard Henderson
` (37 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 4 ++++
target/arm/tcg/sme_helper.c | 37 +++++++++++++++++++++++++---------
target/arm/tcg/translate-sme.c | 2 ++
target/arm/tcg/sme.decode | 2 ++
4 files changed, 35 insertions(+), 10 deletions(-)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 17d1a7c102..ecd06f2cd1 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -156,3 +156,7 @@ DEF_HELPER_FLAGS_6(sme_usmopa_d, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_6(sme2_bmopa_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(sme2_smopa2_s, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(sme2_umopa2_s, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 6d8e2cedc9..d973d83957 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1231,7 +1231,7 @@ static inline void do_imopa_d(uint64_t *za, uint64_t *zn, uint64_t *zm,
}
}
-#define DEF_IMOP_32(NAME, NTYPE, MTYPE) \
+#define DEF_IMOP_8x4_32(NAME, NTYPE, MTYPE) \
static uint32_t NAME(uint32_t n, uint32_t m, uint32_t a, uint8_t p, bool neg) \
{ \
uint32_t sum = 0; \
@@ -1244,7 +1244,7 @@ static uint32_t NAME(uint32_t n, uint32_t m, uint32_t a, uint8_t p, bool neg) \
return neg ? a - sum : a + sum; \
}
-#define DEF_IMOP_64(NAME, NTYPE, MTYPE) \
+#define DEF_IMOP_16x4_64(NAME, NTYPE, MTYPE) \
static uint64_t NAME(uint64_t n, uint64_t m, uint64_t a, uint8_t p, bool neg) \
{ \
uint64_t sum = 0; \
@@ -1257,15 +1257,15 @@ static uint64_t NAME(uint64_t n, uint64_t m, uint64_t a, uint8_t p, bool neg) \
return neg ? a - sum : a + sum; \
}
-DEF_IMOP_32(smopa_s, int8_t, int8_t)
-DEF_IMOP_32(umopa_s, uint8_t, uint8_t)
-DEF_IMOP_32(sumopa_s, int8_t, uint8_t)
-DEF_IMOP_32(usmopa_s, uint8_t, int8_t)
+DEF_IMOP_8x4_32(smopa_s, int8_t, int8_t)
+DEF_IMOP_8x4_32(umopa_s, uint8_t, uint8_t)
+DEF_IMOP_8x4_32(sumopa_s, int8_t, uint8_t)
+DEF_IMOP_8x4_32(usmopa_s, uint8_t, int8_t)
-DEF_IMOP_64(smopa_d, int16_t, int16_t)
-DEF_IMOP_64(umopa_d, uint16_t, uint16_t)
-DEF_IMOP_64(sumopa_d, int16_t, uint16_t)
-DEF_IMOP_64(usmopa_d, uint16_t, int16_t)
+DEF_IMOP_16x4_64(smopa_d, int16_t, int16_t)
+DEF_IMOP_16x4_64(umopa_d, uint16_t, uint16_t)
+DEF_IMOP_16x4_64(sumopa_d, int16_t, uint16_t)
+DEF_IMOP_16x4_64(usmopa_d, uint16_t, int16_t)
#define DEF_IMOPH(P, NAME, S) \
void HELPER(P##_##NAME##_##S)(void *vza, void *vzn, void *vzm, \
@@ -1295,3 +1295,20 @@ static uint32_t bmopa_s(uint32_t n, uint32_t m, uint32_t a, uint8_t p, bool neg)
}
DEF_IMOPH(sme2, bmopa, s)
+
+#define DEF_IMOP_16x2_32(NAME, NTYPE, MTYPE) \
+static uint32_t NAME(uint32_t n, uint32_t m, uint32_t a, uint8_t p, bool neg) \
+{ \
+ uint32_t sum = 0; \
+ /* Apply P to N as a mask, making the inactive elements 0. */ \
+ n &= expand_pred_h(p); \
+ sum += (NTYPE)(n >> 0) * (MTYPE)(m >> 0); \
+ sum += (NTYPE)(n >> 16) * (MTYPE)(m >> 16); \
+ return neg ? a - sum : a + sum; \
+}
+
+DEF_IMOP_16x2_32(smopa2_s, int16_t, int16_t)
+DEF_IMOP_16x2_32(umopa2_s, uint16_t, uint16_t)
+
+DEF_IMOPH(sme2, smopa2, s)
+DEF_IMOPH(sme2, umopa2, s)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 25139cb7aa..57c7aacb6d 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -509,3 +509,5 @@ TRANS_FEAT(SUMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_sumop
TRANS_FEAT(USMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_usmopa_d)
TRANS_FEAT(BMOPA, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_bmopa_s)
+TRANS_FEAT(SMOPA2_s, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_smopa2_s)
+TRANS_FEAT(UMOPA2_s, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_umopa2_s)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index de8d04cb87..36f369d02a 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -167,3 +167,5 @@ USMOPA_d 1010000 1 11 0 ..... ... ... ..... . 0 ... @op_64
UMOPA_d 1010000 1 11 1 ..... ... ... ..... . 0 ... @op_64
BMOPA 1000000 0 10 0 ..... ... ... ..... . 10 .. @op_32
+SMOPA2_s 1010000 0 10 0 ..... ... ... ..... . 10 .. @op_32
+UMOPA2_s 1010000 1 10 0 ..... ... ... ..... . 10 .. @op_32
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 25/61] target/arm: Introduce gen_gvec_sve2_sqdmulh
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (23 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 24/61] target/arm: Implement SME2 SMOPS, UMOPS (2-way) Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 26/61] target/arm: Implement SME2 Multiple and Single SVE Destructive Richard Henderson
` (36 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
To be used by both SVE2 and SME2.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-a64.h | 4 ++++
target/arm/tcg/gengvec64.c | 11 +++++++++++
target/arm/tcg/translate-sve.c | 8 +-------
3 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/target/arm/tcg/translate-a64.h b/target/arm/tcg/translate-a64.h
index 7d3b59ccd9..481dfeb965 100644
--- a/target/arm/tcg/translate-a64.h
+++ b/target/arm/tcg/translate-a64.h
@@ -225,6 +225,10 @@ void gen_gvec_usqadd_qc(unsigned vece, uint32_t rd_ofs,
uint32_t rn_ofs, uint32_t rm_ofs,
uint32_t opr_sz, uint32_t max_sz);
+void gen_gvec_sve2_sqdmulh(unsigned vece, uint32_t rd_ofs,
+ uint32_t rn_ofs, uint32_t rm_ofs,
+ uint32_t opr_sz, uint32_t max_sz);
+
void gen_sve_ldr(DisasContext *s, TCGv_ptr, int vofs, int len, int rn, int imm);
void gen_sve_str(DisasContext *s, TCGv_ptr, int vofs, int len, int rn, int imm);
diff --git a/target/arm/tcg/gengvec64.c b/target/arm/tcg/gengvec64.c
index 2617cde0a5..2429cab1b8 100644
--- a/target/arm/tcg/gengvec64.c
+++ b/target/arm/tcg/gengvec64.c
@@ -369,3 +369,14 @@ void gen_gvec_usqadd_qc(unsigned vece, uint32_t rd_ofs,
tcg_gen_gvec_4(rd_ofs, offsetof(CPUARMState, vfp.qc),
rn_ofs, rm_ofs, opr_sz, max_sz, &ops[vece]);
}
+
+void gen_gvec_sve2_sqdmulh(unsigned vece, uint32_t rd_ofs,
+ uint32_t rn_ofs, uint32_t rm_ofs,
+ uint32_t opr_sz, uint32_t max_sz)
+{
+ static gen_helper_gvec_3 * const fns[4] = {
+ gen_helper_sve2_sqdmulh_b, gen_helper_sve2_sqdmulh_h,
+ gen_helper_sve2_sqdmulh_s, gen_helper_sve2_sqdmulh_d,
+ };
+ tcg_gen_gvec_3_ool(rd_ofs, rn_ofs, rm_ofs, opr_sz, max_sz, 0, fns[vece]);
+}
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index d23be477b4..0907a4e9e9 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -5911,6 +5911,7 @@ TRANS_FEAT(MOVPRFX_z, aa64_sve, do_movz_zpz, a->rd, a->rn, a->pg, a->esz, false)
*/
TRANS_FEAT(MUL_zzz, aa64_sve2, gen_gvec_fn_arg_zzz, tcg_gen_gvec_mul, a)
+TRANS_FEAT(SQDMULH_zzz, aa64_sve2, gen_gvec_fn_arg_zzz, gen_gvec_sve2_sqdmulh, a)
static gen_helper_gvec_3 * const smulh_zzz_fns[4] = {
gen_helper_gvec_smulh_b, gen_helper_gvec_smulh_h,
@@ -5929,13 +5930,6 @@ TRANS_FEAT(UMULH_zzz, aa64_sve2, gen_gvec_ool_arg_zzz,
TRANS_FEAT(PMUL_zzz, aa64_sve2, gen_gvec_ool_arg_zzz,
gen_helper_gvec_pmul_b, a, 0)
-static gen_helper_gvec_3 * const sqdmulh_zzz_fns[4] = {
- gen_helper_sve2_sqdmulh_b, gen_helper_sve2_sqdmulh_h,
- gen_helper_sve2_sqdmulh_s, gen_helper_sve2_sqdmulh_d,
-};
-TRANS_FEAT(SQDMULH_zzz, aa64_sve2, gen_gvec_ool_arg_zzz,
- sqdmulh_zzz_fns[a->esz], a, 0)
-
static gen_helper_gvec_3 * const sqrdmulh_zzz_fns[4] = {
gen_helper_sve2_sqrdmulh_b, gen_helper_sve2_sqrdmulh_h,
gen_helper_sve2_sqrdmulh_s, gen_helper_sve2_sqrdmulh_d,
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 26/61] target/arm: Implement SME2 Multiple and Single SVE Destructive
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (24 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 25/61] target/arm: Introduce gen_gvec_sve2_sqdmulh Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 27/61] target/arm: Implement SME2 Multiple Vectors " Richard Henderson
` (35 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 13 ++++
target/arm/tcg/vec_internal.h | 4 ++
target/arm/tcg/helper-a64.c | 2 +
target/arm/tcg/translate-sme.c | 115 +++++++++++++++++++++++++++++++++
target/arm/tcg/vec_helper.c | 7 ++
target/arm/tcg/sme.decode | 40 ++++++++++++
6 files changed, 181 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index ecd06f2cd1..cdd7058aed 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -160,3 +160,16 @@ DEF_HELPER_FLAGS_6(sme2_smopa2_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_6(sme2_umopa2_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_5(gvec_fmax_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmin_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_ah_fmax_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_ah_fmin_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmaxnum_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fminnum_b16, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/vec_internal.h b/target/arm/tcg/vec_internal.h
index 6b93b5aeb9..205f85b8d3 100644
--- a/target/arm/tcg/vec_internal.h
+++ b/target/arm/tcg/vec_internal.h
@@ -300,4 +300,8 @@ static inline float64 float64_maybe_ah_chs(float64 a, bool fpcr_ah)
return fpcr_ah && float64_is_any_nan(a) ? a : float64_chs(a);
}
+/* Not actually called directly as a helper, but uses similar machinery. */
+bfloat16 helper_sme2_ah_fmax_b16(bfloat16 a, bfloat16 b, float_status *fpst);
+bfloat16 helper_sme2_ah_fmin_b16(bfloat16 a, bfloat16 b, float_status *fpst);
+
#endif /* TARGET_ARM_VEC_INTERNAL_H */
diff --git a/target/arm/tcg/helper-a64.c b/target/arm/tcg/helper-a64.c
index 32f0647ca4..9685ef65a0 100644
--- a/target/arm/tcg/helper-a64.c
+++ b/target/arm/tcg/helper-a64.c
@@ -399,6 +399,8 @@ AH_MINMAX_HELPER(vfp_ah_mind, float64, float64, min)
AH_MINMAX_HELPER(vfp_ah_maxh, dh_ctype_f16, float16, max)
AH_MINMAX_HELPER(vfp_ah_maxs, float32, float32, max)
AH_MINMAX_HELPER(vfp_ah_maxd, float64, float64, max)
+AH_MINMAX_HELPER(sme2_ah_fmax_b16, bfloat16, bfloat16, max)
+AH_MINMAX_HELPER(sme2_ah_fmin_b16, bfloat16, bfloat16, min)
/* 64-bit versions of the CRC helpers. Note that although the operation
* (and the prototypes of crc32c() and crc32() mean that only the bottom
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 57c7aacb6d..0e05153924 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -511,3 +511,118 @@ TRANS_FEAT(USMOPA_d, aa64_sme_i16i64, do_outprod, a, MO_64, gen_helper_sme_usmop
TRANS_FEAT(BMOPA, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_bmopa_s)
TRANS_FEAT(SMOPA2_s, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_smopa2_s)
TRANS_FEAT(UMOPA2_s, aa64_sme2, do_outprod, a, MO_32, gen_helper_sme2_umopa2_s)
+
+static bool do_z2z_n1(DisasContext *s, arg_z2z_en *a, GVecGen3Fn *fn)
+{
+ int esz, dn, vsz, mofs, n;
+ bool overlap = false;
+
+ if (!sme_sm_enabled_check(s)) {
+ return true;
+ }
+
+ esz = a->esz;
+ n = a->n;
+ dn = a->zdn;
+ mofs = vec_full_reg_offset(s, a->zm);
+ vsz = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; i++) {
+ int dofs = vec_full_reg_offset(s, dn + i);
+ if (dofs == mofs) {
+ overlap = true;
+ } else {
+ fn(esz, dofs, dofs, mofs, vsz, vsz);
+ }
+ }
+ if (overlap) {
+ fn(esz, mofs, mofs, mofs, vsz, vsz);
+ }
+ return true;
+}
+
+TRANS_FEAT(ADD_n1, aa64_sme2, do_z2z_n1, a, tcg_gen_gvec_add)
+TRANS_FEAT(SMAX_n1, aa64_sme2, do_z2z_n1, a, tcg_gen_gvec_smax)
+TRANS_FEAT(SMIN_n1, aa64_sme2, do_z2z_n1, a, tcg_gen_gvec_smin)
+TRANS_FEAT(UMAX_n1, aa64_sme2, do_z2z_n1, a, tcg_gen_gvec_umax)
+TRANS_FEAT(UMIN_n1, aa64_sme2, do_z2z_n1, a, tcg_gen_gvec_umin)
+TRANS_FEAT(SRSHL_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_srshl)
+TRANS_FEAT(URSHL_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_urshl)
+TRANS_FEAT(SQDMULH_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_sve2_sqdmulh)
+
+static bool do_z2z_n1_fpst(DisasContext *s, arg_z2z_en *a,
+ gen_helper_gvec_3_ptr * const fns[4])
+{
+ int esz = a->esz, n, dn, vsz, mofs;
+ bool overlap = false;
+ gen_helper_gvec_3_ptr *fn;
+ TCGv_ptr fpst;
+
+ /* These insns use MO_8 to encode BFloat16. */
+ if (esz == MO_8 && !dc_isar_feature(aa64_sme2_b16b16, s)) {
+ return false;
+ }
+ if (!sme_sm_enabled_check(s)) {
+ return true;
+ }
+
+ fpst = fpstatus_ptr(esz == MO_16 ? FPST_A64_F16 : FPST_A64);
+ fn = fns[esz];
+ n = a->n;
+ dn = a->zdn;
+ mofs = vec_full_reg_offset(s, a->zm);
+ vsz = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; i++) {
+ int dofs = vec_full_reg_offset(s, dn + i);
+ if (dofs == mofs) {
+ overlap = true;
+ } else {
+ tcg_gen_gvec_3_ptr(dofs, dofs, mofs, fpst, vsz, vsz, 0, fn);
+ }
+ }
+ if (overlap) {
+ tcg_gen_gvec_3_ptr(mofs, mofs, mofs, fpst, vsz, vsz, 0, fn);
+ }
+ return true;
+}
+
+static gen_helper_gvec_3_ptr * const f_vector_fmax[2][4] = {
+ { gen_helper_gvec_fmax_b16,
+ gen_helper_gvec_fmax_h,
+ gen_helper_gvec_fmax_s,
+ gen_helper_gvec_fmax_d },
+ { gen_helper_gvec_ah_fmax_b16,
+ gen_helper_gvec_ah_fmax_h,
+ gen_helper_gvec_ah_fmax_s,
+ gen_helper_gvec_ah_fmax_d },
+};
+TRANS_FEAT(FMAX_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmax[s->fpcr_ah])
+
+static gen_helper_gvec_3_ptr * const f_vector_fmin[2][4] = {
+ { gen_helper_gvec_fmin_b16,
+ gen_helper_gvec_fmin_h,
+ gen_helper_gvec_fmin_s,
+ gen_helper_gvec_fmin_d },
+ { gen_helper_gvec_ah_fmin_b16,
+ gen_helper_gvec_ah_fmin_h,
+ gen_helper_gvec_ah_fmin_s,
+ gen_helper_gvec_ah_fmin_d },
+};
+TRANS_FEAT(FMIN_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmin[s->fpcr_ah])
+
+static gen_helper_gvec_3_ptr * const f_vector_fmaxnm[4] = {
+ gen_helper_gvec_fmaxnum_b16,
+ gen_helper_gvec_fmaxnum_h,
+ gen_helper_gvec_fmaxnum_s,
+ gen_helper_gvec_fmaxnum_d,
+};
+TRANS_FEAT(FMAXNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmaxnm)
+
+static gen_helper_gvec_3_ptr * const f_vector_fminnm[4] = {
+ gen_helper_gvec_fminnum_b16,
+ gen_helper_gvec_fminnum_h,
+ gen_helper_gvec_fminnum_s,
+ gen_helper_gvec_fminnum_d,
+};
+TRANS_FEAT(FMINNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fminnm)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 986eaf8ffa..671777ce52 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -1515,6 +1515,13 @@ DO_3OP(gvec_ah_fmin_h, helper_vfp_ah_minh, float16)
DO_3OP(gvec_ah_fmin_s, helper_vfp_ah_mins, float32)
DO_3OP(gvec_ah_fmin_d, helper_vfp_ah_mind, float64)
+DO_3OP(gvec_fmax_b16, bfloat16_max, bfloat16)
+DO_3OP(gvec_fmin_b16, bfloat16_min, bfloat16)
+DO_3OP(gvec_fmaxnum_b16, bfloat16_maxnum, bfloat16)
+DO_3OP(gvec_fminnum_b16, bfloat16_minnum, bfloat16)
+DO_3OP(gvec_ah_fmax_b16, helper_sme2_ah_fmax_b16, bfloat16)
+DO_3OP(gvec_ah_fmin_b16, helper_sme2_ah_fmin_b16, bfloat16)
+
#endif
#undef DO_3OP
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 36f369d02a..005f87777b 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -169,3 +169,43 @@ UMOPA_d 1010000 1 11 1 ..... ... ... ..... . 0 ... @op_64
BMOPA 1000000 0 10 0 ..... ... ... ..... . 10 .. @op_32
SMOPA2_s 1010000 0 10 0 ..... ... ... ..... . 10 .. @op_32
UMOPA2_s 1010000 1 10 0 ..... ... ... ..... . 10 .. @op_32
+
+### SME2 Multi-vector Multiple and Single SVE Destructive
+
+%zd_ax2 1:4 !function=times_2
+%zd_ax4 2:3 !function=times_4
+
+&z2z_en zdn zm esz n
+@z2z_2x1 ....... . esz:2 .. zm:4 ....0. ..... .... . \
+ &z2z_en n=2 zdn=%zd_ax2
+@z2z_4x1 ....... . esz:2 .. zm:4 ....1. ..... ...0 . \
+ &z2z_en n=4 zdn=%zd_ax4
+
+SMAX_n1 1100000 1 .. 10 .... 1010.0 00000 .... 0 @z2z_2x1
+SMAX_n1 1100000 1 .. 10 .... 1010.0 00000 .... 0 @z2z_4x1
+UMAX_n1 1100000 1 .. 10 .... 1010.0 00000 .... 1 @z2z_2x1
+UMAX_n1 1100000 1 .. 10 .... 1010.0 00000 .... 1 @z2z_4x1
+SMIN_n1 1100000 1 .. 10 .... 1010.0 00001 .... 0 @z2z_2x1
+SMIN_n1 1100000 1 .. 10 .... 1010.0 00001 .... 0 @z2z_4x1
+UMIN_n1 1100000 1 .. 10 .... 1010.0 00001 .... 1 @z2z_2x1
+UMIN_n1 1100000 1 .. 10 .... 1010.0 00001 .... 1 @z2z_4x1
+
+FMAX_n1 1100000 1 .. 10 .... 1010.0 01000 .... 0 @z2z_2x1
+FMAX_n1 1100000 1 .. 10 .... 1010.0 01000 .... 0 @z2z_4x1
+FMIN_n1 1100000 1 .. 10 .... 1010.0 01000 .... 1 @z2z_2x1
+FMIN_n1 1100000 1 .. 10 .... 1010.0 01000 .... 1 @z2z_4x1
+FMAXNM_n1 1100000 1 .. 10 .... 1010.0 01001 .... 0 @z2z_2x1
+FMAXNM_n1 1100000 1 .. 10 .... 1010.0 01001 .... 0 @z2z_4x1
+FMINNM_n1 1100000 1 .. 10 .... 1010.0 01001 .... 1 @z2z_2x1
+FMINNM_n1 1100000 1 .. 10 .... 1010.0 01001 .... 1 @z2z_4x1
+
+SRSHL_n1 1100000 1 .. 10 .... 1010.0 10001 .... 0 @z2z_2x1
+SRSHL_n1 1100000 1 .. 10 .... 1010.0 10001 .... 0 @z2z_4x1
+URSHL_n1 1100000 1 .. 10 .... 1010.0 10001 .... 1 @z2z_2x1
+URSHL_n1 1100000 1 .. 10 .... 1010.0 10001 .... 1 @z2z_4x1
+
+ADD_n1 1100000 1 .. 10 .... 1010.0 11000 .... 0 @z2z_2x1
+ADD_n1 1100000 1 .. 10 .... 1010.0 11000 .... 0 @z2z_4x1
+
+SQDMULH_n1 1100000 1 .. 10 .... 1010.1 00000 .... 0 @z2z_2x1
+SQDMULH_n1 1100000 1 .. 10 .... 1010.1 00000 .... 0 @z2z_4x1
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 27/61] target/arm: Implement SME2 Multiple Vectors SVE Destructive
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (25 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 26/61] target/arm: Implement SME2 Multiple and Single SVE Destructive Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 28/61] target/arm: Implement SME2 ADD/SUB (array results, multiple and single vector) Richard Henderson
` (34 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 65 ++++++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 36 +++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 0e05153924..617621d663 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -550,6 +550,37 @@ TRANS_FEAT(SRSHL_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_srshl)
TRANS_FEAT(URSHL_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_urshl)
TRANS_FEAT(SQDMULH_n1, aa64_sme2, do_z2z_n1, a, gen_gvec_sve2_sqdmulh)
+static bool do_z2z_nn(DisasContext *s, arg_z2z_en *a, GVecGen3Fn *fn)
+{
+ int esz, dn, dm, vsz, n;
+
+ if (!sme_sm_enabled_check(s)) {
+ return true;
+ }
+
+ esz = a->esz;
+ n = a->n;
+ dn = a->zdn;
+ dm = a->zm;
+ vsz = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; i++) {
+ int dofs = vec_full_reg_offset(s, dn + i);
+ int mofs = vec_full_reg_offset(s, dm + i);
+
+ fn(esz, dofs, dofs, mofs, vsz, vsz);
+ }
+ return true;
+}
+
+TRANS_FEAT(SMAX_nn, aa64_sme2, do_z2z_nn, a, tcg_gen_gvec_smax)
+TRANS_FEAT(SMIN_nn, aa64_sme2, do_z2z_nn, a, tcg_gen_gvec_smin)
+TRANS_FEAT(UMAX_nn, aa64_sme2, do_z2z_nn, a, tcg_gen_gvec_umax)
+TRANS_FEAT(UMIN_nn, aa64_sme2, do_z2z_nn, a, tcg_gen_gvec_umin)
+TRANS_FEAT(SRSHL_nn, aa64_sme2, do_z2z_nn, a, gen_gvec_srshl)
+TRANS_FEAT(URSHL_nn, aa64_sme2, do_z2z_nn, a, gen_gvec_urshl)
+TRANS_FEAT(SQDMULH_nn, aa64_sme2, do_z2z_nn, a, gen_gvec_sve2_sqdmulh)
+
static bool do_z2z_n1_fpst(DisasContext *s, arg_z2z_en *a,
gen_helper_gvec_3_ptr * const fns[4])
{
@@ -587,6 +618,36 @@ static bool do_z2z_n1_fpst(DisasContext *s, arg_z2z_en *a,
return true;
}
+static bool do_z2z_nn_fpst(DisasContext *s, arg_z2z_en *a,
+ gen_helper_gvec_3_ptr * const fns[4])
+{
+ int esz = a->esz, n, dn, dm, vsz;
+ gen_helper_gvec_3_ptr *fn;
+ TCGv_ptr fpst;
+
+ if (esz == MO_8 && !dc_isar_feature(aa64_sme2_b16b16, s)) {
+ return false;
+ }
+ if (!sme_sm_enabled_check(s)) {
+ return true;
+ }
+
+ fpst = fpstatus_ptr(esz == MO_16 ? FPST_A64_F16 : FPST_A64);
+ fn = fns[esz];
+ n = a->n;
+ dn = a->zdn;
+ dm = a->zm;
+ vsz = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; i++) {
+ int dofs = vec_full_reg_offset(s, dn + i);
+ int mofs = vec_full_reg_offset(s, dm + i);
+
+ tcg_gen_gvec_3_ptr(dofs, dofs, mofs, fpst, vsz, vsz, 0, fn);
+ }
+ return true;
+}
+
static gen_helper_gvec_3_ptr * const f_vector_fmax[2][4] = {
{ gen_helper_gvec_fmax_b16,
gen_helper_gvec_fmax_h,
@@ -598,6 +659,7 @@ static gen_helper_gvec_3_ptr * const f_vector_fmax[2][4] = {
gen_helper_gvec_ah_fmax_d },
};
TRANS_FEAT(FMAX_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmax[s->fpcr_ah])
+TRANS_FEAT(FMAX_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fmax[s->fpcr_ah])
static gen_helper_gvec_3_ptr * const f_vector_fmin[2][4] = {
{ gen_helper_gvec_fmin_b16,
@@ -610,6 +672,7 @@ static gen_helper_gvec_3_ptr * const f_vector_fmin[2][4] = {
gen_helper_gvec_ah_fmin_d },
};
TRANS_FEAT(FMIN_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmin[s->fpcr_ah])
+TRANS_FEAT(FMIN_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fmin[s->fpcr_ah])
static gen_helper_gvec_3_ptr * const f_vector_fmaxnm[4] = {
gen_helper_gvec_fmaxnum_b16,
@@ -618,6 +681,7 @@ static gen_helper_gvec_3_ptr * const f_vector_fmaxnm[4] = {
gen_helper_gvec_fmaxnum_d,
};
TRANS_FEAT(FMAXNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fmaxnm)
+TRANS_FEAT(FMAXNM_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fmaxnm)
static gen_helper_gvec_3_ptr * const f_vector_fminnm[4] = {
gen_helper_gvec_fminnum_b16,
@@ -626,3 +690,4 @@ static gen_helper_gvec_3_ptr * const f_vector_fminnm[4] = {
gen_helper_gvec_fminnum_d,
};
TRANS_FEAT(FMINNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fminnm)
+TRANS_FEAT(FMINNM_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fminnm)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 005f87777b..470592f4c0 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -209,3 +209,39 @@ ADD_n1 1100000 1 .. 10 .... 1010.0 11000 .... 0 @z2z_4x1
SQDMULH_n1 1100000 1 .. 10 .... 1010.1 00000 .... 0 @z2z_2x1
SQDMULH_n1 1100000 1 .. 10 .... 1010.1 00000 .... 0 @z2z_4x1
+
+### SME2 Multi-vector Multiple Vectors SVE Destructive
+
+%zm_ax2 17:4 !function=times_2
+%zm_ax4 18:3 !function=times_4
+
+@z2z_2x2 ....... . esz:2 . ....0 ....0. ..... .... . \
+ &z2z_en n=2 zdn=%zd_ax2 zm=%zm_ax2
+@z2z_4x4 ....... . esz:2 . ...00 ....1. ..... ...0 . \
+ &z2z_en n=4 zdn=%zd_ax4 zm=%zm_ax4
+
+SMAX_nn 1100000 1 .. 1 ..... 1011.0 00000 .... 0 @z2z_2x2
+SMAX_nn 1100000 1 .. 1 ..... 1011.0 00000 .... 0 @z2z_4x4
+UMAX_nn 1100000 1 .. 1 ..... 1011.0 00000 .... 1 @z2z_2x2
+UMAX_nn 1100000 1 .. 1 ..... 1011.0 00000 .... 1 @z2z_4x4
+SMIN_nn 1100000 1 .. 1 ..... 1011.0 00001 .... 0 @z2z_2x2
+SMIN_nn 1100000 1 .. 1 ..... 1011.0 00001 .... 0 @z2z_4x4
+UMIN_nn 1100000 1 .. 1 ..... 1011.0 00001 .... 1 @z2z_2x2
+UMIN_nn 1100000 1 .. 1 ..... 1011.0 00001 .... 1 @z2z_4x4
+
+FMAX_nn 1100000 1 .. 1 ..... 1011.0 01000 .... 0 @z2z_2x2
+FMAX_nn 1100000 1 .. 1 ..... 1011.0 01000 .... 0 @z2z_4x4
+FMIN_nn 1100000 1 .. 1 ..... 1011.0 01000 .... 1 @z2z_2x2
+FMIN_nn 1100000 1 .. 1 ..... 1011.0 01000 .... 1 @z2z_4x4
+FMAXNM_nn 1100000 1 .. 1 ..... 1011.0 01001 .... 0 @z2z_2x2
+FMAXNM_nn 1100000 1 .. 1 ..... 1011.0 01001 .... 0 @z2z_4x4
+FMINNM_nn 1100000 1 .. 1 ..... 1011.0 01001 .... 1 @z2z_2x2
+FMINNM_nn 1100000 1 .. 1 ..... 1011.0 01001 .... 1 @z2z_4x4
+
+SRSHL_nn 1100000 1 .. 1 ..... 1011.0 10001 .... 0 @z2z_2x2
+SRSHL_nn 1100000 1 .. 1 ..... 1011.0 10001 .... 0 @z2z_4x4
+URSHL_nn 1100000 1 .. 1 ..... 1011.0 10001 .... 1 @z2z_2x2
+URSHL_nn 1100000 1 .. 1 ..... 1011.0 10001 .... 1 @z2z_4x4
+
+SQDMULH_nn 1100000 1 .. 1 ..... 1011.1 00000 .... 0 @z2z_2x2
+SQDMULH_nn 1100000 1 .. 1 ..... 1011.1 00000 .... 0 @z2z_4x4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 28/61] target/arm: Implement SME2 ADD/SUB (array results, multiple and single vector)
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (26 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 27/61] target/arm: Implement SME2 Multiple Vectors " Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 29/61] target/arm: Implement SME2 ADD/SUB (array results, multiple vectors) Richard Henderson
` (33 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate.h | 2 ++
target/arm/tcg/translate-sme.c | 29 +++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 15 +++++++++++++++
3 files changed, 46 insertions(+)
diff --git a/target/arm/tcg/translate.h b/target/arm/tcg/translate.h
index c364d977f3..be39adfa86 100644
--- a/target/arm/tcg/translate.h
+++ b/target/arm/tcg/translate.h
@@ -638,6 +638,8 @@ typedef void GVecGen3Fn(unsigned, uint32_t, uint32_t,
uint32_t, uint32_t, uint32_t);
typedef void GVecGen4Fn(unsigned, uint32_t, uint32_t, uint32_t,
uint32_t, uint32_t, uint32_t);
+typedef void GVecGen3FnVar(unsigned, TCGv_ptr, uint32_t, TCGv_ptr, uint32_t,
+ TCGv_ptr, uint32_t, uint32_t, uint32_t);
/* Function prototype for gen_ functions for calling Neon helpers */
typedef void NeonGenOneOpFn(TCGv_i32, TCGv_i32);
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 617621d663..09a4da1725 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -691,3 +691,32 @@ static gen_helper_gvec_3_ptr * const f_vector_fminnm[4] = {
};
TRANS_FEAT(FMINNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fminnm)
TRANS_FEAT(FMINNM_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fminnm)
+
+static bool do_azz_n1(DisasContext *s, arg_azz_n *a, int esz,
+ GVecGen3FnVar *fn)
+{
+ TCGv_ptr t_za;
+ int svl, n, o_zm;
+
+ if (!sme_smza_enabled_check(s)) {
+ return true;
+ }
+
+ n = a->n;
+ t_za = get_zarray(s, a->rv, a->off, n);
+ o_zm = vec_full_reg_offset(s, a->zm);
+ svl = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; ++i) {
+ int o_za = (svl / n * sizeof(ARMVectorReg)) * i;
+ int o_zn = vec_full_reg_offset(s, (a->zn + i) % 32);
+
+ fn(esz, t_za, o_za, tcg_env, o_zn, tcg_env, o_zm, svl, svl);
+ }
+ return true;
+}
+
+TRANS_FEAT(ADD_azz_n1_s, aa64_sme2, do_azz_n1, a, MO_32, tcg_gen_gvec_add_var)
+TRANS_FEAT(SUB_azz_n1_s, aa64_sme2, do_azz_n1, a, MO_32, tcg_gen_gvec_sub_var)
+TRANS_FEAT(ADD_azz_n1_d, aa64_sme2_i16i64, do_azz_n1, a, MO_64, tcg_gen_gvec_add_var)
+TRANS_FEAT(SUB_azz_n1_d, aa64_sme2_i16i64, do_azz_n1, a, MO_64, tcg_gen_gvec_sub_var)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 470592f4c0..8b81c0a0ce 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -245,3 +245,18 @@ URSHL_nn 1100000 1 .. 1 ..... 1011.0 10001 .... 1 @z2z_4x4
SQDMULH_nn 1100000 1 .. 1 ..... 1011.1 00000 .... 0 @z2z_2x2
SQDMULH_nn 1100000 1 .. 1 ..... 1011.1 00000 .... 0 @z2z_4x4
+
+### SME2 Multi-vector Multiple and Single Array Vectors
+
+&azz_n n off rv zn zm
+@azz_nx1_o3 ........ .... zm:4 ...... zn:5 .. off:3 &azz_n rv=%mova_rv
+
+ADD_azz_n1_s 11000001 0010 .... 0 .. 110 ..... 10 ... @azz_nx1_o3 n=2
+ADD_azz_n1_s 11000001 0011 .... 0 .. 110 ..... 10 ... @azz_nx1_o3 n=4
+ADD_azz_n1_d 11000001 0110 .... 0 .. 110 ..... 10 ... @azz_nx1_o3 n=2
+ADD_azz_n1_d 11000001 0111 .... 0 .. 110 ..... 10 ... @azz_nx1_o3 n=4
+
+SUB_azz_n1_s 11000001 0010 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=2
+SUB_azz_n1_s 11000001 0011 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
+SUB_azz_n1_d 11000001 0110 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=2
+SUB_azz_n1_d 11000001 0111 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 29/61] target/arm: Implement SME2 ADD/SUB (array results, multiple vectors)
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (27 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 28/61] target/arm: Implement SME2 ADD/SUB (array results, multiple and single vector) Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 30/61] target/arm: Pass ZA to helper_sve2_fmlal_zz[zx]w_s Richard Henderson
` (32 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 31 +++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 20 ++++++++++++++++++++
2 files changed, 51 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 09a4da1725..8aae70201c 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -692,6 +692,7 @@ static gen_helper_gvec_3_ptr * const f_vector_fminnm[4] = {
TRANS_FEAT(FMINNM_n1, aa64_sme2, do_z2z_n1_fpst, a, f_vector_fminnm)
TRANS_FEAT(FMINNM_nn, aa64_sme2, do_z2z_nn_fpst, a, f_vector_fminnm)
+/* Add/Sub vector Z[m] to each Z[n*N] with result in ZA[d*N]. */
static bool do_azz_n1(DisasContext *s, arg_azz_n *a, int esz,
GVecGen3FnVar *fn)
{
@@ -720,3 +721,33 @@ TRANS_FEAT(ADD_azz_n1_s, aa64_sme2, do_azz_n1, a, MO_32, tcg_gen_gvec_add_var)
TRANS_FEAT(SUB_azz_n1_s, aa64_sme2, do_azz_n1, a, MO_32, tcg_gen_gvec_sub_var)
TRANS_FEAT(ADD_azz_n1_d, aa64_sme2_i16i64, do_azz_n1, a, MO_64, tcg_gen_gvec_add_var)
TRANS_FEAT(SUB_azz_n1_d, aa64_sme2_i16i64, do_azz_n1, a, MO_64, tcg_gen_gvec_sub_var)
+
+/* Add/Sub each vector Z[m*N] to each Z[n*N] with result in ZA[d*N]. */
+static bool do_azz_nn(DisasContext *s, arg_azz_n *a, int esz,
+ GVecGen3FnVar *fn)
+{
+ TCGv_ptr t_za;
+ int svl, n;
+
+ if (!sme_smza_enabled_check(s)) {
+ return true;
+ }
+
+ n = a->n;
+ t_za = get_zarray(s, a->rv, a->off, n);
+ svl = streaming_vec_reg_size(s);
+
+ for (int i = 0; i < n; ++i) {
+ int o_za = (svl / n * sizeof(ARMVectorReg)) * i;
+ int o_zn = vec_full_reg_offset(s, a->zn + i);
+ int o_zm = vec_full_reg_offset(s, a->zm + i);
+
+ fn(esz, t_za, o_za, tcg_env, o_zn, tcg_env, o_zm, svl, svl);
+ }
+ return true;
+}
+
+TRANS_FEAT(ADD_azz_nn_s, aa64_sme2, do_azz_nn, a, MO_32, tcg_gen_gvec_add_var)
+TRANS_FEAT(SUB_azz_nn_s, aa64_sme2, do_azz_nn, a, MO_32, tcg_gen_gvec_sub_var)
+TRANS_FEAT(ADD_azz_nn_d, aa64_sme2_i16i64, do_azz_nn, a, MO_64, tcg_gen_gvec_add_var)
+TRANS_FEAT(SUB_azz_nn_d, aa64_sme2_i16i64, do_azz_nn, a, MO_64, tcg_gen_gvec_sub_var)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 8b81c0a0ce..a6dee08661 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -260,3 +260,23 @@ SUB_azz_n1_s 11000001 0010 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=2
SUB_azz_n1_s 11000001 0011 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
SUB_azz_n1_d 11000001 0110 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=2
SUB_azz_n1_d 11000001 0111 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
+
+### SME2 Multi-vector Multiple Array Vectors
+
+%zn_ax2 6:4 !function=times_2
+%zn_ax4 7:3 !function=times_4
+
+@azz_2x2_o3 ........ ... ..... . .. ... ..... .. off:3 \
+ &azz_n n=2 rv=%mova_rv zn=%zn_ax2 zm=%zm_ax2
+@azz_4x4_o3 ........ ... ..... . .. ... ..... .. off:3 \
+ &azz_n n=4 rv=%mova_rv zn=%zn_ax4 zm=%zm_ax4
+
+ADD_azz_nn_s 11000001 101 ....0 0 .. 110 ....0 10 ... @azz_2x2_o3
+ADD_azz_nn_s 11000001 101 ...01 0 .. 110 ...00 10 ... @azz_4x4_o3
+ADD_azz_nn_d 11000001 111 ....0 0 .. 110 ....0 10 ... @azz_2x2_o3
+ADD_azz_nn_d 11000001 111 ...01 0 .. 110 ...00 10 ... @azz_4x4_o3
+
+SUB_azz_nn_s 11000001 101 ....0 0 .. 110 ....0 11 ... @azz_2x2_o3
+SUB_azz_nn_s 11000001 101 ...01 0 .. 110 ...00 11 ... @azz_4x4_o3
+SUB_azz_nn_d 11000001 111 ....0 0 .. 110 ....0 11 ... @azz_2x2_o3
+SUB_azz_nn_d 11000001 111 ...01 0 .. 110 ...00 11 ... @azz_4x4_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 30/61] target/arm: Pass ZA to helper_sve2_fmlal_zz[zx]w_s
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (28 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 29/61] target/arm: Implement SME2 ADD/SUB (array results, multiple vectors) Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 31/61] target/arm: Implement SME2 FMLAL, BFMLAL Richard Henderson
` (31 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Indicate whether to use FPST_FPCR or FPST_ZA via bit 2 of
simd_data(desc). For SVE, this bit remains zero.
For do_FMLAL_zzzw, this requires no change.
For do_FMLAL_zzxw, move the index up one bit.
Read fz16 directly from env->fpcr.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sve.c | 2 +-
target/arm/tcg/vec_helper.c | 8 +++++---
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index 0907a4e9e9..e56ef6ad68 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -7168,7 +7168,7 @@ static bool do_FMLAL_zzxw(DisasContext *s, arg_rrxr_esz *a, bool sub, bool sel)
{
return gen_gvec_ptr_zzzz(s, gen_helper_sve2_fmlal_zzxw_s,
a->rd, a->rn, a->rm, a->ra,
- (a->index << 2) | (sel << 1) | sub, tcg_env);
+ (a->index << 3) | (sel << 1) | sub, tcg_env);
}
TRANS_FEAT(FMLALB_zzxw, aa64_sve2, do_FMLAL_zzxw, a, false, false)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 671777ce52..16ddd35239 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -2191,7 +2191,8 @@ void HELPER(sve2_fmlal_zzzw_s)(void *vd, void *vn, void *vm, void *va,
intptr_t i, oprsz = simd_oprsz(desc);
bool is_s = extract32(desc, SIMD_DATA_SHIFT, 1);
intptr_t sel = extract32(desc, SIMD_DATA_SHIFT + 1, 1) * sizeof(float16);
- float_status *status = &env->vfp.fp_status[FPST_A64];
+ bool za = extract32(desc, SIMD_DATA_SHIFT + 2, 1);
+ float_status *status = &env->vfp.fp_status[za ? FPST_ZA : FPST_A64];
bool fz16 = env->vfp.fpcr & FPCR_FZ16;
int negx = 0, negf = 0;
@@ -2274,8 +2275,9 @@ void HELPER(sve2_fmlal_zzxw_s)(void *vd, void *vn, void *vm, void *va,
intptr_t i, j, oprsz = simd_oprsz(desc);
bool is_s = extract32(desc, SIMD_DATA_SHIFT, 1);
intptr_t sel = extract32(desc, SIMD_DATA_SHIFT + 1, 1) * sizeof(float16);
- intptr_t idx = extract32(desc, SIMD_DATA_SHIFT + 2, 3) * sizeof(float16);
- float_status *status = &env->vfp.fp_status[FPST_A64];
+ bool za = extract32(desc, SIMD_DATA_SHIFT + 2, 1);
+ intptr_t idx = extract32(desc, SIMD_DATA_SHIFT + 3, 3) * sizeof(float16);
+ float_status *status = &env->vfp.fp_status[za ? FPST_ZA : FPST_A64];
bool fz16 = env->vfp.fpcr & FPCR_FZ16;
int negx = 0, negf = 0;
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 31/61] target/arm: Implement SME2 FMLAL, BFMLAL
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (29 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 30/61] target/arm: Pass ZA to helper_sve2_fmlal_zz[zx]w_s Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 32/61] target/arm: Implement SME2 FDOT Richard Henderson
` (30 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 93 ++++++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 71 ++++++++++++++++++++++++++
2 files changed, 164 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 8aae70201c..ec12bfd089 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -751,3 +751,96 @@ TRANS_FEAT(ADD_azz_nn_s, aa64_sme2, do_azz_nn, a, MO_32, tcg_gen_gvec_add_var)
TRANS_FEAT(SUB_azz_nn_s, aa64_sme2, do_azz_nn, a, MO_32, tcg_gen_gvec_sub_var)
TRANS_FEAT(ADD_azz_nn_d, aa64_sme2_i16i64, do_azz_nn, a, MO_64, tcg_gen_gvec_add_var)
TRANS_FEAT(SUB_azz_nn_d, aa64_sme2_i16i64, do_azz_nn, a, MO_64, tcg_gen_gvec_sub_var)
+
+/*
+ * Expand array multi-vector single (n1), array multi-vector (nn),
+ * and array multi-vector indexed (nx), for floating-point accumulate.
+ * multi: true for nn, false for n1.
+ * fpst: >= 0 to set ptr argument for FPST_*, < 0 for ENV.
+ * data: stuff for simd_data, including any index.
+ */
+#define FPST_ENV -1
+
+static bool do_azz_acc_fp(DisasContext *s, int nreg, int nsel,
+ int rv, int off, int zn, int zm,
+ int data, int shsel, bool multi, int fpst,
+ gen_helper_gvec_4_ptr *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ int vstride = svl / nreg;
+ TCGv_ptr t_za = get_zarray(s, rv, off, nreg);
+ TCGv_ptr t, ptr;
+
+ if (fpst >= 0) {
+ ptr = fpstatus_ptr(fpst);
+ } else {
+ ptr = tcg_env;
+ }
+ t = tcg_temp_new_ptr();
+
+ for (int r = 0; r < nreg; ++r) {
+ TCGv_ptr t_zn = vec_full_reg_ptr(s, zn);
+ TCGv_ptr t_zm = vec_full_reg_ptr(s, zm);
+
+ for (int i = 0; i < nsel; ++i) {
+ int o_za = (r * vstride + i) * sizeof(ARMVectorReg);
+ int desc = simd_desc(svl, svl, data | (i << shsel));
+
+ tcg_gen_addi_ptr(t, t_za, o_za);
+ fn(t, t_zn, t_zm, t, ptr, tcg_constant_i32(desc));
+ }
+
+ /*
+ * For multiple-and-single vectors, Zn may wrap.
+ * For multiple vectors, both Zn and Zm are aligned.
+ */
+ zn = (zn + 1) % 32;
+ zm += multi;
+ }
+ }
+ return true;
+}
+
+static bool do_fmlal(DisasContext *s, arg_azz_n *a, bool sub, bool multi)
+{
+ return do_azz_acc_fp(s, a->n, 2, a->rv, a->off, a->zn, a->zm,
+ (1 << 2) | sub, 1,
+ multi, FPST_ENV, gen_helper_sve2_fmlal_zzzw_s);
+}
+
+TRANS_FEAT(FMLAL_n1, aa64_sme2, do_fmlal, a, false, false)
+TRANS_FEAT(FMLSL_n1, aa64_sme2, do_fmlal, a, true, false)
+TRANS_FEAT(FMLAL_nn, aa64_sme2, do_fmlal, a, false, true)
+TRANS_FEAT(FMLSL_nn, aa64_sme2, do_fmlal, a, true, true)
+
+static bool do_fmlal_nx(DisasContext *s, arg_azx_n *a, bool sub)
+{
+ return do_azz_acc_fp(s, a->n, 2, a->rv, a->off, a->zn, a->zm,
+ (a->idx << 3) | (1 << 2) | sub, 1,
+ false, FPST_ENV, gen_helper_sve2_fmlal_zzxw_s);
+}
+
+TRANS_FEAT(FMLAL_nx, aa64_sme2, do_fmlal_nx, a, false)
+TRANS_FEAT(FMLSL_nx, aa64_sme2, do_fmlal_nx, a, true)
+
+static bool do_bfmlal(DisasContext *s, arg_azz_n *a, bool sub, bool multi)
+{
+ return do_azz_acc_fp(s, a->n, 2, a->rv, a->off, a->zn, a->zm, sub, 1,
+ multi, FPST_ZA, gen_helper_gvec_bfmlal);
+}
+
+TRANS_FEAT(BFMLAL_n1, aa64_sme2, do_bfmlal, a, false, false)
+TRANS_FEAT(BFMLSL_n1, aa64_sme2, do_bfmlal, a, true, false)
+TRANS_FEAT(BFMLAL_nn, aa64_sme2, do_bfmlal, a, false, true)
+TRANS_FEAT(BFMLSL_nn, aa64_sme2, do_bfmlal, a, true, true)
+
+static bool do_bfmlal_nx(DisasContext *s, arg_azx_n *a, bool sub)
+{
+ return do_azz_acc_fp(s, a->n, 2, a->rv, a->off, a->zn, a->zm,
+ (a->idx << 2) | sub, 1,
+ false, FPST_ZA, gen_helper_gvec_bfmlal_idx);
+}
+
+TRANS_FEAT(BFMLAL_nx, aa64_sme2, do_bfmlal_nx, a, false)
+TRANS_FEAT(BFMLSL_nx, aa64_sme2, do_bfmlal_nx, a, true)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index a6dee08661..9850c19d90 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -261,6 +261,30 @@ SUB_azz_n1_s 11000001 0011 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
SUB_azz_n1_d 11000001 0110 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=2
SUB_azz_n1_d 11000001 0111 .... 0 .. 110 ..... 11 ... @azz_nx1_o3 n=4
+%off3_x2 0:3 !function=times_2
+%off2_x2 0:2 !function=times_2
+
+@azz_nx1_o3x2 ........ ... . zm:4 . .. ... zn:5 .. ... \
+ &azz_n off=%off3_x2 rv=%mova_rv
+@azz_nx1_o2x2 ........ ... . zm:4 . .. ... zn:5 ... .. \
+ &azz_n off=%off2_x2 rv=%mova_rv
+
+FMLAL_n1 11000001 001 0 .... 0 .. 011 ..... 00 ... @azz_nx1_o3x2 n=1
+FMLAL_n1 11000001 001 0 .... 0 .. 010 ..... 000 .. @azz_nx1_o2x2 n=2
+FMLAL_n1 11000001 001 1 .... 0 .. 010 ..... 000 .. @azz_nx1_o2x2 n=4
+
+FMLSL_n1 11000001 001 0 .... 0 .. 011 ..... 01 ... @azz_nx1_o3x2 n=1
+FMLSL_n1 11000001 001 0 .... 0 .. 010 ..... 010 .. @azz_nx1_o2x2 n=2
+FMLSL_n1 11000001 001 1 .... 0 .. 010 ..... 010 .. @azz_nx1_o2x2 n=4
+
+BFMLAL_n1 11000001 001 0 .... 0 .. 011 ..... 10 ... @azz_nx1_o3x2 n=1
+BFMLAL_n1 11000001 001 0 .... 0 .. 010 ..... 100 .. @azz_nx1_o2x2 n=2
+BFMLAL_n1 11000001 001 1 .... 0 .. 010 ..... 100 .. @azz_nx1_o2x2 n=4
+
+BFMLSL_n1 11000001 001 0 .... 0 .. 011 ..... 11 ... @azz_nx1_o3x2 n=1
+BFMLSL_n1 11000001 001 0 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=2
+BFMLSL_n1 11000001 001 1 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -280,3 +304,50 @@ SUB_azz_nn_s 11000001 101 ....0 0 .. 110 ....0 11 ... @azz_2x2_o3
SUB_azz_nn_s 11000001 101 ...01 0 .. 110 ...00 11 ... @azz_4x4_o3
SUB_azz_nn_d 11000001 111 ....0 0 .. 110 ....0 11 ... @azz_2x2_o3
SUB_azz_nn_d 11000001 111 ...01 0 .. 110 ...00 11 ... @azz_4x4_o3
+
+@azz_2x2_o2x2 ........ ... ..... . .. ... ..... ... .. \
+ &azz_n n=2 rv=%mova_rv zn=%zn_ax2 zm=%zm_ax2 off=%off2_x2
+@azz_4x4_o2x2 ........ ... ..... . .. ... ..... ... .. \
+ &azz_n n=4 rv=%mova_rv zn=%zn_ax4 zm=%zm_ax4 off=%off2_x2
+
+FMLAL_nn 11000001 101 ....0 0 .. 010 ....0 000 .. @azz_2x2_o2x2
+FMLAL_nn 11000001 101 ...01 0 .. 010 ...00 000 .. @azz_4x4_o2x2
+
+FMLSL_nn 11000001 101 ....0 0 .. 010 ....0 010 .. @azz_2x2_o2x2
+FMLSL_nn 11000001 101 ...01 0 .. 010 ...00 010 .. @azz_4x4_o2x2
+
+BFMLAL_nn 11000001 101 ....0 0 .. 010 ....0 100 .. @azz_2x2_o2x2
+BFMLAL_nn 11000001 101 ...01 0 .. 010 ...00 100 .. @azz_4x4_o2x2
+
+BFMLSL_nn 11000001 101 ....0 0 .. 010 ....0 110 .. @azz_2x2_o2x2
+BFMLSL_nn 11000001 101 ...01 0 .. 010 ...00 110 .. @azz_4x4_o2x2
+
+### SME2 Multi-vector Indexed
+
+&azx_n n off rv zn zm idx
+
+%idx3_15_10 15:1 10:2
+%idx2_10_2 10:2 2:1
+
+@azx_1x1_o3x2 ........ .... zm:4 . .. . .. zn:5 .. ... \
+ &azx_n n=1 rv=%mova_rv off=%off3_x2 idx=%idx3_15_10
+@azx_2x1_o2x2 ........ .... zm:4 . .. . .. ..... .. ... \
+ &azx_n n=2 rv=%mova_rv off=%off2_x2 zn=%zn_ax2 idx=%idx2_10_2
+@azx_4x1_o2x2 ........ .... zm:4 . .. . .. ..... .. ... \
+ &azx_n n=4 rv=%mova_rv off=%off2_x2 zn=%zn_ax4 idx=%idx2_10_2
+
+FMLAL_nx 11000001 1000 .... . .. 1 .. ..... 00 ... @azx_1x1_o3x2
+FMLAL_nx 11000001 1001 .... 0 .. 1 .. ....0 00 ... @azx_2x1_o2x2
+FMLAL_nx 11000001 1001 .... 1 .. 1 .. ...00 00 ... @azx_4x1_o2x2
+
+FMLSL_nx 11000001 1000 .... . .. 1 .. ..... 01 ... @azx_1x1_o3x2
+FMLSL_nx 11000001 1001 .... 0 .. 1 .. ....0 01 ... @azx_2x1_o2x2
+FMLSL_nx 11000001 1001 .... 1 .. 1 .. ...00 01 ... @azx_4x1_o2x2
+
+BFMLAL_nx 11000001 1000 .... . .. 1 .. ..... 10 ... @azx_1x1_o3x2
+BFMLAL_nx 11000001 1001 .... 0 .. 1 .. ....0 10 ... @azx_2x1_o2x2
+BFMLAL_nx 11000001 1001 .... 1 .. 1 .. ...00 10 ... @azx_4x1_o2x2
+
+BFMLSL_nx 11000001 1000 .... . .. 1 .. ..... 11 ... @azx_1x1_o3x2
+BFMLSL_nx 11000001 1001 .... 0 .. 1 .. ....0 11 ... @azx_2x1_o2x2
+BFMLSL_nx 11000001 1001 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 32/61] target/arm: Implement SME2 FDOT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (30 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 31/61] target/arm: Implement SME2 FMLAL, BFMLAL Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 33/61] target/arm: Implement SME2 BFDOT Richard Henderson
` (29 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 5 ++++
target/arm/tcg/sme_helper.c | 44 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 18 ++++++++++++++
target/arm/tcg/sme.decode | 14 +++++++++++
4 files changed, 81 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index cdd7058aed..ec93ff57ff 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -173,3 +173,8 @@ DEF_HELPER_FLAGS_5(gvec_fmaxnum_b16, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fminnum_b16, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, fpst, i32)
+
+DEF_HELPER_FLAGS_6(sme2_fdot_h, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, env, i32)
+DEF_HELPER_FLAGS_6(sme2_fdot_idx_h, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, env, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index d973d83957..18b17e9fb2 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1122,6 +1122,50 @@ void HELPER(sme_fmopa_h)(void *vza, void *vzn, void *vzm, void *vpn,
}
}
+void HELPER(sme2_fdot_h)(void *vd, void *vn, void *vm, void *va,
+ CPUARMState *env, uint32_t desc)
+{
+ intptr_t i, oprsz = simd_maxsz(desc);
+ float_status fpst_odd, *fpst_std, *fpst_f16;
+ float32 *d = vd, *a = va;
+ uint32_t *n = vn, *m = vm;
+
+ fpst_std = &env->vfp.fp_status[FPST_ZA];
+ fpst_f16 = &env->vfp.fp_status[FPST_ZA_F16];
+ fpst_odd = *fpst_std;
+ set_float_rounding_mode(float_round_to_odd, &fpst_odd);
+
+ for (i = 0; i < oprsz / sizeof(float32); ++i) {
+ d[H4(i)] = f16_dotadd(a[H4(i)], n[H4(i)], m[H4(i)],
+ fpst_f16, fpst_std, &fpst_odd);
+ }
+}
+
+void HELPER(sme2_fdot_idx_h)(void *vd, void *vn, void *vm, void *va,
+ CPUARMState *env, uint32_t desc)
+{
+ intptr_t i, j, oprsz = simd_maxsz(desc);
+ intptr_t elements = oprsz / sizeof(float32);
+ intptr_t eltspersegment = MIN(4, elements);
+ int idx = extract32(desc, SIMD_DATA_SHIFT, 2);
+ float_status fpst_odd, *fpst_std, *fpst_f16;
+ float32 *d = vd, *a = va;
+ uint32_t *n = vn, *m = (uint32_t *)vm + H4(idx);
+
+ fpst_std = &env->vfp.fp_status[FPST_ZA];
+ fpst_f16 = &env->vfp.fp_status[FPST_ZA_F16];
+ fpst_odd = *fpst_std;
+ set_float_rounding_mode(float_round_to_odd, &fpst_odd);
+
+ for (i = 0; i < elements; i += eltspersegment) {
+ uint32_t mm = m[i];
+ for (j = 0; j < eltspersegment; ++j) {
+ d[H4(i + j)] = f16_dotadd(a[H4(i + j)], n[H4(i + j)], mm,
+ fpst_f16, fpst_std, &fpst_odd);
+ }
+ }
+}
+
void HELPER(sme_bfmopa)(void *vza, void *vzn, void *vzm,
void *vpn, void *vpm, CPUARMState *env, uint32_t desc)
{
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index ec12bfd089..2885655cc5 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -844,3 +844,21 @@ static bool do_bfmlal_nx(DisasContext *s, arg_azx_n *a, bool sub)
TRANS_FEAT(BFMLAL_nx, aa64_sme2, do_bfmlal_nx, a, false)
TRANS_FEAT(BFMLSL_nx, aa64_sme2, do_bfmlal_nx, a, true)
+
+static bool do_fdot(DisasContext *s, arg_azz_n *a, bool multi)
+{
+ return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm, 0, 0,
+ multi, FPST_ENV, gen_helper_sme2_fdot_h);
+}
+
+TRANS_FEAT(FDOT_n1, aa64_sme2, do_fdot, a, false)
+TRANS_FEAT(FDOT_nn, aa64_sme2, do_fdot, a, true)
+
+static bool do_fdot_nx(DisasContext *s, arg_azx_n *a)
+{
+ return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm,
+ a->idx, 0, false, FPST_ENV,
+ gen_helper_sme2_fdot_idx_h);
+}
+
+TRANS_FEAT(FDOT_nx, aa64_sme2, do_fdot_nx, a)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 9850c19d90..a2b93519c4 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -285,6 +285,9 @@ BFMLSL_n1 11000001 001 0 .... 0 .. 011 ..... 11 ... @azz_nx1_o3x2 n=1
BFMLSL_n1 11000001 001 0 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=2
BFMLSL_n1 11000001 001 1 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=4
+FDOT_n1 11000001 001 0 .... 0 .. 100 ..... 00 ... @azz_nx1_o3 n=2
+FDOT_n1 11000001 001 1 .... 0 .. 100 ..... 00 ... @azz_nx1_o3 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -322,6 +325,9 @@ BFMLAL_nn 11000001 101 ...01 0 .. 010 ...00 100 .. @azz_4x4_o2x2
BFMLSL_nn 11000001 101 ....0 0 .. 010 ....0 110 .. @azz_2x2_o2x2
BFMLSL_nn 11000001 101 ...01 0 .. 010 ...00 110 .. @azz_4x4_o2x2
+FDOT_nn 11000001 101 ....0 0 .. 100 ....0 00 ... @azz_2x2_o3
+FDOT_nn 11000001 101 ...01 0 .. 100 ...00 00 ... @azz_4x4_o3
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -351,3 +357,11 @@ BFMLAL_nx 11000001 1001 .... 1 .. 1 .. ...00 10 ... @azx_4x1_o2x2
BFMLSL_nx 11000001 1000 .... . .. 1 .. ..... 11 ... @azx_1x1_o3x2
BFMLSL_nx 11000001 1001 .... 0 .. 1 .. ....0 11 ... @azx_2x1_o2x2
BFMLSL_nx 11000001 1001 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
+
+@azx_2x1_i2_o3 ........ .... zm:4 . .. . idx:2 .... ... off:3 \
+ &azx_n n=2 rv=%mova_rv zn=%zn_ax2
+@azx_4x1_i2_o3 ........ .... zm:4 . .. . idx:2 .... ... off:3 \
+ &azx_n n=2 rv=%mova_rv zn=%zn_ax4
+
+FDOT_nx 11000001 0101 .... 0 .. 1 .. ....0 01 ... @azx_2x1_i2_o3
+FDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 01 ... @azx_4x1_i2_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 33/61] target/arm: Implement SME2 BFDOT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (31 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 32/61] target/arm: Implement SME2 FDOT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 34/61] target/arm: Implement SME2 FVDOT, BFVDOT Richard Henderson
` (28 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 17 +++++++++++++++++
target/arm/tcg/sme.decode | 9 +++++++++
2 files changed, 26 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 2885655cc5..c03daa535d 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -862,3 +862,20 @@ static bool do_fdot_nx(DisasContext *s, arg_azx_n *a)
}
TRANS_FEAT(FDOT_nx, aa64_sme2, do_fdot_nx, a)
+
+static bool do_bfdot(DisasContext *s, arg_azz_n *a, bool multi)
+{
+ return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm, 0, 0,
+ multi, FPST_ENV, gen_helper_gvec_bfdot);
+}
+
+TRANS_FEAT(BFDOT_n1, aa64_sme2, do_bfdot, a, false)
+TRANS_FEAT(BFDOT_nn, aa64_sme2, do_bfdot, a, true)
+
+static bool do_bfdot_nx(DisasContext *s, arg_azx_n *a)
+{
+ return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm, a->idx, 0,
+ false, FPST_ENV, gen_helper_gvec_bfdot_idx);
+}
+
+TRANS_FEAT(BFDOT_nx, aa64_sme2, do_bfdot_nx, a)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index a2b93519c4..18e625605f 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -288,6 +288,9 @@ BFMLSL_n1 11000001 001 1 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=4
FDOT_n1 11000001 001 0 .... 0 .. 100 ..... 00 ... @azz_nx1_o3 n=2
FDOT_n1 11000001 001 1 .... 0 .. 100 ..... 00 ... @azz_nx1_o3 n=4
+BFDOT_n1 11000001 001 0 .... 0 .. 100 ..... 10 ... @azz_nx1_o3 n=2
+BFDOT_n1 11000001 001 1 .... 0 .. 100 ..... 10 ... @azz_nx1_o3 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -328,6 +331,9 @@ BFMLSL_nn 11000001 101 ...01 0 .. 010 ...00 110 .. @azz_4x4_o2x2
FDOT_nn 11000001 101 ....0 0 .. 100 ....0 00 ... @azz_2x2_o3
FDOT_nn 11000001 101 ...01 0 .. 100 ...00 00 ... @azz_4x4_o3
+BFDOT_nn 11000001 101 ....0 0 .. 100 ....0 10 ... @azz_2x2_o3
+BFDOT_nn 11000001 101 ...01 0 .. 100 ...00 10 ... @azz_4x4_o3
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -365,3 +371,6 @@ BFMLSL_nx 11000001 1001 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
FDOT_nx 11000001 0101 .... 0 .. 1 .. ....0 01 ... @azx_2x1_i2_o3
FDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 01 ... @azx_4x1_i2_o3
+
+BFDOT_nx 11000001 0101 .... 0 .. 1 .. ....0 11 ... @azx_2x1_i2_o3
+BFDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 11 ... @azx_4x1_i2_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 34/61] target/arm: Implement SME2 FVDOT, BFVDOT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (32 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 33/61] target/arm: Implement SME2 BFDOT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 35/61] target/arm: Rename helper_gvec_*dot_[bh] to *_4[bh] Richard Henderson
` (27 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 2 ++
target/arm/tcg/helper-sme.h | 2 ++
target/arm/tcg/sme_helper.c | 30 ++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 16 ++++++++++++++
target/arm/tcg/vec_helper.c | 39 ++++++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 3 +++
6 files changed, 92 insertions(+)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 0907505839..e64ddd68b2 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1079,6 +1079,8 @@ DEF_HELPER_FLAGS_6(gvec_bfdot, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
DEF_HELPER_FLAGS_6(gvec_bfdot_idx, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
+DEF_HELPER_FLAGS_6(sme2_bfvdot_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, env, i32)
DEF_HELPER_FLAGS_6(gvec_bfmmla, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index ec93ff57ff..8f5a1b3c90 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -178,3 +178,5 @@ DEF_HELPER_FLAGS_6(sme2_fdot_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
DEF_HELPER_FLAGS_6(sme2_fdot_idx_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
+DEF_HELPER_FLAGS_6(sme2_fvdot_idx_h, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, env, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 18b17e9fb2..6fb7e6b306 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1166,6 +1166,36 @@ void HELPER(sme2_fdot_idx_h)(void *vd, void *vn, void *vm, void *va,
}
}
+void HELPER(sme2_fvdot_idx_h)(void *vd, void *vn, void *vm, void *va,
+ CPUARMState *env, uint32_t desc)
+{
+ intptr_t i, j, oprsz = simd_maxsz(desc);
+ intptr_t elements = oprsz / sizeof(float32);
+ intptr_t eltspersegment = MIN(4, elements);
+ int idx = extract32(desc, SIMD_DATA_SHIFT, 2);
+ int sel = extract32(desc, SIMD_DATA_SHIFT + 2, 1);
+ float_status fpst_odd, *fpst_std, *fpst_f16;
+ float32 *d = vd, *a = va;
+ uint16_t *n0 = vn;
+ uint16_t *n1 = vn + sizeof(ARMVectorReg);
+ uint32_t *m = (uint32_t *)vm + H4(idx);
+
+ fpst_std = &env->vfp.fp_status[FPST_ZA];
+ fpst_f16 = &env->vfp.fp_status[FPST_ZA_F16];
+ fpst_odd = *fpst_std;
+ set_float_rounding_mode(float_round_to_odd, &fpst_odd);
+
+ for (i = 0; i < elements; i += eltspersegment) {
+ uint32_t mm = m[i];
+ for (j = 0; j < eltspersegment; ++j) {
+ uint32_t nn = n0[i + H2(2 * j + sel)]
+ | (n1[i + H2(2 * j + sel)] << 16);
+ d[i + H4(j)] = f16_dotadd(a[i + H4(j)], nn, mm,
+ fpst_f16, fpst_std, &fpst_odd);
+ }
+ }
+}
+
void HELPER(sme_bfmopa)(void *vza, void *vzn, void *vzm,
void *vpn, void *vpm, CPUARMState *env, uint32_t desc)
{
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index c03daa535d..f4a684f6a4 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -863,6 +863,14 @@ static bool do_fdot_nx(DisasContext *s, arg_azx_n *a)
TRANS_FEAT(FDOT_nx, aa64_sme2, do_fdot_nx, a)
+static bool trans_FVDOT(DisasContext *s, arg_azx_n *a)
+{
+ return dc_isar_feature(aa64_sme2, s) &&
+ do_azz_acc_fp(s, 1, 2, a->rv, a->off, a->zn, a->zm,
+ a->idx, 2, false, FPST_ENV,
+ gen_helper_sme2_fvdot_idx_h);
+}
+
static bool do_bfdot(DisasContext *s, arg_azz_n *a, bool multi)
{
return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm, 0, 0,
@@ -879,3 +887,11 @@ static bool do_bfdot_nx(DisasContext *s, arg_azx_n *a)
}
TRANS_FEAT(BFDOT_nx, aa64_sme2, do_bfdot_nx, a)
+
+static bool trans_BFVDOT(DisasContext *s, arg_azx_n *a)
+{
+ return dc_isar_feature(aa64_sme2, s) &&
+ do_azz_acc_fp(s, 1, 2, a->rv, a->off, a->zn, a->zm,
+ a->idx, 2, false, FPST_ENV,
+ gen_helper_sme2_bfvdot_idx);
+}
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 16ddd35239..cfd219ed13 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -3079,6 +3079,45 @@ void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
clear_tail(d, opr_sz, simd_maxsz(desc));
}
+void HELPER(sme2_bfvdot_idx)(void *vd, void *vn, void *vm,
+ void *va, CPUARMState *env, uint32_t desc)
+{
+ intptr_t i, j, opr_sz = simd_oprsz(desc);
+ intptr_t idx = extract32(desc, SIMD_DATA_SHIFT, 2);
+ intptr_t sel = extract32(desc, SIMD_DATA_SHIFT + 2, 1);
+ intptr_t elements = opr_sz / 4;
+ intptr_t eltspersegment = MIN(16 / 4, elements);
+ float32 *d = vd, *a = va;
+ uint16_t *n0 = vn;
+ uint16_t *n1 = vn + sizeof(ARMVectorReg);
+ uint32_t *m = vm;
+ float_status fpst, fpst_odd;
+
+ if (is_ebf(env, &fpst, &fpst_odd)) {
+ for (i = 0; i < elements; i += eltspersegment) {
+ uint32_t m_idx = m[i + H4(idx)];
+
+ for (j = 0; j < eltspersegment; j++) {
+ uint32_t nn = n0[i + H2(2 * j + sel)]
+ | (n1[i + H2(2 * j + sel)] << 16);
+ d[i + H4(j)] = bfdotadd_ebf(a[i + H4(j)], nn, m_idx,
+ &fpst, &fpst_odd);
+ }
+ }
+ } else {
+ for (i = 0; i < elements; i += eltspersegment) {
+ uint32_t m_idx = m[i + H4(idx)];
+
+ for (j = 0; j < eltspersegment; j++) {
+ uint32_t nn = n0[i + H2(2 * j + sel)]
+ | (n1[i + H2(2 * j + sel)] << 16);
+ d[i + H4(j)] = bfdotadd(a[i + H4(j)], nn, m_idx, &fpst);
+ }
+ }
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
+
void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va,
CPUARMState *env, uint32_t desc)
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 18e625605f..7c057bcad2 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -374,3 +374,6 @@ FDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 01 ... @azx_4x1_i2_o3
BFDOT_nx 11000001 0101 .... 0 .. 1 .. ....0 11 ... @azx_2x1_i2_o3
BFDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 11 ... @azx_4x1_i2_o3
+
+FVDOT 11000001 0101 .... 0 .. 0 .. ....0 01 ... @azx_2x1_i2_o3
+BFVDOT 11000001 0101 .... 0 .. 0 .. ....0 11 ... @azx_2x1_i2_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 35/61] target/arm: Rename helper_gvec_*dot_[bh] to *_4[bh]
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (33 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 34/61] target/arm: Implement SME2 FVDOT, BFVDOT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 36/61] target/arm: Remove helper_gvec_sudot_idx_4b Richard Henderson
` (26 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Emphasize that these are 4-way dot products.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 22 +++++++++++-----------
target/arm/tcg/translate-a64.c | 14 +++++++-------
target/arm/tcg/translate-neon.c | 14 +++++++-------
target/arm/tcg/translate-sve.c | 18 +++++++++---------
target/arm/tcg/vec_helper.c | 22 +++++++++++-----------
5 files changed, 45 insertions(+), 45 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index e64ddd68b2..5772ea98bd 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -606,23 +606,23 @@ DEF_HELPER_FLAGS_5(sve2_sqrdmlah_d, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_5(sve2_sqrdmlsh_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sdot_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_udot_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sdot_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_udot_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_usdot_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_sdot_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_udot_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_sdot_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_udot_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_usdot_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sdot_idx_b, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_sdot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_udot_idx_b, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_udot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sdot_idx_h, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_sdot_idx_4h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_udot_idx_h, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_udot_idx_4h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sudot_idx_b, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_sudot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_usdot_idx_b, TCG_CALL_NO_RWG,
+DEF_HELPER_FLAGS_5(gvec_usdot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_fcaddh, TCG_CALL_NO_RWG,
diff --git a/target/arm/tcg/translate-a64.c b/target/arm/tcg/translate-a64.c
index bc96cee273..46c378fd08 100644
--- a/target/arm/tcg/translate-a64.c
+++ b/target/arm/tcg/translate-a64.c
@@ -6118,9 +6118,9 @@ static bool do_dot_vector_env(DisasContext *s, arg_qrrr_e *a,
return true;
}
-TRANS_FEAT(SDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_sdot_b)
-TRANS_FEAT(UDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_udot_b)
-TRANS_FEAT(USDOT_v, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_usdot_b)
+TRANS_FEAT(SDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_sdot_4b)
+TRANS_FEAT(UDOT_v, aa64_dp, do_dot_vector, a, gen_helper_gvec_udot_4b)
+TRANS_FEAT(USDOT_v, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_usdot_4b)
TRANS_FEAT(BFDOT_v, aa64_bf16, do_dot_vector_env, a, gen_helper_gvec_bfdot)
TRANS_FEAT(BFMMLA, aa64_bf16, do_dot_vector_env, a, gen_helper_gvec_bfmmla)
TRANS_FEAT(SMMLA, aa64_i8mm, do_dot_vector, a, gen_helper_gvec_smmla_b)
@@ -6880,12 +6880,12 @@ static bool do_dot_vector_idx_env(DisasContext *s, arg_qrrx_e *a,
return true;
}
-TRANS_FEAT(SDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_sdot_idx_b)
-TRANS_FEAT(UDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_udot_idx_b)
+TRANS_FEAT(SDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_sdot_idx_4b)
+TRANS_FEAT(UDOT_vi, aa64_dp, do_dot_vector_idx, a, gen_helper_gvec_udot_idx_4b)
TRANS_FEAT(SUDOT_vi, aa64_i8mm, do_dot_vector_idx, a,
- gen_helper_gvec_sudot_idx_b)
+ gen_helper_gvec_sudot_idx_4b)
TRANS_FEAT(USDOT_vi, aa64_i8mm, do_dot_vector_idx, a,
- gen_helper_gvec_usdot_idx_b)
+ gen_helper_gvec_usdot_idx_4b)
TRANS_FEAT(BFDOT_vi, aa64_bf16, do_dot_vector_idx_env, a,
gen_helper_gvec_bfdot_idx)
diff --git a/target/arm/tcg/translate-neon.c b/target/arm/tcg/translate-neon.c
index c4fecb8fd6..ea04336797 100644
--- a/target/arm/tcg/translate-neon.c
+++ b/target/arm/tcg/translate-neon.c
@@ -271,7 +271,7 @@ static bool trans_VSDOT(DisasContext *s, arg_VSDOT *a)
return false;
}
return do_neon_ddda(s, a->q * 7, a->vd, a->vn, a->vm, 0,
- gen_helper_gvec_sdot_b);
+ gen_helper_gvec_sdot_4b);
}
static bool trans_VUDOT(DisasContext *s, arg_VUDOT *a)
@@ -280,7 +280,7 @@ static bool trans_VUDOT(DisasContext *s, arg_VUDOT *a)
return false;
}
return do_neon_ddda(s, a->q * 7, a->vd, a->vn, a->vm, 0,
- gen_helper_gvec_udot_b);
+ gen_helper_gvec_udot_4b);
}
static bool trans_VUSDOT(DisasContext *s, arg_VUSDOT *a)
@@ -289,7 +289,7 @@ static bool trans_VUSDOT(DisasContext *s, arg_VUSDOT *a)
return false;
}
return do_neon_ddda(s, a->q * 7, a->vd, a->vn, a->vm, 0,
- gen_helper_gvec_usdot_b);
+ gen_helper_gvec_usdot_4b);
}
static bool trans_VDOT_b16(DisasContext *s, arg_VDOT_b16 *a)
@@ -356,7 +356,7 @@ static bool trans_VSDOT_scalar(DisasContext *s, arg_VSDOT_scalar *a)
return false;
}
return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
- gen_helper_gvec_sdot_idx_b);
+ gen_helper_gvec_sdot_idx_4b);
}
static bool trans_VUDOT_scalar(DisasContext *s, arg_VUDOT_scalar *a)
@@ -365,7 +365,7 @@ static bool trans_VUDOT_scalar(DisasContext *s, arg_VUDOT_scalar *a)
return false;
}
return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
- gen_helper_gvec_udot_idx_b);
+ gen_helper_gvec_udot_idx_4b);
}
static bool trans_VUSDOT_scalar(DisasContext *s, arg_VUSDOT_scalar *a)
@@ -374,7 +374,7 @@ static bool trans_VUSDOT_scalar(DisasContext *s, arg_VUSDOT_scalar *a)
return false;
}
return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
- gen_helper_gvec_usdot_idx_b);
+ gen_helper_gvec_usdot_idx_4b);
}
static bool trans_VSUDOT_scalar(DisasContext *s, arg_VSUDOT_scalar *a)
@@ -383,7 +383,7 @@ static bool trans_VSUDOT_scalar(DisasContext *s, arg_VSUDOT_scalar *a)
return false;
}
return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
- gen_helper_gvec_sudot_idx_b);
+ gen_helper_gvec_sudot_idx_4b);
}
static bool trans_VDOT_b16_scal(DisasContext *s, arg_VDOT_b16_scal *a)
diff --git a/target/arm/tcg/translate-sve.c b/target/arm/tcg/translate-sve.c
index e56ef6ad68..3c8c018af3 100644
--- a/target/arm/tcg/translate-sve.c
+++ b/target/arm/tcg/translate-sve.c
@@ -3385,8 +3385,8 @@ DO_ZZI(UMIN, umin)
#undef DO_ZZI
static gen_helper_gvec_4 * const dot_fns[2][2] = {
- { gen_helper_gvec_sdot_b, gen_helper_gvec_sdot_h },
- { gen_helper_gvec_udot_b, gen_helper_gvec_udot_h }
+ { gen_helper_gvec_sdot_4b, gen_helper_gvec_sdot_4h },
+ { gen_helper_gvec_udot_4b, gen_helper_gvec_udot_4h }
};
TRANS_FEAT(DOT_zzzz, aa64_sve, gen_gvec_ool_zzzz,
dot_fns[a->u][a->sz], a->rd, a->rn, a->rm, a->ra, 0)
@@ -3396,18 +3396,18 @@ TRANS_FEAT(DOT_zzzz, aa64_sve, gen_gvec_ool_zzzz,
*/
TRANS_FEAT(SDOT_zzxw_s, aa64_sve, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_sdot_idx_b, a)
+ gen_helper_gvec_sdot_idx_4b, a)
TRANS_FEAT(SDOT_zzxw_d, aa64_sve, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_sdot_idx_h, a)
+ gen_helper_gvec_sdot_idx_4h, a)
TRANS_FEAT(UDOT_zzxw_s, aa64_sve, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_udot_idx_b, a)
+ gen_helper_gvec_udot_idx_4b, a)
TRANS_FEAT(UDOT_zzxw_d, aa64_sve, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_udot_idx_h, a)
+ gen_helper_gvec_udot_idx_4h, a)
TRANS_FEAT(SUDOT_zzxw_s, aa64_sve_i8mm, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_sudot_idx_b, a)
+ gen_helper_gvec_sudot_idx_4b, a)
TRANS_FEAT(USDOT_zzxw_s, aa64_sve_i8mm, gen_gvec_ool_arg_zzxz,
- gen_helper_gvec_usdot_idx_b, a)
+ gen_helper_gvec_usdot_idx_4b, a)
#define DO_SVE2_RRX(NAME, FUNC) \
TRANS_FEAT(NAME, aa64_sve, gen_gvec_ool_zzz, FUNC, \
@@ -7106,7 +7106,7 @@ TRANS_FEAT(SQRDCMLAH_zzzz, aa64_sve2, gen_gvec_ool_zzzz,
sqrdcmlah_fns[a->esz], a->rd, a->rn, a->rm, a->ra, a->rot)
TRANS_FEAT(USDOT_zzzz, aa64_sve_i8mm, gen_gvec_ool_arg_zzzz,
- a->esz == 2 ? gen_helper_gvec_usdot_b : NULL, a, 0)
+ a->esz == 2 ? gen_helper_gvec_usdot_4b : NULL, a, 0)
TRANS_FEAT_NONSTREAMING(AESMC, aa64_sve2_aes, gen_gvec_ool_zz,
gen_helper_crypto_aesmc, a->rd, a->rd, 0)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index cfd219ed13..8beace8147 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -825,11 +825,11 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
clear_tail(d, opr_sz, simd_maxsz(desc)); \
}
-DO_DOT(gvec_sdot_b, int32_t, int8_t, int8_t)
-DO_DOT(gvec_udot_b, uint32_t, uint8_t, uint8_t)
-DO_DOT(gvec_usdot_b, uint32_t, uint8_t, int8_t)
-DO_DOT(gvec_sdot_h, int64_t, int16_t, int16_t)
-DO_DOT(gvec_udot_h, uint64_t, uint16_t, uint16_t)
+DO_DOT(gvec_sdot_4b, int32_t, int8_t, int8_t)
+DO_DOT(gvec_udot_4b, uint32_t, uint8_t, uint8_t)
+DO_DOT(gvec_usdot_4b, uint32_t, uint8_t, int8_t)
+DO_DOT(gvec_sdot_4h, int64_t, int16_t, int16_t)
+DO_DOT(gvec_udot_4h, uint64_t, uint16_t, uint16_t)
#define DO_DOT_IDX(NAME, TYPED, TYPEN, TYPEM, HD) \
void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
@@ -865,12 +865,12 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
clear_tail(d, opr_sz, simd_maxsz(desc)); \
}
-DO_DOT_IDX(gvec_sdot_idx_b, int32_t, int8_t, int8_t, H4)
-DO_DOT_IDX(gvec_udot_idx_b, uint32_t, uint8_t, uint8_t, H4)
-DO_DOT_IDX(gvec_sudot_idx_b, int32_t, int8_t, uint8_t, H4)
-DO_DOT_IDX(gvec_usdot_idx_b, int32_t, uint8_t, int8_t, H4)
-DO_DOT_IDX(gvec_sdot_idx_h, int64_t, int16_t, int16_t, H8)
-DO_DOT_IDX(gvec_udot_idx_h, uint64_t, uint16_t, uint16_t, H8)
+DO_DOT_IDX(gvec_sdot_idx_4b, int32_t, int8_t, int8_t, H4)
+DO_DOT_IDX(gvec_udot_idx_4b, uint32_t, uint8_t, uint8_t, H4)
+DO_DOT_IDX(gvec_sudot_idx_4b, int32_t, int8_t, uint8_t, H4)
+DO_DOT_IDX(gvec_usdot_idx_4b, int32_t, uint8_t, int8_t, H4)
+DO_DOT_IDX(gvec_sdot_idx_4h, int64_t, int16_t, int16_t, H8)
+DO_DOT_IDX(gvec_udot_idx_4h, uint64_t, uint16_t, uint16_t, H8)
void HELPER(gvec_fcaddh)(void *vd, void *vn, void *vm,
float_status *fpst, uint32_t desc)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 36/61] target/arm: Remove helper_gvec_sudot_idx_4b
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (34 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 35/61] target/arm: Rename helper_gvec_*dot_[bh] to *_4[bh] Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 37/61] target/arm: Implemement SME2 SDOT, UDOT, USDOT, SUDOT Richard Henderson
` (25 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Add gen_helper_gvec_sudot_idx_4b as an expander which
swaps arguments and uses helper_gvec_usdot_idx_4b.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 2 --
target/arm/tcg/translate.h | 3 +++
target/arm/tcg/gengvec.c | 6 ++++++
target/arm/tcg/vec_helper.c | 1 -
4 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 5772ea98bd..7ae4cddb3c 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -620,8 +620,6 @@ DEF_HELPER_FLAGS_5(gvec_sdot_idx_4h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_udot_idx_4h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
-DEF_HELPER_FLAGS_5(gvec_sudot_idx_4b, TCG_CALL_NO_RWG,
- void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_usdot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/translate.h b/target/arm/tcg/translate.h
index be39adfa86..a6e6181421 100644
--- a/target/arm/tcg/translate.h
+++ b/target/arm/tcg/translate.h
@@ -624,6 +624,9 @@ void gen_gvec_urecpe(unsigned vece, uint32_t rd_ofs, uint32_t rn_ofs,
void gen_gvec_ursqrte(unsigned vece, uint32_t rd_ofs, uint32_t rn_ofs,
uint32_t opr_sz, uint32_t max_sz);
+void gen_helper_gvec_sudot_idx_4b(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc);
+
/*
* Forward to the isar_feature_* tests given a DisasContext pointer.
*/
diff --git a/target/arm/tcg/gengvec.c b/target/arm/tcg/gengvec.c
index 01867f8ace..c182e5ab6f 100644
--- a/target/arm/tcg/gengvec.c
+++ b/target/arm/tcg/gengvec.c
@@ -23,6 +23,12 @@
#include "translate.h"
+void gen_helper_gvec_sudot_idx_4b(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_gvec_usdot_idx_4b(d, m, n, a, desc);
+}
+
static void gen_gvec_fn3_qc(uint32_t rd_ofs, uint32_t rn_ofs, uint32_t rm_ofs,
uint32_t opr_sz, uint32_t max_sz,
gen_helper_gvec_3_ptr *fn)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 8beace8147..7904159d57 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -867,7 +867,6 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
DO_DOT_IDX(gvec_sdot_idx_4b, int32_t, int8_t, int8_t, H4)
DO_DOT_IDX(gvec_udot_idx_4b, uint32_t, uint8_t, uint8_t, H4)
-DO_DOT_IDX(gvec_sudot_idx_4b, int32_t, int8_t, uint8_t, H4)
DO_DOT_IDX(gvec_usdot_idx_4b, int32_t, uint8_t, int8_t, H4)
DO_DOT_IDX(gvec_sdot_idx_4h, int64_t, int16_t, int16_t, H8)
DO_DOT_IDX(gvec_udot_idx_4h, uint64_t, uint16_t, uint16_t, H8)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 37/61] target/arm: Implemement SME2 SDOT, UDOT, USDOT, SUDOT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (35 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 36/61] target/arm: Remove helper_gvec_sudot_idx_4b Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 38/61] target/arm: Implement SME2 SVDOT, UVDOT, SUVDOT, USVDOT Richard Henderson
` (24 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 6 +++
target/arm/tcg/translate-sme.c | 85 ++++++++++++++++++++++++++++++++++
target/arm/tcg/vec_helper.c | 51 ++++++++++++++++++++
target/arm/tcg/sme.decode | 63 ++++++++++++++++++++++++-
4 files changed, 204 insertions(+), 1 deletion(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 7ae4cddb3c..5509486aa7 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -611,6 +611,8 @@ DEF_HELPER_FLAGS_5(gvec_udot_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_sdot_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_udot_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_usdot_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_sdot_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_udot_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_sdot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
@@ -622,6 +624,10 @@ DEF_HELPER_FLAGS_5(gvec_udot_idx_4h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_usdot_idx_4b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_sdot_idx_2h, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_udot_idx_2h, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(gvec_fcaddh, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index f4a684f6a4..d859910f93 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -895,3 +895,88 @@ static bool trans_BFVDOT(DisasContext *s, arg_azx_n *a)
a->idx, 2, false, FPST_ENV,
gen_helper_sme2_bfvdot_idx);
}
+
+/*
+ * Expand array multi-vector single (n1), array multi-vector (nn),
+ * and array multi-vector indexed (nx), for integer accumulate.
+ * multi: true for nn, false for n1.
+ * data: stuff for simd_data, including any index.
+ */
+static bool do_azz_acc(DisasContext *s, int nreg, int nsel,
+ int rv, int off, int zn, int zm,
+ int data, int shsel, bool multi,
+ gen_helper_gvec_4 *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ int vstride = svl / nreg;
+ TCGv_ptr t_za = get_zarray(s, rv, off, nreg);
+ TCGv_ptr t = tcg_temp_new_ptr();
+
+ for (int r = 0; r < nreg; ++r) {
+ TCGv_ptr t_zn = vec_full_reg_ptr(s, zn);
+ TCGv_ptr t_zm = vec_full_reg_ptr(s, zm);
+
+ for (int i = 0; i < nsel; ++i) {
+ int o_za = (r * vstride + i) * sizeof(ARMVectorReg);
+ int desc = simd_desc(svl, svl, data | (i << shsel));
+
+ tcg_gen_addi_ptr(t, t_za, o_za);
+ fn(t, t_zn, t_zm, t, tcg_constant_i32(desc));
+ }
+
+ /*
+ * For multiple-and-single vectors, Zn may wrap.
+ * For multiple vectors, both Zn and Zm are aligned.
+ */
+ zn = (zn + 1) % 32;
+ zm += multi;
+ }
+ }
+ return true;
+}
+
+static bool do_dot(DisasContext *s, arg_azz_n *a, bool multi,
+ gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 1, a->rv, a->off, a->zn, a->zm,
+ 0, 0, multi, fn);
+}
+
+static void gen_helper_gvec_sudot_4b(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_gvec_usdot_4b(d, m, n, a, desc);
+}
+
+TRANS_FEAT(USDOT_n1, aa64_sme, do_dot, a, false, gen_helper_gvec_usdot_4b)
+TRANS_FEAT(SUDOT_n1, aa64_sme, do_dot, a, false, gen_helper_gvec_sudot_4b)
+TRANS_FEAT(SDOT_n1_2h, aa64_sme, do_dot, a, false, gen_helper_gvec_sdot_2h)
+TRANS_FEAT(UDOT_n1_2h, aa64_sme, do_dot, a, false, gen_helper_gvec_udot_2h)
+TRANS_FEAT(SDOT_n1_4b, aa64_sme, do_dot, a, false, gen_helper_gvec_sdot_4b)
+TRANS_FEAT(UDOT_n1_4b, aa64_sme, do_dot, a, false, gen_helper_gvec_udot_4b)
+TRANS_FEAT(SDOT_n1_4h, aa64_sme2_i16i64, do_dot, a, false, gen_helper_gvec_sdot_4h)
+TRANS_FEAT(UDOT_n1_4h, aa64_sme2_i16i64, do_dot, a, false, gen_helper_gvec_udot_4h)
+
+TRANS_FEAT(USDOT_nn, aa64_sme, do_dot, a, true, gen_helper_gvec_usdot_4b)
+TRANS_FEAT(SDOT_nn_2h, aa64_sme, do_dot, a, true, gen_helper_gvec_sdot_2h)
+TRANS_FEAT(UDOT_nn_2h, aa64_sme, do_dot, a, true, gen_helper_gvec_udot_2h)
+TRANS_FEAT(SDOT_nn_4b, aa64_sme, do_dot, a, true, gen_helper_gvec_sdot_4b)
+TRANS_FEAT(UDOT_nn_4b, aa64_sme, do_dot, a, true, gen_helper_gvec_udot_4b)
+TRANS_FEAT(SDOT_nn_4h, aa64_sme2_i16i64, do_dot, a, true, gen_helper_gvec_sdot_4h)
+TRANS_FEAT(UDOT_nn_4h, aa64_sme2_i16i64, do_dot, a, true, gen_helper_gvec_udot_4h)
+
+static bool do_dot_nx(DisasContext *s, arg_azx_n *a, gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 1, a->rv, a->off, a->zn, a->zm,
+ a->idx, 0, false, fn);
+}
+
+TRANS_FEAT(USDOT_nx, aa64_sme, do_dot_nx, a, gen_helper_gvec_usdot_idx_4b)
+TRANS_FEAT(SUDOT_nx, aa64_sme, do_dot_nx, a, gen_helper_gvec_sudot_idx_4b)
+TRANS_FEAT(SDOT_nx_2h, aa64_sme, do_dot_nx, a, gen_helper_gvec_sdot_idx_2h)
+TRANS_FEAT(UDOT_nx_2h, aa64_sme, do_dot_nx, a, gen_helper_gvec_udot_idx_2h)
+TRANS_FEAT(SDOT_nx_4b, aa64_sme, do_dot_nx, a, gen_helper_gvec_sdot_idx_4b)
+TRANS_FEAT(UDOT_nx_4b, aa64_sme, do_dot_nx, a, gen_helper_gvec_udot_idx_4b)
+TRANS_FEAT(SDOT_nx_4h, aa64_sme2_i16i64, do_dot_nx, a, gen_helper_gvec_sdot_idx_4h)
+TRANS_FEAT(UDOT_nx_4h, aa64_sme2_i16i64, do_dot_nx, a, gen_helper_gvec_udot_idx_4h)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 7904159d57..ce7f55bfc8 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -871,6 +871,57 @@ DO_DOT_IDX(gvec_usdot_idx_4b, int32_t, uint8_t, int8_t, H4)
DO_DOT_IDX(gvec_sdot_idx_4h, int64_t, int16_t, int16_t, H8)
DO_DOT_IDX(gvec_udot_idx_4h, uint64_t, uint16_t, uint16_t, H8)
+#undef DO_DOT
+#undef DO_DOT_IDX
+
+/* Similar for 2-way dot product */
+#define DO_DOT(NAME, TYPED, TYPEN, TYPEM) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
+{ \
+ intptr_t i, opr_sz = simd_oprsz(desc); \
+ TYPED *d = vd, *a = va; \
+ TYPEN *n = vn; \
+ TYPEM *m = vm; \
+ for (i = 0; i < opr_sz / sizeof(TYPED); ++i) { \
+ d[i] = (a[i] + \
+ (TYPED)n[i * 2 + 0] * m[i * 2 + 0] + \
+ (TYPED)n[i * 2 + 1] * m[i * 2 + 1]); \
+ } \
+ clear_tail(d, opr_sz, simd_maxsz(desc)); \
+}
+
+#define DO_DOT_IDX(NAME, TYPED, TYPEN, TYPEM, HD) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
+{ \
+ intptr_t i = 0, opr_sz = simd_oprsz(desc); \
+ intptr_t opr_sz_n = opr_sz / sizeof(TYPED); \
+ intptr_t segend = MIN(16 / sizeof(TYPED), opr_sz_n); \
+ intptr_t index = simd_data(desc); \
+ TYPED *d = vd, *a = va; \
+ TYPEN *n = vn; \
+ TYPEM *m_indexed = (TYPEM *)vm + HD(index) * 2; \
+ do { \
+ TYPED m0 = m_indexed[i * 2 + 0]; \
+ TYPED m1 = m_indexed[i * 2 + 1]; \
+ do { \
+ d[i] = (a[i] + \
+ n[i * 4 + 0] * m0 + \
+ n[i * 4 + 1] * m1); \
+ } while (++i < segend); \
+ segend = i + (16 / sizeof(TYPED)); \
+ } while (i < opr_sz_n); \
+ clear_tail(d, opr_sz, simd_maxsz(desc)); \
+}
+
+DO_DOT(gvec_sdot_2h, int32_t, int16_t, int16_t)
+DO_DOT(gvec_udot_2h, uint32_t, uint16_t, uint16_t)
+
+DO_DOT_IDX(gvec_sdot_idx_2h, int32_t, int16_t, int16_t, H4)
+DO_DOT_IDX(gvec_udot_idx_2h, uint32_t, uint16_t, uint16_t, H4)
+
+#undef DO_DOT
+#undef DO_DOT_IDX
+
void HELPER(gvec_fcaddh)(void *vd, void *vn, void *vm,
float_status *fpst, uint32_t desc)
{
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 7c057bcad2..338637decd 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -291,6 +291,26 @@ FDOT_n1 11000001 001 1 .... 0 .. 100 ..... 00 ... @azz_nx1_o3 n=4
BFDOT_n1 11000001 001 0 .... 0 .. 100 ..... 10 ... @azz_nx1_o3 n=2
BFDOT_n1 11000001 001 1 .... 0 .. 100 ..... 10 ... @azz_nx1_o3 n=4
+USDOT_n1 11000001 001 0 .... 0 .. 101 ..... 01 ... @azz_nx1_o3 n=2
+USDOT_n1 11000001 001 1 .... 0 .. 101 ..... 01 ... @azz_nx1_o3 n=4
+
+SUDOT_n1 11000001 001 0 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=2
+SUDOT_n1 11000001 001 1 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=4
+
+SDOT_n1_4b 11000001 001 0 .... 0 .. 101 ..... 00 ... @azz_nx1_o3 n=2
+SDOT_n1_4b 11000001 001 1 .... 0 .. 101 ..... 00 ... @azz_nx1_o3 n=4
+SDOT_n1_4h 11000001 011 0 .... 0 .. 101 ..... 00 ... @azz_nx1_o3 n=2
+SDOT_n1_4h 11000001 011 1 .... 0 .. 101 ..... 00 ... @azz_nx1_o3 n=4
+SDOT_n1_2h 11000001 011 0 .... 0 .. 101 ..... 01 ... @azz_nx1_o3 n=2
+SDOT_n1_2h 11000001 011 1 .... 0 .. 101 ..... 01 ... @azz_nx1_o3 n=4
+
+UDOT_n1_4b 11000001 001 0 .... 0 .. 101 ..... 10 ... @azz_nx1_o3 n=2
+UDOT_n1_4b 11000001 001 1 .... 0 .. 101 ..... 10 ... @azz_nx1_o3 n=4
+UDOT_n1_4h 11000001 011 0 .... 0 .. 101 ..... 10 ... @azz_nx1_o3 n=2
+UDOT_n1_4h 11000001 011 1 .... 0 .. 101 ..... 10 ... @azz_nx1_o3 n=4
+UDOT_n1_2h 11000001 011 0 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=2
+UDOT_n1_2h 11000001 011 1 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -334,6 +354,23 @@ FDOT_nn 11000001 101 ...01 0 .. 100 ...00 00 ... @azz_4x4_o3
BFDOT_nn 11000001 101 ....0 0 .. 100 ....0 10 ... @azz_2x2_o3
BFDOT_nn 11000001 101 ...01 0 .. 100 ...00 10 ... @azz_4x4_o3
+USDOT_nn 11000001 101 ....0 0 .. 101 ....0 01 ... @azz_2x2_o3
+USDOT_nn 11000001 101 ...01 0 .. 101 ...00 01 ... @azz_4x4_o3
+
+SDOT_nn_4b 11000001 101 ....0 0 .. 101 ....0 00 ... @azz_2x2_o3
+SDOT_nn_4b 11000001 101 ...01 0 .. 101 ...00 00 ... @azz_4x4_o3
+SDOT_nn_4h 11000001 111 ....0 0 .. 101 ....0 00 ... @azz_2x2_o3
+SDOT_nn_4h 11000001 111 ...01 0 .. 101 ...00 00 ... @azz_4x4_o3
+SDOT_nn_2h 11000001 111 ....0 0 .. 101 ....0 01 ... @azz_2x2_o3
+SDOT_nn_2h 11000001 111 ...01 0 .. 101 ...00 01 ... @azz_4x4_o3
+
+UDOT_nn_4b 11000001 101 ....0 0 .. 101 ....0 10 ... @azz_2x2_o3
+UDOT_nn_4b 11000001 101 ...01 0 .. 101 ...00 10 ... @azz_4x4_o3
+UDOT_nn_4h 11000001 111 ....0 0 .. 101 ....0 10 ... @azz_2x2_o3
+UDOT_nn_4h 11000001 111 ...01 0 .. 101 ...00 10 ... @azz_4x4_o3
+UDOT_nn_2h 11000001 111 ....0 0 .. 101 ....0 11 ... @azz_2x2_o3
+UDOT_nn_2h 11000001 111 ...01 0 .. 101 ...00 11 ... @azz_4x4_o3
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -367,7 +404,11 @@ BFMLSL_nx 11000001 1001 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
@azx_2x1_i2_o3 ........ .... zm:4 . .. . idx:2 .... ... off:3 \
&azx_n n=2 rv=%mova_rv zn=%zn_ax2
@azx_4x1_i2_o3 ........ .... zm:4 . .. . idx:2 .... ... off:3 \
- &azx_n n=2 rv=%mova_rv zn=%zn_ax4
+ &azx_n n=4 rv=%mova_rv zn=%zn_ax4
+@azx_2x1_i1_o3 ........ .... zm:4 . .. .. idx:1 .... ... off:3 \
+ &azx_n n=2 rv=%mova_rv zn=%zn_ax2
+@azx_4x1_i1_o3 ........ .... zm:4 . .. .. idx:1 .... ... off:3 \
+ &azx_n n=4 rv=%mova_rv zn=%zn_ax4
FDOT_nx 11000001 0101 .... 0 .. 1 .. ....0 01 ... @azx_2x1_i2_o3
FDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 01 ... @azx_4x1_i2_o3
@@ -377,3 +418,23 @@ BFDOT_nx 11000001 0101 .... 1 .. 1 .. ...00 11 ... @azx_4x1_i2_o3
FVDOT 11000001 0101 .... 0 .. 0 .. ....0 01 ... @azx_2x1_i2_o3
BFVDOT 11000001 0101 .... 0 .. 0 .. ....0 11 ... @azx_2x1_i2_o3
+
+SDOT_nx_2h 11000001 0101 .... 0 .. 1 .. ....0 00 ... @azx_2x1_i2_o3
+SDOT_nx_2h 11000001 0101 .... 1 .. 1 .. ...00 00 ... @azx_4x1_i2_o3
+SDOT_nx_4b 11000001 0101 .... 0 .. 1 .. ....1 00 ... @azx_2x1_i2_o3
+SDOT_nx_4b 11000001 0101 .... 1 .. 1 .. ...01 00 ... @azx_4x1_i2_o3
+SDOT_nx_4h 11000001 1101 .... 0 .. 00 . ....0 01 ... @azx_2x1_i1_o3
+SDOT_nx_4h 11000001 1101 .... 1 .. 00 . ...00 01 ... @azx_4x1_i1_o3
+
+UDOT_nx_2h 11000001 0101 .... 0 .. 1 .. ....0 10 ... @azx_2x1_i2_o3
+UDOT_nx_2h 11000001 0101 .... 1 .. 1 .. ...00 10 ... @azx_4x1_i2_o3
+UDOT_nx_4b 11000001 0101 .... 0 .. 1 .. ....1 10 ... @azx_2x1_i2_o3
+UDOT_nx_4b 11000001 0101 .... 1 .. 1 .. ...01 10 ... @azx_4x1_i2_o3
+UDOT_nx_4h 11000001 1101 .... 0 .. 00 . ....0 11 ... @azx_2x1_i1_o3
+UDOT_nx_4h 11000001 1101 .... 1 .. 00 . ...00 11 ... @azx_4x1_i1_o3
+
+USDOT_nx 11000001 0101 .... 0 .. 1 .. ....1 01 ... @azx_2x1_i2_o3
+USDOT_nx 11000001 0101 .... 1 .. 1 .. ...01 01 ... @azx_4x1_i2_o3
+
+SUDOT_nx 11000001 0101 .... 0 .. 1 .. ....1 11 ... @azx_2x1_i2_o3
+SUDOT_nx 11000001 0101 .... 1 .. 1 .. ...01 11 ... @azx_4x1_i2_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 38/61] target/arm: Implement SME2 SVDOT, UVDOT, SUVDOT, USVDOT
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (36 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 37/61] target/arm: Implemement SME2 SDOT, UDOT, USDOT, SUDOT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 39/61] target/arm: Implement SME2 SMLAL, SMLSL, UMLAL, UMLSL Richard Henderson
` (23 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 11 +++++++++
target/arm/tcg/sme_helper.c | 42 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 23 +++++++++++++++++++
target/arm/tcg/sme.decode | 11 +++++++++
4 files changed, 87 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 8f5a1b3c90..464877516b 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -180,3 +180,14 @@ DEF_HELPER_FLAGS_6(sme2_fdot_idx_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
DEF_HELPER_FLAGS_6(sme2_fvdot_idx_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, env, i32)
+
+DEF_HELPER_FLAGS_4(sme2_svdot_idx_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uvdot_idx_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_suvdot_idx_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_usvdot_idx_4b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_svdot_idx_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uvdot_idx_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_svdot_idx_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uvdot_idx_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 6fb7e6b306..2976c0b33d 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1386,3 +1386,45 @@ DEF_IMOP_16x2_32(umopa2_s, uint16_t, uint16_t)
DEF_IMOPH(sme2, smopa2, s)
DEF_IMOPH(sme2, umopa2, s)
+
+#define DO_VDOT_IDX(NAME, TYPED, TYPEN, TYPEM, HD, HN) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, uint32_t desc) \
+{ \
+ intptr_t svl = simd_oprsz(desc); \
+ intptr_t elements = svl / sizeof(TYPED); \
+ intptr_t eltperseg = 16 / sizeof(TYPED); \
+ intptr_t vstride = (svl / 4) * sizeof(ARMVectorReg); \
+ intptr_t zstride = sizeof(ARMVectorReg) / sizeof(TYPEN); \
+ intptr_t nreg = sizeof(TYPED) / sizeof(TYPEN); \
+ intptr_t idx = extract32(desc, SIMD_DATA_SHIFT, 2); \
+ TYPEN *n = vn; \
+ TYPEM *m = vm; \
+ for (intptr_t r = 0; r < nreg; r++) { \
+ TYPED *d = vd + r * vstride; \
+ for (intptr_t seg = 0; seg < elements; seg += eltperseg) { \
+ intptr_t s = seg + idx; \
+ for (intptr_t e = seg; e < seg + eltperseg; e++) { \
+ TYPED sum = d[HD(e)]; \
+ for (intptr_t i = 0; i < nreg; i++) { \
+ TYPED nn = n[i * zstride + HN(nreg * e + r)]; \
+ TYPED mm = m[HN(nreg * s + i)]; \
+ sum += nn * mm; \
+ } \
+ d[HD(e)] = sum; \
+ } \
+ } \
+ } \
+}
+
+DO_VDOT_IDX(sme2_svdot_idx_4b, int32_t, int8_t, int8_t, H4, H1)
+DO_VDOT_IDX(sme2_uvdot_idx_4b, uint32_t, uint8_t, uint8_t, H4, H1)
+DO_VDOT_IDX(sme2_suvdot_idx_4b, int32_t, int8_t, uint8_t, H4, H1)
+DO_VDOT_IDX(sme2_usvdot_idx_4b, int32_t, uint8_t, int8_t, H4, H1)
+
+DO_VDOT_IDX(sme2_svdot_idx_4h, int64_t, int16_t, int16_t, H8, H2)
+DO_VDOT_IDX(sme2_uvdot_idx_4h, uint64_t, uint16_t, uint16_t, H8, H2)
+
+DO_VDOT_IDX(sme2_svdot_idx_2h, int32_t, int16_t, int16_t, H4, H2)
+DO_VDOT_IDX(sme2_uvdot_idx_2h, uint32_t, uint16_t, uint16_t, H4, H2)
+
+#undef DO_VDOT_IDX
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index d859910f93..492933d42d 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -980,3 +980,26 @@ TRANS_FEAT(SDOT_nx_4b, aa64_sme, do_dot_nx, a, gen_helper_gvec_sdot_idx_4b)
TRANS_FEAT(UDOT_nx_4b, aa64_sme, do_dot_nx, a, gen_helper_gvec_udot_idx_4b)
TRANS_FEAT(SDOT_nx_4h, aa64_sme2_i16i64, do_dot_nx, a, gen_helper_gvec_sdot_idx_4h)
TRANS_FEAT(UDOT_nx_4h, aa64_sme2_i16i64, do_dot_nx, a, gen_helper_gvec_udot_idx_4h)
+
+static bool do_vdot_nx(DisasContext *s, arg_azx_n *a, gen_helper_gvec_3 *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ fn(get_zarray(s, a->rv, a->off, a->n),
+ vec_full_reg_ptr(s, a->zn),
+ vec_full_reg_ptr(s, a->zm),
+ tcg_constant_i32(simd_desc(svl, svl, a->idx)));
+ }
+ return true;
+}
+
+TRANS_FEAT(SVDOT_nx_2h, aa64_sme, do_vdot_nx, a, gen_helper_sme2_svdot_idx_2h)
+TRANS_FEAT(SVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_svdot_idx_4b)
+TRANS_FEAT(SVDOT_nx_4h, aa64_sme, do_vdot_nx, a, gen_helper_sme2_svdot_idx_4h)
+
+TRANS_FEAT(UVDOT_nx_2h, aa64_sme, do_vdot_nx, a, gen_helper_sme2_uvdot_idx_2h)
+TRANS_FEAT(UVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_uvdot_idx_4b)
+TRANS_FEAT(UVDOT_nx_4h, aa64_sme, do_vdot_nx, a, gen_helper_sme2_uvdot_idx_4h)
+
+TRANS_FEAT(SUVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_suvdot_idx_4b)
+TRANS_FEAT(USVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_usvdot_idx_4b)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 338637decd..4146744a46 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -438,3 +438,14 @@ USDOT_nx 11000001 0101 .... 1 .. 1 .. ...01 01 ... @azx_4x1_i2_o3
SUDOT_nx 11000001 0101 .... 0 .. 1 .. ....1 11 ... @azx_2x1_i2_o3
SUDOT_nx 11000001 0101 .... 1 .. 1 .. ...01 11 ... @azx_4x1_i2_o3
+
+SVDOT_nx_2h 11000001 0101 .... 0 .. 0 .. ....1 00 ... @azx_2x1_i2_o3
+SVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 00 ... @azx_4x1_i2_o3
+SVDOT_nx_4h 11000001 1101 .... 1 .. 01 . ...00 01 ... @azx_4x1_i1_o3
+
+UVDOT_nx_2h 11000001 0101 .... 0 .. 0 .. ....1 10 ... @azx_2x1_i2_o3
+UVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 10 ... @azx_4x1_i2_o3
+UVDOT_nx_4h 11000001 1101 .... 1 .. 01 . ...00 11 ... @azx_4x1_i1_o3
+
+SUVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 11 ... @azx_4x1_i2_o3
+USVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 01 ... @azx_4x1_i2_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 39/61] target/arm: Implement SME2 SMLAL, SMLSL, UMLAL, UMLSL
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (37 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 38/61] target/arm: Implement SME2 SVDOT, UVDOT, SUVDOT, USVDOT Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 40/61] target/arm: Implement SME2 SMLALL, SMLSLL, UMLALL, UMLSLL Richard Henderson
` (22 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 29 ++++++++++++++++++++++
target/arm/tcg/sme.decode | 44 ++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 492933d42d..f43511335f 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1003,3 +1003,32 @@ TRANS_FEAT(UVDOT_nx_4h, aa64_sme, do_vdot_nx, a, gen_helper_sme2_uvdot_idx_4h)
TRANS_FEAT(SUVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_suvdot_idx_4b)
TRANS_FEAT(USVDOT_nx_4b, aa64_sme, do_vdot_nx, a, gen_helper_sme2_usvdot_idx_4b)
+
+static bool do_smlal(DisasContext *s, arg_azz_n *a, bool multi,
+ gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 2, a->rv, a->off, a->zn, a->zm,
+ 0, 0, multi, fn);
+}
+
+TRANS_FEAT(SMLAL_n1, aa64_sme, do_smlal, a, false, gen_helper_sve2_smlal_zzzw_s)
+TRANS_FEAT(SMLSL_n1, aa64_sme, do_smlal, a, false, gen_helper_sve2_smlsl_zzzw_s)
+TRANS_FEAT(UMLAL_n1, aa64_sme, do_smlal, a, false, gen_helper_sve2_umlal_zzzw_s)
+TRANS_FEAT(UMLSL_n1, aa64_sme, do_smlal, a, false, gen_helper_sve2_umlsl_zzzw_s)
+
+TRANS_FEAT(SMLAL_nn, aa64_sme, do_smlal, a, true, gen_helper_sve2_smlal_zzzw_s)
+TRANS_FEAT(SMLSL_nn, aa64_sme, do_smlal, a, true, gen_helper_sve2_smlsl_zzzw_s)
+TRANS_FEAT(UMLAL_nn, aa64_sme, do_smlal, a, true, gen_helper_sve2_umlal_zzzw_s)
+TRANS_FEAT(UMLSL_nn, aa64_sme, do_smlal, a, true, gen_helper_sve2_umlsl_zzzw_s)
+
+static bool do_smlal_nx(DisasContext *s, arg_azx_n *a,
+ gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 2, a->rv, a->off, a->zn, a->zm,
+ a->idx << 1, 0, false, fn);
+}
+
+TRANS_FEAT(SMLAL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_smlal_idx_s)
+TRANS_FEAT(SMLSL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_smlsl_idx_s)
+TRANS_FEAT(UMLAL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_umlal_idx_s)
+TRANS_FEAT(UMLSL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_umlsl_idx_s)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 4146744a46..ec011d9382 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -311,6 +311,22 @@ UDOT_n1_4h 11000001 011 1 .... 0 .. 101 ..... 10 ... @azz_nx1_o3 n=4
UDOT_n1_2h 11000001 011 0 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=2
UDOT_n1_2h 11000001 011 1 .... 0 .. 101 ..... 11 ... @azz_nx1_o3 n=4
+SMLAL_n1 11000001 011 0 .... 0 .. 011 ..... 00 ... @azz_nx1_o3x2 n=1
+SMLAL_n1 11000001 011 0 .... 0 .. 010 ..... 000 .. @azz_nx1_o2x2 n=2
+SMLAL_n1 11000001 011 1 .... 0 .. 010 ..... 000 .. @azz_nx1_o2x2 n=4
+
+SMLSL_n1 11000001 011 0 .... 0 .. 011 ..... 01 ... @azz_nx1_o3x2 n=1
+SMLSL_n1 11000001 011 0 .... 0 .. 010 ..... 010 .. @azz_nx1_o2x2 n=2
+SMLSL_n1 11000001 011 1 .... 0 .. 010 ..... 010 .. @azz_nx1_o2x2 n=4
+
+UMLAL_n1 11000001 011 0 .... 0 .. 011 ..... 10 ... @azz_nx1_o3x2 n=1
+UMLAL_n1 11000001 011 0 .... 0 .. 010 ..... 100 .. @azz_nx1_o2x2 n=2
+UMLAL_n1 11000001 011 1 .... 0 .. 010 ..... 100 .. @azz_nx1_o2x2 n=4
+
+UMLSL_n1 11000001 011 0 .... 0 .. 011 ..... 11 ... @azz_nx1_o3x2 n=1
+UMLSL_n1 11000001 011 0 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=2
+UMLSL_n1 11000001 011 1 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -371,6 +387,18 @@ UDOT_nn_4h 11000001 111 ...01 0 .. 101 ...00 10 ... @azz_4x4_o3
UDOT_nn_2h 11000001 111 ....0 0 .. 101 ....0 11 ... @azz_2x2_o3
UDOT_nn_2h 11000001 111 ...01 0 .. 101 ...00 11 ... @azz_4x4_o3
+SMLAL_nn 11000001 111 ....0 0 .. 010 ....0 000 .. @azz_2x2_o2x2
+SMLAL_nn 11000001 111 ...01 0 .. 010 ...00 000 .. @azz_4x4_o2x2
+
+SMLSL_nn 11000001 111 ....0 0 .. 010 ....0 010 .. @azz_2x2_o2x2
+SMLSL_nn 11000001 111 ...01 0 .. 010 ...00 010 .. @azz_4x4_o2x2
+
+UMLAL_nn 11000001 111 ....0 0 .. 010 ....0 100 .. @azz_2x2_o2x2
+UMLAL_nn 11000001 111 ...01 0 .. 010 ...00 100 .. @azz_4x4_o2x2
+
+UMLSL_nn 11000001 111 ....0 0 .. 010 ....0 110 .. @azz_2x2_o2x2
+UMLSL_nn 11000001 111 ...01 0 .. 010 ...00 110 .. @azz_4x4_o2x2
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -449,3 +477,19 @@ UVDOT_nx_4h 11000001 1101 .... 1 .. 01 . ...00 11 ... @azx_4x1_i1_o3
SUVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 11 ... @azx_4x1_i2_o3
USVDOT_nx_4b 11000001 0101 .... 1 .. 0 .. ...01 01 ... @azx_4x1_i2_o3
+
+SMLAL_nx 11000001 1100 .... . .. 1 .. ..... 00 ... @azx_1x1_o3x2
+SMLAL_nx 11000001 1101 .... 0 .. 1 .. ....0 00 ... @azx_2x1_o2x2
+SMLAL_nx 11000001 1101 .... 1 .. 1 .. ...00 00 ... @azx_4x1_o2x2
+
+SMLSL_nx 11000001 1100 .... . .. 1 .. ..... 01 ... @azx_1x1_o3x2
+SMLSL_nx 11000001 1101 .... 0 .. 1 .. ....0 01 ... @azx_2x1_o2x2
+SMLSL_nx 11000001 1101 .... 1 .. 1 .. ...00 01 ... @azx_4x1_o2x2
+
+UMLAL_nx 11000001 1100 .... . .. 1 .. ..... 10 ... @azx_1x1_o3x2
+UMLAL_nx 11000001 1101 .... 0 .. 1 .. ....0 10 ... @azx_2x1_o2x2
+UMLAL_nx 11000001 1101 .... 1 .. 1 .. ...00 10 ... @azx_4x1_o2x2
+
+UMLSL_nx 11000001 1100 .... . .. 1 .. ..... 11 ... @azx_1x1_o3x2
+UMLSL_nx 11000001 1101 .... 0 .. 1 .. ....0 11 ... @azx_2x1_o2x2
+UMLSL_nx 11000001 1101 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 40/61] target/arm: Implement SME2 SMLALL, SMLSLL, UMLALL, UMLSLL
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (38 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 39/61] target/arm: Implement SME2 SMLAL, SMLSL, UMLAL, UMLSL Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 41/61] target/arm: Rename gvec_fml[as]_[hs] with _nf_ infix Richard Henderson
` (21 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 22 ++++++
target/arm/tcg/sme_helper.c | 60 +++++++++++++++
target/arm/tcg/translate-sme.c | 78 +++++++++++++++++++
target/arm/tcg/sme.decode | 135 +++++++++++++++++++++++++++++++++
4 files changed, 295 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 464877516b..673aa347bc 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -191,3 +191,25 @@ DEF_HELPER_FLAGS_4(sme2_uvdot_idx_4h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_4(sme2_svdot_idx_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_4(sme2_uvdot_idx_2h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_5(sme2_smlall_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlall_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlsll_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlsll_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlall_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlall_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlsll_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlsll_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_usmlall_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_usmlall_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_5(sme2_smlall_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlall_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlsll_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_smlsll_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlall_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlall_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlsll_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_umlsll_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_usmlall_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sme2_usmlall_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 2976c0b33d..6fbf49e77d 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1428,3 +1428,63 @@ DO_VDOT_IDX(sme2_svdot_idx_2h, int32_t, int16_t, int16_t, H4, H2)
DO_VDOT_IDX(sme2_uvdot_idx_2h, uint32_t, uint16_t, uint16_t, H4, H2)
#undef DO_VDOT_IDX
+
+#define DO_MLALL(NAME, TYPEW, TYPEN, TYPEM, HW, HN, OP) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
+{ \
+ intptr_t elements = simd_oprsz(desc) / sizeof(TYPEW); \
+ intptr_t sel = extract32(desc, SIMD_DATA_SHIFT, 2); \
+ TYPEW *d = vd, *a = va; TYPEN *n = vn; TYPEM *m = vm; \
+ for (intptr_t i = 0; i < elements; ++i) { \
+ TYPEW nn = n[HN(i * 4 + sel)]; \
+ TYPEN mm = m[HN(i * 4 + sel)]; \
+ d[HW(i)] = a[HW(i)] OP (nn * mm); \
+ } \
+}
+
+DO_MLALL(sme2_smlall_s, int32_t, int8_t, int8_t, H4, H1, +)
+DO_MLALL(sme2_smlall_d, int64_t, int16_t, int16_t, H8, H2, +)
+DO_MLALL(sme2_smlsll_s, int32_t, int8_t, int8_t, H4, H1, -)
+DO_MLALL(sme2_smlsll_d, int64_t, int16_t, int16_t, H8, H2, -)
+
+DO_MLALL(sme2_umlall_s, uint32_t, uint8_t, uint8_t, H4, H1, +)
+DO_MLALL(sme2_umlall_d, uint64_t, uint16_t, uint16_t, H8, H2, +)
+DO_MLALL(sme2_umlsll_s, uint32_t, uint8_t, uint8_t, H4, H1, -)
+DO_MLALL(sme2_umlsll_d, uint64_t, uint16_t, uint16_t, H8, H2, -)
+
+DO_MLALL(sme2_usmlall_s, uint32_t, uint8_t, int8_t, H4, H1, +)
+DO_MLALL(sme2_usmlall_d, uint64_t, uint16_t, int16_t, H8, H2, +)
+
+#undef DO_MLALL
+
+#define DO_MLALL_IDX(NAME, TYPEW, TYPEN, TYPEM, HW, HN, OP) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, uint32_t desc) \
+{ \
+ intptr_t elements = simd_oprsz(desc) / sizeof(TYPEW); \
+ intptr_t eltspersegment = 16 / sizeof(TYPEW); \
+ intptr_t sel = extract32(desc, SIMD_DATA_SHIFT, 2); \
+ intptr_t idx = extract32(desc, SIMD_DATA_SHIFT + 2, 4); \
+ TYPEW *d = vd, *a = va; TYPEN *n = vn; TYPEM *m = vm; \
+ for (intptr_t i = 0; i < elements; i += eltspersegment) { \
+ TYPEW mm = m[HN(i * 4 + idx)]; \
+ for (intptr_t j = 0; j < eltspersegment; ++j) { \
+ TYPEN nn = n[HN((i + j) * 4 + sel)]; \
+ d[HW(i + j)] = a[HW(i + j)] OP (nn * mm); \
+ } \
+ } \
+}
+
+DO_MLALL_IDX(sme2_smlall_idx_s, int32_t, int8_t, int8_t, H4, H1, +)
+DO_MLALL_IDX(sme2_smlall_idx_d, int64_t, int16_t, int16_t, H8, H2, +)
+DO_MLALL_IDX(sme2_smlsll_idx_s, int32_t, int8_t, int8_t, H4, H1, -)
+DO_MLALL_IDX(sme2_smlsll_idx_d, int64_t, int16_t, int16_t, H8, H2, -)
+
+DO_MLALL_IDX(sme2_umlall_idx_s, uint32_t, uint8_t, uint8_t, H4, H1, +)
+DO_MLALL_IDX(sme2_umlall_idx_d, uint64_t, uint16_t, uint16_t, H8, H2, +)
+DO_MLALL_IDX(sme2_umlsll_idx_s, uint32_t, uint8_t, uint8_t, H4, H1, -)
+DO_MLALL_IDX(sme2_umlsll_idx_d, uint64_t, uint16_t, uint16_t, H8, H2, -)
+
+DO_MLALL_IDX(sme2_usmlall_idx_s, uint32_t, uint8_t, int8_t, H4, H1, +)
+DO_MLALL_IDX(sme2_usmlall_idx_d, uint64_t, uint16_t, int16_t, H8, H2, +)
+
+#undef DO_MLALL_IDX
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index f43511335f..9c0989baff 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1032,3 +1032,81 @@ TRANS_FEAT(SMLAL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_smlal_idx_s)
TRANS_FEAT(SMLSL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_smlsl_idx_s)
TRANS_FEAT(UMLAL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_umlal_idx_s)
TRANS_FEAT(UMLSL_nx, aa64_sme, do_smlal_nx, a, gen_helper_sve2_umlsl_idx_s)
+
+static bool do_smlall(DisasContext *s, arg_azz_n *a, bool multi,
+ gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 4, a->rv, a->off, a->zn, a->zm,
+ 0, 0, multi, fn);
+}
+
+static void gen_helper_sme2_sumlall_s(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_sme2_usmlall_s(d, m, n, a, desc);
+}
+
+static void gen_helper_sme2_sumlall_d(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_sme2_usmlall_d(d, m, n, a, desc);
+}
+
+TRANS_FEAT(SMLALL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_smlall_s)
+TRANS_FEAT(SMLSLL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_smlsll_s)
+TRANS_FEAT(UMLALL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_umlall_s)
+TRANS_FEAT(UMLSLL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_umlsll_s)
+TRANS_FEAT(USMLALL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_usmlall_s)
+TRANS_FEAT(SUMLALL_n1_s, aa64_sme, do_smlall, a, false, gen_helper_sme2_sumlall_s)
+
+TRANS_FEAT(SMLALL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_smlall_d)
+TRANS_FEAT(SMLSLL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_smlsll_d)
+TRANS_FEAT(UMLALL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_umlall_d)
+TRANS_FEAT(UMLSLL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_umlsll_d)
+TRANS_FEAT(USMLALL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_usmlall_d)
+TRANS_FEAT(SUMLALL_n1_d, aa64_sme_i16i64, do_smlall, a, false, gen_helper_sme2_sumlall_d)
+
+TRANS_FEAT(SMLALL_nn_s, aa64_sme, do_smlall, a, true, gen_helper_sme2_smlall_s)
+TRANS_FEAT(SMLSLL_nn_s, aa64_sme, do_smlall, a, true, gen_helper_sme2_smlsll_s)
+TRANS_FEAT(UMLALL_nn_s, aa64_sme, do_smlall, a, true, gen_helper_sme2_umlall_s)
+TRANS_FEAT(UMLSLL_nn_s, aa64_sme, do_smlall, a, true, gen_helper_sme2_umlsll_s)
+TRANS_FEAT(USMLALL_nn_s, aa64_sme, do_smlall, a, true, gen_helper_sme2_usmlall_s)
+
+TRANS_FEAT(SMLALL_nn_d, aa64_sme_i16i64, do_smlall, a, true, gen_helper_sme2_smlall_d)
+TRANS_FEAT(SMLSLL_nn_d, aa64_sme_i16i64, do_smlall, a, true, gen_helper_sme2_smlsll_d)
+TRANS_FEAT(UMLALL_nn_d, aa64_sme_i16i64, do_smlall, a, true, gen_helper_sme2_umlall_d)
+TRANS_FEAT(UMLSLL_nn_d, aa64_sme_i16i64, do_smlall, a, true, gen_helper_sme2_umlsll_d)
+TRANS_FEAT(USMLALL_nn_d, aa64_sme_i16i64, do_smlall, a, true, gen_helper_sme2_usmlall_d)
+
+static bool do_smlall_nx(DisasContext *s, arg_azx_n *a,
+ gen_helper_gvec_4 *fn)
+{
+ return do_azz_acc(s, a->n, 4, a->rv, a->off, a->zn, a->zm,
+ a->idx << 2, 0, false, fn);
+}
+
+static void gen_helper_sme2_sumlall_idx_s(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_sme2_usmlall_idx_s(d, m, n, a, desc);
+}
+
+static void gen_helper_sme2_sumlall_idx_d(TCGv_ptr d, TCGv_ptr n, TCGv_ptr m,
+ TCGv_ptr a, TCGv_i32 desc)
+{
+ gen_helper_sme2_usmlall_idx_d(d, m, n, a, desc);
+}
+
+TRANS_FEAT(SMLALL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_smlall_idx_s)
+TRANS_FEAT(SMLSLL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_smlsll_idx_s)
+TRANS_FEAT(UMLALL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_umlall_idx_s)
+TRANS_FEAT(UMLSLL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_smlsll_idx_s)
+TRANS_FEAT(USMLALL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_usmlall_idx_s)
+TRANS_FEAT(SUMLALL_nx_s, aa64_sme, do_smlall_nx, a, gen_helper_sme2_sumlall_idx_s)
+
+TRANS_FEAT(SMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_smlall_idx_d)
+TRANS_FEAT(SMLSLL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_smlsll_idx_d)
+TRANS_FEAT(UMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_umlall_idx_d)
+TRANS_FEAT(UMLSLL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_smlsll_idx_d)
+TRANS_FEAT(USMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_usmlall_idx_d)
+TRANS_FEAT(SUMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_sumlall_idx_d)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index ec011d9382..49a45612fd 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -327,6 +327,52 @@ UMLSL_n1 11000001 011 0 .... 0 .. 011 ..... 11 ... @azz_nx1_o3x2 n=1
UMLSL_n1 11000001 011 0 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=2
UMLSL_n1 11000001 011 1 .... 0 .. 010 ..... 110 .. @azz_nx1_o2x2 n=4
+%off2_x4 0:2 !function=times_4
+%off1_x4 0:1 !function=times_4
+
+@azz_nx1_o2x4 ........ ... . zm:4 . .. ... zn:5 ... .. \
+ &azz_n off=%off2_x4 rv=%mova_rv
+@azz_nx1_o1x4 ........ ... . zm:4 . .. ... zn:5 .... . \
+ &azz_n off=%off1_x4 rv=%mova_rv
+
+SMLALL_n1_s 11000001 001 0 .... 0 .. 001 ..... 000 .. @azz_nx1_o2x4 n=1
+SMLALL_n1_d 11000001 011 0 .... 0 .. 001 ..... 000 .. @azz_nx1_o2x4 n=1
+SMLALL_n1_s 11000001 001 0 .... 0 .. 000 ..... 0000 . @azz_nx1_o1x4 n=2
+SMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 0000 . @azz_nx1_o1x4 n=2
+SMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 0000 . @azz_nx1_o1x4 n=4
+SMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 0000 . @azz_nx1_o1x4 n=4
+
+SMLSLL_n1_s 11000001 001 0 .... 0 .. 001 ..... 010 .. @azz_nx1_o2x4 n=1
+SMLSLL_n1_d 11000001 011 0 .... 0 .. 001 ..... 010 .. @azz_nx1_o2x4 n=1
+SMLSLL_n1_s 11000001 001 0 .... 0 .. 000 ..... 0100 . @azz_nx1_o1x4 n=2
+SMLSLL_n1_d 11000001 011 0 .... 0 .. 000 ..... 0100 . @azz_nx1_o1x4 n=2
+SMLSLL_n1_s 11000001 001 1 .... 0 .. 000 ..... 0100 . @azz_nx1_o1x4 n=4
+SMLSLL_n1_d 11000001 011 1 .... 0 .. 000 ..... 0100 . @azz_nx1_o1x4 n=4
+
+UMLALL_n1_s 11000001 001 0 .... 0 .. 001 ..... 100 .. @azz_nx1_o2x4 n=1
+UMLALL_n1_d 11000001 011 0 .... 0 .. 001 ..... 100 .. @azz_nx1_o2x4 n=1
+UMLALL_n1_s 11000001 001 0 .... 0 .. 000 ..... 1000 . @azz_nx1_o1x4 n=2
+UMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 1000 . @azz_nx1_o1x4 n=2
+UMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 1000 . @azz_nx1_o1x4 n=4
+UMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 1000 . @azz_nx1_o1x4 n=4
+
+UMLSLL_n1_s 11000001 001 0 .... 0 .. 001 ..... 110 .. @azz_nx1_o2x4 n=1
+UMLSLL_n1_d 11000001 011 0 .... 0 .. 001 ..... 110 .. @azz_nx1_o2x4 n=1
+UMLSLL_n1_s 11000001 001 0 .... 0 .. 000 ..... 1100 . @azz_nx1_o1x4 n=2
+UMLSLL_n1_d 11000001 011 0 .... 0 .. 000 ..... 1100 . @azz_nx1_o1x4 n=2
+UMLSLL_n1_s 11000001 001 1 .... 0 .. 000 ..... 1100 . @azz_nx1_o1x4 n=4
+UMLSLL_n1_d 11000001 011 1 .... 0 .. 000 ..... 1100 . @azz_nx1_o1x4 n=4
+
+USMLALL_n1_s 11000001 001 0 .... 0 .. 000 ..... 0010 . @azz_nx1_o1x4 n=2
+USMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 0010 . @azz_nx1_o1x4 n=2
+USMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 0010 . @azz_nx1_o1x4 n=4
+USMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 0010 . @azz_nx1_o1x4 n=4
+
+SUMLALL_n1_s 11000001 001 0 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=2
+SUMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=2
+SUMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
+SUMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -399,6 +445,36 @@ UMLAL_nn 11000001 111 ...01 0 .. 010 ...00 100 .. @azz_4x4_o2x2
UMLSL_nn 11000001 111 ....0 0 .. 010 ....0 110 .. @azz_2x2_o2x2
UMLSL_nn 11000001 111 ...01 0 .. 010 ...00 110 .. @azz_4x4_o2x2
+@azz_2x2_o1x4 ........ ... ..... . .. ... ..... ... .. \
+ &azz_n n=2 rv=%mova_rv zn=%zn_ax2 zm=%zm_ax2 off=%off1_x4
+@azz_4x4_o1x4 ........ ... ..... . .. ... ..... ... .. \
+ &azz_n n=4 rv=%mova_rv zn=%zn_ax4 zm=%zm_ax4 off=%off1_x4
+
+SMLALL_nn_s 11000001 101 ....0 0 .. 000 ....0 0000 . @azz_2x2_o1x4
+SMLALL_nn_d 11000001 111 ....0 0 .. 000 ....0 0000 . @azz_2x2_o1x4
+SMLALL_nn_s 11000001 101 ...01 0 .. 000 ...00 0000 . @azz_4x4_o1x4
+SMLALL_nn_d 11000001 111 ...01 0 .. 000 ...00 0000 . @azz_4x4_o1x4
+
+SMLSLL_nn_s 11000001 101 ....0 0 .. 000 ....0 0100 . @azz_2x2_o1x4
+SMLSLL_nn_d 11000001 111 ....0 0 .. 000 ....0 0100 . @azz_2x2_o1x4
+SMLSLL_nn_s 11000001 101 ...01 0 .. 000 ...00 0100 . @azz_4x4_o1x4
+SMLSLL_nn_d 11000001 111 ...01 0 .. 000 ...00 0100 . @azz_4x4_o1x4
+
+UMLALL_nn_s 11000001 101 ....0 0 .. 000 ....0 1000 . @azz_2x2_o1x4
+UMLALL_nn_d 11000001 111 ....0 0 .. 000 ....0 1000 . @azz_2x2_o1x4
+UMLALL_nn_s 11000001 101 ...01 0 .. 000 ...00 1000 . @azz_4x4_o1x4
+UMLALL_nn_d 11000001 111 ...01 0 .. 000 ...00 1000 . @azz_4x4_o1x4
+
+UMLSLL_nn_s 11000001 101 ....0 0 .. 000 ....0 1100 . @azz_2x2_o1x4
+UMLSLL_nn_d 11000001 111 ....0 0 .. 000 ....0 1100 . @azz_2x2_o1x4
+UMLSLL_nn_s 11000001 101 ...01 0 .. 000 ...00 1100 . @azz_4x4_o1x4
+UMLSLL_nn_d 11000001 111 ...01 0 .. 000 ...00 1100 . @azz_4x4_o1x4
+
+USMLALL_nn_s 11000001 101 ....0 0 .. 000 ....0 1010 . @azz_2x2_o1x4
+USMLALL_nn_d 11000001 111 ....0 0 .. 000 ....0 1010 . @azz_2x2_o1x4
+USMLALL_nn_s 11000001 101 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
+USMLALL_nn_d 11000001 111 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -493,3 +569,62 @@ UMLAL_nx 11000001 1101 .... 1 .. 1 .. ...00 10 ... @azx_4x1_o2x2
UMLSL_nx 11000001 1100 .... . .. 1 .. ..... 11 ... @azx_1x1_o3x2
UMLSL_nx 11000001 1101 .... 0 .. 1 .. ....0 11 ... @azx_2x1_o2x2
UMLSL_nx 11000001 1101 .... 1 .. 1 .. ...00 11 ... @azx_4x1_o2x2
+
+%idx4_15_10 15:1 10:3
+%idx4_10_1 10:2 1:2
+%idx3_10_1 10:1 1:2
+
+@azx_1x1_i4_o2 ........ .... zm:4 . .. ... zn:5 ... .. \
+ &azx_n n=1 rv=%mova_rv off=%off2_x4 idx=%idx4_15_10
+@azx_1x1_i3_o2 ........ .... zm:4 . .. ... zn:5 ... .. \
+ &azx_n n=1 rv=%mova_rv off=%off2_x4 idx=%idx3_15_10
+@azx_2x1_i4_o1 ........ .... zm:4 . .. ... ..... ... .. \
+ &azx_n n=2 rv=%mova_rv off=%off1_x4 zn=%zn_ax2 idx=%idx4_10_1
+@azx_2x1_i3_o1 ........ .... zm:4 . .. ... ..... ... .. \
+ &azx_n n=2 rv=%mova_rv off=%off1_x4 zn=%zn_ax2 idx=%idx3_10_1
+@azx_4x1_i4_o1 ........ .... zm:4 . .. ... ..... ... .. \
+ &azx_n n=4 rv=%mova_rv off=%off1_x4 zn=%zn_ax4 idx=%idx4_10_1
+@azx_4x1_i3_o1 ........ .... zm:4 . .. ... ..... ... .. \
+ &azx_n n=4 rv=%mova_rv off=%off1_x4 zn=%zn_ax4 idx=%idx3_10_1
+
+SMLALL_nx_s 11000001 0000 .... . .. ... ..... 000 .. @azx_1x1_i4_o2
+SMLALL_nx_d 11000001 1000 .... . .. 0.. ..... 000 .. @azx_1x1_i3_o2
+SMLALL_nx_s 11000001 0001 .... 0 .. 0.. ....0 00 ... @azx_2x1_i4_o1
+SMLALL_nx_d 11000001 1001 .... 0 .. 00. ....0 00 ... @azx_2x1_i3_o1
+SMLALL_nx_s 11000001 0001 .... 1 .. 0.. ...00 00 ... @azx_4x1_i4_o1
+SMLALL_nx_d 11000001 1001 .... 1 .. 00. ...00 00 ... @azx_4x1_i3_o1
+
+SMLSLL_nx_s 11000001 0000 .... . .. ... ..... 010 .. @azx_1x1_i4_o2
+SMLSLL_nx_d 11000001 1000 .... . .. 0.. ..... 010 .. @azx_1x1_i3_o2
+SMLSLL_nx_s 11000001 0001 .... 0 .. 0.. ....0 01 ... @azx_2x1_i4_o1
+SMLSLL_nx_d 11000001 1001 .... 0 .. 00. ....0 01 ... @azx_2x1_i3_o1
+SMLSLL_nx_s 11000001 0001 .... 1 .. 0.. ...00 01 ... @azx_4x1_i4_o1
+SMLSLL_nx_d 11000001 1001 .... 1 .. 00. ...00 01 ... @azx_4x1_i3_o1
+
+UMLALL_nx_s 11000001 0000 .... . .. ... ..... 100 .. @azx_1x1_i4_o2
+UMLALL_nx_d 11000001 1000 .... . .. 0.. ..... 100 .. @azx_1x1_i3_o2
+UMLALL_nx_s 11000001 0001 .... 0 .. 0.. ....0 10 ... @azx_2x1_i4_o1
+UMLALL_nx_d 11000001 1001 .... 0 .. 00. ....0 10 ... @azx_2x1_i3_o1
+UMLALL_nx_s 11000001 0001 .... 1 .. 0.. ...00 10 ... @azx_4x1_i4_o1
+UMLALL_nx_d 11000001 1001 .... 1 .. 00. ...00 10 ... @azx_4x1_i3_o1
+
+UMLSLL_nx_s 11000001 0000 .... . .. ... ..... 110 .. @azx_1x1_i4_o2
+UMLSLL_nx_d 11000001 1000 .... . .. 0.. ..... 110 .. @azx_1x1_i3_o2
+UMLSLL_nx_s 11000001 0001 .... 0 .. 0.. ....0 11 ... @azx_2x1_i4_o1
+UMLSLL_nx_d 11000001 1001 .... 0 .. 00. ....0 11 ... @azx_2x1_i3_o1
+UMLSLL_nx_s 11000001 0001 .... 1 .. 0.. ...00 11 ... @azx_4x1_i4_o1
+UMLSLL_nx_d 11000001 1001 .... 1 .. 00. ...00 11 ... @azx_4x1_i3_o1
+
+USMLALL_nx_s 11000001 0000 .... . .. ... ..... 001 .. @azx_1x1_i4_o2
+USMLALL_nx_d 11000001 1000 .... . .. 0.. ..... 001 .. @azx_1x1_i3_o2
+USMLALL_nx_s 11000001 0001 .... 0 .. 0.. ....1 00 ... @azx_2x1_i4_o1
+USMLALL_nx_d 11000001 1001 .... 0 .. 00. ....1 00 ... @azx_2x1_i3_o1
+USMLALL_nx_s 11000001 0001 .... 1 .. 0.. ...01 00 ... @azx_4x1_i4_o1
+USMLALL_nx_d 11000001 1001 .... 1 .. 00. ...01 00 ... @azx_4x1_i3_o1
+
+SUMLALL_nx_s 11000001 0000 .... . .. ... ..... 101 .. @azx_1x1_i4_o2
+SUMLALL_nx_d 11000001 1000 .... . .. 0.. ..... 101 .. @azx_1x1_i3_o2
+SUMLALL_nx_s 11000001 0001 .... 0 .. 0.. ....1 10 ... @azx_2x1_i4_o1
+SUMLALL_nx_d 11000001 1001 .... 0 .. 00. ....1 10 ... @azx_2x1_i3_o1
+SUMLALL_nx_s 11000001 0001 .... 1 .. 0.. ...01 10 ... @azx_4x1_i4_o1
+SUMLALL_nx_d 11000001 1001 .... 1 .. 00. ...01 10 ... @azx_4x1_i3_o1
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 41/61] target/arm: Rename gvec_fml[as]_[hs] with _nf_ infix
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (39 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 40/61] target/arm: Implement SME2 SMLALL, SMLSLL, UMLALL, UMLSLL Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 42/61] target/arm: Implement SME2 FMLA, FMLS Richard Henderson
` (20 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Emphasize the non-fused nature of these multiply-add.
Matches other helpers such as gvec_rsqrts_nf_[hs].
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 8 ++++----
target/arm/tcg/translate-neon.c | 4 ++--
target/arm/tcg/vec_helper.c | 8 ++++----
3 files changed, 10 insertions(+), 10 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 5509486aa7..b00b79c12c 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -776,11 +776,11 @@ DEF_HELPER_FLAGS_5(gvec_recps_nf_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst,
DEF_HELPER_FLAGS_5(gvec_rsqrts_nf_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_rsqrts_nf_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
-DEF_HELPER_FLAGS_5(gvec_fmla_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
-DEF_HELPER_FLAGS_5(gvec_fmla_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmla_nf_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmla_nf_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
-DEF_HELPER_FLAGS_5(gvec_fmls_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
-DEF_HELPER_FLAGS_5(gvec_fmls_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmls_nf_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_fmls_nf_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfma_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfma_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/translate-neon.c b/target/arm/tcg/translate-neon.c
index ea04336797..844d2e29e4 100644
--- a/target/arm/tcg/translate-neon.c
+++ b/target/arm/tcg/translate-neon.c
@@ -1010,8 +1010,8 @@ DO_3S_FP_GVEC(VACGE, gen_helper_gvec_facge_s, gen_helper_gvec_facge_h)
DO_3S_FP_GVEC(VACGT, gen_helper_gvec_facgt_s, gen_helper_gvec_facgt_h)
DO_3S_FP_GVEC(VMAX, gen_helper_gvec_fmax_s, gen_helper_gvec_fmax_h)
DO_3S_FP_GVEC(VMIN, gen_helper_gvec_fmin_s, gen_helper_gvec_fmin_h)
-DO_3S_FP_GVEC(VMLA, gen_helper_gvec_fmla_s, gen_helper_gvec_fmla_h)
-DO_3S_FP_GVEC(VMLS, gen_helper_gvec_fmls_s, gen_helper_gvec_fmls_h)
+DO_3S_FP_GVEC(VMLA, gen_helper_gvec_fmla_nf_s, gen_helper_gvec_fmla_nf_h)
+DO_3S_FP_GVEC(VMLS, gen_helper_gvec_fmls_nf_s, gen_helper_gvec_fmls_nf_h)
DO_3S_FP_GVEC(VFMA, gen_helper_gvec_vfma_s, gen_helper_gvec_vfma_h)
DO_3S_FP_GVEC(VFMS, gen_helper_gvec_vfms_s, gen_helper_gvec_vfms_h)
DO_3S_FP_GVEC(VRECPS, gen_helper_gvec_recps_nf_s, gen_helper_gvec_recps_nf_h)
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index ce7f55bfc8..2d2a000a4a 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -1667,11 +1667,11 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, \
clear_tail(d, oprsz, simd_maxsz(desc)); \
}
-DO_MULADD(gvec_fmla_h, float16_muladd_nf, float16)
-DO_MULADD(gvec_fmla_s, float32_muladd_nf, float32)
+DO_MULADD(gvec_fmla_nf_h, float16_muladd_nf, float16)
+DO_MULADD(gvec_fmla_nf_s, float32_muladd_nf, float32)
-DO_MULADD(gvec_fmls_h, float16_mulsub_nf, float16)
-DO_MULADD(gvec_fmls_s, float32_mulsub_nf, float32)
+DO_MULADD(gvec_fmls_nf_h, float16_mulsub_nf, float16)
+DO_MULADD(gvec_fmls_nf_s, float32_mulsub_nf, float32)
DO_MULADD(gvec_vfma_h, float16_muladd_f, float16)
DO_MULADD(gvec_vfma_s, float32_muladd_f, float32)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 42/61] target/arm: Implement SME2 FMLA, FMLS
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (40 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 41/61] target/arm: Rename gvec_fml[as]_[hs] with _nf_ infix Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 43/61] target/arm: Implement SME2 BFMLA, BFMLS Richard Henderson
` (19 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 95 ++++++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 48 +++++++++++++++++
2 files changed, 143 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 9c0989baff..1989dedab5 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -761,6 +761,47 @@ TRANS_FEAT(SUB_azz_nn_d, aa64_sme2_i16i64, do_azz_nn, a, MO_64, tcg_gen_gvec_sub
*/
#define FPST_ENV -1
+static bool do_azz_fp(DisasContext *s, int nreg, int nsel,
+ int rv, int off, int zn, int zm,
+ int data, int shsel, bool multi, int fpst,
+ gen_helper_gvec_3_ptr *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ int vstride = svl / nreg;
+ TCGv_ptr t_za = get_zarray(s, rv, off, nreg);
+ TCGv_ptr t, ptr;
+
+ if (fpst >= 0) {
+ ptr = fpstatus_ptr(fpst);
+ } else {
+ ptr = tcg_env;
+ }
+ t = tcg_temp_new_ptr();
+
+ for (int r = 0; r < nreg; ++r) {
+ TCGv_ptr t_zn = vec_full_reg_ptr(s, zn);
+ TCGv_ptr t_zm = vec_full_reg_ptr(s, zm);
+
+ for (int i = 0; i < nsel; ++i) {
+ int o_za = (r * vstride + i) * sizeof(ARMVectorReg);
+ int desc = simd_desc(svl, svl, data | (i << shsel));
+
+ tcg_gen_addi_ptr(t, t_za, o_za);
+ fn(t, t_zn, t_zm, ptr, tcg_constant_i32(desc));
+ }
+
+ /*
+ * For multiple-and-single vectors, Zn may wrap.
+ * For multiple vectors, both Zn and Zm are aligned.
+ */
+ zn = (zn + 1) % 32;
+ zm += multi;
+ }
+ }
+ return true;
+}
+
static bool do_azz_acc_fp(DisasContext *s, int nreg, int nsel,
int rv, int off, int zn, int zm,
int data, int shsel, bool multi, int fpst,
@@ -896,6 +937,60 @@ static bool trans_BFVDOT(DisasContext *s, arg_azx_n *a)
gen_helper_sme2_bfvdot_idx);
}
+static bool do_fmla(DisasContext *s, arg_azz_n *a, bool multi,
+ ARMFPStatusFlavour fpst, gen_helper_gvec_3_ptr *fn)
+{
+ return do_azz_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm,
+ 0, 0, multi, fpst, fn);
+}
+
+TRANS_FEAT(FMLA_n1_h, aa64_sme2_f16f16, do_fmla, a, false, FPST_ZA_F16,
+ gen_helper_gvec_vfma_h)
+TRANS_FEAT(FMLS_n1_h, aa64_sme2_f16f16, do_fmla, a, false, FPST_ZA_F16,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_h : gen_helper_gvec_vfms_h)
+TRANS_FEAT(FMLA_nn_h, aa64_sme2_f16f16, do_fmla, a, true, FPST_ZA_F16,
+ gen_helper_gvec_vfma_h)
+TRANS_FEAT(FMLS_nn_h, aa64_sme2_f16f16, do_fmla, a, true, FPST_ZA_F16,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_h : gen_helper_gvec_vfms_h)
+
+TRANS_FEAT(FMLA_n1_s, aa64_sme2, do_fmla, a, false, FPST_ZA,
+ gen_helper_gvec_vfma_s)
+TRANS_FEAT(FMLS_n1_s, aa64_sme2, do_fmla, a, false, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_s : gen_helper_gvec_vfms_s)
+TRANS_FEAT(FMLA_nn_s, aa64_sme2, do_fmla, a, true, FPST_ZA,
+ gen_helper_gvec_vfma_s)
+TRANS_FEAT(FMLS_nn_s, aa64_sme2, do_fmla, a, true, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_s : gen_helper_gvec_vfms_s)
+
+TRANS_FEAT(FMLA_n1_d, aa64_sme2_f64f64, do_fmla, a, false, FPST_ZA,
+ gen_helper_gvec_vfma_d)
+TRANS_FEAT(FMLS_n1_d, aa64_sme2_f64f64, do_fmla, a, false, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_d : gen_helper_gvec_vfms_d)
+TRANS_FEAT(FMLA_nn_d, aa64_sme2_f64f64, do_fmla, a, true, FPST_ZA,
+ gen_helper_gvec_vfma_d)
+TRANS_FEAT(FMLS_nn_d, aa64_sme2_f64f64, do_fmla, a, true, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_vfms_d : gen_helper_gvec_vfms_d)
+
+static bool do_fmla_nx(DisasContext *s, arg_azx_n *a,
+ ARMFPStatusFlavour fpst, gen_helper_gvec_4_ptr *fn)
+{
+ return do_azz_acc_fp(s, a->n, 1, a->rv, a->off, a->zn, a->zm,
+ a->idx, 0, false, fpst, fn);
+}
+
+TRANS_FEAT(FMLA_nx_h, aa64_sme2_f16f16, do_fmla_nx, a, FPST_ZA_F16,
+ gen_helper_gvec_fmla_idx_h)
+TRANS_FEAT(FMLS_nx_h, aa64_sme2_f16f16, do_fmla_nx, a, FPST_ZA_F16,
+ s->fpcr_ah ? gen_helper_gvec_ah_fmls_idx_h : gen_helper_gvec_fmls_idx_h)
+TRANS_FEAT(FMLA_nx_s, aa64_sme2, do_fmla_nx, a, FPST_ZA,
+ gen_helper_gvec_fmla_idx_s)
+TRANS_FEAT(FMLS_nx_s, aa64_sme2, do_fmla_nx, a, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_fmls_idx_s : gen_helper_gvec_fmls_idx_s)
+TRANS_FEAT(FMLA_nx_d, aa64_sme2_f64f64, do_fmla_nx, a, FPST_ZA,
+ gen_helper_gvec_fmla_idx_d)
+TRANS_FEAT(FMLS_nx_d, aa64_sme2_f64f64, do_fmla_nx, a, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_fmls_idx_d : gen_helper_gvec_fmls_idx_d)
+
/*
* Expand array multi-vector single (n1), array multi-vector (nn),
* and array multi-vector indexed (nx), for integer accumulate.
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 49a45612fd..30fa60f9a0 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -373,6 +373,20 @@ SUMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=2
SUMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
SUMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
+FMLA_n1_h 11000001 001 0 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=2
+FMLA_n1_s 11000001 001 0 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=2
+FMLA_n1_d 11000001 011 0 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=2
+FMLA_n1_h 11000001 001 1 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=4
+FMLA_n1_s 11000001 001 1 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=4
+FMLA_n1_d 11000001 011 1 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=4
+
+FMLS_n1_h 11000001 001 0 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=2
+FMLS_n1_s 11000001 001 0 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=2
+FMLS_n1_d 11000001 011 0 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=2
+FMLS_n1_h 11000001 001 1 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=4
+FMLS_n1_s 11000001 001 1 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=4
+FMLS_n1_d 11000001 011 1 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=4
+
### SME2 Multi-vector Multiple Array Vectors
%zn_ax2 6:4 !function=times_2
@@ -475,6 +489,20 @@ USMLALL_nn_d 11000001 111 ....0 0 .. 000 ....0 1010 . @azz_2x2_o1x4
USMLALL_nn_s 11000001 101 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
USMLALL_nn_d 11000001 111 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
+FMLA_nn_h 11000001 101 ....0 0 .. 100 ....0 01 ... @azz_2x2_o3
+FMLA_nn_s 11000001 101 ....0 0 .. 110 ....0 00 ... @azz_2x2_o3
+FMLA_nn_d 11000001 111 ....0 0 .. 110 ....0 00 ... @azz_2x2_o3
+FMLA_nn_h 11000001 101 ...01 0 .. 100 ...00 01 ... @azz_4x4_o3
+FMLA_nn_s 11000001 101 ...01 0 .. 110 ...00 00 ... @azz_4x4_o3
+FMLA_nn_d 11000001 111 ...01 0 .. 110 ...00 00 ... @azz_4x4_o3
+
+FMLS_nn_h 11000001 101 ....0 0 .. 100 ....0 11 ... @azz_2x2_o3
+FMLS_nn_s 11000001 101 ....0 0 .. 110 ....0 01 ... @azz_2x2_o3
+FMLS_nn_d 11000001 111 ....0 0 .. 110 ....0 01 ... @azz_2x2_o3
+FMLS_nn_h 11000001 101 ...01 0 .. 100 ...00 11 ... @azz_4x4_o3
+FMLS_nn_s 11000001 101 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
+FMLS_nn_d 11000001 111 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
@@ -628,3 +656,23 @@ SUMLALL_nx_s 11000001 0001 .... 0 .. 0.. ....1 10 ... @azx_2x1_i4_o1
SUMLALL_nx_d 11000001 1001 .... 0 .. 00. ....1 10 ... @azx_2x1_i3_o1
SUMLALL_nx_s 11000001 0001 .... 1 .. 0.. ...01 10 ... @azx_4x1_i4_o1
SUMLALL_nx_d 11000001 1001 .... 1 .. 00. ...01 10 ... @azx_4x1_i3_o1
+
+%idx3_10_3 10:2 3:1
+@azx_2x1_i3_o3 ........ .... zm:4 . .. ... ..... .. off:3 \
+ &azx_n n=2 rv=%mova_rv zn=%zn_ax2 idx=%idx3_10_3
+@azx_4x1_i3_o3 ........ .... zm:4 . .. ... ..... .. off:3 \
+ &azx_n n=4 rv=%mova_rv zn=%zn_ax4 idx=%idx3_10_3
+
+FMLA_nx_h 11000001 0001 .... 0 .. 1.. ....0 0 .... @azx_2x1_i3_o3
+FMLA_nx_s 11000001 0101 .... 0 .. 0.. ....0 00 ... @azx_2x1_i2_o3
+FMLA_nx_d 11000001 1101 .... 0 .. 00. ....0 00 ... @azx_2x1_i1_o3
+FMLA_nx_h 11000001 0001 .... 1 .. 1.. ...00 0 .... @azx_4x1_i3_o3
+FMLA_nx_s 11000001 0101 .... 1 .. 0.. ...00 00 ... @azx_4x1_i2_o3
+FMLA_nx_d 11000001 1101 .... 1 .. 00. ...00 00 ... @azx_4x1_i1_o3
+
+FMLS_nx_h 11000001 0001 .... 0 .. 1.. ....0 1 .... @azx_2x1_i3_o3
+FMLS_nx_s 11000001 0101 .... 0 .. 0.. ....0 10 ... @azx_2x1_i2_o3
+FMLS_nx_d 11000001 1101 .... 0 .. 00. ....0 10 ... @azx_2x1_i1_o3
+FMLS_nx_h 11000001 0001 .... 1 .. 1.. ...00 1 .... @azx_4x1_i3_o3
+FMLS_nx_s 11000001 0101 .... 1 .. 0.. ...00 10 ... @azx_4x1_i2_o3
+FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 43/61] target/arm: Implement SME2 BFMLA, BFMLS
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (41 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 42/61] target/arm: Implement SME2 FMLA, FMLS Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 44/61] target/arm: Implement SME2 FADD, FSUB, BFADD, BFSUB Richard Henderson
` (18 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 9 +++++++++
target/arm/tcg/translate-sme.c | 14 ++++++++++++++
target/arm/tcg/vec_helper.c | 26 ++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 18 ++++++++++++++++++
4 files changed, 67 insertions(+)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index b00b79c12c..ac2372bbe7 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -785,14 +785,17 @@ DEF_HELPER_FLAGS_5(gvec_fmls_nf_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i
DEF_HELPER_FLAGS_5(gvec_vfma_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfma_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfma_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_bfmla, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfms_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfms_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_vfms_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_bfmls, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_ah_vfms_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_ah_vfms_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_ah_vfms_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_ah_bfmls, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_ftsmul_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, fpst, i32)
@@ -824,6 +827,8 @@ DEF_HELPER_FLAGS_6(gvec_fmla_idx_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_6(gvec_fmla_idx_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_6(gvec_bfmla_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_6(gvec_fmls_idx_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
@@ -831,6 +836,8 @@ DEF_HELPER_FLAGS_6(gvec_fmls_idx_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_6(gvec_fmls_idx_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_6(gvec_bfmls_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_6(gvec_ah_fmls_idx_h, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
@@ -838,6 +845,8 @@ DEF_HELPER_FLAGS_6(gvec_ah_fmls_idx_s, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_6(gvec_ah_fmls_idx_d, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_6(gvec_ah_bfmls_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_uqadd_b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 1989dedab5..5dce2dfd15 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -971,6 +971,15 @@ TRANS_FEAT(FMLA_nn_d, aa64_sme2_f64f64, do_fmla, a, true, FPST_ZA,
TRANS_FEAT(FMLS_nn_d, aa64_sme2_f64f64, do_fmla, a, true, FPST_ZA,
s->fpcr_ah ? gen_helper_gvec_ah_vfms_d : gen_helper_gvec_vfms_d)
+TRANS_FEAT(BFMLA_n1, aa64_sme2_b16b16, do_fmla, a, false, FPST_ZA,
+ gen_helper_gvec_bfmla)
+TRANS_FEAT(BFMLS_n1, aa64_sme2_b16b16, do_fmla, a, false, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_bfmls : gen_helper_gvec_bfmls)
+TRANS_FEAT(BFMLA_nn, aa64_sme2_b16b16, do_fmla, a, true, FPST_ZA,
+ gen_helper_gvec_bfmla)
+TRANS_FEAT(BFMLS_nn, aa64_sme2_b16b16, do_fmla, a, true, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_bfmls : gen_helper_gvec_bfmls)
+
static bool do_fmla_nx(DisasContext *s, arg_azx_n *a,
ARMFPStatusFlavour fpst, gen_helper_gvec_4_ptr *fn)
{
@@ -991,6 +1000,11 @@ TRANS_FEAT(FMLA_nx_d, aa64_sme2_f64f64, do_fmla_nx, a, FPST_ZA,
TRANS_FEAT(FMLS_nx_d, aa64_sme2_f64f64, do_fmla_nx, a, FPST_ZA,
s->fpcr_ah ? gen_helper_gvec_ah_fmls_idx_d : gen_helper_gvec_fmls_idx_d)
+TRANS_FEAT(BFMLA_nx, aa64_sme2_b16b16, do_fmla_nx, a, FPST_ZA,
+ gen_helper_gvec_bfmla_idx)
+TRANS_FEAT(BFMLS_nx, aa64_sme2_b16b16, do_fmla_nx, a, FPST_ZA,
+ s->fpcr_ah ? gen_helper_gvec_ah_bfmls_idx : gen_helper_gvec_bfmls_idx)
+
/*
* Expand array multi-vector single (n1), array multi-vector (nn),
* and array multi-vector indexed (nx), for integer accumulate.
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index 2d2a000a4a..e5a4f56ef7 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -1607,6 +1607,12 @@ static float16 float16_muladd_f(float16 dest, float16 op1, float16 op2,
return float16_muladd(op1, op2, dest, 0, stat);
}
+static bfloat16 bfloat16_muladd_f(bfloat16 dest, bfloat16 op1, bfloat16 op2,
+ float_status *stat)
+{
+ return bfloat16_muladd(op1, op2, dest, 0, stat);
+}
+
static float32 float32_muladd_f(float32 dest, float32 op1, float32 op2,
float_status *stat)
{
@@ -1625,6 +1631,12 @@ static float16 float16_mulsub_f(float16 dest, float16 op1, float16 op2,
return float16_muladd(float16_chs(op1), op2, dest, 0, stat);
}
+static bfloat16 bfloat16_mulsub_f(bfloat16 dest, bfloat16 op1, bfloat16 op2,
+ float_status *stat)
+{
+ return bfloat16_muladd(bfloat16_chs(op1), op2, dest, 0, stat);
+}
+
static float32 float32_mulsub_f(float32 dest, float32 op1, float32 op2,
float_status *stat)
{
@@ -1643,6 +1655,12 @@ static float16 float16_ah_mulsub_f(float16 dest, float16 op1, float16 op2,
return float16_muladd(op1, op2, dest, float_muladd_negate_product, stat);
}
+static bfloat16 bfloat16_ah_mulsub_f(bfloat16 dest, bfloat16 op1, bfloat16 op2,
+ float_status *stat)
+{
+ return bfloat16_muladd(op1, op2, dest, float_muladd_negate_product, stat);
+}
+
static float32 float32_ah_mulsub_f(float32 dest, float32 op1, float32 op2,
float_status *stat)
{
@@ -1676,14 +1694,19 @@ DO_MULADD(gvec_fmls_nf_s, float32_mulsub_nf, float32)
DO_MULADD(gvec_vfma_h, float16_muladd_f, float16)
DO_MULADD(gvec_vfma_s, float32_muladd_f, float32)
DO_MULADD(gvec_vfma_d, float64_muladd_f, float64)
+DO_MULADD(gvec_bfmla, bfloat16_muladd_f, bfloat16)
DO_MULADD(gvec_vfms_h, float16_mulsub_f, float16)
DO_MULADD(gvec_vfms_s, float32_mulsub_f, float32)
DO_MULADD(gvec_vfms_d, float64_mulsub_f, float64)
+DO_MULADD(gvec_bfmls, bfloat16_mulsub_f, bfloat16)
DO_MULADD(gvec_ah_vfms_h, float16_ah_mulsub_f, float16)
DO_MULADD(gvec_ah_vfms_s, float32_ah_mulsub_f, float32)
DO_MULADD(gvec_ah_vfms_d, float64_ah_mulsub_f, float64)
+DO_MULADD(gvec_ah_bfmls, bfloat16_ah_mulsub_f, bfloat16)
+
+#undef DO_MULADD
/* For the indexed ops, SVE applies the index per 128-bit vector segment.
* For AdvSIMD, there is of course only one such vector segment.
@@ -1802,14 +1825,17 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, void *va, \
DO_FMLA_IDX(gvec_fmla_idx_h, float16, H2, 0, 0)
DO_FMLA_IDX(gvec_fmla_idx_s, float32, H4, 0, 0)
DO_FMLA_IDX(gvec_fmla_idx_d, float64, H8, 0, 0)
+DO_FMLA_IDX(gvec_bfmla_idx, bfloat16, H2, 0, 0)
DO_FMLA_IDX(gvec_fmls_idx_h, float16, H2, INT16_MIN, 0)
DO_FMLA_IDX(gvec_fmls_idx_s, float32, H4, INT32_MIN, 0)
DO_FMLA_IDX(gvec_fmls_idx_d, float64, H8, INT64_MIN, 0)
+DO_FMLA_IDX(gvec_bfmls_idx, bfloat16, H2, INT16_MIN, 0)
DO_FMLA_IDX(gvec_ah_fmls_idx_h, float16, H2, 0, float_muladd_negate_product)
DO_FMLA_IDX(gvec_ah_fmls_idx_s, float32, H4, 0, float_muladd_negate_product)
DO_FMLA_IDX(gvec_ah_fmls_idx_d, float64, H8, 0, float_muladd_negate_product)
+DO_FMLA_IDX(gvec_ah_bfmls_idx, bfloat16, H2, 0, float_muladd_negate_product)
#undef DO_FMLA_IDX
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 30fa60f9a0..0d592bb467 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -373,16 +373,22 @@ SUMLALL_n1_d 11000001 011 0 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=2
SUMLALL_n1_s 11000001 001 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
SUMLALL_n1_d 11000001 011 1 .... 0 .. 000 ..... 1010 . @azz_nx1_o1x4 n=4
+BFMLA_n1 11000001 011 0 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=2
FMLA_n1_h 11000001 001 0 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=2
FMLA_n1_s 11000001 001 0 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=2
FMLA_n1_d 11000001 011 0 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=2
+
+BFMLA_n1 11000001 011 1 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=4
FMLA_n1_h 11000001 001 1 .... 0 .. 111 ..... 00 ... @azz_nx1_o3 n=4
FMLA_n1_s 11000001 001 1 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=4
FMLA_n1_d 11000001 011 1 .... 0 .. 110 ..... 00 ... @azz_nx1_o3 n=4
+BFMLS_n1 11000001 011 0 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=2
FMLS_n1_h 11000001 001 0 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=2
FMLS_n1_s 11000001 001 0 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=2
FMLS_n1_d 11000001 011 0 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=2
+
+BFMLS_n1 11000001 011 1 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=4
FMLS_n1_h 11000001 001 1 .... 0 .. 111 ..... 01 ... @azz_nx1_o3 n=4
FMLS_n1_s 11000001 001 1 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=4
FMLS_n1_d 11000001 011 1 .... 0 .. 110 ..... 01 ... @azz_nx1_o3 n=4
@@ -489,16 +495,22 @@ USMLALL_nn_d 11000001 111 ....0 0 .. 000 ....0 1010 . @azz_2x2_o1x4
USMLALL_nn_s 11000001 101 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
USMLALL_nn_d 11000001 111 ...01 0 .. 000 ...00 1010 . @azz_4x4_o1x4
+BFMLA_nn 11000001 111 ....0 0 .. 100 ....0 01 ... @azz_2x2_o3
FMLA_nn_h 11000001 101 ....0 0 .. 100 ....0 01 ... @azz_2x2_o3
FMLA_nn_s 11000001 101 ....0 0 .. 110 ....0 00 ... @azz_2x2_o3
FMLA_nn_d 11000001 111 ....0 0 .. 110 ....0 00 ... @azz_2x2_o3
+
+BFMLA_nn 11000001 111 ...01 0 .. 100 ...00 01 ... @azz_4x4_o3
FMLA_nn_h 11000001 101 ...01 0 .. 100 ...00 01 ... @azz_4x4_o3
FMLA_nn_s 11000001 101 ...01 0 .. 110 ...00 00 ... @azz_4x4_o3
FMLA_nn_d 11000001 111 ...01 0 .. 110 ...00 00 ... @azz_4x4_o3
+BFMLS_nn 11000001 111 ....0 0 .. 100 ....0 11 ... @azz_2x2_o3
FMLS_nn_h 11000001 101 ....0 0 .. 100 ....0 11 ... @azz_2x2_o3
FMLS_nn_s 11000001 101 ....0 0 .. 110 ....0 01 ... @azz_2x2_o3
FMLS_nn_d 11000001 111 ....0 0 .. 110 ....0 01 ... @azz_2x2_o3
+
+BFMLS_nn 11000001 111 ...01 0 .. 100 ...00 11 ... @azz_4x4_o3
FMLS_nn_h 11000001 101 ...01 0 .. 100 ...00 11 ... @azz_4x4_o3
FMLS_nn_s 11000001 101 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
FMLS_nn_d 11000001 111 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
@@ -663,16 +675,22 @@ SUMLALL_nx_d 11000001 1001 .... 1 .. 00. ...01 10 ... @azx_4x1_i3_o1
@azx_4x1_i3_o3 ........ .... zm:4 . .. ... ..... .. off:3 \
&azx_n n=4 rv=%mova_rv zn=%zn_ax4 idx=%idx3_10_3
+BFMLA_nx 11000001 0001 .... 0 .. 1.. ....1 0 .... @azx_2x1_i3_o3
FMLA_nx_h 11000001 0001 .... 0 .. 1.. ....0 0 .... @azx_2x1_i3_o3
FMLA_nx_s 11000001 0101 .... 0 .. 0.. ....0 00 ... @azx_2x1_i2_o3
FMLA_nx_d 11000001 1101 .... 0 .. 00. ....0 00 ... @azx_2x1_i1_o3
+
+BFMLA_nx 11000001 0001 .... 1 .. 1.. ...01 0 .... @azx_4x1_i3_o3
FMLA_nx_h 11000001 0001 .... 1 .. 1.. ...00 0 .... @azx_4x1_i3_o3
FMLA_nx_s 11000001 0101 .... 1 .. 0.. ...00 00 ... @azx_4x1_i2_o3
FMLA_nx_d 11000001 1101 .... 1 .. 00. ...00 00 ... @azx_4x1_i1_o3
+BFMLS_nx 11000001 0001 .... 0 .. 1.. ....1 1 .... @azx_2x1_i3_o3
FMLS_nx_h 11000001 0001 .... 0 .. 1.. ....0 1 .... @azx_2x1_i3_o3
FMLS_nx_s 11000001 0101 .... 0 .. 0.. ....0 10 ... @azx_2x1_i2_o3
FMLS_nx_d 11000001 1101 .... 0 .. 00. ....0 10 ... @azx_2x1_i1_o3
+
+BFMLS_nx 11000001 0001 .... 1 .. 1.. ...01 1 .... @azx_4x1_i3_o3
FMLS_nx_h 11000001 0001 .... 1 .. 1.. ...00 1 .... @azx_4x1_i3_o3
FMLS_nx_s 11000001 0101 .... 1 .. 0.. ...00 10 ... @azx_4x1_i2_o3
FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 44/61] target/arm: Implement SME2 FADD, FSUB, BFADD, BFSUB
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (42 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 43/61] target/arm: Implement SME2 BFMLA, BFMLS Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:56 ` [PATCH 45/61] target/arm: Remove CPUARMState.vfp.scratch Richard Henderson
` (17 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 2 ++
target/arm/tcg/translate-sme.c | 44 ++++++++++++++++++++++++++++++++++
target/arm/tcg/vec_helper.c | 2 ++
target/arm/tcg/sme.decode | 25 +++++++++++++++++++
4 files changed, 73 insertions(+)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index ac2372bbe7..4fa0730716 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -717,10 +717,12 @@ DEF_HELPER_FLAGS_4(gvec_fclt0_d, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fadd_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fadd_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fadd_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_bfadd, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fsub_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fsub_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fsub_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(gvec_bfsub, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fmul_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_5(gvec_fmul_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 5dce2dfd15..f41b4343e4 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1005,6 +1005,50 @@ TRANS_FEAT(BFMLA_nx, aa64_sme2_b16b16, do_fmla_nx, a, FPST_ZA,
TRANS_FEAT(BFMLS_nx, aa64_sme2_b16b16, do_fmla_nx, a, FPST_ZA,
s->fpcr_ah ? gen_helper_gvec_ah_bfmls_idx : gen_helper_gvec_bfmls_idx)
+static bool do_faddsub(DisasContext *s, arg_az_n *a, ARMFPStatusFlavour fpst,
+ gen_helper_gvec_3_ptr *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ int n = a->n;
+ int zm = a->zm;
+ int vstride = svl / n;
+ TCGv_ptr t_za = get_zarray(s, a->rv, a->off, n);
+ TCGv_ptr ptr = fpstatus_ptr(fpst);
+ TCGv_ptr t = tcg_temp_new_ptr();
+
+ for (int r = 0; r < n; ++r) {
+ TCGv_ptr t_zm = vec_full_reg_ptr(s, zm + r);
+ int o_za = r * vstride * sizeof(ARMVectorReg);
+ int desc = simd_desc(svl, svl, 0);
+
+ tcg_gen_addi_ptr(t, t_za, o_za);
+ fn(t, t, t_zm, ptr, tcg_constant_i32(desc));
+ }
+ }
+ return true;
+}
+
+TRANS_FEAT(FADD_nn_h, aa64_sme2_f16f16, do_faddsub, a,
+ FPST_ZA_F16, gen_helper_gvec_fadd_h)
+TRANS_FEAT(FSUB_nn_h, aa64_sme2_f16f16, do_faddsub, a,
+ FPST_ZA_F16, gen_helper_gvec_fsub_h)
+
+TRANS_FEAT(FADD_nn_s, aa64_sme2, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_fadd_s)
+TRANS_FEAT(FSUB_nn_s, aa64_sme2, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_fsub_s)
+
+TRANS_FEAT(FADD_nn_d, aa64_sme2_f64f64, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_fadd_d)
+TRANS_FEAT(FSUB_nn_d, aa64_sme2_f64f64, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_fsub_d)
+
+TRANS_FEAT(BFADD_nn, aa64_sme2_b16b16, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_bfadd)
+TRANS_FEAT(BFSUB_nn, aa64_sme2_b16b16, do_faddsub, a,
+ FPST_ZA, gen_helper_gvec_bfsub)
+
/*
* Expand array multi-vector single (n1), array multi-vector (nn),
* and array multi-vector indexed (nx), for integer accumulate.
diff --git a/target/arm/tcg/vec_helper.c b/target/arm/tcg/vec_helper.c
index e5a4f56ef7..e41386861e 100644
--- a/target/arm/tcg/vec_helper.c
+++ b/target/arm/tcg/vec_helper.c
@@ -1469,10 +1469,12 @@ void HELPER(NAME)(void *vd, void *vn, void *vm, \
DO_3OP(gvec_fadd_h, float16_add, float16)
DO_3OP(gvec_fadd_s, float32_add, float32)
DO_3OP(gvec_fadd_d, float64_add, float64)
+DO_3OP(gvec_bfadd, bfloat16_add, bfloat16)
DO_3OP(gvec_fsub_h, float16_sub, float16)
DO_3OP(gvec_fsub_s, float32_sub, float32)
DO_3OP(gvec_fsub_d, float64_sub, float64)
+DO_3OP(gvec_bfsub, bfloat16_sub, bfloat16)
DO_3OP(gvec_fmul_h, float16_mul, float16)
DO_3OP(gvec_fmul_s, float32_mul, float32)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 0d592bb467..91df2068cf 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -515,6 +515,31 @@ FMLS_nn_h 11000001 101 ...01 0 .. 100 ...00 11 ... @azz_4x4_o3
FMLS_nn_s 11000001 101 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
FMLS_nn_d 11000001 111 ...01 0 .. 110 ...00 01 ... @azz_4x4_o3
+&az_n n off rv zm
+@az_2x2_o3 ........ ... ..... . .. ... ..... .. off:3 \
+ &az_n n=2 rv=%mova_rv zm=%zn_ax2
+@az_4x4_o3 ........ ... ..... . .. ... ..... .. off:3 \
+ &az_n n=4 rv=%mova_rv zm=%zn_ax4
+
+FADD_nn_h 11000001 101 00100 0 .. 111 ....0 00 ... @az_2x2_o3
+FADD_nn_s 11000001 101 00000 0 .. 111 ....0 00 ... @az_2x2_o3
+FADD_nn_d 11000001 111 00000 0 .. 111 ....0 00 ... @az_2x2_o3
+FADD_nn_h 11000001 101 00101 0 .. 111 ...00 00 ... @az_4x4_o3
+FADD_nn_s 11000001 101 00001 0 .. 111 ...00 00 ... @az_4x4_o3
+FADD_nn_d 11000001 111 00001 0 .. 111 ...00 00 ... @az_4x4_o3
+
+FSUB_nn_h 11000001 101 00100 0 .. 111 ....0 01 ... @az_2x2_o3
+FSUB_nn_s 11000001 101 00000 0 .. 111 ....0 01 ... @az_2x2_o3
+FSUB_nn_d 11000001 111 00000 0 .. 111 ....0 01 ... @az_2x2_o3
+FSUB_nn_h 11000001 101 00101 0 .. 111 ...00 01 ... @az_4x4_o3
+FSUB_nn_s 11000001 101 00001 0 .. 111 ...00 01 ... @az_4x4_o3
+FSUB_nn_d 11000001 111 00001 0 .. 111 ...00 01 ... @az_4x4_o3
+
+BFADD_nn 11000001 111 00100 0 .. 111 ....0 00 ... @az_2x2_o3
+BFADD_nn 11000001 111 00101 0 .. 111 ...00 00 ... @az_4x4_o3
+BFSUB_nn 11000001 111 00100 0 .. 111 ....0 01 ... @az_2x2_o3
+BFSUB_nn 11000001 111 00101 0 .. 111 ...00 01 ... @az_4x4_o3
+
### SME2 Multi-vector Indexed
&azx_n n off rv zn zm idx
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 45/61] target/arm: Remove CPUARMState.vfp.scratch
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (43 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 44/61] target/arm: Implement SME2 FADD, FSUB, BFADD, BFSUB Richard Henderson
@ 2025-02-06 19:56 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 46/61] target/arm: Implement SME2 BFCVT, BFCVTN, FCVT, FCVTN Richard Henderson
` (16 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:56 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
The last use of this field was removed in b2fc7be972b9.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 3 ---
1 file changed, 3 deletions(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 61f959af8b..91edeae9ad 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -693,9 +693,6 @@ typedef struct CPUArchState {
uint32_t xregs[16];
- /* Scratch space for aa32 neon expansion. */
- uint32_t scratch[8];
-
/* There are a number of distinct float control structures. */
float_status fp_status[FPST_COUNT];
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 46/61] target/arm: Implement SME2 BFCVT, BFCVTN, FCVT, FCVTN
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (44 preceding siblings ...)
2025-02-06 19:56 ` [PATCH 45/61] target/arm: Remove CPUARMState.vfp.scratch Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 47/61] target/arm: Implement SME2 FCVT (widening), FCVTL Richard Henderson
` (15 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 5 +++
target/arm/tcg/sme_helper.c | 74 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 25 ++++++++++++
target/arm/tcg/sme.decode | 12 ++++++
4 files changed, 116 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 673aa347bc..cb81f89fb3 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -213,3 +213,8 @@ DEF_HELPER_FLAGS_5(sme2_umlsll_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr,
DEF_HELPER_FLAGS_5(sme2_umlsll_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sme2_usmlall_idx_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sme2_usmlall_idx_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_bfcvt, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_4(sme2_bfcvtn, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_4(sme2_fcvt_n, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_4(sme2_fcvtn, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 6fbf49e77d..6d2b83d26a 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1488,3 +1488,77 @@ DO_MLALL_IDX(sme2_usmlall_idx_s, uint32_t, uint8_t, int8_t, H4, H1, +)
DO_MLALL_IDX(sme2_usmlall_idx_d, uint64_t, uint16_t, int16_t, H8, H2, +)
#undef DO_MLALL_IDX
+
+/* Convert and compress */
+void HELPER(sme2_bfcvt)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ ARMVectorReg scratch __attribute__((uninitialized));
+ size_t oprsz = simd_oprsz(desc);
+ size_t i, n = oprsz / 4;
+ float32 *s0 = vs;
+ float32 *s1 = vs + sizeof(ARMVectorReg);
+ bfloat16 *d = vd;
+
+ if (vd == s1) {
+ s1 = memcpy(&scratch, s1, oprsz);
+ }
+
+ for (i = 0; i < n; ++i) {
+ d[H2(i)] = float32_to_bfloat16(s0[H4(i)], fpst);
+ }
+ for (i = 0; i < n; ++i) {
+ d[H2(i) + n] = float32_to_bfloat16(s1[H4(i)], fpst);
+ }
+}
+
+void HELPER(sme2_fcvt_n)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ ARMVectorReg scratch __attribute__((uninitialized));
+ size_t oprsz = simd_oprsz(desc);
+ size_t i, n = oprsz / 4;
+ float32 *s0 = vs;
+ float32 *s1 = vs + sizeof(ARMVectorReg);
+ float16 *d = vd;
+
+ if (vd == s1) {
+ s1 = memcpy(&scratch, s1, oprsz);
+ }
+
+ for (i = 0; i < n; ++i) {
+ d[H2(i)] = float32_to_float16(s0[H4(i)], true, fpst);
+ }
+ for (i = 0; i < n; ++i) {
+ d[H2(i) + n] = float32_to_float16(s1[H4(i)], true, fpst);
+ }
+}
+
+/* Convert and interleave */
+void HELPER(sme2_bfcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ size_t i, n = simd_oprsz(desc) / 4;
+ float32 *s0 = vs;
+ float32 *s1 = vs + sizeof(ARMVectorReg);
+ bfloat16 *d = vd;
+
+ for (i = 0; i < n; ++i) {
+ bfloat16 d0 = float32_to_bfloat16(s0[H4(i)], fpst);
+ bfloat16 d1 = float32_to_bfloat16(s1[H4(i)], fpst);
+ d[H2(i * 2 + 0)] = d0;
+ d[H2(i * 2 + 1)] = d1;
+ }
+}
+
+void HELPER(sme2_fcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ size_t i, n = simd_oprsz(desc) / 4;
+ float32 *s0 = vs;
+ float32 *s1 = vs + sizeof(ARMVectorReg);
+ bfloat16 *d = vd;
+
+ for (i = 0; i < n; ++i) {
+ bfloat16 d0 = float32_to_float16(s0[H4(i)], true, fpst);
+ bfloat16 d1 = float32_to_float16(s1[H4(i)], true, fpst);
+ d[H2(i * 2 + 0)] = d0;
+ d[H2(i * 2 + 1)] = d1;
+ }
+}
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index f41b4343e4..777ea15a80 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1263,3 +1263,28 @@ TRANS_FEAT(UMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_umlall
TRANS_FEAT(UMLSLL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_smlsll_idx_d)
TRANS_FEAT(USMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_usmlall_idx_d)
TRANS_FEAT(SUMLALL_nx_d, aa64_sme_i16i64, do_smlall_nx, a, gen_helper_sme2_sumlall_idx_d)
+
+static bool do_zz_fpst(DisasContext *s, arg_zz_n *a, int data,
+ ARMFPStatusFlavour type, gen_helper_gvec_2_ptr *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ TCGv_ptr fpst = fpstatus_ptr(type);
+
+ for (int i = 0, n = a->n; i < n; ++i) {
+ tcg_gen_gvec_2_ptr(vec_full_reg_offset(s, a->zd + i),
+ vec_full_reg_offset(s, a->zn + i),
+ fpst, svl, svl, data, fn);
+ }
+ }
+ return true;
+}
+
+TRANS_FEAT(BFCVT, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_sme2_bfcvt)
+TRANS_FEAT(BFCVTN, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_sme2_bfcvtn)
+TRANS_FEAT(FCVT_n, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_sme2_fcvt_n)
+TRANS_FEAT(FCVTN, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_sme2_fcvtn)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 91df2068cf..8cca7d0d46 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -719,3 +719,15 @@ BFMLS_nx 11000001 0001 .... 1 .. 1.. ...01 1 .... @azx_4x1_i3_o3
FMLS_nx_h 11000001 0001 .... 1 .. 1.. ...00 1 .... @azx_4x1_i3_o3
FMLS_nx_s 11000001 0101 .... 1 .. 0.. ...00 10 ... @azx_4x1_i2_o3
FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
+
+### SME2 Multi-vector SVE Constructive Unary
+
+&zz_n zd zn n
+@zz_1x2 ........ ... ..... ...... ..... zd:5 \
+ &zz_n n=1 zn=%zn_ax2
+
+BFCVT 11000001 011 00000 111000 ....0 ..... @zz_1x2
+BFCVTN 11000001 011 00000 111000 ....1 ..... @zz_1x2
+
+FCVT_n 11000001 001 00000 111000 ....0 ..... @zz_1x2
+FCVTN 11000001 001 00000 111000 ....1 ..... @zz_1x2
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 47/61] target/arm: Implement SME2 FCVT (widening), FCVTL
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (45 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 46/61] target/arm: Implement SME2 BFCVT, BFCVTN, FCVT, FCVTN Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 48/61] target/arm: Implement SME2 FCVTZS, FCVTZU Richard Henderson
` (14 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 2 ++
target/arm/tcg/sme_helper.c | 38 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 5 +++++
target/arm/tcg/sme.decode | 5 +++++
4 files changed, 50 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index cb81f89fb3..04db920299 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -218,3 +218,5 @@ DEF_HELPER_FLAGS_4(sme2_bfcvt, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_bfcvtn, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_fcvt_n, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_fcvtn, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_4(sme2_fcvt_w, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_4(sme2_fcvtl, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 6d2b83d26a..8289a02bfa 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1562,3 +1562,41 @@ void HELPER(sme2_fcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
d[H2(i * 2 + 1)] = d1;
}
}
+
+/* Expand and convert */
+void HELPER(sme2_fcvt_w)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ ARMVectorReg scratch __attribute__((uninitialized));
+ size_t oprsz = simd_oprsz(desc);
+ size_t i, n = oprsz / 4;
+ float16 *s = vs;
+ float32 *d0 = vd;
+ float32 *d1 = vd + sizeof(ARMVectorReg);
+
+ if (vd == vs) {
+ s = memcpy(&scratch, s, oprsz);
+ }
+
+ for (i = 0; i < n; ++i) {
+ d0[H4(i)] = float16_to_float32(s[H2(i)], true, fpst);
+ }
+ for (i = 0; i < n; ++i) {
+ d1[H4(i)] = float16_to_float32(s[H2(n + i)], true, fpst);
+ }
+}
+
+/* Deinterleave and convert. */
+void HELPER(sme2_fcvtl)(void *vd, void *vs, float_status *fpst, uint32_t desc)
+{
+ size_t i, n = simd_oprsz(desc) / 4;
+ float16 *s = vs;
+ float32 *d0 = vd;
+ float32 *d1 = vd + sizeof(ARMVectorReg);
+
+ for (i = 0; i < n; ++i) {
+ float32 v0 = float16_to_float32(s[H2(i * 2 + 0)], true, fpst);
+ float32 v1 = float16_to_float32(s[H2(i * 2 + 1)], true, fpst);
+ d0[H4(i)] = v0;
+ d1[H4(i)] = v1;
+ }
+}
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 777ea15a80..2b45244e23 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1288,3 +1288,8 @@ TRANS_FEAT(FCVT_n, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_sme2_fcvt_n)
TRANS_FEAT(FCVTN, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_sme2_fcvtn)
+
+TRANS_FEAT(FCVT_w, aa64_sme2_f16f16, do_zz_fpst, a, 0,
+ FPST_A64_F16, gen_helper_sme2_fcvt_w)
+TRANS_FEAT(FCVTL, aa64_sme2_f16f16, do_zz_fpst, a, 0,
+ FPST_A64_F16, gen_helper_sme2_fcvtl)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 8cca7d0d46..644794bdc1 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -725,9 +725,14 @@ FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
&zz_n zd zn n
@zz_1x2 ........ ... ..... ...... ..... zd:5 \
&zz_n n=1 zn=%zn_ax2
+@zz_2x1 ........ ... ..... ...... zn:5 ..... \
+ &zz_n n=1 zd=%zd_ax2
BFCVT 11000001 011 00000 111000 ....0 ..... @zz_1x2
BFCVTN 11000001 011 00000 111000 ....1 ..... @zz_1x2
FCVT_n 11000001 001 00000 111000 ....0 ..... @zz_1x2
FCVTN 11000001 001 00000 111000 ....1 ..... @zz_1x2
+
+FCVT_w 11000001 101 00000 111000 ..... ....0 @zz_2x1
+FCVTL 11000001 101 00000 111000 ..... ....1 @zz_2x1
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 48/61] target/arm: Implement SME2 FCVTZS, FCVTZU
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (46 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 47/61] target/arm: Implement SME2 FCVT (widening), FCVTL Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 49/61] target/arm: Implement SME2 SCVTF, UCVTF Richard Henderson
` (13 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 5 +++++
target/arm/tcg/sme.decode | 9 +++++++++
2 files changed, 14 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 2b45244e23..4b45459e77 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1293,3 +1293,8 @@ TRANS_FEAT(FCVT_w, aa64_sme2_f16f16, do_zz_fpst, a, 0,
FPST_A64_F16, gen_helper_sme2_fcvt_w)
TRANS_FEAT(FCVTL, aa64_sme2_f16f16, do_zz_fpst, a, 0,
FPST_A64_F16, gen_helper_sme2_fcvtl)
+
+TRANS_FEAT(FCVTZS, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_gvec_vcvt_rz_fs)
+TRANS_FEAT(FCVTZU, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_gvec_vcvt_rz_fu)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 644794bdc1..bb985f6f61 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -727,6 +727,10 @@ FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
&zz_n n=1 zn=%zn_ax2
@zz_2x1 ........ ... ..... ...... zn:5 ..... \
&zz_n n=1 zd=%zd_ax2
+@zz_2x2 ........ ... ..... ...... .... . ..... \
+ &zz_n n=2 zd=%zd_ax2 zn=%zn_ax2
+@zz_4x4 ........ ... ..... ...... .... . ..... \
+ &zz_n n=4 zd=%zd_ax4 zn=%zn_ax4
BFCVT 11000001 011 00000 111000 ....0 ..... @zz_1x2
BFCVTN 11000001 011 00000 111000 ....1 ..... @zz_1x2
@@ -736,3 +740,8 @@ FCVTN 11000001 001 00000 111000 ....1 ..... @zz_1x2
FCVT_w 11000001 101 00000 111000 ..... ....0 @zz_2x1
FCVTL 11000001 101 00000 111000 ..... ....1 @zz_2x1
+
+FCVTZS 11000001 001 00001 111000 ....0 ....0 @zz_2x2
+FCVTZS 11000001 001 10001 111000 ...00 ...00 @zz_4x4
+FCVTZU 11000001 001 00001 111000 ....1 ....0 @zz_2x2
+FCVTZU 11000001 001 10001 111000 ...01 ...00 @zz_4x4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 49/61] target/arm: Implement SME2 SCVTF, UCVTF
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (47 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 48/61] target/arm: Implement SME2 FCVTZS, FCVTZU Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 50/61] target/arm: Implement SME2 FRINTN, FRINTP, FRINTM, FRINTA Richard Henderson
` (12 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 5 +++++
target/arm/tcg/sme.decode | 5 +++++
2 files changed, 10 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 4b45459e77..a993870812 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1298,3 +1298,8 @@ TRANS_FEAT(FCVTZS, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_gvec_vcvt_rz_fs)
TRANS_FEAT(FCVTZU, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_gvec_vcvt_rz_fu)
+
+TRANS_FEAT(SCVTF, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_gvec_vcvt_sf)
+TRANS_FEAT(UCVTF, aa64_sme2, do_zz_fpst, a, 0,
+ FPST_A64, gen_helper_gvec_vcvt_uf)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index bb985f6f61..e2d3668567 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -745,3 +745,8 @@ FCVTZS 11000001 001 00001 111000 ....0 ....0 @zz_2x2
FCVTZS 11000001 001 10001 111000 ...00 ...00 @zz_4x4
FCVTZU 11000001 001 00001 111000 ....1 ....0 @zz_2x2
FCVTZU 11000001 001 10001 111000 ...01 ...00 @zz_4x4
+
+SCVTF 11000001 001 00010 111000 ....0 ....0 @zz_2x2
+SCVTF 11000001 001 10010 111000 ...00 ...00 @zz_4x4
+UCVTF 11000001 001 00010 111000 ....1 ....0 @zz_2x2
+UCVTF 11000001 001 10010 111000 ...01 ...00 @zz_4x4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 50/61] target/arm: Implement SME2 FRINTN, FRINTP, FRINTM, FRINTA
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (48 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 49/61] target/arm: Implement SME2 SCVTF, UCVTF Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 51/61] target/arm: Introduce do_[us]sat_[bhs] macros Richard Henderson
` (11 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 9 +++++++++
target/arm/tcg/sme.decode | 9 +++++++++
2 files changed, 18 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index a993870812..a4c683e12f 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1303,3 +1303,12 @@ TRANS_FEAT(SCVTF, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_gvec_vcvt_sf)
TRANS_FEAT(UCVTF, aa64_sme2, do_zz_fpst, a, 0,
FPST_A64, gen_helper_gvec_vcvt_uf)
+
+TRANS_FEAT(FRINTN, aa64_sme2, do_zz_fpst, a, float_round_nearest_even,
+ FPST_A64, gen_helper_gvec_vrint_rm_s)
+TRANS_FEAT(FRINTP, aa64_sme2, do_zz_fpst, a, float_round_up,
+ FPST_A64, gen_helper_gvec_vrint_rm_s)
+TRANS_FEAT(FRINTM, aa64_sme2, do_zz_fpst, a, float_round_down,
+ FPST_A64, gen_helper_gvec_vrint_rm_s)
+TRANS_FEAT(FRINTA, aa64_sme2, do_zz_fpst, a, float_round_ties_away,
+ FPST_A64, gen_helper_gvec_vrint_rm_s)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index e2d3668567..dffd4f5ff8 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -750,3 +750,12 @@ SCVTF 11000001 001 00010 111000 ....0 ....0 @zz_2x2
SCVTF 11000001 001 10010 111000 ...00 ...00 @zz_4x4
UCVTF 11000001 001 00010 111000 ....1 ....0 @zz_2x2
UCVTF 11000001 001 10010 111000 ...01 ...00 @zz_4x4
+
+FRINTN 11000001 101 01000 111000 ....0 ....0 @zz_2x2
+FRINTN 11000001 101 11000 111000 ...00 ...00 @zz_4x4
+FRINTP 11000001 101 01001 111000 ....0 ....0 @zz_2x2
+FRINTP 11000001 101 11001 111000 ...00 ...00 @zz_4x4
+FRINTM 11000001 101 01010 111000 ....0 ....0 @zz_2x2
+FRINTM 11000001 101 11010 111000 ...00 ...00 @zz_4x4
+FRINTA 11000001 101 01100 111000 ....0 ....0 @zz_2x2
+FRINTA 11000001 101 11100 111000 ...00 ...00 @zz_4x4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 51/61] target/arm: Introduce do_[us]sat_[bhs] macros
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (49 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 50/61] target/arm: Implement SME2 FRINTN, FRINTP, FRINTM, FRINTA Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 52/61] target/arm: Use do_[us]sat_[bhs] in sve_helper.c Richard Henderson
` (10 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Inputs are a wider type of indeterminate sign.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/vec_internal.h | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/target/arm/tcg/vec_internal.h b/target/arm/tcg/vec_internal.h
index 205f85b8d3..ad6fef03e6 100644
--- a/target/arm/tcg/vec_internal.h
+++ b/target/arm/tcg/vec_internal.h
@@ -221,6 +221,13 @@ int16_t do_sqrdmlah_h(int16_t, int16_t, int16_t, bool, bool, uint32_t *);
int32_t do_sqrdmlah_s(int32_t, int32_t, int32_t, bool, bool, uint32_t *);
int64_t do_sqrdmlah_d(int64_t, int64_t, int64_t, bool, bool);
+#define do_ssat_b(val) MIN(MAX(val, INT8_MIN), INT8_MAX)
+#define do_ssat_h(val) MIN(MAX(val, INT16_MIN), INT16_MAX)
+#define do_ssat_s(val) MIN(MAX(val, INT32_MIN), INT32_MAX)
+#define do_usat_b(val) MIN(MAX(val, 0), UINT8_MAX)
+#define do_usat_h(val) MIN(MAX(val, 0), UINT16_MAX)
+#define do_usat_s(val) MIN(MAX(val, 0), UINT32_MAX)
+
/**
* bfdotadd:
* @sum: addend
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 52/61] target/arm: Use do_[us]sat_[bhs] in sve_helper.c
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (50 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 51/61] target/arm: Introduce do_[us]sat_[bhs] macros Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 53/61] target/arm: Implement SME2 SQCVT, UQCVT, SQCVTU Richard Henderson
` (9 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Replace and remove do_sat_bhs.
This avoids multiple repetitions of INT*_MIN/MAX.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/sve_helper.c | 116 +++++++++++++++---------------------
1 file changed, 48 insertions(+), 68 deletions(-)
diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
index c206ca65ce..e84cffd0b0 100644
--- a/target/arm/tcg/sve_helper.c
+++ b/target/arm/tcg/sve_helper.c
@@ -523,14 +523,9 @@ DO_ZPZZ(sve2_uhsub_zpzz_h, uint16_t, H1_2, DO_HSUB_BHS)
DO_ZPZZ(sve2_uhsub_zpzz_s, uint32_t, H1_4, DO_HSUB_BHS)
DO_ZPZZ_D(sve2_uhsub_zpzz_d, uint64_t, DO_HSUB_D)
-static inline int32_t do_sat_bhs(int64_t val, int64_t min, int64_t max)
-{
- return val >= max ? max : val <= min ? min : val;
-}
-
-#define DO_SQADD_B(n, m) do_sat_bhs((int64_t)n + m, INT8_MIN, INT8_MAX)
-#define DO_SQADD_H(n, m) do_sat_bhs((int64_t)n + m, INT16_MIN, INT16_MAX)
-#define DO_SQADD_S(n, m) do_sat_bhs((int64_t)n + m, INT32_MIN, INT32_MAX)
+#define DO_SQADD_B(n, m) do_ssat_b((int64_t)n + m)
+#define DO_SQADD_H(n, m) do_ssat_h((int64_t)n + m)
+#define DO_SQADD_S(n, m) do_ssat_s((int64_t)n + m)
static inline int64_t do_sqadd_d(int64_t n, int64_t m)
{
@@ -547,9 +542,9 @@ DO_ZPZZ(sve2_sqadd_zpzz_h, int16_t, H1_2, DO_SQADD_H)
DO_ZPZZ(sve2_sqadd_zpzz_s, int32_t, H1_4, DO_SQADD_S)
DO_ZPZZ_D(sve2_sqadd_zpzz_d, int64_t, do_sqadd_d)
-#define DO_UQADD_B(n, m) do_sat_bhs((int64_t)n + m, 0, UINT8_MAX)
-#define DO_UQADD_H(n, m) do_sat_bhs((int64_t)n + m, 0, UINT16_MAX)
-#define DO_UQADD_S(n, m) do_sat_bhs((int64_t)n + m, 0, UINT32_MAX)
+#define DO_UQADD_B(n, m) do_usat_b((int64_t)n + m)
+#define DO_UQADD_H(n, m) do_usat_h((int64_t)n + m)
+#define DO_UQADD_S(n, m) do_usat_s((int64_t)n + m)
static inline uint64_t do_uqadd_d(uint64_t n, uint64_t m)
{
@@ -562,9 +557,9 @@ DO_ZPZZ(sve2_uqadd_zpzz_h, uint16_t, H1_2, DO_UQADD_H)
DO_ZPZZ(sve2_uqadd_zpzz_s, uint32_t, H1_4, DO_UQADD_S)
DO_ZPZZ_D(sve2_uqadd_zpzz_d, uint64_t, do_uqadd_d)
-#define DO_SQSUB_B(n, m) do_sat_bhs((int64_t)n - m, INT8_MIN, INT8_MAX)
-#define DO_SQSUB_H(n, m) do_sat_bhs((int64_t)n - m, INT16_MIN, INT16_MAX)
-#define DO_SQSUB_S(n, m) do_sat_bhs((int64_t)n - m, INT32_MIN, INT32_MAX)
+#define DO_SQSUB_B(n, m) do_ssat_b((int64_t)n - m)
+#define DO_SQSUB_H(n, m) do_ssat_h((int64_t)n - m)
+#define DO_SQSUB_S(n, m) do_ssat_s((int64_t)n - m)
static inline int64_t do_sqsub_d(int64_t n, int64_t m)
{
@@ -581,9 +576,9 @@ DO_ZPZZ(sve2_sqsub_zpzz_h, int16_t, H1_2, DO_SQSUB_H)
DO_ZPZZ(sve2_sqsub_zpzz_s, int32_t, H1_4, DO_SQSUB_S)
DO_ZPZZ_D(sve2_sqsub_zpzz_d, int64_t, do_sqsub_d)
-#define DO_UQSUB_B(n, m) do_sat_bhs((int64_t)n - m, 0, UINT8_MAX)
-#define DO_UQSUB_H(n, m) do_sat_bhs((int64_t)n - m, 0, UINT16_MAX)
-#define DO_UQSUB_S(n, m) do_sat_bhs((int64_t)n - m, 0, UINT32_MAX)
+#define DO_UQSUB_B(n, m) do_usat_b((int64_t)n - m)
+#define DO_UQSUB_H(n, m) do_usat_h((int64_t)n - m)
+#define DO_UQSUB_S(n, m) do_usat_s((int64_t)n - m)
static inline uint64_t do_uqsub_d(uint64_t n, uint64_t m)
{
@@ -595,12 +590,9 @@ DO_ZPZZ(sve2_uqsub_zpzz_h, uint16_t, H1_2, DO_UQSUB_H)
DO_ZPZZ(sve2_uqsub_zpzz_s, uint32_t, H1_4, DO_UQSUB_S)
DO_ZPZZ_D(sve2_uqsub_zpzz_d, uint64_t, do_uqsub_d)
-#define DO_SUQADD_B(n, m) \
- do_sat_bhs((int64_t)(int8_t)n + m, INT8_MIN, INT8_MAX)
-#define DO_SUQADD_H(n, m) \
- do_sat_bhs((int64_t)(int16_t)n + m, INT16_MIN, INT16_MAX)
-#define DO_SUQADD_S(n, m) \
- do_sat_bhs((int64_t)(int32_t)n + m, INT32_MIN, INT32_MAX)
+#define DO_SUQADD_B(n, m) do_ssat_b((int64_t)(int8_t)n + m)
+#define DO_SUQADD_H(n, m) do_ssat_h((int64_t)(int16_t)n + m)
+#define DO_SUQADD_S(n, m) do_ssat_s((int64_t)(int32_t)n + m)
static inline int64_t do_suqadd_d(int64_t n, uint64_t m)
{
@@ -630,12 +622,9 @@ DO_ZPZZ(sve2_suqadd_zpzz_h, uint16_t, H1_2, DO_SUQADD_H)
DO_ZPZZ(sve2_suqadd_zpzz_s, uint32_t, H1_4, DO_SUQADD_S)
DO_ZPZZ_D(sve2_suqadd_zpzz_d, uint64_t, do_suqadd_d)
-#define DO_USQADD_B(n, m) \
- do_sat_bhs((int64_t)n + (int8_t)m, 0, UINT8_MAX)
-#define DO_USQADD_H(n, m) \
- do_sat_bhs((int64_t)n + (int16_t)m, 0, UINT16_MAX)
-#define DO_USQADD_S(n, m) \
- do_sat_bhs((int64_t)n + (int32_t)m, 0, UINT32_MAX)
+#define DO_USQADD_B(n, m) do_usat_b((int64_t)n + (int8_t)m)
+#define DO_USQADD_H(n, m) do_usat_h((int64_t)n + (int16_t)m)
+#define DO_USQADD_S(n, m) do_usat_s((int64_t)n + (int32_t)m)
static inline uint64_t do_usqadd_d(uint64_t n, int64_t m)
{
@@ -1222,37 +1211,29 @@ void HELPER(NAME)(void *vd, void *vn, uint32_t desc) \
} \
}
-#define DO_SQXTN_H(n) do_sat_bhs(n, INT8_MIN, INT8_MAX)
-#define DO_SQXTN_S(n) do_sat_bhs(n, INT16_MIN, INT16_MAX)
-#define DO_SQXTN_D(n) do_sat_bhs(n, INT32_MIN, INT32_MAX)
+DO_XTNB(sve2_sqxtnb_h, int16_t, do_ssat_b)
+DO_XTNB(sve2_sqxtnb_s, int32_t, do_ssat_h)
+DO_XTNB(sve2_sqxtnb_d, int64_t, do_ssat_s)
-DO_XTNB(sve2_sqxtnb_h, int16_t, DO_SQXTN_H)
-DO_XTNB(sve2_sqxtnb_s, int32_t, DO_SQXTN_S)
-DO_XTNB(sve2_sqxtnb_d, int64_t, DO_SQXTN_D)
+DO_XTNT(sve2_sqxtnt_h, int16_t, int8_t, H1, do_ssat_b)
+DO_XTNT(sve2_sqxtnt_s, int32_t, int16_t, H1_2, do_ssat_h)
+DO_XTNT(sve2_sqxtnt_d, int64_t, int32_t, H1_4, do_ssat_s)
-DO_XTNT(sve2_sqxtnt_h, int16_t, int8_t, H1, DO_SQXTN_H)
-DO_XTNT(sve2_sqxtnt_s, int32_t, int16_t, H1_2, DO_SQXTN_S)
-DO_XTNT(sve2_sqxtnt_d, int64_t, int32_t, H1_4, DO_SQXTN_D)
+DO_XTNB(sve2_uqxtnb_h, uint16_t, do_usat_b)
+DO_XTNB(sve2_uqxtnb_s, uint32_t, do_usat_h)
+DO_XTNB(sve2_uqxtnb_d, uint64_t, do_usat_s)
-#define DO_UQXTN_H(n) do_sat_bhs(n, 0, UINT8_MAX)
-#define DO_UQXTN_S(n) do_sat_bhs(n, 0, UINT16_MAX)
-#define DO_UQXTN_D(n) do_sat_bhs(n, 0, UINT32_MAX)
+DO_XTNT(sve2_uqxtnt_h, uint16_t, uint8_t, H1, do_usat_b)
+DO_XTNT(sve2_uqxtnt_s, uint32_t, uint16_t, H1_2, do_usat_h)
+DO_XTNT(sve2_uqxtnt_d, uint64_t, uint32_t, H1_4, do_usat_s)
-DO_XTNB(sve2_uqxtnb_h, uint16_t, DO_UQXTN_H)
-DO_XTNB(sve2_uqxtnb_s, uint32_t, DO_UQXTN_S)
-DO_XTNB(sve2_uqxtnb_d, uint64_t, DO_UQXTN_D)
+DO_XTNB(sve2_sqxtunb_h, int16_t, do_usat_b)
+DO_XTNB(sve2_sqxtunb_s, int32_t, do_usat_h)
+DO_XTNB(sve2_sqxtunb_d, int64_t, do_usat_s)
-DO_XTNT(sve2_uqxtnt_h, uint16_t, uint8_t, H1, DO_UQXTN_H)
-DO_XTNT(sve2_uqxtnt_s, uint32_t, uint16_t, H1_2, DO_UQXTN_S)
-DO_XTNT(sve2_uqxtnt_d, uint64_t, uint32_t, H1_4, DO_UQXTN_D)
-
-DO_XTNB(sve2_sqxtunb_h, int16_t, DO_UQXTN_H)
-DO_XTNB(sve2_sqxtunb_s, int32_t, DO_UQXTN_S)
-DO_XTNB(sve2_sqxtunb_d, int64_t, DO_UQXTN_D)
-
-DO_XTNT(sve2_sqxtunt_h, int16_t, int8_t, H1, DO_UQXTN_H)
-DO_XTNT(sve2_sqxtunt_s, int32_t, int16_t, H1_2, DO_UQXTN_S)
-DO_XTNT(sve2_sqxtunt_d, int64_t, int32_t, H1_4, DO_UQXTN_D)
+DO_XTNT(sve2_sqxtunt_h, int16_t, int8_t, H1, do_usat_b)
+DO_XTNT(sve2_sqxtunt_s, int32_t, int16_t, H1_2, do_usat_h)
+DO_XTNT(sve2_sqxtunt_d, int64_t, int32_t, H1_4, do_usat_s)
#undef DO_XTNB
#undef DO_XTNT
@@ -2183,10 +2164,9 @@ DO_SHRNT(sve2_rshrnt_h, uint16_t, uint8_t, H1_2, H1, do_urshr)
DO_SHRNT(sve2_rshrnt_s, uint32_t, uint16_t, H1_4, H1_2, do_urshr)
DO_SHRNT(sve2_rshrnt_d, uint64_t, uint32_t, H1_8, H1_4, do_urshr)
-#define DO_SQSHRUN_H(x, sh) do_sat_bhs((int64_t)(x) >> sh, 0, UINT8_MAX)
-#define DO_SQSHRUN_S(x, sh) do_sat_bhs((int64_t)(x) >> sh, 0, UINT16_MAX)
-#define DO_SQSHRUN_D(x, sh) \
- do_sat_bhs((int64_t)(x) >> (sh < 64 ? sh : 63), 0, UINT32_MAX)
+#define DO_SQSHRUN_H(x, sh) do_usat_b((int64_t)(x) >> sh)
+#define DO_SQSHRUN_S(x, sh) do_usat_h((int64_t)(x) >> sh)
+#define DO_SQSHRUN_D(x, sh) do_usat_s((int64_t)(x) >> (sh < 64 ? sh : 63))
DO_SHRNB(sve2_sqshrunb_h, int16_t, uint8_t, DO_SQSHRUN_H)
DO_SHRNB(sve2_sqshrunb_s, int32_t, uint16_t, DO_SQSHRUN_S)
@@ -2196,9 +2176,9 @@ DO_SHRNT(sve2_sqshrunt_h, int16_t, uint8_t, H1_2, H1, DO_SQSHRUN_H)
DO_SHRNT(sve2_sqshrunt_s, int32_t, uint16_t, H1_4, H1_2, DO_SQSHRUN_S)
DO_SHRNT(sve2_sqshrunt_d, int64_t, uint32_t, H1_8, H1_4, DO_SQSHRUN_D)
-#define DO_SQRSHRUN_H(x, sh) do_sat_bhs(do_srshr(x, sh), 0, UINT8_MAX)
-#define DO_SQRSHRUN_S(x, sh) do_sat_bhs(do_srshr(x, sh), 0, UINT16_MAX)
-#define DO_SQRSHRUN_D(x, sh) do_sat_bhs(do_srshr(x, sh), 0, UINT32_MAX)
+#define DO_SQRSHRUN_H(x, sh) do_usat_b(do_srshr(x, sh))
+#define DO_SQRSHRUN_S(x, sh) do_usat_h(do_srshr(x, sh))
+#define DO_SQRSHRUN_D(x, sh) do_usat_s(do_srshr(x, sh))
DO_SHRNB(sve2_sqrshrunb_h, int16_t, uint8_t, DO_SQRSHRUN_H)
DO_SHRNB(sve2_sqrshrunb_s, int32_t, uint16_t, DO_SQRSHRUN_S)
@@ -2208,9 +2188,9 @@ DO_SHRNT(sve2_sqrshrunt_h, int16_t, uint8_t, H1_2, H1, DO_SQRSHRUN_H)
DO_SHRNT(sve2_sqrshrunt_s, int32_t, uint16_t, H1_4, H1_2, DO_SQRSHRUN_S)
DO_SHRNT(sve2_sqrshrunt_d, int64_t, uint32_t, H1_8, H1_4, DO_SQRSHRUN_D)
-#define DO_SQSHRN_H(x, sh) do_sat_bhs(x >> sh, INT8_MIN, INT8_MAX)
-#define DO_SQSHRN_S(x, sh) do_sat_bhs(x >> sh, INT16_MIN, INT16_MAX)
-#define DO_SQSHRN_D(x, sh) do_sat_bhs(x >> sh, INT32_MIN, INT32_MAX)
+#define DO_SQSHRN_H(x, sh) do_ssat_b(x >> sh)
+#define DO_SQSHRN_S(x, sh) do_ssat_h(x >> sh)
+#define DO_SQSHRN_D(x, sh) do_ssat_s(x >> sh)
DO_SHRNB(sve2_sqshrnb_h, int16_t, uint8_t, DO_SQSHRN_H)
DO_SHRNB(sve2_sqshrnb_s, int32_t, uint16_t, DO_SQSHRN_S)
@@ -2220,9 +2200,9 @@ DO_SHRNT(sve2_sqshrnt_h, int16_t, uint8_t, H1_2, H1, DO_SQSHRN_H)
DO_SHRNT(sve2_sqshrnt_s, int32_t, uint16_t, H1_4, H1_2, DO_SQSHRN_S)
DO_SHRNT(sve2_sqshrnt_d, int64_t, uint32_t, H1_8, H1_4, DO_SQSHRN_D)
-#define DO_SQRSHRN_H(x, sh) do_sat_bhs(do_srshr(x, sh), INT8_MIN, INT8_MAX)
-#define DO_SQRSHRN_S(x, sh) do_sat_bhs(do_srshr(x, sh), INT16_MIN, INT16_MAX)
-#define DO_SQRSHRN_D(x, sh) do_sat_bhs(do_srshr(x, sh), INT32_MIN, INT32_MAX)
+#define DO_SQRSHRN_H(x, sh) do_ssat_b(do_srshr(x, sh))
+#define DO_SQRSHRN_S(x, sh) do_ssat_h(do_srshr(x, sh))
+#define DO_SQRSHRN_D(x, sh) do_ssat_s(do_srshr(x, sh))
DO_SHRNB(sve2_sqrshrnb_h, int16_t, uint8_t, DO_SQRSHRN_H)
DO_SHRNB(sve2_sqrshrnb_s, int32_t, uint16_t, DO_SQRSHRN_S)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 53/61] target/arm: Implement SME2 SQCVT, UQCVT, SQCVTU
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (51 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 52/61] target/arm: Use do_[us]sat_[bhs] in sve_helper.c Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 54/61] target/arm: Implement SME2 SUNPK, UUNPK Richard Henderson
` (8 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 18 +++++++
target/arm/tcg/sme_helper.c | 91 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 35 +++++++++++++
target/arm/tcg/sme.decode | 22 ++++++++
4 files changed, 166 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 04db920299..7ae3fd172d 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -220,3 +220,21 @@ DEF_HELPER_FLAGS_4(sme2_fcvt_n, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_fcvtn, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_fcvt_w, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
DEF_HELPER_FLAGS_4(sme2_fcvtl, TCG_CALL_NO_RWG, void, ptr, ptr, fpst, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqcvt_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqcvt_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtu_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqcvt_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqcvt_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtu_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvt_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqcvt_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtu_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqcvtn_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqcvtn_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtun_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqcvtn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqcvtun_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 8289a02bfa..a9adc8589f 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1532,6 +1532,64 @@ void HELPER(sme2_fcvt_n)(void *vd, void *vs, float_status *fpst, uint32_t desc)
}
}
+#define SQCVT2(NAME, TW, TN, HW, HN, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 2 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(i)] = SAT(s0[HW(i)]); \
+ d[HN(i) + n] = SAT(s1[HW(i)]); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQCVT2(sme2_sqcvt_sh, int32_t, int16_t, H4, H2, do_ssat_h)
+SQCVT2(sme2_uqcvt_sh, uint32_t, uint16_t, H4, H2, do_usat_h)
+SQCVT2(sme2_sqcvtu_sh, int32_t, uint16_t, H4, H2, do_usat_h)
+
+#undef SQCVT2
+
+#define SQCVT4(NAME, TW, TN, HW, HN, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TW *s2 = vs + 2 * sizeof(ARMVectorReg); \
+ TW *s3 = vs + 3 * sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 4 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(i)] = SAT(s0[HW(i)]); \
+ d[HN(i) + n] = SAT(s1[HW(i)]); \
+ d[HN(i) + 2 * n] = SAT(s2[HW(i)]); \
+ d[HN(i) + 3 * n] = SAT(s3[HW(i)]); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQCVT4(sme2_sqcvt_sb, int32_t, int8_t, H4, H2, do_ssat_b)
+SQCVT4(sme2_uqcvt_sb, uint32_t, uint8_t, H4, H2, do_usat_b)
+SQCVT4(sme2_sqcvtu_sb, int32_t, uint8_t, H4, H2, do_usat_b)
+
+SQCVT4(sme2_sqcvt_dh, int64_t, int16_t, H8, H2, do_ssat_h)
+SQCVT4(sme2_uqcvt_dh, uint64_t, uint16_t, H8, H2, do_usat_h)
+SQCVT4(sme2_sqcvtu_dh, int64_t, uint16_t, H8, H2, do_usat_h)
+
+#undef SQCVT4
+
/* Convert and interleave */
void HELPER(sme2_bfcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
{
@@ -1563,6 +1621,39 @@ void HELPER(sme2_fcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
}
}
+#define SQCVTN4(NAME, TW, TN, HW, HN, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TW *s2 = vs + 2 * sizeof(ARMVectorReg); \
+ TW *s3 = vs + 3 * sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 4 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(4 * i + 0)] = SAT(s0[HW(i)]); \
+ d[HN(4 * i + 1)] = SAT(s1[HW(i)]); \
+ d[HN(4 * i + 2)] = SAT(s2[HW(i)]); \
+ d[HN(4 * i + 3)] = SAT(s3[HW(i)]); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQCVTN4(sme2_sqcvtn_sb, int32_t, int8_t, H4, H1, do_ssat_b)
+SQCVTN4(sme2_uqcvtn_sb, uint32_t, uint8_t, H4, H1, do_usat_b)
+SQCVTN4(sme2_sqcvtun_sb, int32_t, uint8_t, H4, H1, do_usat_b)
+
+SQCVTN4(sme2_sqcvtn_dh, int64_t, int16_t, H8, H2, do_ssat_h)
+SQCVTN4(sme2_uqcvtn_dh, uint64_t, uint16_t, H8, H2, do_usat_h)
+SQCVTN4(sme2_sqcvtun_dh, int64_t, uint16_t, H8, H2, do_usat_h)
+
+#undef SQCVTN4
+
/* Expand and convert */
void HELPER(sme2_fcvt_w)(void *vd, void *vs, float_status *fpst, uint32_t desc)
{
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index a4c683e12f..ec88a91cf1 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1312,3 +1312,38 @@ TRANS_FEAT(FRINTM, aa64_sme2, do_zz_fpst, a, float_round_down,
FPST_A64, gen_helper_gvec_vrint_rm_s)
TRANS_FEAT(FRINTA, aa64_sme2, do_zz_fpst, a, float_round_ties_away,
FPST_A64, gen_helper_gvec_vrint_rm_s)
+
+static bool do_zz(DisasContext *s, arg_zz_n *a, int data,
+ gen_helper_gvec_2 *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+
+ for (int i = 0, n = a->n; i < n; ++i) {
+ tcg_gen_gvec_2_ool(vec_full_reg_offset(s, a->zd + i),
+ vec_full_reg_offset(s, a->zn + i),
+ svl, svl, data, fn);
+ }
+ }
+ return true;
+}
+
+TRANS_FEAT(SQCVT_sh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvt_sh)
+TRANS_FEAT(UQCVT_sh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvt_sh)
+TRANS_FEAT(SQCVTU_sh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtu_sh)
+
+TRANS_FEAT(SQCVT_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvt_sb)
+TRANS_FEAT(UQCVT_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvt_sb)
+TRANS_FEAT(SQCVTU_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtu_sb)
+
+TRANS_FEAT(SQCVT_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvt_dh)
+TRANS_FEAT(UQCVT_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvt_dh)
+TRANS_FEAT(SQCVTU_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtu_dh)
+
+TRANS_FEAT(SQCVTN_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtn_sb)
+TRANS_FEAT(UQCVTN_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvtn_sb)
+TRANS_FEAT(SQCVTUN_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtun_sb)
+
+TRANS_FEAT(SQCVTN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtn_dh)
+TRANS_FEAT(UQCVTN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvtn_dh)
+TRANS_FEAT(SQCVTUN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtun_dh)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index dffd4f5ff8..491c6231ea 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -725,6 +725,8 @@ FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
&zz_n zd zn n
@zz_1x2 ........ ... ..... ...... ..... zd:5 \
&zz_n n=1 zn=%zn_ax2
+@zz_1x4 ........ ... ..... ...... ..... zd:5 \
+ &zz_n n=1 zn=%zn_ax4
@zz_2x1 ........ ... ..... ...... zn:5 ..... \
&zz_n n=1 zd=%zd_ax2
@zz_2x2 ........ ... ..... ...... .... . ..... \
@@ -759,3 +761,23 @@ FRINTM 11000001 101 01010 111000 ....0 ....0 @zz_2x2
FRINTM 11000001 101 11010 111000 ...00 ...00 @zz_4x4
FRINTA 11000001 101 01100 111000 ....0 ....0 @zz_2x2
FRINTA 11000001 101 11100 111000 ...00 ...00 @zz_4x4
+
+SQCVT_sh 11000001 001 00011 111000 ....0 ..... @zz_1x2
+UQCVT_sh 11000001 001 00011 111000 ....1 ..... @zz_1x2
+SQCVTU_sh 11000001 011 00011 111000 ....0 ..... @zz_1x2
+
+SQCVT_sb 11000001 001 10011 111000 ...00 ..... @zz_1x4
+UQCVT_sb 11000001 001 10011 111000 ...01 ..... @zz_1x4
+SQCVTU_sb 11000001 011 10011 111000 ...00 ..... @zz_1x4
+
+SQCVT_dh 11000001 101 10011 111000 ...00 ..... @zz_1x4
+UQCVT_dh 11000001 101 10011 111000 ...01 ..... @zz_1x4
+SQCVTU_dh 11000001 111 10011 111000 ...00 ..... @zz_1x4
+
+SQCVTN_sb 11000001 001 10011 111000 ...10 ..... @zz_1x4
+UQCVTN_sb 11000001 001 10011 111000 ...11 ..... @zz_1x4
+SQCVTUN_sb 11000001 011 10011 111000 ...10 ..... @zz_1x4
+
+SQCVTN_dh 11000001 101 10011 111000 ...10 ..... @zz_1x4
+UQCVTN_dh 11000001 101 10011 111000 ...11 ..... @zz_1x4
+SQCVTUN_dh 11000001 111 10011 111000 ...10 ..... @zz_1x4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 54/61] target/arm: Implement SME2 SUNPK, UUNPK
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (52 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 53/61] target/arm: Implement SME2 SQCVT, UQCVT, SQCVTU Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 55/61] target/arm: Implement SME2 ZIP, UZP (four registers) Richard Henderson
` (7 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 13 ++++++++++++
target/arm/tcg/sme_helper.c | 38 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 16 ++++++++++++++
target/arm/tcg/sme.decode | 18 ++++++++++++++++
4 files changed, 85 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 7ae3fd172d..3d3eb393b3 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -238,3 +238,16 @@ DEF_HELPER_FLAGS_3(sme2_sqcvtun_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_sqcvtn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uqcvtn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_sqcvtun_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sunpk2_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sunpk2_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sunpk2_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sunpk4_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sunpk4_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sunpk4_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk2_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk2_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk2_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk4_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk4_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uunpk4_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index a9adc8589f..cf4f09dbc4 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1676,6 +1676,44 @@ void HELPER(sme2_fcvt_w)(void *vd, void *vs, float_status *fpst, uint32_t desc)
}
}
+#define UNPK(NAME, SREG, TW, TN, HW, HN) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch[SREG] __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc); \
+ size_t n = oprsz / sizeof(TW); \
+ if ((vs - vd) < 2 * SREG * sizeof(ARMVectorReg)) { \
+ vs = memcpy(scratch, vs, sizeof(scratch)); \
+ } \
+ for (size_t r = 0; r < SREG; ++r) { \
+ TN *s = vs + r * sizeof(ARMVectorReg); \
+ for (size_t i = 0; i < 2; ++i) { \
+ TW *d = vd + (2 * r + i) * sizeof(ARMVectorReg); \
+ for (size_t e = 0; e < n; ++e) { \
+ d[HW(e)] = s[i * n + HN(e)]; \
+ } \
+ } \
+ } \
+}
+
+UNPK(sme2_sunpk2_bh, 1, int16_t, int8_t, H2, H1)
+UNPK(sme2_sunpk2_hs, 1, int32_t, int16_t, H4, H2)
+UNPK(sme2_sunpk2_sd, 1, int64_t, int32_t, H8, H4)
+
+UNPK(sme2_sunpk4_bh, 2, int16_t, int8_t, H2, H1)
+UNPK(sme2_sunpk4_hs, 2, int32_t, int16_t, H4, H2)
+UNPK(sme2_sunpk4_sd, 2, int64_t, int32_t, H8, H4)
+
+UNPK(sme2_uunpk2_bh, 1, uint16_t, uint8_t, H2, H1)
+UNPK(sme2_uunpk2_hs, 1, uint32_t, uint16_t, H4, H2)
+UNPK(sme2_uunpk2_sd, 1, uint64_t, uint32_t, H8, H4)
+
+UNPK(sme2_uunpk4_bh, 2, uint16_t, uint8_t, H2, H1)
+UNPK(sme2_uunpk4_hs, 2, uint32_t, uint16_t, H4, H2)
+UNPK(sme2_uunpk4_sd, 2, uint64_t, uint32_t, H8, H4)
+
+#undef UNPK
+
/* Deinterleave and convert. */
void HELPER(sme2_fcvtl)(void *vd, void *vs, float_status *fpst, uint32_t desc)
{
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index ec88a91cf1..9d7a10aa6b 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1347,3 +1347,19 @@ TRANS_FEAT(SQCVTUN_sb, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtun_sb)
TRANS_FEAT(SQCVTN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtn_dh)
TRANS_FEAT(UQCVTN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uqcvtn_dh)
TRANS_FEAT(SQCVTUN_dh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sqcvtun_dh)
+
+TRANS_FEAT(SUNPK_2bh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk2_bh)
+TRANS_FEAT(SUNPK_2hs, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk2_hs)
+TRANS_FEAT(SUNPK_2sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk2_sd)
+
+TRANS_FEAT(SUNPK_4bh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk4_bh)
+TRANS_FEAT(SUNPK_4hs, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk4_hs)
+TRANS_FEAT(SUNPK_4sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_sunpk4_sd)
+
+TRANS_FEAT(UUNPK_2bh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk2_bh)
+TRANS_FEAT(UUNPK_2hs, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk2_hs)
+TRANS_FEAT(UUNPK_2sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk2_sd)
+
+TRANS_FEAT(UUNPK_4bh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_bh)
+TRANS_FEAT(UUNPK_4hs, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_hs)
+TRANS_FEAT(UUNPK_4sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_sd)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 491c6231ea..1da64c5258 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -733,6 +733,8 @@ FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
&zz_n n=2 zd=%zd_ax2 zn=%zn_ax2
@zz_4x4 ........ ... ..... ...... .... . ..... \
&zz_n n=4 zd=%zd_ax4 zn=%zn_ax4
+@zz_4x2_n1 ........ ... ..... ...... .... . ..... \
+ &zz_n n=1 zd=%zd_ax4 zn=%zn_ax2
BFCVT 11000001 011 00000 111000 ....0 ..... @zz_1x2
BFCVTN 11000001 011 00000 111000 ....1 ..... @zz_1x2
@@ -781,3 +783,19 @@ SQCVTUN_sb 11000001 011 10011 111000 ...10 ..... @zz_1x4
SQCVTN_dh 11000001 101 10011 111000 ...10 ..... @zz_1x4
UQCVTN_dh 11000001 101 10011 111000 ...11 ..... @zz_1x4
SQCVTUN_dh 11000001 111 10011 111000 ...10 ..... @zz_1x4
+
+SUNPK_2bh 11000001 011 00101 111000 ..... ....0 @zz_2x1
+SUNPK_2hs 11000001 101 00101 111000 ..... ....0 @zz_2x1
+SUNPK_2sd 11000001 111 00101 111000 ..... ....0 @zz_2x1
+
+UUNPK_2bh 11000001 011 00101 111000 ..... ....1 @zz_2x1
+UUNPK_2hs 11000001 101 00101 111000 ..... ....1 @zz_2x1
+UUNPK_2sd 11000001 111 00101 111000 ..... ....1 @zz_2x1
+
+SUNPK_4bh 11000001 011 10101 111000 ....0 ...00 @zz_4x2_n1
+SUNPK_4hs 11000001 101 10101 111000 ....0 ...00 @zz_4x2_n1
+SUNPK_4sd 11000001 111 10101 111000 ....0 ...00 @zz_4x2_n1
+
+UUNPK_4bh 11000001 011 10101 111000 ....0 ...01 @zz_4x2_n1
+UUNPK_4hs 11000001 101 10101 111000 ....0 ...01 @zz_4x2_n1
+UUNPK_4sd 11000001 111 10101 111000 ....0 ...01 @zz_4x2_n1
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 55/61] target/arm: Implement SME2 ZIP, UZP (four registers)
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (53 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 54/61] target/arm: Implement SME2 SUNPK, UUNPK Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 56/61] target/arm: Move do_urshr, do_srshr to vec_internal.h Richard Henderson
` (6 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 12 ++++++
target/arm/tcg/sme_helper.c | 68 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 35 +++++++++++++++++
target/arm/tcg/sme.decode | 11 ++++++
4 files changed, 126 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 3d3eb393b3..1438a2a5b8 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -251,3 +251,15 @@ DEF_HELPER_FLAGS_3(sme2_uunpk2_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uunpk4_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uunpk4_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uunpk4_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_zip4_b, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_zip4_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_zip4_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_zip4_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_zip4_q, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_uzp4_b, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uzp4_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uzp4_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uzp4_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uzp4_q, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index cf4f09dbc4..de0fec272d 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1729,3 +1729,71 @@ void HELPER(sme2_fcvtl)(void *vd, void *vs, float_status *fpst, uint32_t desc)
d1[H4(i)] = v1;
}
}
+
+#define ZIP4(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch[4] __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc); \
+ size_t quads = oprsz / (sizeof(TYPE) * 4); \
+ TYPE *s0, *s1, *s2, *s3; \
+ if (vs == vd) { \
+ vs = memcpy(scratch, vs, sizeof(scratch)); \
+ } \
+ s0 = vs; \
+ s1 = vs + sizeof(ARMVectorReg); \
+ s2 = vs + 2 * sizeof(ARMVectorReg); \
+ s3 = vs + 3 * sizeof(ARMVectorReg); \
+ for (size_t r = 0; r < 4; ++r) { \
+ TYPE *d = vd + r * sizeof(ARMVectorReg); \
+ size_t base = r * quads; \
+ for (size_t q = 0; q < quads; ++q) { \
+ d[H(4 * q + 0)] = s0[base + H(q)]; \
+ d[H(4 * q + 1)] = s1[base + H(q)]; \
+ d[H(4 * q + 2)] = s2[base + H(q)]; \
+ d[H(4 * q + 3)] = s3[base + H(q)]; \
+ } \
+ } \
+}
+
+ZIP4(sme2_zip4_b, uint8_t, H1)
+ZIP4(sme2_zip4_h, uint16_t, H2)
+ZIP4(sme2_zip4_s, uint32_t, H4)
+ZIP4(sme2_zip4_d, uint64_t, )
+ZIP4(sme2_zip4_q, Int128, )
+
+#undef ZIP4
+
+#define UZP4(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch[4] __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc); \
+ size_t quads = oprsz / (sizeof(TYPE) * 4); \
+ TYPE *d0, *d1, *d2, *d3; \
+ if (vs == vd) { \
+ vs = memcpy(scratch, vs, sizeof(scratch)); \
+ } \
+ d0 = vd; \
+ d1 = vd + sizeof(ARMVectorReg); \
+ d2 = vd + 2 * sizeof(ARMVectorReg); \
+ d3 = vd + 3 * sizeof(ARMVectorReg); \
+ for (size_t r = 0; r < 4; ++r) { \
+ TYPE *s = vs + r * sizeof(ARMVectorReg); \
+ size_t base = r * quads; \
+ for (size_t q = 0; q < quads; ++q) { \
+ d0[base + H(q)] = s[H(4 * q + 0)]; \
+ d1[base + H(q)] = s[H(4 * q + 1)]; \
+ d2[base + H(q)] = s[H(4 * q + 2)]; \
+ d3[base + H(q)] = s[H(4 * q + 3)]; \
+ } \
+ } \
+}
+
+UZP4(sme2_uzp4_b, uint8_t, H1)
+UZP4(sme2_uzp4_h, uint16_t, H2)
+UZP4(sme2_uzp4_s, uint32_t, H4)
+UZP4(sme2_uzp4_d, uint64_t, )
+UZP4(sme2_uzp4_q, Int128, )
+
+#undef UZP4
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 9d7a10aa6b..01542e5330 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1363,3 +1363,38 @@ TRANS_FEAT(UUNPK_2sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk2_sd)
TRANS_FEAT(UUNPK_4bh, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_bh)
TRANS_FEAT(UUNPK_4hs, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_hs)
TRANS_FEAT(UUNPK_4sd, aa64_sme2, do_zz, a, 0, gen_helper_sme2_uunpk4_sd)
+
+static bool do_zipuzp_4(DisasContext *s, arg_zz_e *a,
+ gen_helper_gvec_2 * const fn[5])
+{
+ int svl = streaming_vec_reg_size(s);
+
+ /* Both MO_64 and MO_128 can fail the size test. */
+ if (svl < (4 << a->esz)) {
+ return false;
+ }
+ if (sme_sm_enabled_check(s)) {
+ tcg_gen_gvec_2_ool(vec_full_reg_offset(s, a->zd),
+ vec_full_reg_offset(s, a->zn),
+ svl, svl, 0, fn[a->esz]);
+ }
+ return true;
+}
+
+static gen_helper_gvec_2 * const zip4_fns[] = {
+ gen_helper_sme2_zip4_b,
+ gen_helper_sme2_zip4_h,
+ gen_helper_sme2_zip4_s,
+ gen_helper_sme2_zip4_d,
+ gen_helper_sme2_zip4_q,
+};
+TRANS_FEAT(ZIP_4, aa64_sme2, do_zipuzp_4, a, zip4_fns)
+
+static gen_helper_gvec_2 * const uzp4_fns[] = {
+ gen_helper_sme2_uzp4_b,
+ gen_helper_sme2_uzp4_h,
+ gen_helper_sme2_uzp4_s,
+ gen_helper_sme2_uzp4_d,
+ gen_helper_sme2_uzp4_q,
+};
+TRANS_FEAT(UZP_4, aa64_sme2, do_zipuzp_4, a, uzp4_fns)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 1da64c5258..a43edb625c 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -722,6 +722,7 @@ FMLS_nx_d 11000001 1101 .... 1 .. 00. ...00 10 ... @azx_4x1_i1_o3
### SME2 Multi-vector SVE Constructive Unary
+&zz_e zd zn esz
&zz_n zd zn n
@zz_1x2 ........ ... ..... ...... ..... zd:5 \
&zz_n n=1 zn=%zn_ax2
@@ -799,3 +800,13 @@ SUNPK_4sd 11000001 111 10101 111000 ....0 ...00 @zz_4x2_n1
UUNPK_4bh 11000001 011 10101 111000 ....0 ...01 @zz_4x2_n1
UUNPK_4hs 11000001 101 10101 111000 ....0 ...01 @zz_4x2_n1
UUNPK_4sd 11000001 111 10101 111000 ....0 ...01 @zz_4x2_n1
+
+ZIP_4 11000001 esz:2 1 10110 111000 ...00 ... 00 \
+ &zz_e zd=%zd_ax4 zn=%zn_ax4
+ZIP_4 11000001 001 10111 111000 ...00 ... 00 \
+ &zz_e esz=4 zd=%zd_ax4 zn=%zn_ax4
+
+UZP_4 11000001 esz:2 1 10110 111000 ...00 ... 10 \
+ &zz_e zd=%zd_ax4 zn=%zn_ax4
+UZP_4 11000001 001 10111 111000 ...00 ... 10 \
+ &zz_e esz=4 zd=%zd_ax4 zn=%zn_ax4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 56/61] target/arm: Move do_urshr, do_srshr to vec_internal.h
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (54 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 55/61] target/arm: Implement SME2 ZIP, UZP (four registers) Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 57/61] target/arm: Implement SME2 SQRSHR, UQRSHR, SQRSHRN Richard Henderson
` (5 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Unify two copies of these inline functions.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/vec_internal.h | 21 +++++++++++++++++++++
target/arm/tcg/mve_helper.c | 21 ---------------------
target/arm/tcg/sve_helper.c | 21 ---------------------
3 files changed, 21 insertions(+), 42 deletions(-)
diff --git a/target/arm/tcg/vec_internal.h b/target/arm/tcg/vec_internal.h
index ad6fef03e6..b33a0ec2fc 100644
--- a/target/arm/tcg/vec_internal.h
+++ b/target/arm/tcg/vec_internal.h
@@ -228,6 +228,27 @@ int64_t do_sqrdmlah_d(int64_t, int64_t, int64_t, bool, bool);
#define do_usat_h(val) MIN(MAX(val, 0), UINT16_MAX)
#define do_usat_s(val) MIN(MAX(val, 0), UINT32_MAX)
+static inline uint64_t do_urshr(uint64_t x, unsigned sh)
+{
+ if (likely(sh < 64)) {
+ return (x >> sh) + ((x >> (sh - 1)) & 1);
+ } else if (sh == 64) {
+ return x >> 63;
+ } else {
+ return 0;
+ }
+}
+
+static inline int64_t do_srshr(int64_t x, unsigned sh)
+{
+ if (likely(sh < 64)) {
+ return (x >> sh) + ((x >> (sh - 1)) & 1);
+ } else {
+ /* Rounding the sign bit always produces 0. */
+ return 0;
+ }
+}
+
/**
* bfdotadd:
* @sum: addend
diff --git a/target/arm/tcg/mve_helper.c b/target/arm/tcg/mve_helper.c
index 274003e2e5..49353369fe 100644
--- a/target/arm/tcg/mve_helper.c
+++ b/target/arm/tcg/mve_helper.c
@@ -2165,27 +2165,6 @@ DO_VSHLL_ALL(vshllt, true)
DO_VSHRN(OP##tb, true, 1, uint8_t, 2, uint16_t, FN) \
DO_VSHRN(OP##th, true, 2, uint16_t, 4, uint32_t, FN)
-static inline uint64_t do_urshr(uint64_t x, unsigned sh)
-{
- if (likely(sh < 64)) {
- return (x >> sh) + ((x >> (sh - 1)) & 1);
- } else if (sh == 64) {
- return x >> 63;
- } else {
- return 0;
- }
-}
-
-static inline int64_t do_srshr(int64_t x, unsigned sh)
-{
- if (likely(sh < 64)) {
- return (x >> sh) + ((x >> (sh - 1)) & 1);
- } else {
- /* Rounding the sign bit always produces 0. */
- return 0;
- }
-}
-
DO_VSHRN_ALL(vshrn, DO_SHR)
DO_VSHRN_ALL(vrshrn, do_urshr)
diff --git a/target/arm/tcg/sve_helper.c b/target/arm/tcg/sve_helper.c
index e84cffd0b0..0973463735 100644
--- a/target/arm/tcg/sve_helper.c
+++ b/target/arm/tcg/sve_helper.c
@@ -2046,27 +2046,6 @@ void HELPER(NAME)(void *vd, void *vn, void *vg, uint32_t desc) \
when N is negative, add 2**M-1. */
#define DO_ASRD(N, M) ((N + (N < 0 ? ((__typeof(N))1 << M) - 1 : 0)) >> M)
-static inline uint64_t do_urshr(uint64_t x, unsigned sh)
-{
- if (likely(sh < 64)) {
- return (x >> sh) + ((x >> (sh - 1)) & 1);
- } else if (sh == 64) {
- return x >> 63;
- } else {
- return 0;
- }
-}
-
-static inline int64_t do_srshr(int64_t x, unsigned sh)
-{
- if (likely(sh < 64)) {
- return (x >> sh) + ((x >> (sh - 1)) & 1);
- } else {
- /* Rounding the sign bit always produces 0. */
- return 0;
- }
-}
-
DO_ZPZI(sve_asr_zpzi_b, int8_t, H1, DO_SHR)
DO_ZPZI(sve_asr_zpzi_h, int16_t, H1_2, DO_SHR)
DO_ZPZI(sve_asr_zpzi_s, int32_t, H1_4, DO_SHR)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 57/61] target/arm: Implement SME2 SQRSHR, UQRSHR, SQRSHRN
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (55 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 56/61] target/arm: Move do_urshr, do_srshr to vec_internal.h Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 58/61] target/arm: Implement SME2 ZIP, UZP (two registers) Richard Henderson
` (4 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 18 +++++++
target/arm/tcg/sme_helper.c | 94 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 29 +++++++++++
target/arm/tcg/sme.decode | 33 ++++++++++++
4 files changed, 174 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index 1438a2a5b8..b9e1de63e6 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -263,3 +263,21 @@ DEF_HELPER_FLAGS_3(sme2_uzp4_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uzp4_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uzp4_d, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uzp4_q, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqrshr_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqrshr_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshru_sh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqrshr_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqrshr_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshru_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshr_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqrshr_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshru_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_3(sme2_sqrshrn_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqrshrn_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshrun_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshrn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_uqrshrn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_3(sme2_sqrshrun_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index de0fec272d..9a92302b8b 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1590,6 +1590,66 @@ SQCVT4(sme2_sqcvtu_dh, int64_t, uint16_t, H8, H2, do_usat_h)
#undef SQCVT4
+#define SQRSHR2(NAME, TW, TN, HW, HN, RSHR, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ int shift = simd_data(desc); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 2 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(i)] = SAT(RSHR(s0[HW(i)], shift)); \
+ d[HN(i) + n] = SAT(RSHR(s1[HW(i)], shift)); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQRSHR2(sme2_sqrshr_sh, int32_t, int16_t, H4, H2, do_srshr, do_ssat_b)
+SQRSHR2(sme2_uqrshr_sh, uint32_t, uint16_t, H4, H2, do_urshr, do_usat_b)
+SQRSHR2(sme2_sqrshru_sh, int32_t, uint16_t, H4, H2, do_srshr, do_usat_b)
+
+#undef SQRSHR2
+
+#define SQRSHR4(NAME, TW, TN, HW, HN, RSHR, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ int shift = simd_data(desc); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TW *s2 = vs + 2 * sizeof(ARMVectorReg); \
+ TW *s3 = vs + 3 * sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 4 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(i)] = SAT(RSHR(s0[HW(i)], shift)); \
+ d[HN(i) + n] = SAT(RSHR(s1[HW(i)], shift)); \
+ d[HN(i) + 2 * n] = SAT(RSHR(s2[HW(i)], shift)); \
+ d[HN(i) + 3 * n] = SAT(RSHR(s3[HW(i)], shift)); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQRSHR4(sme2_sqrshr_sb, int32_t, int8_t, H4, H2, do_srshr, do_ssat_b)
+SQRSHR4(sme2_uqrshr_sb, uint32_t, uint8_t, H4, H2, do_urshr, do_usat_b)
+SQRSHR4(sme2_sqrshru_sb, int32_t, uint8_t, H4, H2, do_srshr, do_usat_b)
+
+SQRSHR4(sme2_sqrshr_dh, int64_t, int16_t, H8, H2, do_srshr, do_ssat_h)
+SQRSHR4(sme2_uqrshr_dh, uint64_t, uint16_t, H8, H2, do_urshr, do_usat_h)
+SQRSHR4(sme2_sqrshru_dh, int64_t, uint16_t, H8, H2, do_srshr, do_usat_h)
+
+#undef SQRSHR4
+
/* Convert and interleave */
void HELPER(sme2_bfcvtn)(void *vd, void *vs, float_status *fpst, uint32_t desc)
{
@@ -1654,6 +1714,40 @@ SQCVTN4(sme2_sqcvtun_dh, int64_t, uint16_t, H8, H2, do_usat_h)
#undef SQCVTN4
+#define SQRSHRN4(NAME, TW, TN, HW, HN, RSHR, SAT) \
+void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc), n = oprsz / sizeof(TW); \
+ int shift = simd_data(desc); \
+ TW *s0 = vs, *s1 = vs + sizeof(ARMVectorReg); \
+ TW *s2 = vs + 2 * sizeof(ARMVectorReg); \
+ TW *s3 = vs + 3 * sizeof(ARMVectorReg); \
+ TN *d = vd; \
+ if ((vd - vs) < 4 * sizeof(ARMVectorReg)) { \
+ d = (TN *)&scratch; \
+ } \
+ for (size_t i = 0; i < n; ++i) { \
+ d[HN(4 * i + 0)] = SAT(RSHR(s0[HW(i)], shift)); \
+ d[HN(4 * i + 1)] = SAT(RSHR(s1[HW(i)], shift)); \
+ d[HN(4 * i + 2)] = SAT(RSHR(s2[HW(i)], shift)); \
+ d[HN(4 * i + 3)] = SAT(RSHR(s3[HW(i)], shift)); \
+ } \
+ if (d != vd) { \
+ memcpy(vd, d, oprsz); \
+ } \
+}
+
+SQRSHRN4(sme2_sqrshrn_sb, int32_t, int8_t, H4, H1, do_srshr, do_ssat_b)
+SQRSHRN4(sme2_uqrshrn_sb, uint32_t, uint8_t, H4, H1, do_urshr, do_usat_b)
+SQRSHRN4(sme2_sqrshrun_sb, int32_t, uint8_t, H4, H1, do_srshr, do_usat_b)
+
+SQRSHRN4(sme2_sqrshrn_dh, int64_t, int16_t, H8, H2, do_srshr, do_ssat_h)
+SQRSHRN4(sme2_uqrshrn_dh, uint64_t, uint16_t, H8, H2, do_urshr, do_usat_h)
+SQRSHRN4(sme2_sqrshrun_dh, int64_t, uint16_t, H8, H2, do_srshr, do_usat_h)
+
+#undef SQRSHRN4
+
/* Expand and convert */
void HELPER(sme2_fcvt_w)(void *vd, void *vs, float_status *fpst, uint32_t desc)
{
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 01542e5330..55c3310e9b 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1398,3 +1398,32 @@ static gen_helper_gvec_2 * const uzp4_fns[] = {
gen_helper_sme2_uzp4_q,
};
TRANS_FEAT(UZP_4, aa64_sme2, do_zipuzp_4, a, uzp4_fns)
+
+static bool do_zz_rshr(DisasContext *s, arg_rshr *a, gen_helper_gvec_2 *fn)
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ tcg_gen_gvec_2_ool(vec_full_reg_offset(s, a->zd),
+ vec_full_reg_offset(s, a->zn),
+ svl, svl, a->shift, fn);
+ }
+ return true;
+}
+
+TRANS_FEAT(SQRSHR_sh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshr_sh)
+TRANS_FEAT(UQRSHR_sh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshr_sh)
+TRANS_FEAT(SQRSHRU_sh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshru_sh)
+
+TRANS_FEAT(SQRSHR_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshr_sb)
+TRANS_FEAT(SQRSHR_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshr_dh)
+TRANS_FEAT(UQRSHR_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshr_sb)
+TRANS_FEAT(UQRSHR_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshr_dh)
+TRANS_FEAT(SQRSHRU_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshru_sb)
+TRANS_FEAT(SQRSHRU_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshru_dh)
+
+TRANS_FEAT(SQRSHRN_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrn_sb)
+TRANS_FEAT(SQRSHRN_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrn_dh)
+TRANS_FEAT(UQRSHRN_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshrn_sb)
+TRANS_FEAT(UQRSHRN_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshrn_dh)
+TRANS_FEAT(SQRSHRUN_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrun_sb)
+TRANS_FEAT(SQRSHRUN_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrun_dh)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index a43edb625c..b28113a883 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -810,3 +810,36 @@ UZP_4 11000001 esz:2 1 10110 111000 ...00 ... 10 \
&zz_e zd=%zd_ax4 zn=%zn_ax4
UZP_4 11000001 001 10111 111000 ...00 ... 10 \
&zz_e esz=4 zd=%zd_ax4 zn=%zn_ax4
+
+### SME2 Multi-vector SVE Constructive Binary
+
+&rshr zd zn shift
+
+%rshr_sh_shift 16:4 !function=rsub_16
+%rshr_sb_shift 16:5 !function=rsub_32
+%rshr_dh_shift 22:1 16:5 !function=rsub_64
+
+@rshr_sh ........ .... .... ...... ..... zd:5 \
+ &rshr zn=%zn_ax2 shift=%rshr_sh_shift
+@rshr_sb ........ ... ..... ...... ..... zd:5 \
+ &rshr zn=%zn_ax4 shift=%rshr_sb_shift
+@rshr_dh ........ ... ..... ...... ..... zd:5 \
+ &rshr zn=%zn_ax4 shift=%rshr_dh_shift
+
+SQRSHR_sh 11000001 1110 .... 110101 ....0 ..... @rshr_sh
+UQRSHR_sh 11000001 1110 .... 110101 ....1 ..... @rshr_sh
+SQRSHRU_sh 11000001 1111 .... 110101 ....0 ..... @rshr_sh
+
+SQRSHR_sb 11000001 011 ..... 110110 ...00 ..... @rshr_sb
+SQRSHR_dh 11000001 1.1 ..... 110110 ...00 ..... @rshr_dh
+UQRSHR_sb 11000001 011 ..... 110110 ...01 ..... @rshr_sb
+UQRSHR_dh 11000001 1.1 ..... 110110 ...01 ..... @rshr_dh
+SQRSHRU_sb 11000001 011 ..... 110110 ...10 ..... @rshr_sb
+SQRSHRU_dh 11000001 1.1 ..... 110110 ...10 ..... @rshr_dh
+
+SQRSHRN_sb 11000001 011 ..... 110111 ...00 ..... @rshr_sb
+SQRSHRN_dh 11000001 1.1 ..... 110111 ...00 ..... @rshr_dh
+UQRSHRN_sb 11000001 011 ..... 110111 ...01 ..... @rshr_sb
+UQRSHRN_dh 11000001 1.1 ..... 110111 ...01 ..... @rshr_dh
+SQRSHRUN_sb 11000001 011 ..... 110111 ...10 ..... @rshr_sb
+SQRSHRUN_dh 11000001 1.1 ..... 110111 ...10 ..... @rshr_dh
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 58/61] target/arm: Implement SME2 ZIP, UZP (two registers)
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (56 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 57/61] target/arm: Implement SME2 SQRSHR, UQRSHR, SQRSHRN Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 59/61] target/arm: Implement SME2 FCLAMP, SCLAMP, UCLAMP Richard Henderson
` (3 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 12 +++++++
target/arm/tcg/sme_helper.c | 58 ++++++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 37 ++++++++++++++++++++++
target/arm/tcg/sme.decode | 12 +++++++
4 files changed, 119 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index b9e1de63e6..a4216c8f3b 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -252,6 +252,18 @@ DEF_HELPER_FLAGS_3(sme2_uunpk4_bh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uunpk4_hs, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uunpk4_sd, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_zip2_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_zip2_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_zip2_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_zip2_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_zip2_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_uzp2_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uzp2_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uzp2_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uzp2_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uzp2_q, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
DEF_HELPER_FLAGS_3(sme2_zip4_b, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_zip4_h, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_zip4_s, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index 9a92302b8b..a88b602314 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1824,6 +1824,37 @@ void HELPER(sme2_fcvtl)(void *vd, void *vs, float_status *fpst, uint32_t desc)
}
}
+#define ZIP2(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, uint32_t desc) \
+{ \
+ ARMVectorReg scratch[2] __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc); \
+ size_t pairs = oprsz / (sizeof(TYPE) * 2); \
+ TYPE *n = vn, *m = vm; \
+ if (vd >= vn && vn < vd + 2 * sizeof(ARMVectorReg)) { \
+ n = memcpy(&scratch[0], vn, oprsz); \
+ } \
+ if (vd >= vm && vm < vd + 2 * sizeof(ARMVectorReg)) { \
+ m = memcpy(&scratch[1], vm, oprsz); \
+ } \
+ for (size_t r = 0; r < 2; ++r) { \
+ TYPE *d = vd + r * sizeof(ARMVectorReg); \
+ size_t base = r * pairs; \
+ for (size_t p = 0; p < pairs; ++p) { \
+ d[H(2 * p + 0)] = n[base + H(p)]; \
+ d[H(2 * p + 1)] = m[base + H(p)]; \
+ } \
+ } \
+}
+
+ZIP2(sme2_zip2_b, uint8_t, H1)
+ZIP2(sme2_zip2_h, uint16_t, H2)
+ZIP2(sme2_zip2_s, uint32_t, H4)
+ZIP2(sme2_zip2_d, uint64_t, )
+ZIP2(sme2_zip2_q, Int128, )
+
+#undef ZIP2
+
#define ZIP4(NAME, TYPE, H) \
void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
{ \
@@ -1858,6 +1889,33 @@ ZIP4(sme2_zip4_q, Int128, )
#undef ZIP4
+#define UZP2(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, uint32_t desc) \
+{ \
+ ARMVectorReg scratch __attribute__((uninitialized)); \
+ size_t oprsz = simd_oprsz(desc); \
+ size_t pairs = oprsz / (sizeof(TYPE) * 2); \
+ TYPE *d0 = vd, *d1 = vd + sizeof(ARMVectorReg); \
+ if (vd >= vm && vm < vd + 2 * sizeof(ARMVectorReg)) { \
+ vm = memcpy(&scratch, vm, oprsz); \
+ } \
+ for (size_t r = 0; r < 2; ++r) { \
+ TYPE *s = r ? vm : vn; \
+ for (size_t p = 0; p < pairs; ++p) { \
+ d0[H(p)] = s[H(2 * p + 0)]; \
+ d1[H(p)] = s[H(2 * p + 1)]; \
+ } \
+ } \
+}
+
+UZP2(sme2_uzp2_b, uint8_t, H1)
+UZP2(sme2_uzp2_h, uint16_t, H2)
+UZP2(sme2_uzp2_s, uint32_t, H4)
+UZP2(sme2_uzp2_d, uint64_t, )
+UZP2(sme2_uzp2_q, Int128, )
+
+#undef UZP2
+
#define UZP4(NAME, TYPE, H) \
void HELPER(NAME)(void *vd, void *vs, uint32_t desc) \
{ \
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index 55c3310e9b..c449760add 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1427,3 +1427,40 @@ TRANS_FEAT(UQRSHRN_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshrn_sb)
TRANS_FEAT(UQRSHRN_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_uqrshrn_dh)
TRANS_FEAT(SQRSHRUN_sb, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrun_sb)
TRANS_FEAT(SQRSHRUN_dh, aa64_sme2, do_zz_rshr, a, gen_helper_sme2_sqrshrun_dh)
+
+static bool do_zipuzp_2(DisasContext *s, arg_zzz_e *a,
+ gen_helper_gvec_3 * const fn[5])
+{
+ int svl = streaming_vec_reg_size(s);
+
+ /* MO_128 can fail the size test. */
+ if (svl < (2 << a->esz)) {
+ return false;
+ }
+ if (sme_sm_enabled_check(s)) {
+ tcg_gen_gvec_3_ool(vec_full_reg_offset(s, a->zd),
+ vec_full_reg_offset(s, a->zn),
+ vec_full_reg_offset(s, a->zm),
+ svl, svl, 0, fn[a->esz]);
+ }
+ return true;
+}
+
+static gen_helper_gvec_3 * const zip2_fns[] = {
+ gen_helper_sme2_zip2_b,
+ gen_helper_sme2_zip2_h,
+ gen_helper_sme2_zip2_s,
+ gen_helper_sme2_zip2_d,
+ gen_helper_sme2_zip2_q,
+};
+TRANS_FEAT(ZIP_2, aa64_sme2, do_zipuzp_2, a, zip2_fns)
+
+static gen_helper_gvec_3 * const uzp2_fns[] = {
+ gen_helper_sme2_uzp2_b,
+ gen_helper_sme2_uzp2_h,
+ gen_helper_sme2_uzp2_s,
+ gen_helper_sme2_uzp2_d,
+ gen_helper_sme2_uzp2_q,
+};
+TRANS_FEAT(UZP_2, aa64_sme2, do_zipuzp_2, a, uzp2_fns)
+
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index b28113a883..bd485a26f0 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -843,3 +843,15 @@ UQRSHRN_sb 11000001 011 ..... 110111 ...01 ..... @rshr_sb
UQRSHRN_dh 11000001 1.1 ..... 110111 ...01 ..... @rshr_dh
SQRSHRUN_sb 11000001 011 ..... 110111 ...10 ..... @rshr_sb
SQRSHRUN_dh 11000001 1.1 ..... 110111 ...10 ..... @rshr_dh
+
+&zzz_e zd zn zm esz
+
+ZIP_2 11000001 esz:2 1 zm:5 110100 zn:5 .... 0 \
+ &zzz_e zd=%zd_ax2
+ZIP_2 11000001 00 1 zm:5 110101 zn:5 .... 0 \
+ &zzz_e zd=%zd_ax2 esz=4
+
+UZP_2 11000001 esz:2 1 zm:5 110100 zn:5 .... 1 \
+ &zzz_e zd=%zd_ax2
+UZP_2 11000001 00 1 zm:5 110101 zn:5 .... 1 \
+ &zzz_e zd=%zd_ax2 esz=4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 59/61] target/arm: Implement SME2 FCLAMP, SCLAMP, UCLAMP
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (57 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 58/61] target/arm: Implement SME2 ZIP, UZP (two registers) Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 60/61] target/arm: Implement SME2 SEL Richard Henderson
` (2 subsequent siblings)
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/helper-sme.h | 15 +++++++++
target/arm/tcg/sme_helper.c | 52 +++++++++++++++++++++++++++++++
target/arm/tcg/translate-sme.c | 56 ++++++++++++++++++++++++++++++++++
target/arm/tcg/sme.decode | 17 +++++++++++
4 files changed, 140 insertions(+)
diff --git a/target/arm/tcg/helper-sme.h b/target/arm/tcg/helper-sme.h
index a4216c8f3b..6f66e84f38 100644
--- a/target/arm/tcg/helper-sme.h
+++ b/target/arm/tcg/helper-sme.h
@@ -293,3 +293,18 @@ DEF_HELPER_FLAGS_3(sme2_sqrshrun_sb, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_sqrshrn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_uqrshrn_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
DEF_HELPER_FLAGS_3(sme2_sqrshrun_dh, TCG_CALL_NO_RWG, void, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_sclamp_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_sclamp_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_sclamp_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_sclamp_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_4(sme2_uclamp_b, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uclamp_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uclamp_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_4(sme2_uclamp_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, i32)
+
+DEF_HELPER_FLAGS_5(sme2_fclamp_h, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(sme2_fclamp_s, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(sme2_fclamp_d, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
+DEF_HELPER_FLAGS_5(sme2_bfclamp, TCG_CALL_NO_RWG, void, ptr, ptr, ptr, fpst, i32)
diff --git a/target/arm/tcg/sme_helper.c b/target/arm/tcg/sme_helper.c
index a88b602314..850d130512 100644
--- a/target/arm/tcg/sme_helper.c
+++ b/target/arm/tcg/sme_helper.c
@@ -1949,3 +1949,55 @@ UZP4(sme2_uzp4_d, uint64_t, )
UZP4(sme2_uzp4_q, Int128, )
#undef UZP4
+
+#define ICLAMP(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, uint32_t desc) \
+{ \
+ size_t stride = sizeof(ARMVectorReg) / sizeof(TYPE); \
+ size_t elements = simd_oprsz(desc) / sizeof(TYPE); \
+ size_t nreg = simd_data(desc); \
+ TYPE *d = vd, *n = vn, *m = vm; \
+ for (size_t e = 0; e < elements; e++) { \
+ TYPE nn = n[H(e)], mm = m[H(e)]; \
+ for (size_t r = 0; r < nreg; r++) { \
+ TYPE *dd = &d[r * stride + H(e)]; \
+ *dd = MIN(MAX(*dd, nn), mm); \
+ } \
+ } \
+}
+
+ICLAMP(sme2_sclamp_b, int8_t, H1)
+ICLAMP(sme2_sclamp_h, int16_t, H2)
+ICLAMP(sme2_sclamp_s, int32_t, H4)
+ICLAMP(sme2_sclamp_d, int64_t, H8)
+
+ICLAMP(sme2_uclamp_b, uint8_t, H1)
+ICLAMP(sme2_uclamp_h, uint16_t, H2)
+ICLAMP(sme2_uclamp_s, uint32_t, H4)
+ICLAMP(sme2_uclamp_d, uint64_t, H8)
+
+#undef ICLAMP
+
+#define FCLAMP(NAME, TYPE, H) \
+void HELPER(NAME)(void *vd, void *vn, void *vm, \
+ float_status *fpst, uint32_t desc) \
+{ \
+ size_t stride = sizeof(ARMVectorReg) / sizeof(TYPE); \
+ size_t elements = simd_oprsz(desc) / sizeof(TYPE); \
+ size_t nreg = simd_data(desc); \
+ TYPE *d = vd, *n = vn, *m = vm; \
+ for (size_t e = 0; e < elements; e++) { \
+ TYPE nn = n[H(e)], mm = m[H(e)]; \
+ for (size_t r = 0; r < nreg; r++) { \
+ TYPE *dd = &d[r * stride + H(e)]; \
+ *dd = TYPE##_minnum(TYPE##_maxnum(*dd, nn, fpst), mm, fpst); \
+ } \
+ } \
+}
+
+FCLAMP(sme2_fclamp_h, float16, H2)
+FCLAMP(sme2_fclamp_s, float32, H4)
+FCLAMP(sme2_fclamp_d, float64, H8)
+FCLAMP(sme2_bfclamp, bfloat16, H2)
+
+#undef FCLAMP
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index c449760add..e71e3ec8e3 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1464,3 +1464,59 @@ static gen_helper_gvec_3 * const uzp2_fns[] = {
};
TRANS_FEAT(UZP_2, aa64_sme2, do_zipuzp_2, a, uzp2_fns)
+static bool trans_FCLAMP(DisasContext *s, arg_zzz_en *a)
+{
+ static gen_helper_gvec_3_ptr * const fn[] = {
+ gen_helper_sme2_bfclamp,
+ gen_helper_sme2_fclamp_h,
+ gen_helper_sme2_fclamp_s,
+ gen_helper_sme2_fclamp_d,
+ };
+
+ /* This insn uses MO_8 to encode BFloat16. */
+ if (!(a->esz == MO_8
+ ? dc_isar_feature(aa64_sme2_b16b16, s)
+ : dc_isar_feature(aa64_sme2, s))) {
+ return false;
+ }
+
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ TCGv_ptr fpst = fpstatus_ptr(a->esz == MO_16 ? FPST_A64_F16 : FPST_A64);
+
+ tcg_gen_gvec_3_ptr(vec_full_reg_offset(s, a->zd),
+ vec_full_reg_offset(s, a->zn),
+ vec_full_reg_offset(s, a->zm),
+ fpst, svl, svl, a->n, fn[a->esz]);
+ }
+ return true;
+}
+
+static bool do_clamp(DisasContext *s, arg_zzz_en *a,
+ gen_helper_gvec_3 * const fn[4])
+{
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ tcg_gen_gvec_3_ool(vec_full_reg_offset(s, a->zd),
+ vec_full_reg_offset(s, a->zn),
+ vec_full_reg_offset(s, a->zm),
+ svl, svl, a->n, fn[a->esz]);
+ }
+ return true;
+}
+
+static gen_helper_gvec_3 * const sclamp_fns[] = {
+ gen_helper_sme2_sclamp_b,
+ gen_helper_sme2_sclamp_h,
+ gen_helper_sme2_sclamp_s,
+ gen_helper_sme2_sclamp_d,
+};
+TRANS_FEAT(SCLAMP, aa64_sme2, do_clamp, a, sclamp_fns)
+
+static gen_helper_gvec_3 * const uclamp_fns[] = {
+ gen_helper_sme2_uclamp_b,
+ gen_helper_sme2_uclamp_h,
+ gen_helper_sme2_uclamp_s,
+ gen_helper_sme2_uclamp_d,
+};
+TRANS_FEAT(UCLAMP, aa64_sme2, do_clamp, a, uclamp_fns)
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index bd485a26f0..9bbf89b15c 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -855,3 +855,20 @@ UZP_2 11000001 esz:2 1 zm:5 110100 zn:5 .... 1 \
&zzz_e zd=%zd_ax2
UZP_2 11000001 00 1 zm:5 110101 zn:5 .... 1 \
&zzz_e zd=%zd_ax2 esz=4
+
+&zzz_en zd zn zm esz n
+
+FCLAMP 11000001 esz:2 1 zm:5 110000 zn:5 .... 0 \
+ &zzz_en zd=%zd_ax2 n=2
+FCLAMP 11000001 esz:2 1 zm:5 110010 zn:5 ...0 0 \
+ &zzz_en zd=%zd_ax4 n=4
+
+SCLAMP 11000001 esz:2 1 zm:5 110001 zn:5 .... 0 \
+ &zzz_en zd=%zd_ax2 n=2
+SCLAMP 11000001 esz:2 1 zm:5 110011 zn:5 ...0 0 \
+ &zzz_en zd=%zd_ax4 n=4
+
+UCLAMP 11000001 esz:2 1 zm:5 110001 zn:5 .... 1 \
+ &zzz_en zd=%zd_ax2 n=2
+UCLAMP 11000001 esz:2 1 zm:5 110011 zn:5 ...0 1 \
+ &zzz_en zd=%zd_ax4 n=4
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 60/61] target/arm: Implement SME2 SEL
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (58 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 59/61] target/arm: Implement SME2 FCLAMP, SCLAMP, UCLAMP Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-06 19:57 ` [PATCH 61/61] target/arm: Enable FEAT_SME2, FEAT_SME_F16F16, FEAT_SVE_B16B16 on -cpu max Richard Henderson
2025-02-24 20:27 ` [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/translate-sme.c | 25 +++++++++++++++++++++++++
target/arm/tcg/sme.decode | 9 +++++++++
2 files changed, 34 insertions(+)
diff --git a/target/arm/tcg/translate-sme.c b/target/arm/tcg/translate-sme.c
index e71e3ec8e3..78bd750701 100644
--- a/target/arm/tcg/translate-sme.c
+++ b/target/arm/tcg/translate-sme.c
@@ -1520,3 +1520,28 @@ static gen_helper_gvec_3 * const uclamp_fns[] = {
gen_helper_sme2_uclamp_d,
};
TRANS_FEAT(UCLAMP, aa64_sme2, do_clamp, a, uclamp_fns)
+
+static bool trans_SEL(DisasContext *s, arg_SEL *a)
+{
+ static gen_helper_gvec_4 * const fns[4] = {
+ gen_helper_sve_sel_zpzz_b, gen_helper_sve_sel_zpzz_h,
+ gen_helper_sve_sel_zpzz_s, gen_helper_sve_sel_zpzz_d
+ };
+
+ if (!dc_isar_feature(aa64_sme2, s)) {
+ return false;
+ }
+ if (sme_sm_enabled_check(s)) {
+ int svl = streaming_vec_reg_size(s);
+ gen_helper_gvec_4 *f = fns[a->esz];
+
+ for (int i = 0; i < a->n; ++i) {
+ tcg_gen_gvec_4_ool(vec_full_reg_offset(s, a->zd + i),
+ vec_full_reg_offset(s, a->zn + i),
+ vec_full_reg_offset(s, a->zm + i),
+ pred_full_reg_offset(s, a->pg),
+ svl, svl, 0, f);
+ }
+ }
+ return true;
+}
diff --git a/target/arm/tcg/sme.decode b/target/arm/tcg/sme.decode
index 9bbf89b15c..1d0ac4f09b 100644
--- a/target/arm/tcg/sme.decode
+++ b/target/arm/tcg/sme.decode
@@ -872,3 +872,12 @@ UCLAMP 11000001 esz:2 1 zm:5 110001 zn:5 .... 1 \
&zzz_en zd=%zd_ax2 n=2
UCLAMP 11000001 esz:2 1 zm:5 110011 zn:5 ...0 1 \
&zzz_en zd=%zd_ax4 n=4
+
+### SME2 Multi-vector SVE Select
+
+%sel_pg 10:3 !function=plus_8
+
+SEL 11000001 esz:2 1 ....0 100 ... ....0 ....0 \
+ n=2 zd=%zd_ax2 zn=%zn_ax2 zm=%zm_ax2 pg=%sel_pg
+SEL 11000001 esz:2 1 ...01 100 ... ...00 ...00 \
+ n=4 zd=%zd_ax4 zn=%zn_ax4 zm=%zm_ax4 pg=%sel_pg
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* [PATCH 61/61] target/arm: Enable FEAT_SME2, FEAT_SME_F16F16, FEAT_SVE_B16B16 on -cpu max
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (59 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 60/61] target/arm: Implement SME2 SEL Richard Henderson
@ 2025-02-06 19:57 ` Richard Henderson
2025-02-24 20:27 ` [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
61 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-06 19:57 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/tcg/cpu64.c | 7 ++++++-
docs/system/arm/emulation.rst | 3 +++
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/target/arm/tcg/cpu64.c b/target/arm/tcg/cpu64.c
index 29ab0ac79d..6fb821ad9a 100644
--- a/target/arm/tcg/cpu64.c
+++ b/target/arm/tcg/cpu64.c
@@ -1194,7 +1194,7 @@ void aarch64_max_tcg_initfn(Object *obj)
*/
t = FIELD_DP64(t, ID_AA64PFR1, MTE, 3); /* FEAT_MTE3 */
t = FIELD_DP64(t, ID_AA64PFR1, RAS_FRAC, 0); /* FEAT_RASv1p1 + FEAT_DoubleFault */
- t = FIELD_DP64(t, ID_AA64PFR1, SME, 1); /* FEAT_SME */
+ t = FIELD_DP64(t, ID_AA64PFR1, SME, 2); /* FEAT_SME2 */
t = FIELD_DP64(t, ID_AA64PFR1, CSV2_FRAC, 0); /* FEAT_CSV2_3 */
t = FIELD_DP64(t, ID_AA64PFR1, NMI, 1); /* FEAT_NMI */
cpu->isar.id_aa64pfr1 = t;
@@ -1264,11 +1264,16 @@ void aarch64_max_tcg_initfn(Object *obj)
t = cpu->isar.id_aa64smfr0;
t = FIELD_DP64(t, ID_AA64SMFR0, F32F32, 1); /* FEAT_SME */
+ t = FIELD_DP64(t, ID_AA64SMFR0, BI32I32, 1); /* FEAT_SME2 */
t = FIELD_DP64(t, ID_AA64SMFR0, B16F32, 1); /* FEAT_SME */
t = FIELD_DP64(t, ID_AA64SMFR0, F16F32, 1); /* FEAT_SME */
t = FIELD_DP64(t, ID_AA64SMFR0, I8I32, 0xf); /* FEAT_SME */
+ t = FIELD_DP64(t, ID_AA64SMFR0, F16F16, 1); /* FEAT_SME_F16F16 */
+ t = FIELD_DP64(t, ID_AA64SMFR0, B16B16, 1); /* FEAT_SVE_B16B16 */
+ t = FIELD_DP64(t, ID_AA64SMFR0, I16I32, 5); /* FEAT_SME2 */
t = FIELD_DP64(t, ID_AA64SMFR0, F64F64, 1); /* FEAT_SME_F64F64 */
t = FIELD_DP64(t, ID_AA64SMFR0, I16I64, 0xf); /* FEAT_SME_I16I64 */
+ t = FIELD_DP64(t, ID_AA64SMFR0, SMEVER, 1); /* FEAT_SME2 */
t = FIELD_DP64(t, ID_AA64SMFR0, FA64, 1); /* FEAT_SME_FA64 */
cpu->isar.id_aa64smfr0 = t;
diff --git a/docs/system/arm/emulation.rst b/docs/system/arm/emulation.rst
index 78c2fd2113..d063571556 100644
--- a/docs/system/arm/emulation.rst
+++ b/docs/system/arm/emulation.rst
@@ -129,11 +129,14 @@ the following architecture extensions:
- FEAT_SM3 (Advanced SIMD SM3 instructions)
- FEAT_SM4 (Advanced SIMD SM4 instructions)
- FEAT_SME (Scalable Matrix Extension)
+- FEAT_SME2 (Scalable Matrix Extension version 2)
- FEAT_SME_FA64 (Full A64 instruction set in Streaming SVE mode)
+- FEAT_SME_F16F16 (Non-widening half-precision FP16 arithmetic for SME2)
- FEAT_SME_F64F64 (Double-precision floating-point outer product instructions)
- FEAT_SME_I16I64 (16-bit to 64-bit integer widening outer product instructions)
- FEAT_SVE (Scalable Vector Extension)
- FEAT_SVE_AES (Scalable Vector AES instructions)
+- FEAT_SVE_B16B16 (Non-widening BFloat16 arithmetic for SVE2 and SME2)
- FEAT_SVE_BitPerm (Scalable Vector Bit Permutes instructions)
- FEAT_SVE_PMULL128 (Scalable Vector PMULL instructions)
- FEAT_SVE_SHA3 (Scalable Vector SHA3 instructions)
--
2.43.0
^ permalink raw reply related [flat|nested] 64+ messages in thread
* Re: [PATCH 00/61] target/arm: Implement FEAT_SME2
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
` (60 preceding siblings ...)
2025-02-06 19:57 ` [PATCH 61/61] target/arm: Enable FEAT_SME2, FEAT_SME_F16F16, FEAT_SVE_B16B16 on -cpu max Richard Henderson
@ 2025-02-24 20:27 ` Richard Henderson
2025-02-24 20:35 ` Richard Henderson
61 siblings, 1 reply; 64+ messages in thread
From: Richard Henderson @ 2025-02-24 20:27 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
On 2/6/25 11:56, Richard Henderson wrote:
> Based-on:20250201164012.1660228-1-peter.maydell@linaro.org
> ("[PATCH v2 00/69] target/arm: FEAT_AFP and FEAT_RPRES")
>
> This implements the Scalar Matrix Extensions, version 2, plus two
> trivial extensions for float16 and bfloat16.
>
> This hasn't been tested much at all; I need to either get FVP up and
> running for RISU comparison, or write some stand-alone test cases.
> But in the meantime this could use some eyes.
>
> SME2 is the first vector-like extension we've had that has dynamic
> indexing of registers: ZArray[(rv + offset) % svl], where RV is a
> general register. So the first thing I do is extend TCG's gvec
> support to handle TCGv_ptr base + offset addressing. I only changed
> enough to handle what I needed within SME2; changing it all would be
> a big job, and it would (at least for the moment) remain unused.
>
> Still to-do are few more extensions for SME2p1.
Ho hum. I've missed the entire set of counted predicate insns.
Hooray for Arm sample code!
r~
^ permalink raw reply [flat|nested] 64+ messages in thread
* Re: [PATCH 00/61] target/arm: Implement FEAT_SME2
2025-02-24 20:27 ` [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
@ 2025-02-24 20:35 ` Richard Henderson
0 siblings, 0 replies; 64+ messages in thread
From: Richard Henderson @ 2025-02-24 20:35 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
On 2/24/25 12:27, Richard Henderson wrote:
> On 2/6/25 11:56, Richard Henderson wrote:
>> Based-on:20250201164012.1660228-1-peter.maydell@linaro.org
>> ("[PATCH v2 00/69] target/arm: FEAT_AFP and FEAT_RPRES")
>>
>> This implements the Scalar Matrix Extensions, version 2, plus two
>> trivial extensions for float16 and bfloat16.
>>
>> This hasn't been tested much at all; I need to either get FVP up and
>> running for RISU comparison, or write some stand-alone test cases.
>> But in the meantime this could use some eyes.
>>
>> SME2 is the first vector-like extension we've had that has dynamic
>> indexing of registers: ZArray[(rv + offset) % svl], where RV is a
>> general register. So the first thing I do is extend TCG's gvec
>> support to handle TCGv_ptr base + offset addressing. I only changed
>> enough to handle what I needed within SME2; changing it all would be
>> a big job, and it would (at least for the moment) remain unused.
>>
>> Still to-do are few more extensions for SME2p1.
>
> Ho hum. I've missed the entire set of counted predicate insns.
> Hooray for Arm sample code!
Oh, duh -- they're in SME2p1, which is I guess not as optional as I assumed.
r~
^ permalink raw reply [flat|nested] 64+ messages in thread
end of thread, other threads:[~2025-02-24 20:36 UTC | newest]
Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06 19:56 [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
2025-02-06 19:56 ` [PATCH 01/61] tcg: Add dbase argument to do_dup_store Richard Henderson
2025-02-06 19:56 ` [PATCH 02/61] tcg: Add dbase argument to do_dup Richard Henderson
2025-02-06 19:56 ` [PATCH 03/61] tcg: Add dbase argument to expand_clr Richard Henderson
2025-02-06 19:56 ` [PATCH 04/61] tcg: Add base arguments to check_overlap_[234] Richard Henderson
2025-02-06 19:56 ` [PATCH 05/61] tcg: Split out tcg_gen_gvec_2_var Richard Henderson
2025-02-06 19:56 ` [PATCH 06/61] tcg: Split out tcg_gen_gvec_3_var Richard Henderson
2025-02-06 19:56 ` [PATCH 07/61] tcg: Split out tcg_gen_gvec_mov_var Richard Henderson
2025-02-06 19:56 ` [PATCH 08/61] tcg: Split out tcg_gen_gvec_{add,sub}_var Richard Henderson
2025-02-06 19:56 ` [PATCH 09/61] target/arm: Introduce FPST_ZA, FPST_ZA_F16 Richard Henderson
2025-02-06 19:56 ` [PATCH 10/61] target/arm: Use FPST_ZA for sme_fmopa_[hsd] Richard Henderson
2025-02-06 19:56 ` [PATCH 11/61] target/arm: Rename zarray to za_state.za Richard Henderson
2025-02-06 19:56 ` [PATCH 12/61] target/arm: Add isar_feature_aa64_sme2* Richard Henderson
2025-02-06 19:56 ` [PATCH 13/61] target/arm: Add ZT0 Richard Henderson
2025-02-06 19:56 ` [PATCH 14/61] target/arm: Add zt0_excp_el to DisasContext Richard Henderson
2025-02-06 19:56 ` [PATCH 15/61] target/arm: Implement SME2 ZERO ZT0 Richard Henderson
2025-02-06 19:56 ` [PATCH 16/61] target/arm: Implement SME2 LDR/STR ZT0 Richard Henderson
2025-02-06 19:56 ` [PATCH 17/61] target/arm: Implement SME2 MOVT Richard Henderson
2025-02-06 19:56 ` [PATCH 18/61] target/arm: Split get_tile_rowcol argument tile_index Richard Henderson
2025-02-06 19:56 ` [PATCH 19/61] target/arm: Rename MOVA for translate Richard Henderson
2025-02-06 19:56 ` [PATCH 20/61] target/arm: Implement SME2 MOVA to/from tile, multiple registers Richard Henderson
2025-02-06 19:56 ` [PATCH 21/61] target/arm: Split out get_zarray Richard Henderson
2025-02-06 19:56 ` [PATCH 22/61] target/arm: Implement SME2 MOVA to/from array, multiple registers Richard Henderson
2025-02-06 19:56 ` [PATCH 23/61] target/arm: Implement SME2 BMOPA Richard Henderson
2025-02-06 19:56 ` [PATCH 24/61] target/arm: Implement SME2 SMOPS, UMOPS (2-way) Richard Henderson
2025-02-06 19:56 ` [PATCH 25/61] target/arm: Introduce gen_gvec_sve2_sqdmulh Richard Henderson
2025-02-06 19:56 ` [PATCH 26/61] target/arm: Implement SME2 Multiple and Single SVE Destructive Richard Henderson
2025-02-06 19:56 ` [PATCH 27/61] target/arm: Implement SME2 Multiple Vectors " Richard Henderson
2025-02-06 19:56 ` [PATCH 28/61] target/arm: Implement SME2 ADD/SUB (array results, multiple and single vector) Richard Henderson
2025-02-06 19:56 ` [PATCH 29/61] target/arm: Implement SME2 ADD/SUB (array results, multiple vectors) Richard Henderson
2025-02-06 19:56 ` [PATCH 30/61] target/arm: Pass ZA to helper_sve2_fmlal_zz[zx]w_s Richard Henderson
2025-02-06 19:56 ` [PATCH 31/61] target/arm: Implement SME2 FMLAL, BFMLAL Richard Henderson
2025-02-06 19:56 ` [PATCH 32/61] target/arm: Implement SME2 FDOT Richard Henderson
2025-02-06 19:56 ` [PATCH 33/61] target/arm: Implement SME2 BFDOT Richard Henderson
2025-02-06 19:56 ` [PATCH 34/61] target/arm: Implement SME2 FVDOT, BFVDOT Richard Henderson
2025-02-06 19:56 ` [PATCH 35/61] target/arm: Rename helper_gvec_*dot_[bh] to *_4[bh] Richard Henderson
2025-02-06 19:56 ` [PATCH 36/61] target/arm: Remove helper_gvec_sudot_idx_4b Richard Henderson
2025-02-06 19:56 ` [PATCH 37/61] target/arm: Implemement SME2 SDOT, UDOT, USDOT, SUDOT Richard Henderson
2025-02-06 19:56 ` [PATCH 38/61] target/arm: Implement SME2 SVDOT, UVDOT, SUVDOT, USVDOT Richard Henderson
2025-02-06 19:56 ` [PATCH 39/61] target/arm: Implement SME2 SMLAL, SMLSL, UMLAL, UMLSL Richard Henderson
2025-02-06 19:56 ` [PATCH 40/61] target/arm: Implement SME2 SMLALL, SMLSLL, UMLALL, UMLSLL Richard Henderson
2025-02-06 19:56 ` [PATCH 41/61] target/arm: Rename gvec_fml[as]_[hs] with _nf_ infix Richard Henderson
2025-02-06 19:56 ` [PATCH 42/61] target/arm: Implement SME2 FMLA, FMLS Richard Henderson
2025-02-06 19:56 ` [PATCH 43/61] target/arm: Implement SME2 BFMLA, BFMLS Richard Henderson
2025-02-06 19:56 ` [PATCH 44/61] target/arm: Implement SME2 FADD, FSUB, BFADD, BFSUB Richard Henderson
2025-02-06 19:56 ` [PATCH 45/61] target/arm: Remove CPUARMState.vfp.scratch Richard Henderson
2025-02-06 19:57 ` [PATCH 46/61] target/arm: Implement SME2 BFCVT, BFCVTN, FCVT, FCVTN Richard Henderson
2025-02-06 19:57 ` [PATCH 47/61] target/arm: Implement SME2 FCVT (widening), FCVTL Richard Henderson
2025-02-06 19:57 ` [PATCH 48/61] target/arm: Implement SME2 FCVTZS, FCVTZU Richard Henderson
2025-02-06 19:57 ` [PATCH 49/61] target/arm: Implement SME2 SCVTF, UCVTF Richard Henderson
2025-02-06 19:57 ` [PATCH 50/61] target/arm: Implement SME2 FRINTN, FRINTP, FRINTM, FRINTA Richard Henderson
2025-02-06 19:57 ` [PATCH 51/61] target/arm: Introduce do_[us]sat_[bhs] macros Richard Henderson
2025-02-06 19:57 ` [PATCH 52/61] target/arm: Use do_[us]sat_[bhs] in sve_helper.c Richard Henderson
2025-02-06 19:57 ` [PATCH 53/61] target/arm: Implement SME2 SQCVT, UQCVT, SQCVTU Richard Henderson
2025-02-06 19:57 ` [PATCH 54/61] target/arm: Implement SME2 SUNPK, UUNPK Richard Henderson
2025-02-06 19:57 ` [PATCH 55/61] target/arm: Implement SME2 ZIP, UZP (four registers) Richard Henderson
2025-02-06 19:57 ` [PATCH 56/61] target/arm: Move do_urshr, do_srshr to vec_internal.h Richard Henderson
2025-02-06 19:57 ` [PATCH 57/61] target/arm: Implement SME2 SQRSHR, UQRSHR, SQRSHRN Richard Henderson
2025-02-06 19:57 ` [PATCH 58/61] target/arm: Implement SME2 ZIP, UZP (two registers) Richard Henderson
2025-02-06 19:57 ` [PATCH 59/61] target/arm: Implement SME2 FCLAMP, SCLAMP, UCLAMP Richard Henderson
2025-02-06 19:57 ` [PATCH 60/61] target/arm: Implement SME2 SEL Richard Henderson
2025-02-06 19:57 ` [PATCH 61/61] target/arm: Enable FEAT_SME2, FEAT_SME_F16F16, FEAT_SVE_B16B16 on -cpu max Richard Henderson
2025-02-24 20:27 ` [PATCH 00/61] target/arm: Implement FEAT_SME2 Richard Henderson
2025-02-24 20:35 ` Richard Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).