* [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
@ 2017-08-17 18:03 Alex Bennée
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
` (10 more replies)
0 siblings, 11 replies; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:03 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée
Hi,
With upcoming work on SVE I've been looking at the way we implement
vector registers in QEMU's TCG. The current orthodoxy is to decompose
the vector into a series of TCG registers, often calling a helper
function the calculation of each element. The result of the helper is
then is then stored back in the vector representation afterwards.
There are occasional outliers like simd_tbl which access elements
directly from a passed CPUFooState env pointer but these are rare.
This series introduces the concept of TCGv_vec type. This is a pointer
to the start of the in memory representation of an arbitrarily long
vector register. This is passed to a helper function as a pointer
along with a normal TCG register containing information about the
actual vector length and any additional information the helper needs
to do the operation. The hope* is this saves on the churn of having
the TCG do things element by element and allows the compiler to use
native vector operations to streamline the helpers.
There are some downsides to this approach. The first is you have to be
careful about register aliasing. If you are doing a same reg to same
reg operation you need to make a copy of the vector so you don't
trample your input data as you go. The second is this involves
changing some of the assumptions the TCG makes about things. I've
managed to keep all the changes within the core TCG code for now but
so far it has only been tested for the tcg_call path which is the only
place where TCGv_vec's should turn up. It is possible to do the same
thing without touching the TCG code generation by using TCGv_ptrs and
manually emitting tcg_addi ops to pass the correct address. Richard
has been exploring this approach with his series. The downside of that
is you do miss the ability to have named global vector registers which
makes reading the TCG dumps a little easier.
I've only patched one helper in this series which implements the
indexed smull. This is because it appears in the profiles for my test
case which was using an arm64 ffmpeg to transcode:
./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \
-threads 1 -qscale:v 3 -f null -
* hope. On an earlier revision (which included sqshrn conversions) I
had measured a minor saving but this had disappeared once I measured
the final code. However the profile is fairly dominated by
softfloat.
master:
8.05% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32
7.28% qemu-aarch64 qemu-aarch64 [.] float32_mul
6.56% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr
5.31% qemu-aarch64 qemu-aarch64 [.] float32_muladd
4.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16
4.00% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs
3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs
2.26% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl
2.00% qemu-aarch64 qemu-aarch64 [.] float32_add
1.81% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8
1.64% qemu-aarch64 qemu-aarch64 [.] float32_sub
1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32
0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8
tcg-native-vectors-rfc:
7.93% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32
7.54% qemu-aarch64 qemu-aarch64 [.] float32_mul
6.29% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr
5.39% qemu-aarch64 qemu-aarch64 [.] float32_muladd
3.92% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs
3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs
3.62% qemu-aarch64 qemu-aarch64 [.] helper_advsimd_smull_idx_s32
2.19% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl
2.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16
1.99% qemu-aarch64 qemu-aarch64 [.] float32_add
1.79% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8
1.62% qemu-aarch64 qemu-aarch64 [.] float32_sub
1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32
1.00% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8
0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_addl_u32
At the moment the default compiler settings don't actually vectorise
the helper. I could get it to once I added some alignment guarantees
but the casting I did broke the instruction emulation so I haven't
included that patch in this series.
Given the results why continue investigating this? Well for one thing
vector sizes are growing, SVE vectors are up to 2048 bits long. Those
longer vectors should offer more scope for the host compiler to
generate efficient code in the helper. Also vector operations tend to
be quite complex operations, being able to handle this in C code
instead of TCGOps might be more preferable from a code maintainability
point of view. Finally this noddy little experiment has at least shown
it doesn't worsen performance. It would be nice if I could find a
benchmark that made heavy use if non-floating point SIMD instructions
to better measure the effect of marshalling elements vs vectorised
helpers. If anyone has any suggestions I'm all ears ;-)
Anyway questions, comments?
Alex Bennée (9):
tcg/README: listify the TCG types.
tcg: introduce the concepts of a TCGv_vec register type
tcg: generate ptrs to vector registers
helper-head: add support for vec type
arm/cpu.h: align VFP registers
target/arm/translate-a64: regnames -> x_regnames
target/arm/translate-a64: register global vectors
target/arm/helpers: introduce ADVSIMD flags
target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]
include/exec/helper-head.h | 5 ++
target/arm/advsimd_helper_flags.h | 50 ++++++++++++++++++++
target/arm/cpu.h | 4 +-
target/arm/helper-a64.c | 18 ++++++++
target/arm/helper-a64.h | 2 +
target/arm/translate-a64.c | 97 +++++++++++++++++++++++++++++++++++++--
tcg/README | 10 ++--
tcg/tcg.c | 26 ++++++++++-
tcg/tcg.h | 20 ++++++++
9 files changed, 222 insertions(+), 10 deletions(-)
create mode 100644 target/arm/advsimd_helper_flags.h
--
2.13.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types.
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
@ 2017-08-17 18:03 ` Alex Bennée
2017-08-17 20:05 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type Alex Bennée
` (9 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:03 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée
Although the other types are aliases lets make it clear what TCG types
are available.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
tcg/README | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/tcg/README b/tcg/README
index 03bfb6acd4..f116b7b694 100644
--- a/tcg/README
+++ b/tcg/README
@@ -53,9 +53,12 @@ an "undefined result".
TCG instructions operate on variables which are temporaries, local
temporaries or globals. TCG instructions and variables are strongly
-typed. Two types are supported: 32 bit integers and 64 bit
-integers. Pointers are defined as an alias to 32 bit or 64 bit
-integers depending on the TCG target word size.
+typed. A number of types are supported:
+
+ TCGv_i32 - 32 bit integer
+ TCGv_i64 - 64 bit integer
+ TCGv - target pointer (aliased to 32 or 64 bit integer)
+ TCGv_ptr - host pointer (used for direct access to host structures)
Each instruction has a fixed number of output variable operands, input
variable operands and always constant operands.
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
@ 2017-08-17 18:03 ` Alex Bennée
2017-08-17 20:07 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers Alex Bennée
` (8 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:03 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée
Currently it only makes sense for globals - i.e. registers directly
mapped to CPUEnv.
---
tcg/README | 1 +
tcg/tcg.h | 20 ++++++++++++++++++++
2 files changed, 21 insertions(+)
diff --git a/tcg/README b/tcg/README
index f116b7b694..e0868d95b4 100644
--- a/tcg/README
+++ b/tcg/README
@@ -57,6 +57,7 @@ typed. A number of types are supported:
TCGv_i32 - 32 bit integer
TCGv_i64 - 64 bit integer
+ TCGv_vec - an arbitrary sized vector register
TCGv - target pointer (aliased to 32 or 64 bit integer)
TCGv_ptr - host pointer (used for direct access to host structures)
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 17b7750ee6..d75636b6ab 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -256,6 +256,7 @@ typedef struct TCGPool {
typedef enum TCGType {
TCG_TYPE_I32,
TCG_TYPE_I64,
+ TCG_TYPE_VECTOR,
TCG_TYPE_COUNT, /* number of different types */
/* An alias for the size of the host register. */
@@ -431,6 +432,7 @@ typedef tcg_target_ulong TCGArg;
typedef struct TCGv_i32_d *TCGv_i32;
typedef struct TCGv_i64_d *TCGv_i64;
typedef struct TCGv_ptr_d *TCGv_ptr;
+typedef struct TCGv_vec_d *TCGv_vec;
typedef TCGv_ptr TCGv_env;
#if TARGET_LONG_BITS == 32
#define TCGv TCGv_i32
@@ -450,6 +452,11 @@ static inline TCGv_i64 QEMU_ARTIFICIAL MAKE_TCGV_I64(intptr_t i)
return (TCGv_i64)i;
}
+static inline TCGv_vec QEMU_ARTIFICIAL MAKE_TCGV_VEC(intptr_t i)
+{
+ return (TCGv_vec)i;
+}
+
static inline TCGv_ptr QEMU_ARTIFICIAL MAKE_TCGV_PTR(intptr_t i)
{
return (TCGv_ptr)i;
@@ -465,6 +472,11 @@ static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_I64(TCGv_i64 t)
return (intptr_t)t;
}
+static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_VEC(TCGv_vec t)
+{
+ return (intptr_t)t;
+}
+
static inline intptr_t QEMU_ARTIFICIAL GET_TCGV_PTR(TCGv_ptr t)
{
return (intptr_t)t;
@@ -788,6 +800,7 @@ int tcg_global_mem_new_internal(TCGType, TCGv_ptr, intptr_t, const char *);
TCGv_i32 tcg_global_reg_new_i32(TCGReg reg, const char *name);
TCGv_i64 tcg_global_reg_new_i64(TCGReg reg, const char *name);
+TCGv_vec tcg_global_reg_new_vec(TCGReg reg, const char *name);
TCGv_i32 tcg_temp_new_internal_i32(int temp_local);
TCGv_i64 tcg_temp_new_internal_i64(int temp_local);
@@ -829,6 +842,13 @@ static inline TCGv_i64 tcg_temp_local_new_i64(void)
return tcg_temp_new_internal_i64(1);
}
+static inline TCGv_vec tcg_global_mem_new_vec(TCGv_ptr reg, intptr_t offset,
+ const char *name)
+{
+ int idx = tcg_global_mem_new_internal(TCG_TYPE_VECTOR, reg, offset, name);
+ return MAKE_TCGV_VEC(idx);
+}
+
#if defined(CONFIG_DEBUG_TCG)
/* If you call tcg_clear_temp_count() at the start of a section of
* code which is not supposed to leak any TCG temporaries, then
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type Alex Bennée
@ 2017-08-17 18:03 ` Alex Bennée
2017-08-17 20:13 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 4/9] helper-head: add support for vec type Alex Bennée
` (7 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:03 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée
As we operate directly on the vectors in memory we pass around the
address for TCG_TYPE_VECTOR. Currently only helpers ever see these
values but if we were to generate simd backend instructions they would
load directly from the backing store.
We also need to ensure when copying from one temp register to the
other the right size is used.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
tcg/tcg.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 35598296c5..e16811d68d 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -2034,7 +2034,21 @@ static void temp_load(TCGContext *s, TCGTemp *ts, TCGRegSet desired_regs,
break;
case TEMP_VAL_MEM:
reg = tcg_reg_alloc(s, desired_regs, allocated_regs, ts->indirect_base);
- tcg_out_ld(s, ts->type, reg, ts->mem_base->reg, ts->mem_offset);
+ if (ts->type == TCG_TYPE_VECTOR) {
+ /* Vector registers are ptr's to the memory representation */
+ TCGArg args[TCG_MAX_OP_ARGS];
+ int const_args[TCG_MAX_OP_ARGS];
+ args[0] = reg;
+ args[1] = ts->mem_base->reg;
+ args[2] = ts->mem_offset;
+ const_args[0] = 0;
+ const_args[1] = 0;
+ const_args[2] = 1;
+ /* FIXME: needs to by host_ptr centric */
+ tcg_out_op(s, INDEX_op_add_i64, args, const_args);
+ } else {
+ tcg_out_ld(s, ts->type, reg, ts->mem_base->reg, ts->mem_offset);
+ }
ts->mem_coherent = 1;
break;
case TEMP_VAL_DEAD:
@@ -2196,6 +2210,10 @@ static void tcg_reg_alloc_mov(TCGContext *s, const TCGOpDef *def,
ots->reg = tcg_reg_alloc(s, tcg_target_available_regs[otype],
allocated_regs, ots->indirect_base);
}
+ /* For the purposes of moving stuff about it is a host ptr */
+ if (otype == TCG_TYPE_VECTOR) {
+ otype = TCG_TYPE_PTR;
+ }
tcg_out_mov(s, otype, ots->reg, ts->reg);
}
ots->val_type = TEMP_VAL_REG;
@@ -2440,7 +2458,11 @@ static void tcg_reg_alloc_call(TCGContext *s, int nb_oargs, int nb_iargs,
if (ts->val_type == TEMP_VAL_REG) {
if (ts->reg != reg) {
- tcg_out_mov(s, ts->type, reg, ts->reg);
+ if (ts->type == TCG_TYPE_VECTOR) {
+ tcg_out_mov(s, TCG_TYPE_PTR, reg, ts->reg);
+ } else {
+ tcg_out_mov(s, ts->type, reg, ts->reg);
+ }
}
} else {
TCGRegSet arg_set;
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 4/9] helper-head: add support for vec type
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (2 preceding siblings ...)
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers Alex Bennée
@ 2017-08-17 18:03 ` Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers Alex Bennée
` (6 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:03 UTC (permalink / raw)
To: rth, cota, batuzovk
Cc: qemu-devel, qemu-arm, Alex Bennée, Paolo Bonzini,
Peter Crosthwaite
---
include/exec/helper-head.h | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/include/exec/helper-head.h b/include/exec/helper-head.h
index 1cfc43b9ff..3fb4c3fc39 100644
--- a/include/exec/helper-head.h
+++ b/include/exec/helper-head.h
@@ -23,6 +23,7 @@
#define GET_TCGV_i32 GET_TCGV_I32
#define GET_TCGV_i64 GET_TCGV_I64
#define GET_TCGV_ptr GET_TCGV_PTR
+#define GET_TCGV_vec GET_TCGV_VEC
/* Some types that make sense in C, but not for TCG. */
#define dh_alias_i32 i32
@@ -33,6 +34,7 @@
#define dh_alias_f32 i32
#define dh_alias_f64 i64
#define dh_alias_ptr ptr
+#define dh_alias_vec vec
#define dh_alias_void void
#define dh_alias_noreturn noreturn
#define dh_alias(t) glue(dh_alias_, t)
@@ -45,6 +47,7 @@
#define dh_ctype_f32 float32
#define dh_ctype_f64 float64
#define dh_ctype_ptr void *
+#define dh_ctype_vec void *
#define dh_ctype_void void
#define dh_ctype_noreturn void QEMU_NORETURN
#define dh_ctype(t) dh_ctype_##t
@@ -90,6 +93,7 @@
#define dh_is_64bit_i32 0
#define dh_is_64bit_i64 1
#define dh_is_64bit_ptr (sizeof(void *) == 8)
+#define dh_is_64bit_vec (sizeof(void *) == 8)
#define dh_is_64bit(t) glue(dh_is_64bit_, dh_alias(t))
#define dh_is_signed_void 0
@@ -106,6 +110,7 @@
extension instructions that may be required, e.g. ia64's addp4. But
for now we don't support any 64-bit targets with 32-bit pointers. */
#define dh_is_signed_ptr 0
+#define dh_is_signed_vec dh_is_signed_ptr
#define dh_is_signed_env dh_is_signed_ptr
#define dh_is_signed(t) dh_is_signed_##t
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (3 preceding siblings ...)
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 4/9] helper-head: add support for vec type Alex Bennée
@ 2017-08-17 18:04 ` Alex Bennée
2017-08-17 20:13 ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames Alex Bennée
` (5 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:04 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée, Peter Maydell
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
target/arm/cpu.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index b39d64aa0b..cdd47cb868 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -457,8 +457,8 @@ typedef struct CPUARMState {
* the two execution states, and means we do not need to explicitly
* map these registers when changing states.
*/
- float64 regs[64];
-
+ float64 regs[64] __attribute__((aligned(16)));
+ /* VFP system registers */
uint32_t xregs[16];
/* We store these fpcsr fields separately for convenience. */
int vec_len;
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (4 preceding siblings ...)
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers Alex Bennée
@ 2017-08-17 18:04 ` Alex Bennée
2017-08-17 20:14 ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 7/9] target/arm/translate-a64: register global vectors Alex Bennée
` (4 subsequent siblings)
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:04 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée, Peter Maydell
These are the integer registers as will become clear when we start
declaring the vector ones.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
target/arm/translate-a64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2200e25be0..805af51900 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -43,7 +43,7 @@ static TCGv_i64 cpu_pc;
static TCGv_i64 cpu_exclusive_high;
static TCGv_i64 cpu_reg(DisasContext *s, int reg);
-static const char *regnames[] = {
+static const char *x_regnames[] = {
"x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
"x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15",
"x16", "x17", "x18", "x19", "x20", "x21", "x22", "x23",
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 7/9] target/arm/translate-a64: register global vectors
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (5 preceding siblings ...)
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames Alex Bennée
@ 2017-08-17 18:04 ` Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 8/9] target/arm/helpers: introduce ADVSIMD flags Alex Bennée
` (3 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:04 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée, Peter Maydell
Register the vector registers with TCG.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
target/arm/translate-a64.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 805af51900..b5f48605a7 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -36,8 +36,10 @@
#include "trace-tcg.h"
+/* Global registers */
static TCGv_i64 cpu_X[32];
static TCGv_i64 cpu_pc;
+static TCGv_vec cpu_V[32];
/* Load/store exclusive handling */
static TCGv_i64 cpu_exclusive_high;
@@ -50,6 +52,13 @@ static const char *x_regnames[] = {
"x24", "x25", "x26", "x27", "x28", "x29", "lr", "sp"
};
+static const char *v_regnames[] = {
+ "v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+ "v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+ "v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
+ "v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31"
+};
+
enum a64_shift_type {
A64_SHIFT_TYPE_LSL = 0,
A64_SHIFT_TYPE_LSR = 1,
@@ -91,10 +100,18 @@ void a64_translate_init(void)
cpu_pc = tcg_global_mem_new_i64(cpu_env,
offsetof(CPUARMState, pc),
"pc");
- for (i = 0; i < 32; i++) {
+
+ for (i = 0; i < ARRAY_SIZE(cpu_X); i++) {
cpu_X[i] = tcg_global_mem_new_i64(cpu_env,
offsetof(CPUARMState, xregs[i]),
- regnames[i]);
+ x_regnames[i]);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(cpu_V); i++) {
+ cpu_V[i] = tcg_global_mem_new_vec(cpu_env,
+ offsetof(CPUARMState,
+ vfp.regs[i * 2]),
+ v_regnames[i]);
}
cpu_exclusive_high = tcg_global_mem_new_i64(cpu_env,
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 8/9] target/arm/helpers: introduce ADVSIMD flags
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (6 preceding siblings ...)
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 7/9] target/arm/translate-a64: register global vectors Alex Bennée
@ 2017-08-17 18:04 ` Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] Alex Bennée
` (2 subsequent siblings)
10 siblings, 0 replies; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:04 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée, Peter Maydell
This is used to pass constant information to the helper. This includes
immediate data and element counts/offsets.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
target/arm/advsimd_helper_flags.h | 50 +++++++++++++++++++++++++++++++++++++++
target/arm/helper-a64.c | 1 +
target/arm/translate-a64.c | 2 ++
3 files changed, 53 insertions(+)
create mode 100644 target/arm/advsimd_helper_flags.h
diff --git a/target/arm/advsimd_helper_flags.h b/target/arm/advsimd_helper_flags.h
new file mode 100644
index 0000000000..47429e6fd1
--- /dev/null
+++ b/target/arm/advsimd_helper_flags.h
@@ -0,0 +1,50 @@
+/*
+ * AArch64 Vector Flags
+ *
+ * Copyright (c) 2017 Linaro
+ * Author: Alex Bennée <alex.bennee@linaro.org>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* AdvSIMD element data
+ *
+ * We pack all the additional information for elements into a single
+ * 32 bit constant passed by register. Hopefully for groups of
+ * identical operations on different registers this should propergate
+ * nicely in the TCG.
+ *
+ * The following control element iteration:
+ * ADVSIMD_OPR_ELT - the count of elements affected
+ * ADVSIMD_ALL_ELT - the total count of elements (e.g. clear all-opr elements)
+ * ADVSIMD_DOFF_ELT - the offset for the destination register (e.g. foo2 ops)
+ *
+ * We encode immediate data in:
+ * ADVSIMD_DATA
+ *
+ * Typically this is things like shift counts and the like.
+ */
+
+#define ADVSIMD_OPR_ELT_BITS 5
+#define ADVSIMD_OPR_ELT_SHIFT 0
+#define ADVSIMD_ALL_ELT_BITS 5
+#define ADVSIMD_ALL_ELT_SHIFT 5
+#define ADVSIMD_DOFF_ELT_BITS 5
+#define ADVSIMD_DOFF_ELT_SHIFT 10
+#define ADVSIMD_DATA_BITS 16
+#define ADVSIMD_DATA_SHIFT 16
+
+#define GET_SIMD_DATA(t, d) extract32(d, \
+ ADVSIMD_ ## t ## _SHIFT, \
+ ADVSIMD_ ## t ## _BITS)
diff --git a/target/arm/helper-a64.c b/target/arm/helper-a64.c
index d9df82cff5..17b1edfb5f 100644
--- a/target/arm/helper-a64.c
+++ b/target/arm/helper-a64.c
@@ -30,6 +30,7 @@
#include "exec/exec-all.h"
#include "exec/cpu_ldst.h"
#include "qemu/int128.h"
+#include "advsimd_helper_flags.h"
#include "tcg.h"
#include <zlib.h> /* For crc32 */
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index b5f48605a7..f474c5008b 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -34,6 +34,8 @@
#include "exec/helper-gen.h"
#include "exec/log.h"
+#include "advsimd_helper_flags.h"
+
#include "trace-tcg.h"
/* Global registers */
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (7 preceding siblings ...)
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 8/9] target/arm/helpers: introduce ADVSIMD flags Alex Bennée
@ 2017-08-17 18:04 ` Alex Bennée
2017-08-17 20:23 ` Richard Henderson
2017-08-17 18:32 ` [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion no-reply
2017-08-18 11:33 ` Kirill Batuzov
10 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2017-08-17 18:04 UTC (permalink / raw)
To: rth, cota, batuzovk; +Cc: qemu-devel, qemu-arm, Alex Bennée, Peter Maydell
These instructions show up in the ffmpeg profile from the
ff_simple_idct_put_neon function.
WARNING: this is experimental and essentially shortcuts to the
vectorised helper for the one instruction that shows up a lot in the
ffmpeg trace. Otherwise it falls through to the normal code
generation. We also skip where rd == rn to avoid having to explicitly
deal with the aliasing in the helper.
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
---
target/arm/helper-a64.c | 17 +++++++++++
target/arm/helper-a64.h | 2 ++
target/arm/translate-a64.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 91 insertions(+)
diff --git a/target/arm/helper-a64.c b/target/arm/helper-a64.c
index 17b1edfb5f..ae0f8da5c4 100644
--- a/target/arm/helper-a64.c
+++ b/target/arm/helper-a64.c
@@ -538,3 +538,20 @@ uint64_t HELPER(paired_cmpxchg64_be)(CPUARMState *env, uint64_t addr,
return !success;
}
+
+/* Multiply Long (vector, by element) */
+void HELPER(advsimd_smull_idx_s32)(void *d, void *n, uint32_t m,
+ uint32_t simd_data)
+{
+ int opr_elt = GET_SIMD_DATA(OPR_ELT, simd_data);
+ int doff_elt = GET_SIMD_DATA(DOFF_ELT, simd_data);
+ int32_t *rd = (int32_t *) d;
+ int16_t *rn = (int16_t *) n;
+ int16_t rm = (int16_t) m;
+ int i;
+
+ #pragma GCC ivdep
+ for (i = 0; i < opr_elt; ++i) {
+ rd[i] = rn[i + doff_elt] * rm;
+ }
+}
diff --git a/target/arm/helper-a64.h b/target/arm/helper-a64.h
index 6f9eaba533..0bd7942cec 100644
--- a/target/arm/helper-a64.h
+++ b/target/arm/helper-a64.h
@@ -44,3 +44,5 @@ DEF_HELPER_FLAGS_3(crc32_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
DEF_HELPER_FLAGS_3(crc32c_64, TCG_CALL_NO_RWG_SE, i64, i64, i64, i32)
DEF_HELPER_FLAGS_4(paired_cmpxchg64_le, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
DEF_HELPER_FLAGS_4(paired_cmpxchg64_be, TCG_CALL_NO_WG, i64, env, i64, i64, i64)
+
+DEF_HELPER_4(advsimd_smull_idx_s32, void, vec, vec, i32, i32)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index f474c5008b..3a609e571c 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -10466,6 +10466,74 @@ static void disas_simd_two_reg_misc(DisasContext *s, uint32_t insn)
}
}
+typedef void AdvSIMDGenTwoPlusOneVectorFn(TCGv_vec, TCGv_vec, TCGv_i32, TCGv_i32);
+
+/* Handle [U/S]ML[S/A]L instructions
+ *
+ * This splits off from bellow only to aid experimentation.
+ */
+static bool handle_vec_simd_mul_addsub(DisasContext *s, uint32_t insn, int opcode, int size, bool is_q, bool u, int rn, int rm, int rd)
+{
+ /* fprintf(stderr, "%s: %#04x op:%x sz:%d rn:%d rm:%d rd:%d\n", __func__, */
+ /* insn, opcode, size, rn, rm, rd); */
+
+ if (size == 1) {
+ AdvSIMDGenTwoPlusOneVectorFn *fn = NULL;
+ uint32_t simd_info = 0;
+
+ switch (opcode) {
+ case 0x2: /* SMLAL, SMLAL2, UMLAL, UMLAL2 */
+ break;
+ case 0x6: /* SMLSL, SMLSL2, UMLSL, UMLSL2 */
+ break;
+ case 0xa: /* SMULL, SMULL2, UMULL, UMULL2 */
+ if (!u)
+ {
+ /* helper assumes no aliasing */
+ if (rd == rn) {
+ return false;
+ }
+
+ fn = gen_helper_advsimd_smull_idx_s32;
+ simd_info = deposit32(simd_info,
+ ADVSIMD_OPR_ELT_SHIFT, ADVSIMD_OPR_ELT_BITS, 4);
+
+ if (is_q) {
+ simd_info = deposit32(simd_info,
+ ADVSIMD_DOFF_ELT_SHIFT, ADVSIMD_DOFF_ELT_BITS, 4);
+ }
+ };
+ break;
+ default:
+ break;
+ }
+
+ /* assert(fn); */
+
+ if (fn) {
+ TCGv_i32 tcg_idx = tcg_temp_new_i32();
+ TCGv_i32 tcg_simd_info = tcg_const_i32(simd_info);
+ int h = extract32(insn, 11, 1);
+ int lm = extract32(insn, 20, 2);
+ int index = h << 2 | lm;
+
+ if (!fp_access_check(s)) {
+ return false;
+ }
+
+ read_vec_element_i32(s, tcg_idx, rm, index, size);
+
+ fn(cpu_V[rd], cpu_V[rn], tcg_idx, tcg_simd_info);
+
+ tcg_temp_free_i32(tcg_simd_info);
+ tcg_temp_free_i32(tcg_idx);
+ return true;
+ }
+ }
+
+ return false;
+}
+
/* C3.6.13 AdvSIMD scalar x indexed element
* 31 30 29 28 24 23 22 21 20 19 16 15 12 11 10 9 5 4 0
* +-----+---+-----------+------+---+---+------+-----+---+---+------+------+
@@ -10518,6 +10586,10 @@ static void disas_simd_indexed(DisasContext *s, uint32_t insn)
unallocated_encoding(s);
return;
}
+ /* Shortcut if we have a vectorised helper */
+ if (handle_vec_simd_mul_addsub(s, insn, opcode, size, is_q, u, rn, rm, rd)) {
+ return;
+ }
is_long = true;
break;
case 0x3: /* SQDMLAL, SQDMLAL2 */
--
2.13.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (8 preceding siblings ...)
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] Alex Bennée
@ 2017-08-17 18:32 ` no-reply
2017-08-18 11:33 ` Kirill Batuzov
10 siblings, 0 replies; 20+ messages in thread
From: no-reply @ 2017-08-17 18:32 UTC (permalink / raw)
To: alex.bennee; +Cc: famz, rth, cota, batuzovk, qemu-arm, qemu-devel
Hi,
This series seems to have some coding style problems. See output below for
more information:
Type: series
Message-id: 20170817180404.29334-1-alex.bennee@linaro.org
Subject: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
=== TEST SCRIPT BEGIN ===
#!/bin/bash
BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0
git config --local diff.renamelimit 0
git config --local diff.renames True
commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done
exit $failed
=== TEST SCRIPT END ===
Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
bab61192a4 target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]
c5c733a1f9 target/arm/helpers: introduce ADVSIMD flags
e27691f07a target/arm/translate-a64: register global vectors
6a55e60454 target/arm/translate-a64: regnames -> x_regnames
e7a0e2466b arm/cpu.h: align VFP registers
efa94c04ce helper-head: add support for vec type
d8c96ebdd2 tcg: generate ptrs to vector registers
26ec07c3d7 tcg: introduce the concepts of a TCGv_vec register type
9ec3b4754d tcg/README: listify the TCG types.
=== OUTPUT BEGIN ===
Checking PATCH 1/9: tcg/README: listify the TCG types....
Checking PATCH 2/9: tcg: introduce the concepts of a TCGv_vec register type...
Checking PATCH 3/9: tcg: generate ptrs to vector registers...
Checking PATCH 4/9: helper-head: add support for vec type...
Checking PATCH 5/9: arm/cpu.h: align VFP registers...
Checking PATCH 6/9: target/arm/translate-a64: regnames -> x_regnames...
Checking PATCH 7/9: target/arm/translate-a64: register global vectors...
Checking PATCH 8/9: target/arm/helpers: introduce ADVSIMD flags...
WARNING: line over 80 characters
#50: FILE: target/arm/advsimd_helper_flags.h:30:
+ * ADVSIMD_ALL_ELT - the total count of elements (e.g. clear all-opr elements)
total: 0 errors, 1 warnings, 65 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 9/9: target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]...
WARNING: line over 80 characters
#65: FILE: target/arm/translate-a64.c:10469:
+typedef void AdvSIMDGenTwoPlusOneVectorFn(TCGv_vec, TCGv_vec, TCGv_i32, TCGv_i32);
ERROR: line over 90 characters
#71: FILE: target/arm/translate-a64.c:10475:
+static bool handle_vec_simd_mul_addsub(DisasContext *s, uint32_t insn, int opcode, int size, bool is_q, bool u, int rn, int rm, int rd)
ERROR: that open brace { should be on the previous line
#86: FILE: target/arm/translate-a64.c:10490:
+ if (!u)
+ {
WARNING: line over 80 characters
#95: FILE: target/arm/translate-a64.c:10499:
+ ADVSIMD_OPR_ELT_SHIFT, ADVSIMD_OPR_ELT_BITS, 4);
ERROR: line over 90 characters
#99: FILE: target/arm/translate-a64.c:10503:
+ ADVSIMD_DOFF_ELT_SHIFT, ADVSIMD_DOFF_ELT_BITS, 4);
WARNING: line over 80 characters
#141: FILE: target/arm/translate-a64.c:10590:
+ if (handle_vec_simd_mul_addsub(s, insn, opcode, size, is_q, u, rn, rm, rd)) {
total: 3 errors, 3 warnings, 109 lines checked
Your patch has style problems, please review. If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
=== OUTPUT END ===
Test command exited with code: 1
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types.
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
@ 2017-08-17 20:05 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:05 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: qemu-arm, qemu-devel
On 08/17/2017 11:03 AM, Alex Bennée wrote:
> Although the other types are aliases lets make it clear what TCG types
> are available.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> tcg/README | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type Alex Bennée
@ 2017-08-17 20:07 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:07 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: qemu-arm, qemu-devel
On 08/17/2017 11:03 AM, Alex Bennée wrote:
> Currently it only makes sense for globals - i.e. registers directly
> mapped to CPUEnv.
> ---
> tcg/README | 1 +
> tcg/tcg.h | 20 ++++++++++++++++++++
> 2 files changed, 21 insertions(+)
I'm not keen on this. I know it makes for nicer intermediate dumps, but I'd
rather expose the pointer addition directly (or not, and fold it into a memory
offset).
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers Alex Bennée
@ 2017-08-17 20:13 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:13 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: qemu-arm, qemu-devel
On 08/17/2017 11:03 AM, Alex Bennée wrote:
> As we operate directly on the vectors in memory we pass around the
> address for TCG_TYPE_VECTOR. Currently only helpers ever see these
> values but if we were to generate simd backend instructions they would
> load directly from the backing store.
>
> We also need to ensure when copying from one temp register to the
> other the right size is used.
>
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> tcg/tcg.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/tcg/tcg.c b/tcg/tcg.c
> index 35598296c5..e16811d68d 100644
> --- a/tcg/tcg.c
> +++ b/tcg/tcg.c
> @@ -2034,7 +2034,21 @@ static void temp_load(TCGContext *s, TCGTemp *ts, TCGRegSet desired_regs,
> break;
> case TEMP_VAL_MEM:
> reg = tcg_reg_alloc(s, desired_regs, allocated_regs, ts->indirect_base);
> - tcg_out_ld(s, ts->type, reg, ts->mem_base->reg, ts->mem_offset);
> + if (ts->type == TCG_TYPE_VECTOR) {
> + /* Vector registers are ptr's to the memory representation */
> + TCGArg args[TCG_MAX_OP_ARGS];
> + int const_args[TCG_MAX_OP_ARGS];
> + args[0] = reg;
> + args[1] = ts->mem_base->reg;
> + args[2] = ts->mem_offset;
> + const_args[0] = 0;
> + const_args[1] = 0;
> + const_args[2] = 1;
> + /* FIXME: needs to by host_ptr centric */
> + tcg_out_op(s, INDEX_op_add_i64, args, const_args);
This fails when the offset is out of range for the addition, and technically if
the backend does not support 3-operand addition. You didn't see this because
the x86 backend does use lea, and has a 32-bit offset.
Once upon a time we had a tcg_out_addi; if we go this way with TCG_TYPE_VECTOR,
we should re-introduce that.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers Alex Bennée
@ 2017-08-17 20:13 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:13 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: Peter Maydell, qemu-arm, qemu-devel
On 08/17/2017 11:04 AM, Alex Bennée wrote:
> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
> ---
> target/arm/cpu.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames Alex Bennée
@ 2017-08-17 20:14 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:14 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: Peter Maydell, qemu-arm, qemu-devel
On 08/17/2017 11:04 AM, Alex Bennée wrote:
> -static const char *regnames[] = {
> +static const char *x_regnames[] = {
> "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7",
> "x8", "x9", "x10", "x11", "x12", "x13", "x14", "x15",
> "x16", "x17", "x18", "x19", "x20", "x21", "x22", "x23",
Mis-patch? There should be uses of this array to be renamed too.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[]
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] Alex Bennée
@ 2017-08-17 20:23 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2017-08-17 20:23 UTC (permalink / raw)
To: Alex Bennée, rth, cota, batuzovk; +Cc: Peter Maydell, qemu-arm, qemu-devel
On 08/17/2017 11:04 AM, Alex Bennée wrote:
> + int32_t *rd = (int32_t *) d;
> + int16_t *rn = (int16_t *) n;
> + int16_t rm = (int16_t) m;
> + int i;
> +
> + #pragma GCC ivdep
> + for (i = 0; i < opr_elt; ++i) {
> + rd[i] = rn[i + doff_elt] * rm;
> + }
You need to run this loop backward to avoid clobbering data when rd == rn.
I thought you'd put m into ADVSIMD_DATA.
>
> + if (is_q) {
> + simd_info = deposit32(simd_info,
> + ADVSIMD_DOFF_ELT_SHIFT, ADVSIMD_DOFF_ELT_BITS, 4);
> + }
It'd probably be useful to have a macro to clean this up:
#define PUT_SIMD_DATA(t, d) \
deposit32(0, ADVSIMD_ ## t ## _SHIFT, ADVSIMD_ ## t ## _BITS, (d))
simd_info |= PUT_SIMD_DATA(DOFF_ELT, 4)
that said, folding DOFF into the pointer that gets passed in the first place
seems a better solution to me.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
` (9 preceding siblings ...)
2017-08-17 18:32 ` [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion no-reply
@ 2017-08-18 11:33 ` Kirill Batuzov
2017-08-18 13:44 ` Richard Henderson
10 siblings, 1 reply; 20+ messages in thread
From: Kirill Batuzov @ 2017-08-18 11:33 UTC (permalink / raw)
To: Alex Bennée; +Cc: rth, cota, qemu-devel, qemu-arm
On Thu, 17 Aug 2017, Alex Bennée wrote:
> Hi,
>
> With upcoming work on SVE I've been looking at the way we implement
> vector registers in QEMU's TCG. The current orthodoxy is to decompose
> the vector into a series of TCG registers, often calling a helper
> function the calculation of each element. The result of the helper is
> then is then stored back in the vector representation afterwards.
> There are occasional outliers like simd_tbl which access elements
> directly from a passed CPUFooState env pointer but these are rare.
>
> This series introduces the concept of TCGv_vec type. This is a pointer
> to the start of the in memory representation of an arbitrarily long
> vector register. This is passed to a helper function as a pointer
> along with a normal TCG register containing information about the
> actual vector length and any additional information the helper needs
> to do the operation. The hope* is this saves on the churn of having
> the TCG do things element by element and allows the compiler to use
> native vector operations to streamline the helpers.
>
> There are some downsides to this approach. The first is you have to be
> careful about register aliasing. If you are doing a same reg to same
> reg operation you need to make a copy of the vector so you don't
> trample your input data as you go. The second is this involves
> changing some of the assumptions the TCG makes about things. I've
> managed to keep all the changes within the core TCG code for now but
> so far it has only been tested for the tcg_call path which is the only
> place where TCGv_vec's should turn up. It is possible to do the same
> thing without touching the TCG code generation by using TCGv_ptrs and
> manually emitting tcg_addi ops to pass the correct address. Richard
> has been exploring this approach with his series. The downside of that
> is you do miss the ability to have named global vector registers which
> makes reading the TCG dumps a little easier.
>
> I've only patched one helper in this series which implements the
> indexed smull. This is because it appears in the profiles for my test
> case which was using an arm64 ffmpeg to transcode:
>
> ./ffmpeg.arm64 -i big_buck_bunny_480p_surround-fix.avi \
> -threads 1 -qscale:v 3 -f null -
>
> * hope. On an earlier revision (which included sqshrn conversions) I
> had measured a minor saving but this had disappeared once I measured
> the final code. However the profile is fairly dominated by
> softfloat.
>
> master:
> 8.05% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32
> 7.28% qemu-aarch64 qemu-aarch64 [.] float32_mul
> 6.56% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr
> 5.31% qemu-aarch64 qemu-aarch64 [.] float32_muladd
> 4.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16
> 4.00% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs
> 3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs
> 2.26% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl
> 2.00% qemu-aarch64 qemu-aarch64 [.] float32_add
> 1.81% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8
> 1.64% qemu-aarch64 qemu-aarch64 [.] float32_sub
> 1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32
> 0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8
>
> tcg-native-vectors-rfc:
> 7.93% qemu-aarch64 qemu-aarch64 [.] roundAndPackFloat32
> 7.54% qemu-aarch64 qemu-aarch64 [.] float32_mul
> 6.29% qemu-aarch64 qemu-aarch64 [.] helper_lookup_tb_ptr
> 5.39% qemu-aarch64 qemu-aarch64 [.] float32_muladd
> 3.92% qemu-aarch64 qemu-aarch64 [.] addFloat32Sigs
> 3.86% qemu-aarch64 qemu-aarch64 [.] subFloat32Sigs
> 3.62% qemu-aarch64 qemu-aarch64 [.] helper_advsimd_smull_idx_s32
> 2.19% qemu-aarch64 qemu-aarch64 [.] helper_simd_tbl
> 2.09% qemu-aarch64 qemu-aarch64 [.] helper_neon_mull_s16
> 1.99% qemu-aarch64 qemu-aarch64 [.] float32_add
> 1.79% qemu-aarch64 qemu-aarch64 [.] helper_neon_unarrow_sat8
> 1.62% qemu-aarch64 qemu-aarch64 [.] float32_sub
> 1.43% qemu-aarch64 qemu-aarch64 [.] helper_neon_subl_u32
> 1.00% qemu-aarch64 qemu-aarch64 [.] helper_neon_widen_u8
> 0.98% qemu-aarch64 qemu-aarch64 [.] helper_neon_addl_u32
>
> At the moment the default compiler settings don't actually vectorise
> the helper. I could get it to once I added some alignment guarantees
> but the casting I did broke the instruction emulation so I haven't
> included that patch in this series.
>
> Given the results why continue investigating this? Well for one thing
> vector sizes are growing, SVE vectors are up to 2048 bits long. Those
> longer vectors should offer more scope for the host compiler to
> generate efficient code in the helper. Also vector operations tend to
> be quite complex operations, being able to handle this in C code
> instead of TCGOps might be more preferable from a code maintainability
> point of view. Finally this noddy little experiment has at least shown
> it doesn't worsen performance. It would be nice if I could find a
> benchmark that made heavy use if non-floating point SIMD instructions
> to better measure the effect of marshalling elements vs vectorised
> helpers. If anyone has any suggestions I'm all ears ;-)
While doing my own vector register series I was using
1. Handwritten example (it's for ARM32 NEON, not aarch64)
.cpu cortex-a8
.fpu neon
.text
.global test
test:
vld1.32 d0, [r0]!
vld1.32 d1, [r0]
vld1.32 d2, [r1]!
vld1.32 d3, [r1]
mov r0, #0xb0000000
loop:
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
vadd.i32 q0, q0, q1
subs r0, r0, #1
bne loop
vpadd.i32 d0, d0, d1
vpadd.i32 d0, d0, d1
vmov.i32 r0, d0[0]
bx lr
It can be adapted for aarch64 without much problems. This example shows
what potential speed up you can expect, as it is nearly perfect for the
optimization in question.
2. x264 video encoder. It has a lot of handwritten vector assembler for
different architectures, including aarch64. You probably can access it
as libx264 from within ffmpeg, if this library support was compiled.
>
> Anyway questions, comments?
>
>From my own experimentations some times ago,
(1) translating vector instructions to vector instructions in TCG is faster than
(2) translating vector instructions to series of scalar instructions in TCG,
which is faster than*
(3) translating vector instructions to single helper calls, which is faster
than*
(4) translating vector instructions to helper calls for each vector element.
(*) (2) and (3) may change their respective places in case of some
complicated instructions.
ARM (at least ARM32, I have not checked aarch64 in this regard) uses the
last, the slowest scheme. As far as I understand, you are want to change
it to the third approach. This approach is used in SSE emulation, may be
you can use similar structure of helpers?
I still hope to finish my own series about implementation of the first
approach. I apologize for the long delay since last update and hope to
send next version somewhere next week. I do not think our series
contradict each other: you are trying to optimize existing general
purpose case while I'm trying to optimize case where both host and guest
support vector instructions. Since I'm experimenting on ARM32, we'll not
have much merge conflicts either.
--
Kirill
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
2017-08-18 11:33 ` Kirill Batuzov
@ 2017-08-18 13:44 ` Richard Henderson
2017-08-22 9:04 ` Kirill Batuzov
0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2017-08-18 13:44 UTC (permalink / raw)
To: Kirill Batuzov, Alex Bennée; +Cc: cota, qemu-devel, qemu-arm, rth
On 08/18/2017 04:33 AM, Kirill Batuzov wrote:
> From my own experimentations some times ago,
>
> (1) translating vector instructions to vector instructions in TCG is faster than
>
> (2) translating vector instructions to series of scalar instructions in TCG,
> which is faster than*
>
> (3) translating vector instructions to single helper calls, which is faster
> than*
>
> (4) translating vector instructions to helper calls for each vector element.
>
> (*) (2) and (3) may change their respective places in case of some
> complicated instructions.
This was my gut feeling as well. With the caveat that for the ARM SVE case of
2048-bit registers we cannot afford to expand inline due to generated code size.
> ARM (at least ARM32, I have not checked aarch64 in this regard) uses the
> last, the slowest scheme. As far as I understand, you are want to change
> it to the third approach. This approach is used in SSE emulation, may be
> you can use similar structure of helpers?
>
> I still hope to finish my own series about implementation of the first
> approach. I apologize for the long delay since last update and hope to
> send next version somewhere next week. I do not think our series
> contradict each other: you are trying to optimize existing general
> purpose case while I'm trying to optimize case where both host and guest
> support vector instructions. Since I'm experimenting on ARM32, we'll not
> have much merge conflicts either.
I posted my own, different, take on vectorization yesterday as well.
http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html
The primary difference between my version and your version is that I do not
allow target/cpu/translate*.c to create vector types. All of the host vector
expansion is done within tcg/*.c.
We also would like to settle on a common style for out-of-line helpers, if that
is possible. One thing *not* to take from our current SSE emulation is that we
do not yet support AVX, AVX2, or AVX512 extensions. So the current
construction of helpers within target/i386/ doesn't really take into account
all that should be required.
The thing that's common between AVX512 and SVE is that we have multiple vector
lengths, and that elements beyond the operation length are zeroed. Both Alex
and I have packed operation length + full vector length into a descriptor given
to the helper. (Alex allows for some other bits too; I'm not sure about that.)
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion
2017-08-18 13:44 ` Richard Henderson
@ 2017-08-22 9:04 ` Kirill Batuzov
0 siblings, 0 replies; 20+ messages in thread
From: Kirill Batuzov @ 2017-08-22 9:04 UTC (permalink / raw)
To: Richard Henderson; +Cc: Alex Bennée, cota, qemu-devel, qemu-arm, rth
On Fri, 18 Aug 2017, Richard Henderson wrote:
> On 08/18/2017 04:33 AM, Kirill Batuzov wrote:
> > From my own experimentations some times ago,
> >
> > (1) translating vector instructions to vector instructions in TCG is faster than
> >
> > (2) translating vector instructions to series of scalar instructions in TCG,
> > which is faster than*
> >
> > (3) translating vector instructions to single helper calls, which is faster
> > than*
> >
> > (4) translating vector instructions to helper calls for each vector element.
> >
> > (*) (2) and (3) may change their respective places in case of some
> > complicated instructions.
>
> This was my gut feeling as well. With the caveat that for the ARM SVE case of
> 2048-bit registers we cannot afford to expand inline due to generated code size.
>
> > ARM (at least ARM32, I have not checked aarch64 in this regard) uses the
> > last, the slowest scheme. As far as I understand, you are want to change
> > it to the third approach. This approach is used in SSE emulation, may be
> > you can use similar structure of helpers?
> >
> > I still hope to finish my own series about implementation of the first
> > approach. I apologize for the long delay since last update and hope to
> > send next version somewhere next week. I do not think our series
> > contradict each other: you are trying to optimize existing general
> > purpose case while I'm trying to optimize case where both host and guest
> > support vector instructions. Since I'm experimenting on ARM32, we'll not
> > have much merge conflicts either.
>
> I posted my own, different, take on vectorization yesterday as well.
>
> http://lists.nongnu.org/archive/html/qemu-devel/2017-08/msg03272.html
>
> The primary difference between my version and your version is that I do not
> allow target/cpu/translate*.c to create vector types. All of the host vector
> expansion is done within tcg/*.c.
I took a look at your approach. The only problem with it is that in
current implementation it does not allow to keep vector variables on
register between consecutive guest instructions. But this can be
changed. To do it we need to make copy propagation work with memory
locations as well, and dead code elimination to be able to remove excess
stores to memory. While in general case these can be troublesome if we
limit analysis to addresses that are [env + Const] it becomes relatively
easy. I've done similar thing in my series to track interference between
memory operations and vector global variables. In case of your series
this affects only performance so it does not need to be added in the
initial series and can be added later as a separate patch. I can care of
this once initial series are pulled to master.
Overall I like your approach the most out of three:
- it handles different representations of guest vectors with host
vectors seamlessly (unlike my approach where I still do not know how
to make it right),
- it provides better performance than Alex's (and the same as mine once
we add a bit of alias analysis),
- it moves in the direction of representing guest vectors not as
globals, but as a pair (offset, size) in a special address space
(this approach was successfully used in Valgrind and it handles
intersecting registers much better than what we have now; we are
moving in this direction anyway).
--
Kirill
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2017-08-22 9:04 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-17 18:03 [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion Alex Bennée
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 1/9] tcg/README: listify the TCG types Alex Bennée
2017-08-17 20:05 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 2/9] tcg: introduce the concepts of a TCGv_vec register type Alex Bennée
2017-08-17 20:07 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 3/9] tcg: generate ptrs to vector registers Alex Bennée
2017-08-17 20:13 ` Richard Henderson
2017-08-17 18:03 ` [Qemu-devel] [RFC PATCH 4/9] helper-head: add support for vec type Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 5/9] arm/cpu.h: align VFP registers Alex Bennée
2017-08-17 20:13 ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 6/9] target/arm/translate-a64: regnames -> x_regnames Alex Bennée
2017-08-17 20:14 ` Richard Henderson
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 7/9] target/arm/translate-a64: register global vectors Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 8/9] target/arm/helpers: introduce ADVSIMD flags Alex Bennée
2017-08-17 18:04 ` [Qemu-devel] [RFC PATCH 9/9] target/arm/translate-a64: vectorise smull vD.4s, vN.[48]s, vM.h[] Alex Bennée
2017-08-17 20:23 ` Richard Henderson
2017-08-17 18:32 ` [Qemu-devel] [RFC PATCH 0/9] TCG Vector types and example conversion no-reply
2017-08-18 11:33 ` Kirill Batuzov
2017-08-18 13:44 ` Richard Henderson
2017-08-22 9:04 ` Kirill Batuzov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).