* [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions
@ 2024-05-31 17:44 Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 1/6] target/riscv: Separate vector segment " Max Chou
` (5 more replies)
0 siblings, 6 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv; +Cc: dbarboza, Max Chou
Hi,
This RFC patch set tries to fix the issue of
https://gitlab.com/qemu-project/qemu/-/issues/2137.
In this new version, we added patches that try to load/store more data
at a time in part of vector continuous load/store (unit-stride/whole
register) instructions with some assumptions (e.g. no masking, no tail
agnostic, perform virtual address resolution once for the entire vector,
etc.) as suggested by Richard Henderson.
This version can improve the performance of the test case provided in
https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369 (from
~13.5 sec to ~1.5 sec) on QEMU user mode.
PS: This RFC patch set only focuses on the vle8.v/vse8.v/vl8re8.v/vs8r.v
instructions. The next version will try to complete other instructions.
Series based on riscv-to-apply.next branch (commit 1806da7).
Max Chou (6):
target/riscv: Separate vector segment ld/st instructions
accel/tcg: Avoid unnecessary call overhead from
qemu_plugin_vcpu_mem_cb
target/riscv: Inline vext_ldst_us and corresponding function for
performance
target/riscv: Add check_probe_[read|write] helper functions
target/riscv: rvv: Optimize v[l|s]e8.v with limitations
target/riscv: rvv: Optimize vl8re8.v/vs8r.v with limitations
accel/tcg/ldst_common.c.inc | 8 +-
target/riscv/helper.h | 8 +
target/riscv/insn32.decode | 11 +-
target/riscv/insn_trans/trans_rvv.c.inc | 454 +++++++++++++++++++++++-
target/riscv/vector_helper.c | 142 ++++++--
5 files changed, 591 insertions(+), 32 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 10+ messages in thread
* [RFC PATCH v2 1/6] target/riscv: Separate vector segment ld/st instructions
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
@ 2024-05-31 17:44 ` Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 2/6] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
` (4 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
Weiwei Li, Liu Zhiwei, Richard Henderson
This commit separate the helper function implementations of vector
segment load/store instructions from other vector load/store
instructions.
This can improve performance by avoiding unnecessary segment operation
when NF = 1.
Signed-off-by: Max Chou <max.chou@sifive.com>
---
target/riscv/helper.h | 4 +
target/riscv/insn32.decode | 11 ++-
target/riscv/insn_trans/trans_rvv.c.inc | 61 +++++++++++++++
target/riscv/vector_helper.c | 100 +++++++++++++++++++++---
4 files changed, 164 insertions(+), 12 deletions(-)
diff --git a/target/riscv/helper.h b/target/riscv/helper.h
index 451261ce5a4..aaf68eadfb7 100644
--- a/target/riscv/helper.h
+++ b/target/riscv/helper.h
@@ -158,18 +158,22 @@ DEF_HELPER_FLAGS_3(hyp_hsv_d, TCG_CALL_NO_WG, void, env, tl, tl)
/* Vector functions */
DEF_HELPER_3(vsetvl, tl, env, tl, tl)
DEF_HELPER_5(vle8_v, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vlsege8_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle16_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle32_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle64_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle8_v_mask, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vlsege8_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle16_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle32_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vle64_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse8_v, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vssege8_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse16_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse32_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse64_v, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse8_v_mask, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vssege8_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse16_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse32_v_mask, void, ptr, ptr, tl, env, i32)
DEF_HELPER_5(vse64_v_mask, void, ptr, ptr, tl, env, i32)
diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
index f22df04cfd1..0712e9f6314 100644
--- a/target/riscv/insn32.decode
+++ b/target/riscv/insn32.decode
@@ -77,6 +77,7 @@
@r2 ....... ..... ..... ... ..... ....... &r2 %rs1 %rd
@r2_vm_1 ...... . ..... ..... ... ..... ....... &rmr vm=1 %rs2 %rd
@r2_nfvm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm %nf %rs1 %rd
+@r2_nf_1_vm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm nf=1 %rs1 %rd
@r2_vm ...... vm:1 ..... ..... ... ..... ....... &rmr %rs2 %rd
@r1_vm ...... vm:1 ..... ..... ... ..... ....... %rd
@r_nfvm ... ... vm:1 ..... ..... ... ..... ....... &rnfvm %nf %rs2 %rs1 %rd
@@ -349,11 +350,17 @@ hsv_d 0110111 ..... ..... 100 00000 1110011 @r2_s
# *** Vector loads and stores are encoded within LOADFP/STORE-FP ***
# Vector unit-stride load/store insns.
-vle8_v ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm
+{
+ vle8_v 000 000 . 00000 ..... 000 ..... 0000111 @r2_nf_1_vm
+ vlsege8_v ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm
+}
vle16_v ... 000 . 00000 ..... 101 ..... 0000111 @r2_nfvm
vle32_v ... 000 . 00000 ..... 110 ..... 0000111 @r2_nfvm
vle64_v ... 000 . 00000 ..... 111 ..... 0000111 @r2_nfvm
-vse8_v ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm
+{
+ vse8_v 000 000 . 00000 ..... 000 ..... 0100111 @r2_nf_1_vm
+ vssege8_v ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm
+}
vse16_v ... 000 . 00000 ..... 101 ..... 0100111 @r2_nfvm
vse32_v ... 000 . 00000 ..... 110 ..... 0100111 @r2_nfvm
vse64_v ... 000 . 00000 ..... 111 ..... 0100111 @r2_nfvm
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 3a3896ba06c..1e4fa797a86 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -719,6 +719,40 @@ GEN_VEXT_TRANS(vle16_v, MO_16, r2nfvm, ld_us_op, ld_us_check)
GEN_VEXT_TRANS(vle32_v, MO_32, r2nfvm, ld_us_op, ld_us_check)
GEN_VEXT_TRANS(vle64_v, MO_64, r2nfvm, ld_us_op, ld_us_check)
+static bool ld_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
+{
+ uint32_t data = 0;
+ gen_helper_ldst_us *fn;
+ static gen_helper_ldst_us * const fns[2][4] = {
+ /* masked unit stride load */
+ { gen_helper_vlsege8_v_mask, gen_helper_vle16_v_mask,
+ gen_helper_vle32_v_mask, gen_helper_vle64_v_mask },
+ /* unmasked unit stride load */
+ { gen_helper_vlsege8_v, gen_helper_vle16_v,
+ gen_helper_vle32_v, gen_helper_vle64_v }
+ };
+
+ fn = fns[a->vm][eew];
+ if (fn == NULL) {
+ return false;
+ }
+
+ /*
+ * Vector load/store instructions have the EEW encoded
+ * directly in the instructions. The maximum vector size is
+ * calculated with EMUL rather than LMUL.
+ */
+ uint8_t emul = vext_get_emul(s, eew);
+ data = FIELD_DP32(data, VDATA, VM, a->vm);
+ data = FIELD_DP32(data, VDATA, LMUL, emul);
+ data = FIELD_DP32(data, VDATA, NF, a->nf);
+ data = FIELD_DP32(data, VDATA, VTA, s->vta);
+ data = FIELD_DP32(data, VDATA, VMA, s->vma);
+ return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
+}
+
+GEN_VEXT_TRANS(vlsege8_v, MO_8, r2nfvm, ld_us_seg_op, ld_us_check)
+
static bool st_us_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
{
uint32_t data = 0;
@@ -756,6 +790,33 @@ GEN_VEXT_TRANS(vse16_v, MO_16, r2nfvm, st_us_op, st_us_check)
GEN_VEXT_TRANS(vse32_v, MO_32, r2nfvm, st_us_op, st_us_check)
GEN_VEXT_TRANS(vse64_v, MO_64, r2nfvm, st_us_op, st_us_check)
+static bool st_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
+{
+ uint32_t data = 0;
+ gen_helper_ldst_us *fn;
+ static gen_helper_ldst_us * const fns[2][4] = {
+ /* masked unit stride store */
+ { gen_helper_vssege8_v_mask, gen_helper_vse16_v_mask,
+ gen_helper_vse32_v_mask, gen_helper_vse64_v_mask },
+ /* unmasked unit stride store */
+ { gen_helper_vssege8_v, gen_helper_vse16_v,
+ gen_helper_vse32_v, gen_helper_vse64_v }
+ };
+
+ fn = fns[a->vm][eew];
+ if (fn == NULL) {
+ return false;
+ }
+
+ uint8_t emul = vext_get_emul(s, eew);
+ data = FIELD_DP32(data, VDATA, VM, a->vm);
+ data = FIELD_DP32(data, VDATA, LMUL, emul);
+ data = FIELD_DP32(data, VDATA, NF, a->nf);
+ return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
+}
+
+GEN_VEXT_TRANS(vssege8_v, MO_8, r2nfvm, st_us_seg_op, st_us_check)
+
/*
*** unit stride mask load and store
*/
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 1b4d5a8e378..440c33c141b 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -201,6 +201,32 @@ vext_ldst_stride(void *vd, void *v0, target_ulong base,
uint32_t desc, uint32_t vm,
vext_ldst_elem_fn *ldst_elem,
uint32_t log2_esz, uintptr_t ra)
+{
+ uint32_t i;
+ uint32_t max_elems = vext_max_elems(desc, log2_esz);
+ uint32_t esz = 1 << log2_esz;
+ uint32_t vma = vext_vma(desc);
+
+ for (i = env->vstart; i < env->vl; i++, env->vstart++) {
+ if (!vm && !vext_elem_mask(v0, i)) {
+ /* set masked-off elements to 1s */
+ vext_set_elems_1s(vd, vma, i * esz, (i + 1) * esz);
+ continue;
+ }
+ target_ulong addr = base + stride * i;
+ ldst_elem(env, adjust_addr(env, addr), i, vd, ra);
+ }
+ env->vstart = 0;
+
+ vext_set_tail_elems_1s(env->vl, vd, desc, 1, esz, max_elems);
+}
+
+static void
+vext_ldst_stride_segment(void *vd, void *v0, target_ulong base,
+ target_ulong stride, CPURISCVState *env,
+ uint32_t desc, uint32_t vm,
+ vext_ldst_elem_fn *ldst_elem,
+ uint32_t log2_esz, uintptr_t ra)
{
uint32_t i, k;
uint32_t nf = vext_nf(desc);
@@ -236,8 +262,8 @@ void HELPER(NAME)(void *vd, void * v0, target_ulong base, \
uint32_t desc) \
{ \
uint32_t vm = vext_vm(desc); \
- vext_ldst_stride(vd, v0, base, stride, env, desc, vm, LOAD_FN, \
- ctzl(sizeof(ETYPE)), GETPC()); \
+ vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm, \
+ LOAD_FN, ctzl(sizeof(ETYPE)), GETPC()); \
}
GEN_VEXT_LD_STRIDE(vlse8_v, int8_t, lde_b)
@@ -251,8 +277,8 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \
uint32_t desc) \
{ \
uint32_t vm = vext_vm(desc); \
- vext_ldst_stride(vd, v0, base, stride, env, desc, vm, STORE_FN, \
- ctzl(sizeof(ETYPE)), GETPC()); \
+ vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm, \
+ STORE_FN, ctzl(sizeof(ETYPE)), GETPC()); \
}
GEN_VEXT_ST_STRIDE(vsse8_v, int8_t, ste_b)
@@ -269,6 +295,26 @@ static void
vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
uintptr_t ra)
+{
+ uint32_t i;
+ uint32_t max_elems = vext_max_elems(desc, log2_esz);
+ uint32_t esz = 1 << log2_esz;
+
+ /* load bytes from guest memory */
+ for (i = env->vstart; i < evl; i++, env->vstart++) {
+ target_ulong addr = base + (i << log2_esz);
+ ldst_elem(env, adjust_addr(env, addr), i, vd, ra);
+ }
+ env->vstart = 0;
+
+ vext_set_tail_elems_1s(evl, vd, desc, 1, esz, max_elems);
+}
+
+/* unmasked unit-stride segment load and store operation */
+static void
+vext_ldst_us_segment(void *vd, target_ulong base, CPURISCVState *env,
+ uint32_t desc, vext_ldst_elem_fn *ldst_elem,
+ uint32_t log2_esz, uint32_t evl, uintptr_t ra)
{
uint32_t i, k;
uint32_t nf = vext_nf(desc);
@@ -312,10 +358,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \
ctzl(sizeof(ETYPE)), env->vl, GETPC()); \
}
+#define GEN_VEXT_LD_US_SEG(NAME, ETYPE, LOAD_FN) \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \
+ CPURISCVState *env, uint32_t desc) \
+{ \
+ uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE)); \
+ vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false, \
+ LOAD_FN, ctzl(sizeof(ETYPE)), GETPC()); \
+} \
+ \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base, \
+ CPURISCVState *env, uint32_t desc) \
+{ \
+ vext_ldst_us_segment(vd, base, env, desc, LOAD_FN, \
+ ctzl(sizeof(ETYPE)), env->vl, GETPC()); \
+}
+
GEN_VEXT_LD_US(vle8_v, int8_t, lde_b)
-GEN_VEXT_LD_US(vle16_v, int16_t, lde_h)
-GEN_VEXT_LD_US(vle32_v, int32_t, lde_w)
-GEN_VEXT_LD_US(vle64_v, int64_t, lde_d)
+GEN_VEXT_LD_US_SEG(vlsege8_v, int8_t, lde_b)
+GEN_VEXT_LD_US_SEG(vle16_v, int16_t, lde_h)
+GEN_VEXT_LD_US_SEG(vle32_v, int32_t, lde_w)
+GEN_VEXT_LD_US_SEG(vle64_v, int64_t, lde_d)
#define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN) \
void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \
@@ -333,10 +396,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \
ctzl(sizeof(ETYPE)), env->vl, GETPC()); \
}
+#define GEN_VEXT_ST_US_SEG(NAME, ETYPE, STORE_FN) \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \
+ CPURISCVState *env, uint32_t desc) \
+{ \
+ uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE)); \
+ vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false, \
+ STORE_FN, ctzl(sizeof(ETYPE)), GETPC()); \
+} \
+ \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base, \
+ CPURISCVState *env, uint32_t desc) \
+{ \
+ vext_ldst_us_segment(vd, base, env, desc, STORE_FN, \
+ ctzl(sizeof(ETYPE)), env->vl, GETPC()); \
+}
+
GEN_VEXT_ST_US(vse8_v, int8_t, ste_b)
-GEN_VEXT_ST_US(vse16_v, int16_t, ste_h)
-GEN_VEXT_ST_US(vse32_v, int32_t, ste_w)
-GEN_VEXT_ST_US(vse64_v, int64_t, ste_d)
+GEN_VEXT_ST_US_SEG(vssege8_v, int8_t, ste_b)
+GEN_VEXT_ST_US_SEG(vse16_v, int16_t, ste_h)
+GEN_VEXT_ST_US_SEG(vse32_v, int32_t, ste_w)
+GEN_VEXT_ST_US_SEG(vse64_v, int64_t, ste_d)
/*
* unit stride mask load and store, EEW = 1
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH v2 2/6] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 1/6] target/riscv: Separate vector segment " Max Chou
@ 2024-05-31 17:44 ` Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 3/6] target/riscv: Inline vext_ldst_us and corresponding function for performance Max Chou
` (3 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Richard Henderson, Paolo Bonzini
If there are not any QEMU plugin memory callback functions, checking
before calling the qemu_plugin_vcpu_mem_cb function can reduce the
function call overhead.
Signed-off-by: Max Chou <max.chou@sifive.com>
---
accel/tcg/ldst_common.c.inc | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
index c82048e377e..87ceb954873 100644
--- a/accel/tcg/ldst_common.c.inc
+++ b/accel/tcg/ldst_common.c.inc
@@ -125,7 +125,9 @@ void helper_st_i128(CPUArchState *env, uint64_t addr, Int128 val, MemOpIdx oi)
static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
{
- qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+ if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+ qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+ }
}
uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
@@ -188,7 +190,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
{
- qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+ if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+ qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+ }
}
void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH v2 3/6] target/riscv: Inline vext_ldst_us and corresponding function for performance
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 1/6] target/riscv: Separate vector segment " Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 2/6] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-05-31 17:44 ` Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 4/6] target/riscv: Add check_probe_[read|write] helper functions Max Chou
` (2 subsequent siblings)
5 siblings, 0 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Richard Henderson, Palmer Dabbelt,
Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei
In the vector unit-stride load/store helper functions. the vext_ldst_us
function corresponding most of the execution time. Inline the functions
can avoid the function call overhead to improve the helper function
performance.
Signed-off-by: Max Chou <max.chou@sifive.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
---
target/riscv/vector_helper.c | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 440c33c141b..cb7267c3217 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, int index,
typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
uint32_t idx, void *vd, uintptr_t retaddr);
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \
-static void NAME(CPURISCVState *env, abi_ptr addr, \
- uint32_t idx, void *vd, uintptr_t retaddr)\
-{ \
- ETYPE *cur = ((ETYPE *)vd + H(idx)); \
- *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \
-} \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \
+static inline QEMU_ALWAYS_INLINE \
+void NAME(CPURISCVState *env, abi_ptr addr, \
+ uint32_t idx, void *vd, uintptr_t retaddr) \
+{ \
+ ETYPE *cur = ((ETYPE *)vd + H(idx)); \
+ *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \
+} \
GEN_VEXT_LD_ELEM(lde_b, int8_t, H1, ldsb)
GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \
-static void NAME(CPURISCVState *env, abi_ptr addr, \
- uint32_t idx, void *vd, uintptr_t retaddr)\
-{ \
- ETYPE data = *((ETYPE *)vd + H(idx)); \
- cpu_##STSUF##_data_ra(env, addr, data, retaddr); \
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \
+static inline QEMU_ALWAYS_INLINE \
+void NAME(CPURISCVState *env, abi_ptr addr, \
+ uint32_t idx, void *vd, uintptr_t retaddr) \
+{ \
+ ETYPE data = *((ETYPE *)vd + H(idx)); \
+ cpu_##STSUF##_data_ra(env, addr, data, retaddr); \
}
GEN_VEXT_ST_ELEM(ste_b, int8_t, H1, stb)
@@ -291,7 +293,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
*/
/* unmasked unit-stride load and store operation */
-static void
+static inline QEMU_ALWAYS_INLINE void
vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
uintptr_t ra)
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH v2 4/6] target/riscv: Add check_probe_[read|write] helper functions
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
` (2 preceding siblings ...)
2024-05-31 17:44 ` [RFC PATCH v2 3/6] target/riscv: Inline vext_ldst_us and corresponding function for performance Max Chou
@ 2024-05-31 17:44 ` Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 6/6] target/riscv: rvv: Optimize vl8re8.v/vs8r.v " Max Chou
5 siblings, 0 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
Weiwei Li, Liu Zhiwei
The helper_check_probe_[read|write] functions wrap the probe_pages
function to perform virtual address resolution for continuous vector
load/store instructions.
Signed-off-by: Max Chou <max.chou@sifive.com>
---
target/riscv/helper.h | 4 ++++
target/riscv/vector_helper.c | 12 ++++++++++++
2 files changed, 16 insertions(+)
diff --git a/target/riscv/helper.h b/target/riscv/helper.h
index aaf68eadfb7..f4bc907e85f 100644
--- a/target/riscv/helper.h
+++ b/target/riscv/helper.h
@@ -1,6 +1,10 @@
/* Exceptions */
DEF_HELPER_2(raise_exception, noreturn, env, i32)
+/* Probe page */
+DEF_HELPER_FLAGS_3(check_probe_read, TCG_CALL_NO_WG, void, env, tl, tl)
+DEF_HELPER_FLAGS_3(check_probe_write, TCG_CALL_NO_WG, void, env, tl, tl)
+
/* Floating Point - rounding mode */
DEF_HELPER_FLAGS_2(set_rounding_mode, TCG_CALL_NO_WG, void, env, i32)
DEF_HELPER_FLAGS_2(set_rounding_mode_chkfrm, TCG_CALL_NO_WG, void, env, i32)
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index cb7267c3217..9263ab26b19 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -136,6 +136,18 @@ static void probe_pages(CPURISCVState *env, target_ulong addr,
}
}
+void HELPER(check_probe_read)(CPURISCVState *env, target_ulong addr,
+ target_ulong len)
+{
+ probe_pages(env, addr, len, GETPC(), MMU_DATA_LOAD);
+}
+
+void HELPER(check_probe_write)(CPURISCVState *env, target_ulong addr,
+ target_ulong len)
+{
+ probe_pages(env, addr, len, GETPC(), MMU_DATA_STORE);
+}
+
static inline void vext_set_elem_mask(void *v0, int index,
uint8_t value)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
` (3 preceding siblings ...)
2024-05-31 17:44 ` [RFC PATCH v2 4/6] target/riscv: Add check_probe_[read|write] helper functions Max Chou
@ 2024-05-31 17:44 ` Max Chou
2024-06-02 17:45 ` Richard Henderson
2024-05-31 17:44 ` [RFC PATCH v2 6/6] target/riscv: rvv: Optimize vl8re8.v/vs8r.v " Max Chou
5 siblings, 1 reply; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
Weiwei Li, Liu Zhiwei, Richard Henderson
The vector unit-stride load/store instructions (e.g. vle8.v/vse8.v)
perform continuous load/store. We can replace the corresponding helper
functions by TCG ops to copy more data at a time with following
assumptions:
* Perform virtual address resolution once for entire vector at beginning
* Without mask
* Without tail agnostic
* Both host and target are little endian
Signed-off-by: Max Chou <max.chou@sifive.com>
---
target/riscv/insn_trans/trans_rvv.c.inc | 197 +++++++++++++++++++++++-
1 file changed, 195 insertions(+), 2 deletions(-)
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 1e4fa797a86..bbac73bb12b 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -714,7 +714,105 @@ static bool ld_us_check(DisasContext *s, arg_r2nfvm* a, uint8_t eew)
vext_check_load(s, a->rd, a->nf, a->vm, eew);
}
-GEN_VEXT_TRANS(vle8_v, MO_8, r2nfvm, ld_us_op, ld_us_check)
+static bool trans_vle8_v(DisasContext *s, arg_r2nfvm * a)
+{
+
+ if (ld_us_check(s, a, MO_8)) {
+ if (!HOST_BIG_ENDIAN && s->vstart_eq_zero && s->vta == 0 && a->vm) {
+ uint32_t vofs = vreg_ofs(s, a->rd);
+ uint32_t midx = s->mem_idx;
+
+ TCGv_i64 t0, t1;
+ TCGv_i128 t16;
+ TCGv_ptr tp;
+ TCGv_ptr i = tcg_temp_new_ptr();
+ TCGv len_remain = tcg_temp_new();
+ TCGv rs1 = get_gpr(s, a->rs1, EXT_NONE);
+ TCGv addr = tcg_temp_new();
+
+ TCGLabel *loop_128 = gen_new_label();
+ TCGLabel *remain_64 = gen_new_label();
+ TCGLabel *remain_32 = gen_new_label();
+ TCGLabel *remain_16 = gen_new_label();
+ TCGLabel *remain_8 = gen_new_label();
+ TCGLabel *over = gen_new_label();
+
+ tcg_gen_mov_tl(addr, rs1);
+ tcg_gen_mov_tl(len_remain, cpu_vl);
+ tcg_gen_muli_tl(len_remain, len_remain, a->nf);
+ tcg_gen_movi_ptr(i, 0);
+
+ tcg_gen_brcond_tl(TCG_COND_GEU, cpu_vstart, cpu_vl, over);
+ gen_helper_check_probe_read(tcg_env, addr, len_remain);
+
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 16, remain_64);
+
+ gen_set_label(loop_128);
+
+ t16 = tcg_temp_new_i128();
+ tcg_gen_qemu_ld_i128(t16, addr, midx,
+ MO_LE | MO_128 | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 16);
+
+ tp = tcg_temp_new_ptr();
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 16);
+
+ t0 = tcg_temp_new_i64();
+ t1 = tcg_temp_new_i64();
+ tcg_gen_extr_i128_i64(t0, t1, t16);
+
+ tcg_gen_st_i64(t0, tp, vofs);
+ tcg_gen_st_i64(t1, tp, vofs + 8);
+ tcg_gen_subi_tl(len_remain, len_remain, 16);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, len_remain, 16, loop_128);
+
+ gen_set_label(remain_64);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 8, remain_32);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUQ | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 8);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 8);
+ tcg_gen_st_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 8);
+
+ gen_set_label(remain_32);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 4, remain_16);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUL | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 4);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 4);
+ tcg_gen_st32_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 4);
+
+ gen_set_label(remain_16);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 2, remain_8);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUW | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 2);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 2);
+ tcg_gen_st16_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 2);
+
+ gen_set_label(remain_8);
+ tcg_gen_brcondi_tl(TCG_COND_EQ, len_remain, 0, over);
+ tcg_gen_qemu_ld_i64(t0, addr, midx,
+ MO_LE | MO_8 | MO_ATOM_NONE);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_st8_i64(t0, tp, vofs);
+
+ gen_set_label(over);
+
+ finalize_rvv_inst(s);
+ } else {
+ return ld_us_op(s, a, MO_8);
+ }
+ return true;
+ }
+ return false;
+}
+
GEN_VEXT_TRANS(vle16_v, MO_16, r2nfvm, ld_us_op, ld_us_check)
GEN_VEXT_TRANS(vle32_v, MO_32, r2nfvm, ld_us_op, ld_us_check)
GEN_VEXT_TRANS(vle64_v, MO_64, r2nfvm, ld_us_op, ld_us_check)
@@ -785,7 +883,102 @@ static bool st_us_check(DisasContext *s, arg_r2nfvm* a, uint8_t eew)
vext_check_store(s, a->rd, a->nf, eew);
}
-GEN_VEXT_TRANS(vse8_v, MO_8, r2nfvm, st_us_op, st_us_check)
+static bool trans_vse8_v(DisasContext *s, arg_r2nfvm * a)
+{
+ if (st_us_check(s, a, MO_8)) {
+ if (!HOST_BIG_ENDIAN && s->vstart_eq_zero && s->vta == 0 && a->vm) {
+ uint32_t vofs = vreg_ofs(s, a->rd);
+ uint32_t midx = s->mem_idx;
+
+ TCGv_i64 t0, t1;
+ TCGv_i128 t16;
+ TCGv_ptr tp;
+ TCGv_ptr i = tcg_temp_new_ptr();
+ TCGv len_remain = tcg_temp_new();
+ TCGv rs1 = get_gpr(s, a->rs1, EXT_NONE);
+ TCGv addr = tcg_temp_new();
+
+ TCGLabel *loop_128 = gen_new_label();
+ TCGLabel *remain_64 = gen_new_label();
+ TCGLabel *remain_32 = gen_new_label();
+ TCGLabel *remain_16 = gen_new_label();
+ TCGLabel *remain_8 = gen_new_label();
+ TCGLabel *over = gen_new_label();
+
+ tcg_gen_mov_tl(addr, rs1);
+ tcg_gen_mov_tl(len_remain, cpu_vl);
+ tcg_gen_muli_tl(len_remain, len_remain, a->nf);
+ tcg_gen_movi_ptr(i, 0);
+
+ tcg_gen_brcond_tl(TCG_COND_GEU, cpu_vstart, cpu_vl, over);
+ gen_helper_check_probe_write(tcg_env, addr, len_remain);
+
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 16, remain_64);
+
+ gen_set_label(loop_128);
+
+ t0 = tcg_temp_new_i64();
+ t1 = tcg_temp_new_i64();
+ tp = tcg_temp_new_ptr();
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_ld_i64(t1, tp, vofs + 8);
+ tcg_gen_addi_ptr(i, i, 16);
+
+ t16 = tcg_temp_new_i128();
+ tcg_gen_concat_i64_i128(t16, t0, t1);
+
+ tcg_gen_qemu_st_i128(t16, addr, midx,
+ MO_LE | MO_128 | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 16);
+ tcg_gen_subi_tl(len_remain, len_remain, 16);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, len_remain, 16, loop_128);
+
+ gen_set_label(remain_64);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 8, remain_32);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 8);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUQ | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 8);
+ tcg_gen_subi_tl(len_remain, len_remain, 8);
+
+ gen_set_label(remain_32);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 4, remain_16);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 4);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUL | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 4);
+ tcg_gen_subi_tl(len_remain, len_remain, 4);
+
+ gen_set_label(remain_16);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 2, remain_8);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 2);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUW | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 2);
+ tcg_gen_subi_tl(len_remain, len_remain, 2);
+
+ gen_set_label(remain_8);
+ tcg_gen_brcondi_tl(TCG_COND_EQ, len_remain, 0, over);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LE | MO_8 | MO_ATOM_NONE);
+
+ gen_set_label(over);
+
+ finalize_rvv_inst(s);
+ } else {
+ return st_us_op(s, a, MO_8);
+ }
+ return true;
+ }
+ return false;
+}
+
GEN_VEXT_TRANS(vse16_v, MO_16, r2nfvm, st_us_op, st_us_check)
GEN_VEXT_TRANS(vse32_v, MO_32, r2nfvm, st_us_op, st_us_check)
GEN_VEXT_TRANS(vse64_v, MO_64, r2nfvm, st_us_op, st_us_check)
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [RFC PATCH v2 6/6] target/riscv: rvv: Optimize vl8re8.v/vs8r.v with limitations
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
` (4 preceding siblings ...)
2024-05-31 17:44 ` [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations Max Chou
@ 2024-05-31 17:44 ` Max Chou
5 siblings, 0 replies; 10+ messages in thread
From: Max Chou @ 2024-05-31 17:44 UTC (permalink / raw)
To: qemu-devel, qemu-riscv
Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
Weiwei Li, Liu Zhiwei, Richard Henderson
The vector load/store whole register instructions (e.g. vl8re8.v/vs8r.v)
perform unmasked continuous load/store. We can optimize these
instructions by replacing the corresponding helper functions by TCG ops
to copy more data at a time with following assumptions:
* Host and target are little endian
Signed-off-by: Max Chou <max.chou@sifive.com>
---
target/riscv/insn_trans/trans_rvv.c.inc | 196 +++++++++++++++++++++++-
1 file changed, 194 insertions(+), 2 deletions(-)
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index bbac73bb12b..44763ccec06 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -1402,11 +1402,108 @@ GEN_LDST_WHOLE_TRANS(vl4re8_v, 4)
GEN_LDST_WHOLE_TRANS(vl4re16_v, 4)
GEN_LDST_WHOLE_TRANS(vl4re32_v, 4)
GEN_LDST_WHOLE_TRANS(vl4re64_v, 4)
-GEN_LDST_WHOLE_TRANS(vl8re8_v, 8)
GEN_LDST_WHOLE_TRANS(vl8re16_v, 8)
GEN_LDST_WHOLE_TRANS(vl8re32_v, 8)
GEN_LDST_WHOLE_TRANS(vl8re64_v, 8)
+static bool trans_vl8re8_v(DisasContext *s, arg_r2 * a)
+{
+ if (require_rvv(s) && QEMU_IS_ALIGNED(a->rd, 8)) {
+ if (!HOST_BIG_ENDIAN && s->vstart_eq_zero) {
+ uint32_t vofs = vreg_ofs(s, a->rd);
+ uint32_t midx = s->mem_idx;
+ uint32_t evl = s->cfg_ptr->vlenb << 3;
+
+ TCGv_i64 t0, t1;
+ TCGv_i128 t16;
+ TCGv_ptr tp;
+ TCGv_ptr i = tcg_temp_new_ptr();
+ TCGv len_remain = tcg_temp_new();
+ TCGv rs1 = get_gpr(s, a->rs1, EXT_NONE);
+ TCGv addr = tcg_temp_new();
+
+ TCGLabel *loop_128 = gen_new_label();
+ TCGLabel *remain_64 = gen_new_label();
+ TCGLabel *remain_32 = gen_new_label();
+ TCGLabel *remain_16 = gen_new_label();
+ TCGLabel *remain_8 = gen_new_label();
+ TCGLabel *over = gen_new_label();
+
+ tcg_gen_mov_tl(addr, rs1);
+ tcg_gen_movi_tl(len_remain, evl);
+ tcg_gen_movi_ptr(i, 0);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, cpu_vstart, evl, over);
+ gen_helper_check_probe_read(tcg_env, addr, len_remain);
+
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 16, remain_64);
+
+ gen_set_label(loop_128);
+
+ t16 = tcg_temp_new_i128();
+ tcg_gen_qemu_ld_i128(t16, addr, midx,
+ MO_LE | MO_128 | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 16);
+
+ tp = tcg_temp_new_ptr();
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 16);
+
+ t0 = tcg_temp_new_i64();
+ t1 = tcg_temp_new_i64();
+ tcg_gen_extr_i128_i64(t0, t1, t16);
+
+ tcg_gen_st_i64(t0, tp, vofs);
+ tcg_gen_st_i64(t1, tp, vofs + 8);
+ tcg_gen_subi_tl(len_remain, len_remain, 16);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, len_remain, 16, loop_128);
+
+ gen_set_label(remain_64);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 8, remain_32);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUQ | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 8);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 8);
+ tcg_gen_st_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 8);
+
+ gen_set_label(remain_32);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 4, remain_16);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUL | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 4);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 4);
+ tcg_gen_st32_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 4);
+
+ gen_set_label(remain_16);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 2, remain_8);
+ tcg_gen_qemu_ld_i64(t0, addr, midx, MO_LEUW | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 2);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_addi_ptr(i, i, 2);
+ tcg_gen_st16_i64(t0, tp, vofs);
+ tcg_gen_subi_tl(len_remain, len_remain, 2);
+
+ gen_set_label(remain_8);
+ tcg_gen_brcondi_tl(TCG_COND_EQ, len_remain, 0, over);
+ tcg_gen_qemu_ld_i64(t0, addr, midx,
+ MO_LE | MO_8 | MO_ATOM_NONE);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_st8_i64(t0, tp, vofs);
+
+ gen_set_label(over);
+
+ finalize_rvv_inst(s);
+ } else {
+ return ldst_whole_trans(a->rd, a->rs1, 8, gen_helper_vl8re8_v, s);
+ }
+ return true;
+ }
+ return false;
+}
+
/*
* The vector whole register store instructions are encoded similar to
* unmasked unit-stride store of elements with EEW=8.
@@ -1414,7 +1511,102 @@ GEN_LDST_WHOLE_TRANS(vl8re64_v, 8)
GEN_LDST_WHOLE_TRANS(vs1r_v, 1)
GEN_LDST_WHOLE_TRANS(vs2r_v, 2)
GEN_LDST_WHOLE_TRANS(vs4r_v, 4)
-GEN_LDST_WHOLE_TRANS(vs8r_v, 8)
+
+static bool trans_vs8r_v(DisasContext *s, arg_r2 * a)
+{
+ if (require_rvv(s) && QEMU_IS_ALIGNED(a->rd, 8)) {
+ if (!HOST_BIG_ENDIAN && s->vstart_eq_zero) {
+ uint32_t vofs = vreg_ofs(s, a->rd);
+ uint32_t midx = s->mem_idx;
+ uint32_t evl = s->cfg_ptr->vlenb << 3;
+
+ TCGv_i64 t0, t1;
+ TCGv_i128 t16;
+ TCGv_ptr tp;
+ TCGv_ptr i = tcg_temp_new_ptr();
+ TCGv len_remain = tcg_temp_new();
+ TCGv rs1 = get_gpr(s, a->rs1, EXT_NONE);
+ TCGv addr = tcg_temp_new();
+
+ TCGLabel *loop_128 = gen_new_label();
+ TCGLabel *remain_64 = gen_new_label();
+ TCGLabel *remain_32 = gen_new_label();
+ TCGLabel *remain_16 = gen_new_label();
+ TCGLabel *remain_8 = gen_new_label();
+ TCGLabel *over = gen_new_label();
+
+ tcg_gen_mov_tl(addr, rs1);
+ tcg_gen_movi_tl(len_remain, evl);
+ tcg_gen_movi_ptr(i, 0);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, cpu_vstart, evl, over);
+ gen_helper_check_probe_write(tcg_env, addr, len_remain);
+
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 16, remain_64);
+
+ gen_set_label(loop_128);
+
+ t0 = tcg_temp_new_i64();
+ t1 = tcg_temp_new_i64();
+ tp = tcg_temp_new_ptr();
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_ld_i64(t1, tp, vofs + 8);
+ tcg_gen_addi_ptr(i, i, 16);
+
+ t16 = tcg_temp_new_i128();
+ tcg_gen_concat_i64_i128(t16, t0, t1);
+
+ tcg_gen_qemu_st_i128(t16, addr, midx,
+ MO_LE | MO_128 | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 16);
+ tcg_gen_subi_tl(len_remain, len_remain, 16);
+
+ tcg_gen_brcondi_tl(TCG_COND_GEU, len_remain, 16, loop_128);
+
+ gen_set_label(remain_64);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 8, remain_32);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 8);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUQ | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 8);
+ tcg_gen_subi_tl(len_remain, len_remain, 8);
+
+ gen_set_label(remain_32);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 4, remain_16);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 4);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUL | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 4);
+ tcg_gen_subi_tl(len_remain, len_remain, 4);
+
+ gen_set_label(remain_16);
+ tcg_gen_brcondi_tl(TCG_COND_LTU, len_remain, 2, remain_8);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_addi_ptr(i, i, 2);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LEUW | MO_ATOM_NONE);
+ tcg_gen_addi_tl(addr, addr, 2);
+ tcg_gen_subi_tl(len_remain, len_remain, 2);
+
+ gen_set_label(remain_8);
+ tcg_gen_brcondi_tl(TCG_COND_EQ, len_remain, 0, over);
+ tcg_gen_add_ptr(tp, tcg_env, i);
+ tcg_gen_ld_i64(t0, tp, vofs);
+ tcg_gen_qemu_st_i64(t0, addr, midx, MO_LE | MO_8 | MO_ATOM_NONE);
+
+ gen_set_label(over);
+
+ finalize_rvv_inst(s);
+ } else {
+ return ldst_whole_trans(a->rd, a->rs1, 8, gen_helper_vl8re8_v, s);
+ }
+ return true;
+ }
+ return false;
+}
/*
*** Vector Integer Arithmetic Instructions
--
2.34.1
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations
2024-05-31 17:44 ` [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations Max Chou
@ 2024-06-02 17:45 ` Richard Henderson
2024-06-03 15:50 ` Max Chou
0 siblings, 1 reply; 10+ messages in thread
From: Richard Henderson @ 2024-06-02 17:45 UTC (permalink / raw)
To: Max Chou, qemu-devel, qemu-riscv
Cc: dbarboza, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
Liu Zhiwei
On 5/31/24 12:44, Max Chou wrote:
> The vector unit-stride load/store instructions (e.g. vle8.v/vse8.v)
> perform continuous load/store. We can replace the corresponding helper
> functions by TCG ops to copy more data at a time with following
> assumptions:
>
> * Perform virtual address resolution once for entire vector at beginning
> * Without mask
> * Without tail agnostic
> * Both host and target are little endian
>
> Signed-off-by: Max Chou <max.chou@sifive.com>
Why are you generating all of this inline? This expansion is very large. I would expect
you to get better performance with a helper function.
AGAIN, please see the Arm implementation.
r~
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations
2024-06-02 17:45 ` Richard Henderson
@ 2024-06-03 15:50 ` Max Chou
2024-06-04 0:58 ` Richard Henderson
0 siblings, 1 reply; 10+ messages in thread
From: Max Chou @ 2024-06-03 15:50 UTC (permalink / raw)
To: Richard Henderson, qemu-devel, qemu-riscv
Cc: dbarboza, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
Liu Zhiwei
Hi Richart,
Thank you for your feedback.
This version is created by referencing the gen_sve_ldr translation
function with the similar assumptions that no mask(predication)/no tail
agnostic/continuous load & store.
You are right, the expansion is large in this version (over 20 TCG
instructions that suggested in tcg-op doc).
I will provide next version with the helper function implementation like
sve_ldN_r in ARM target.
Thank you,
Max
On 2024/6/3 1:45 AM, Richard Henderson wrote:
> On 5/31/24 12:44, Max Chou wrote:
>> The vector unit-stride load/store instructions (e.g. vle8.v/vse8.v)
>> perform continuous load/store. We can replace the corresponding helper
>> functions by TCG ops to copy more data at a time with following
>> assumptions:
>>
>> * Perform virtual address resolution once for entire vector at beginning
>> * Without mask
>> * Without tail agnostic
>> * Both host and target are little endian
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>
> Why are you generating all of this inline? This expansion is very
> large. I would expect you to get better performance with a helper
> function.
>
> AGAIN, please see the Arm implementation.
>
>
> r~
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations
2024-06-03 15:50 ` Max Chou
@ 2024-06-04 0:58 ` Richard Henderson
0 siblings, 0 replies; 10+ messages in thread
From: Richard Henderson @ 2024-06-04 0:58 UTC (permalink / raw)
To: Max Chou, qemu-devel, qemu-riscv
Cc: dbarboza, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
Liu Zhiwei
On 6/3/24 10:50, Max Chou wrote:
> Hi Richart,
>
> Thank you for your feedback.
> This version is created by referencing the gen_sve_ldr translation function with the
> similar assumptions that no mask(predication)/no tail agnostic/continuous load & store.
Except that gen_sve_ldr has a compile-time constant for the vector length, which is always
a multiple of 16, and so has no extra special cases like you needed.
r~
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-06-04 0:59 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-31 17:44 [RFC PATCH v2 0/6] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 1/6] target/riscv: Separate vector segment " Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 2/6] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 3/6] target/riscv: Inline vext_ldst_us and corresponding function for performance Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 4/6] target/riscv: Add check_probe_[read|write] helper functions Max Chou
2024-05-31 17:44 ` [RFC PATCH v2 5/6] target/riscv: rvv: Optimize v[l|s]e8.v with limitations Max Chou
2024-06-02 17:45 ` Richard Henderson
2024-06-03 15:50 ` Max Chou
2024-06-04 0:58 ` Richard Henderson
2024-05-31 17:44 ` [RFC PATCH v2 6/6] target/riscv: rvv: Optimize vl8re8.v/vs8r.v " Max Chou
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).