* [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions @ 2024-02-15 19:28 Max Chou 2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou ` (6 more replies) 0 siblings, 7 replies; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv; +Cc: dbarboza, Max Chou Hi all, When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x slower than the scalar equivalent on QEMU and it hurts developer productivity. From the performance analysis result, we can observe that the glibc memcpy spends most of the time in the vector unit-stride load/store helper functions. Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664 Children Self Command Shared Object Symbol + 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us + 26.92% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000000ff + 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb + 13.85% 13.85% qemu-riscv64 qemu-riscv64 [.] lde_b + 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu + 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu + 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup + 7.70% 7.70% qemu-riscv64 qemu-riscv64 [.] ste_b + 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) So this patchset tries to improve the performance of the RVV version of glibc memcpy on QEMU by improving the corresponding helper function quality. The overall performance improvement can achieve following numbers (depending on the size). Average: 2.86X / Smallest: 1.15X / Largest: 4.49X PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions, the next version or next serious will complete other vector ld/st part. Regards, Max. [1] https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com Max Chou (6): target/riscv: Seperate vector segment ld/st instructions accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb target/riscv: Inline vext_ldst_us and coressponding function for performance accel/tcg: Inline cpu_mmu_lookup function accel/tcg: Inline do_ld1_mmu function accel/tcg: Inline do_st1_mmu function accel/tcg/ldst_common.c.inc | 40 ++++++-- accel/tcg/user-exec.c | 17 ++-- target/riscv/helper.h | 4 + target/riscv/insn32.decode | 11 +- target/riscv/insn_trans/trans_rvv.c.inc | 61 +++++++++++ target/riscv/vector_helper.c | 130 +++++++++++++++++++----- 6 files changed, 221 insertions(+), 42 deletions(-) -- 2.34.1 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 1/6] target/riscv: Seperate vector segment ld/st instructions 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou ` (5 subsequent siblings) 6 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei, Richard Henderson, Junqiang Wang This commit seperate the helper function implementations of vector segment load/store instructions from other vector load/store instructions. This can improve performance by avoiding unnecessary segment operation when NF = 1. Signed-off-by: Max Chou <max.chou@sifive.com> --- target/riscv/helper.h | 4 + target/riscv/insn32.decode | 11 ++- target/riscv/insn_trans/trans_rvv.c.inc | 61 +++++++++++++++ target/riscv/vector_helper.c | 100 +++++++++++++++++++++--- 4 files changed, 164 insertions(+), 12 deletions(-) diff --git a/target/riscv/helper.h b/target/riscv/helper.h index 8a635238514..8b6ddc4cb88 100644 --- a/target/riscv/helper.h +++ b/target/riscv/helper.h @@ -157,18 +157,22 @@ DEF_HELPER_FLAGS_3(hyp_hsv_d, TCG_CALL_NO_WG, void, env, tl, tl) /* Vector functions */ DEF_HELPER_3(vsetvl, tl, env, tl, tl) DEF_HELPER_5(vle8_v, void, ptr, ptr, tl, env, i32) +DEF_HELPER_5(vlsege8_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle16_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle32_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle64_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle8_v_mask, void, ptr, ptr, tl, env, i32) +DEF_HELPER_5(vlsege8_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle16_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle32_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vle64_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse8_v, void, ptr, ptr, tl, env, i32) +DEF_HELPER_5(vssege8_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse16_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse32_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse64_v, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse8_v_mask, void, ptr, ptr, tl, env, i32) +DEF_HELPER_5(vssege8_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse16_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse32_v_mask, void, ptr, ptr, tl, env, i32) DEF_HELPER_5(vse64_v_mask, void, ptr, ptr, tl, env, i32) diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode index f22df04cfd1..0712e9f6314 100644 --- a/target/riscv/insn32.decode +++ b/target/riscv/insn32.decode @@ -77,6 +77,7 @@ @r2 ....... ..... ..... ... ..... ....... &r2 %rs1 %rd @r2_vm_1 ...... . ..... ..... ... ..... ....... &rmr vm=1 %rs2 %rd @r2_nfvm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm %nf %rs1 %rd +@r2_nf_1_vm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm nf=1 %rs1 %rd @r2_vm ...... vm:1 ..... ..... ... ..... ....... &rmr %rs2 %rd @r1_vm ...... vm:1 ..... ..... ... ..... ....... %rd @r_nfvm ... ... vm:1 ..... ..... ... ..... ....... &rnfvm %nf %rs2 %rs1 %rd @@ -349,11 +350,17 @@ hsv_d 0110111 ..... ..... 100 00000 1110011 @r2_s # *** Vector loads and stores are encoded within LOADFP/STORE-FP *** # Vector unit-stride load/store insns. -vle8_v ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm +{ + vle8_v 000 000 . 00000 ..... 000 ..... 0000111 @r2_nf_1_vm + vlsege8_v ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm +} vle16_v ... 000 . 00000 ..... 101 ..... 0000111 @r2_nfvm vle32_v ... 000 . 00000 ..... 110 ..... 0000111 @r2_nfvm vle64_v ... 000 . 00000 ..... 111 ..... 0000111 @r2_nfvm -vse8_v ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm +{ + vse8_v 000 000 . 00000 ..... 000 ..... 0100111 @r2_nf_1_vm + vssege8_v ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm +} vse16_v ... 000 . 00000 ..... 101 ..... 0100111 @r2_nfvm vse32_v ... 000 . 00000 ..... 110 ..... 0100111 @r2_nfvm vse64_v ... 000 . 00000 ..... 111 ..... 0100111 @r2_nfvm diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc index 9e101ab4343..04fc6329359 100644 --- a/target/riscv/insn_trans/trans_rvv.c.inc +++ b/target/riscv/insn_trans/trans_rvv.c.inc @@ -690,6 +690,40 @@ GEN_VEXT_TRANS(vle16_v, MO_16, r2nfvm, ld_us_op, ld_us_check) GEN_VEXT_TRANS(vle32_v, MO_32, r2nfvm, ld_us_op, ld_us_check) GEN_VEXT_TRANS(vle64_v, MO_64, r2nfvm, ld_us_op, ld_us_check) +static bool ld_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew) +{ + uint32_t data = 0; + gen_helper_ldst_us *fn; + static gen_helper_ldst_us * const fns[2][4] = { + /* masked unit stride load */ + { gen_helper_vlsege8_v_mask, gen_helper_vle16_v_mask, + gen_helper_vle32_v_mask, gen_helper_vle64_v_mask }, + /* unmasked unit stride load */ + { gen_helper_vlsege8_v, gen_helper_vle16_v, + gen_helper_vle32_v, gen_helper_vle64_v } + }; + + fn = fns[a->vm][eew]; + if (fn == NULL) { + return false; + } + + /* + * Vector load/store instructions have the EEW encoded + * directly in the instructions. The maximum vector size is + * calculated with EMUL rather than LMUL. + */ + uint8_t emul = vext_get_emul(s, eew); + data = FIELD_DP32(data, VDATA, VM, a->vm); + data = FIELD_DP32(data, VDATA, LMUL, emul); + data = FIELD_DP32(data, VDATA, NF, a->nf); + data = FIELD_DP32(data, VDATA, VTA, s->vta); + data = FIELD_DP32(data, VDATA, VMA, s->vma); + return ldst_us_trans(a->rd, a->rs1, data, fn, s, false); +} + +GEN_VEXT_TRANS(vlsege8_v, MO_8, r2nfvm, ld_us_seg_op, ld_us_check) + static bool st_us_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew) { uint32_t data = 0; @@ -727,6 +761,33 @@ GEN_VEXT_TRANS(vse16_v, MO_16, r2nfvm, st_us_op, st_us_check) GEN_VEXT_TRANS(vse32_v, MO_32, r2nfvm, st_us_op, st_us_check) GEN_VEXT_TRANS(vse64_v, MO_64, r2nfvm, st_us_op, st_us_check) +static bool st_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew) +{ + uint32_t data = 0; + gen_helper_ldst_us *fn; + static gen_helper_ldst_us * const fns[2][4] = { + /* masked unit stride store */ + { gen_helper_vssege8_v_mask, gen_helper_vse16_v_mask, + gen_helper_vse32_v_mask, gen_helper_vse64_v_mask }, + /* unmasked unit stride store */ + { gen_helper_vssege8_v, gen_helper_vse16_v, + gen_helper_vse32_v, gen_helper_vse64_v } + }; + + fn = fns[a->vm][eew]; + if (fn == NULL) { + return false; + } + + uint8_t emul = vext_get_emul(s, eew); + data = FIELD_DP32(data, VDATA, VM, a->vm); + data = FIELD_DP32(data, VDATA, LMUL, emul); + data = FIELD_DP32(data, VDATA, NF, a->nf); + return ldst_us_trans(a->rd, a->rs1, data, fn, s, true); +} + +GEN_VEXT_TRANS(vssege8_v, MO_8, r2nfvm, st_us_seg_op, st_us_check) + /* *** unit stride mask load and store */ diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c index 84cec73eb20..e8fbb921449 100644 --- a/target/riscv/vector_helper.c +++ b/target/riscv/vector_helper.c @@ -201,6 +201,32 @@ vext_ldst_stride(void *vd, void *v0, target_ulong base, uint32_t desc, uint32_t vm, vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uintptr_t ra) +{ + uint32_t i; + uint32_t max_elems = vext_max_elems(desc, log2_esz); + uint32_t esz = 1 << log2_esz; + uint32_t vma = vext_vma(desc); + + for (i = env->vstart; i < env->vl; i++, env->vstart++) { + if (!vm && !vext_elem_mask(v0, i)) { + /* set masked-off elements to 1s */ + vext_set_elems_1s(vd, vma, i * esz, (i + 1) * esz); + continue; + } + target_ulong addr = base + stride * i; + ldst_elem(env, adjust_addr(env, addr), i, vd, ra); + } + env->vstart = 0; + + vext_set_tail_elems_1s(env->vl, vd, desc, 1, esz, max_elems); +} + +static void +vext_ldst_stride_segment(void *vd, void *v0, target_ulong base, + target_ulong stride, CPURISCVState *env, + uint32_t desc, uint32_t vm, + vext_ldst_elem_fn *ldst_elem, + uint32_t log2_esz, uintptr_t ra) { uint32_t i, k; uint32_t nf = vext_nf(desc); @@ -234,8 +260,8 @@ void HELPER(NAME)(void *vd, void * v0, target_ulong base, \ uint32_t desc) \ { \ uint32_t vm = vext_vm(desc); \ - vext_ldst_stride(vd, v0, base, stride, env, desc, vm, LOAD_FN, \ - ctzl(sizeof(ETYPE)), GETPC()); \ + vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm, \ + LOAD_FN, ctzl(sizeof(ETYPE)), GETPC()); \ } GEN_VEXT_LD_STRIDE(vlse8_v, int8_t, lde_b) @@ -249,8 +275,8 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \ uint32_t desc) \ { \ uint32_t vm = vext_vm(desc); \ - vext_ldst_stride(vd, v0, base, stride, env, desc, vm, STORE_FN, \ - ctzl(sizeof(ETYPE)), GETPC()); \ + vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm, \ + STORE_FN, ctzl(sizeof(ETYPE)), GETPC()); \ } GEN_VEXT_ST_STRIDE(vsse8_v, int8_t, ste_b) @@ -267,6 +293,26 @@ static void vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl, uintptr_t ra) +{ + uint32_t i; + uint32_t max_elems = vext_max_elems(desc, log2_esz); + uint32_t esz = 1 << log2_esz; + + /* load bytes from guest memory */ + for (i = env->vstart; i < evl; i++, env->vstart++) { + target_ulong addr = base + (i << log2_esz); + ldst_elem(env, adjust_addr(env, addr), i, vd, ra); + } + env->vstart = 0; + + vext_set_tail_elems_1s(evl, vd, desc, 1, esz, max_elems); +} + +/* unmasked unit-stride segment load and store operation */ +static void +vext_ldst_us_segment(void *vd, target_ulong base, CPURISCVState *env, + uint32_t desc, vext_ldst_elem_fn *ldst_elem, + uint32_t log2_esz, uint32_t evl, uintptr_t ra) { uint32_t i, k; uint32_t nf = vext_nf(desc); @@ -308,10 +354,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \ ctzl(sizeof(ETYPE)), env->vl, GETPC()); \ } +#define GEN_VEXT_LD_US_SEG(NAME, ETYPE, LOAD_FN) \ +void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \ + CPURISCVState *env, uint32_t desc) \ +{ \ + uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE)); \ + vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false, \ + LOAD_FN, ctzl(sizeof(ETYPE)), GETPC()); \ +} \ + \ +void HELPER(NAME)(void *vd, void *v0, target_ulong base, \ + CPURISCVState *env, uint32_t desc) \ +{ \ + vext_ldst_us_segment(vd, base, env, desc, LOAD_FN, \ + ctzl(sizeof(ETYPE)), env->vl, GETPC()); \ +} + GEN_VEXT_LD_US(vle8_v, int8_t, lde_b) -GEN_VEXT_LD_US(vle16_v, int16_t, lde_h) -GEN_VEXT_LD_US(vle32_v, int32_t, lde_w) -GEN_VEXT_LD_US(vle64_v, int64_t, lde_d) +GEN_VEXT_LD_US_SEG(vlsege8_v, int8_t, lde_b) +GEN_VEXT_LD_US_SEG(vle16_v, int16_t, lde_h) +GEN_VEXT_LD_US_SEG(vle32_v, int32_t, lde_w) +GEN_VEXT_LD_US_SEG(vle64_v, int64_t, lde_d) #define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN) \ void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \ @@ -329,10 +392,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base, \ ctzl(sizeof(ETYPE)), env->vl, GETPC()); \ } +#define GEN_VEXT_ST_US_SEG(NAME, ETYPE, STORE_FN) \ +void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base, \ + CPURISCVState *env, uint32_t desc) \ +{ \ + uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE)); \ + vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false, \ + STORE_FN, ctzl(sizeof(ETYPE)), GETPC()); \ +} \ + \ +void HELPER(NAME)(void *vd, void *v0, target_ulong base, \ + CPURISCVState *env, uint32_t desc) \ +{ \ + vext_ldst_us_segment(vd, base, env, desc, STORE_FN, \ + ctzl(sizeof(ETYPE)), env->vl, GETPC()); \ +} + GEN_VEXT_ST_US(vse8_v, int8_t, ste_b) -GEN_VEXT_ST_US(vse16_v, int16_t, ste_h) -GEN_VEXT_ST_US(vse32_v, int32_t, ste_w) -GEN_VEXT_ST_US(vse64_v, int64_t, ste_d) +GEN_VEXT_ST_US_SEG(vssege8_v, int8_t, ste_b) +GEN_VEXT_ST_US_SEG(vse16_v, int16_t, ste_h) +GEN_VEXT_ST_US_SEG(vse32_v, int32_t, ste_w) +GEN_VEXT_ST_US_SEG(vse64_v, int64_t, ste_d) /* * unit stride mask load and store, EEW = 1 -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou 2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 20:03 ` Richard Henderson 2024-02-15 20:21 ` Daniel Henrique Barboza 2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou ` (4 subsequent siblings) 6 siblings, 2 replies; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Richard Henderson, Paolo Bonzini If there are not any QEMU plugin memory callback functions, checking before calling the qemu_plugin_vcpu_mem_cb function can reduce the function call overhead. Signed-off-by: Max Chou <max.chou@sifive.com> --- accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc index c82048e377e..bf24986c562 100644 --- a/accel/tcg/ldst_common.c.inc +++ b/accel/tcg/ldst_common.c.inc @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra) tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB); ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); - plugin_load_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_load_cb(env, addr, oi); + } return ret; } @@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr addr, tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); - plugin_load_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_load_cb(env, addr, oi); + } return ret; } @@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr addr, tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); - plugin_load_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_load_cb(env, addr, oi); + } return ret; } @@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr addr, tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); - plugin_load_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_load_cb(env, addr, oi); + } return ret; } @@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr, tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); ret = do_ld16_mmu(env_cpu(env), addr, oi, ra); - plugin_load_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_load_cb(env, addr, oi); + } return ret; } @@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val, MemOpIdx oi, uintptr_t retaddr) { helper_stb_mmu(env, addr, val, oi, retaddr); - plugin_store_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_store_cb(env, addr, oi); + } } void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val, @@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val, { tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); do_st2_mmu(env_cpu(env), addr, val, oi, retaddr); - plugin_store_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_store_cb(env, addr, oi); + } } void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val, @@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val, { tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); do_st4_mmu(env_cpu(env), addr, val, oi, retaddr); - plugin_store_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_store_cb(env, addr, oi); + } } void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val, @@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val, { tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); do_st8_mmu(env_cpu(env), addr, val, oi, retaddr); - plugin_store_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_store_cb(env, addr, oi); + } } void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val, @@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val, { tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); do_st16_mmu(env_cpu(env), addr, val, oi, retaddr); - plugin_store_cb(env, addr, oi); + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { + plugin_store_cb(env, addr, oi); + } } /* -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb 2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou @ 2024-02-15 20:03 ` Richard Henderson 2024-02-17 9:08 ` Max Chou 2024-02-15 20:21 ` Daniel Henrique Barboza 1 sibling, 1 reply; 19+ messages in thread From: Richard Henderson @ 2024-02-15 20:03 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Paolo Bonzini On 2/15/24 09:28, Max Chou wrote: > If there are not any QEMU plugin memory callback functions, checking > before calling the qemu_plugin_vcpu_mem_cb function can reduce the > function call overhead. > > Signed-off-by: Max Chou <max.chou@sifive.com> > --- > accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++---------- > 1 file changed, 30 insertions(+), 10 deletions(-) > > diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc > index c82048e377e..bf24986c562 100644 > --- a/accel/tcg/ldst_common.c.inc > +++ b/accel/tcg/ldst_common.c.inc > @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra) > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB); > ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } Rather than repeating N times, modify plugin_load_cb and plugin_store_cb just above to avoid the call to cpu_plugin_mem_cbs_enabled(). I expect the compiler is inlining those functions already. r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb 2024-02-15 20:03 ` Richard Henderson @ 2024-02-17 9:08 ` Max Chou 0 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-17 9:08 UTC (permalink / raw) To: Richard Henderson, qemu-devel, qemu-riscv; +Cc: dbarboza, Paolo Bonzini Hi Richard, Thank you for the suggestion. I'll do a v2 with this. Thanks, Max On 2024/2/16 4:03 AM, Richard Henderson wrote: > On 2/15/24 09:28, Max Chou wrote: >> If there are not any QEMU plugin memory callback functions, checking >> before calling the qemu_plugin_vcpu_mem_cb function can reduce the >> function call overhead. >> >> Signed-off-by: Max Chou <max.chou@sifive.com> >> --- >> accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++---------- >> 1 file changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc >> index c82048e377e..bf24986c562 100644 >> --- a/accel/tcg/ldst_common.c.inc >> +++ b/accel/tcg/ldst_common.c.inc >> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr >> addr, MemOpIdx oi, uintptr_t ra) >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB); >> ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } > > Rather than repeating N times, modify plugin_load_cb and > plugin_store_cb just above to avoid the call to > cpu_plugin_mem_cbs_enabled(). I expect the compiler is inlining those > functions already. > > > r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb 2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou 2024-02-15 20:03 ` Richard Henderson @ 2024-02-15 20:21 ` Daniel Henrique Barboza 2024-02-17 9:45 ` Max Chou 1 sibling, 1 reply; 19+ messages in thread From: Daniel Henrique Barboza @ 2024-02-15 20:21 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv; +Cc: Richard Henderson, Paolo Bonzini On 2/15/24 16:28, Max Chou wrote: > If there are not any QEMU plugin memory callback functions, checking > before calling the qemu_plugin_vcpu_mem_cb function can reduce the > function call overhead. > > Signed-off-by: Max Chou <max.chou@sifive.com> > --- This was in my TODO list for some time. Thanks for looking it up. Can't we avoid all callbacks, not just qemu_plugin_vcpu_mem_cb, if there's no plugin loaded? The performance increase when building with --disable-plugins shouldn't be a thing - if the user isn't using plug-ins it should have a penalty to it. Thanks, Daniel > accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++---------- > 1 file changed, 30 insertions(+), 10 deletions(-) > > diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc > index c82048e377e..bf24986c562 100644 > --- a/accel/tcg/ldst_common.c.inc > +++ b/accel/tcg/ldst_common.c.inc > @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra) > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB); > ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } > > @@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr addr, > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); > ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } > > @@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr addr, > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); > ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } > > @@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr addr, > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); > ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } > > @@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr, > > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); > ret = do_ld16_mmu(env_cpu(env), addr, oi, ra); > - plugin_load_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_load_cb(env, addr, oi); > + } > return ret; > } > > @@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val, > MemOpIdx oi, uintptr_t retaddr) > { > helper_stb_mmu(env, addr, val, oi, retaddr); > - plugin_store_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_store_cb(env, addr, oi); > + } > } > > void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val, > @@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val, > { > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); > do_st2_mmu(env_cpu(env), addr, val, oi, retaddr); > - plugin_store_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_store_cb(env, addr, oi); > + } > } > > void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val, > @@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val, > { > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); > do_st4_mmu(env_cpu(env), addr, val, oi, retaddr); > - plugin_store_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_store_cb(env, addr, oi); > + } > } > > void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val, > @@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val, > { > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); > do_st8_mmu(env_cpu(env), addr, val, oi, retaddr); > - plugin_store_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_store_cb(env, addr, oi); > + } > } > > void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val, > @@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val, > { > tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); > do_st16_mmu(env_cpu(env), addr, val, oi, retaddr); > - plugin_store_cb(env, addr, oi); > + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { > + plugin_store_cb(env, addr, oi); > + } > } > > /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb 2024-02-15 20:21 ` Daniel Henrique Barboza @ 2024-02-17 9:45 ` Max Chou 0 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-17 9:45 UTC (permalink / raw) To: Daniel Henrique Barboza, qemu-devel, qemu-riscv Cc: Richard Henderson, Paolo Bonzini Hi Daniel, I think the the overhead from the plugin callbacks depending on the type of plugin callbacks. Because there is only qemu_plugin_vcpu_mem_cb in the perf report of the glibc memcpy benchtest, so I just looked it up for the the memory callbacks in this RFC. I'll try to get more experiment results to check the status of other plugin callbacks. Thanks, Max On 2024/2/16 4:21 AM, Daniel Henrique Barboza wrote: > > > On 2/15/24 16:28, Max Chou wrote: >> If there are not any QEMU plugin memory callback functions, checking >> before calling the qemu_plugin_vcpu_mem_cb function can reduce the >> function call overhead. >> >> Signed-off-by: Max Chou <max.chou@sifive.com> >> --- > > This was in my TODO list for some time. Thanks for looking it up. > > Can't we avoid all callbacks, not just qemu_plugin_vcpu_mem_cb, if > there's no > plugin loaded? The performance increase when building with > --disable-plugins > shouldn't be a thing - if the user isn't using plug-ins it should have a > penalty to it. > > Thanks, > > Daniel > >> accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++---------- >> 1 file changed, 30 insertions(+), 10 deletions(-) >> >> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc >> index c82048e377e..bf24986c562 100644 >> --- a/accel/tcg/ldst_common.c.inc >> +++ b/accel/tcg/ldst_common.c.inc >> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr >> addr, MemOpIdx oi, uintptr_t ra) >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB); >> ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } >> @@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr >> addr, >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); >> ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } >> @@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr >> addr, >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); >> ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } >> @@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr >> addr, >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); >> ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } >> @@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr >> addr, >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); >> ret = do_ld16_mmu(env_cpu(env), addr, oi, ra); >> - plugin_load_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_load_cb(env, addr, oi); >> + } >> return ret; >> } >> @@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr >> addr, uint8_t val, >> MemOpIdx oi, uintptr_t retaddr) >> { >> helper_stb_mmu(env, addr, val, oi, retaddr); >> - plugin_store_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_store_cb(env, addr, oi); >> + } >> } >> void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val, >> @@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, >> uint16_t val, >> { >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16); >> do_st2_mmu(env_cpu(env), addr, val, oi, retaddr); >> - plugin_store_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_store_cb(env, addr, oi); >> + } >> } >> void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val, >> @@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, >> uint32_t val, >> { >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32); >> do_st4_mmu(env_cpu(env), addr, val, oi, retaddr); >> - plugin_store_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_store_cb(env, addr, oi); >> + } >> } >> void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val, >> @@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, >> uint64_t val, >> { >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64); >> do_st8_mmu(env_cpu(env), addr, val, oi, retaddr); >> - plugin_store_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_store_cb(env, addr, oi); >> + } >> } >> void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val, >> @@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr >> addr, Int128 val, >> { >> tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128); >> do_st16_mmu(env_cpu(env), addr, val, oi, retaddr); >> - plugin_store_cb(env, addr, oi); >> + if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) { >> + plugin_store_cb(env, addr, oi); >> + } >> } >> /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou 2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou 2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 20:09 ` Richard Henderson 2024-02-15 21:11 ` Daniel Henrique Barboza 2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou ` (3 subsequent siblings) 6 siblings, 2 replies; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei In the vector unit-stride load/store helper functions. the vext_ldst_us function corresponding most of the execution time. Inline the functions can avoid the function call overhead to imperove the helper function performance. Signed-off-by: Max Chou <max.chou@sifive.com> --- target/riscv/vector_helper.c | 30 ++++++++++++++++-------------- 1 file changed, 16 insertions(+), 14 deletions(-) diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c index e8fbb921449..866f77d321d 100644 --- a/target/riscv/vector_helper.c +++ b/target/riscv/vector_helper.c @@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, int index, typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr, uint32_t idx, void *vd, uintptr_t retaddr); -#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ -static void NAME(CPURISCVState *env, abi_ptr addr, \ - uint32_t idx, void *vd, uintptr_t retaddr)\ -{ \ - ETYPE *cur = ((ETYPE *)vd + H(idx)); \ - *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ -} \ +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ +static inline QEMU_ALWAYS_INLINE \ +void NAME(CPURISCVState *env, abi_ptr addr, \ + uint32_t idx, void *vd, uintptr_t retaddr) \ +{ \ + ETYPE *cur = ((ETYPE *)vd + H(idx)); \ + *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ +} \ GEN_VEXT_LD_ELEM(lde_b, int8_t, H1, ldsb) GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw) GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl) GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq) -#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ -static void NAME(CPURISCVState *env, abi_ptr addr, \ - uint32_t idx, void *vd, uintptr_t retaddr)\ -{ \ - ETYPE data = *((ETYPE *)vd + H(idx)); \ - cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ +#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ +static inline QEMU_ALWAYS_INLINE \ +void NAME(CPURISCVState *env, abi_ptr addr, \ + uint32_t idx, void *vd, uintptr_t retaddr) \ +{ \ + ETYPE data = *((ETYPE *)vd + H(idx)); \ + cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ } GEN_VEXT_ST_ELEM(ste_b, int8_t, H1, stb) @@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d) */ /* unmasked unit-stride load and store operation */ -static void +static inline QEMU_ALWAYS_INLINE void vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl, uintptr_t ra) -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance 2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou @ 2024-02-15 20:09 ` Richard Henderson 2024-02-15 21:11 ` Daniel Henrique Barboza 1 sibling, 0 replies; 19+ messages in thread From: Richard Henderson @ 2024-02-15 20:09 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv Cc: dbarboza, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei On 2/15/24 09:28, Max Chou wrote: > /* unmasked unit-stride load and store operation */ > -static void > +static inline QEMU_ALWAYS_INLINE void > vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, > vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl, > uintptr_t ra) Yes, this is important so that ldst_elem itself can always be inlined. Reviewed-by: Richard Henderson <richard.henderson@linaro.org> r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance 2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou 2024-02-15 20:09 ` Richard Henderson @ 2024-02-15 21:11 ` Daniel Henrique Barboza 2024-02-17 10:10 ` Max Chou 1 sibling, 1 reply; 19+ messages in thread From: Daniel Henrique Barboza @ 2024-02-15 21:11 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei On 2/15/24 16:28, Max Chou wrote: > In the vector unit-stride load/store helper functions. the vext_ldst_us > function corresponding most of the execution time. Inline the functions > can avoid the function call overhead to imperove the helper function > performance. > > Signed-off-by: Max Chou <max.chou@sifive.com> > --- The inline is a good idea but I think we can do better. I mentioned in a thread last year [1] about the time we're spending in single byte loads/stores, even for strided instructions. E.g. in vext_ldst_stride(): for (i = env->vstart; i < env->vl; i++, env->vstart++) { k = 0; while (k < nf) { if (!vm && !vext_elem_mask(v0, i)) { /* set masked-off elements to 1s */ vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz, (i + k * max_elems + 1) * esz); k++; continue; } target_ulong addr = base + stride * i + (k << log2_esz); ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra); k++; } } We're doing single byte load/stores in ldst_elem() when, in this case, we could do it in a whole block only once. ARM does something similar in SVE. I update the gitlab bug https://gitlab.com/qemu-project/qemu/-/issues/2137 with this additional info too. Thanks, Daniel [1] https://lore.kernel.org/qemu-riscv/0e54c6c1-2903-7942-eff2-2b8c5e21187e@ventanamicro.com/ > target/riscv/vector_helper.c | 30 ++++++++++++++++-------------- > 1 file changed, 16 insertions(+), 14 deletions(-) > > diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c > index e8fbb921449..866f77d321d 100644 > --- a/target/riscv/vector_helper.c > +++ b/target/riscv/vector_helper.c > @@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, int index, > typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr, > uint32_t idx, void *vd, uintptr_t retaddr); > > -#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ > -static void NAME(CPURISCVState *env, abi_ptr addr, \ > - uint32_t idx, void *vd, uintptr_t retaddr)\ > -{ \ > - ETYPE *cur = ((ETYPE *)vd + H(idx)); \ > - *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ > -} \ > +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ > +static inline QEMU_ALWAYS_INLINE \ > +void NAME(CPURISCVState *env, abi_ptr addr, \ > + uint32_t idx, void *vd, uintptr_t retaddr) \ > +{ \ > + ETYPE *cur = ((ETYPE *)vd + H(idx)); \ > + *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ > +} \ > > GEN_VEXT_LD_ELEM(lde_b, int8_t, H1, ldsb) > GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw) > GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl) > GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq) > > -#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ > -static void NAME(CPURISCVState *env, abi_ptr addr, \ > - uint32_t idx, void *vd, uintptr_t retaddr)\ > -{ \ > - ETYPE data = *((ETYPE *)vd + H(idx)); \ > - cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ > +#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ > +static inline QEMU_ALWAYS_INLINE \ > +void NAME(CPURISCVState *env, abi_ptr addr, \ > + uint32_t idx, void *vd, uintptr_t retaddr) \ > +{ \ > + ETYPE data = *((ETYPE *)vd + H(idx)); \ > + cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ > } > > GEN_VEXT_ST_ELEM(ste_b, int8_t, H1, stb) > @@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d) > */ > > /* unmasked unit-stride load and store operation */ > -static void > +static inline QEMU_ALWAYS_INLINE void > vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc, > vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl, > uintptr_t ra) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance 2024-02-15 21:11 ` Daniel Henrique Barboza @ 2024-02-17 10:10 ` Max Chou 0 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-17 10:10 UTC (permalink / raw) To: Daniel Henrique Barboza, qemu-devel, qemu-riscv Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei Hi Daniel, Thank you for the information and suggestion. Yes, we can do it better if we load/store more bytes at a time. I'll try to improve the RFC on this way. Thanks, Max On 2024/2/16 5:11 AM, Daniel Henrique Barboza wrote: > > > On 2/15/24 16:28, Max Chou wrote: >> In the vector unit-stride load/store helper functions. the vext_ldst_us >> function corresponding most of the execution time. Inline the functions >> can avoid the function call overhead to imperove the helper function >> performance. >> >> Signed-off-by: Max Chou <max.chou@sifive.com> >> --- > > The inline is a good idea but I think we can do better. I mentioned in > a thread > last year [1] about the time we're spending in single byte > loads/stores, even > for strided instructions. > > E.g. in vext_ldst_stride(): > > > for (i = env->vstart; i < env->vl; i++, env->vstart++) { > k = 0; > while (k < nf) { > if (!vm && !vext_elem_mask(v0, i)) { > /* set masked-off elements to 1s */ > vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz, > (i + k * max_elems + 1) * esz); > k++; > continue; > } > target_ulong addr = base + stride * i + (k << log2_esz); > ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, > vd, ra); > k++; > } > } > > We're doing single byte load/stores in ldst_elem() when, in this case, > we could do > it in a whole block only once. ARM does something similar in SVE. > > I update the gitlab bug > https://gitlab.com/qemu-project/qemu/-/issues/2137 with this > additional info too. > > > > Thanks, > > Daniel > > > [1] > https://lore.kernel.org/qemu-riscv/0e54c6c1-2903-7942-eff2-2b8c5e21187e@ventanamicro.com/ > > >> target/riscv/vector_helper.c | 30 ++++++++++++++++-------------- >> 1 file changed, 16 insertions(+), 14 deletions(-) >> >> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c >> index e8fbb921449..866f77d321d 100644 >> --- a/target/riscv/vector_helper.c >> +++ b/target/riscv/vector_helper.c >> @@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, >> int index, >> typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr, >> uint32_t idx, void *vd, uintptr_t >> retaddr); >> -#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ >> -static void NAME(CPURISCVState *env, abi_ptr addr, \ >> - uint32_t idx, void *vd, uintptr_t retaddr)\ >> -{ \ >> - ETYPE *cur = ((ETYPE *)vd + H(idx)); \ >> - *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ >> -} \ >> +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF) \ >> +static inline QEMU_ALWAYS_INLINE \ >> +void NAME(CPURISCVState *env, abi_ptr addr, \ >> + uint32_t idx, void *vd, uintptr_t retaddr) \ >> +{ \ >> + ETYPE *cur = ((ETYPE *)vd + H(idx)); \ >> + *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr); \ >> +} \ >> GEN_VEXT_LD_ELEM(lde_b, int8_t, H1, ldsb) >> GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw) >> GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl) >> GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq) >> -#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ >> -static void NAME(CPURISCVState *env, abi_ptr addr, \ >> - uint32_t idx, void *vd, uintptr_t retaddr)\ >> -{ \ >> - ETYPE data = *((ETYPE *)vd + H(idx)); \ >> - cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ >> +#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF) \ >> +static inline QEMU_ALWAYS_INLINE \ >> +void NAME(CPURISCVState *env, abi_ptr addr, \ >> + uint32_t idx, void *vd, uintptr_t retaddr) \ >> +{ \ >> + ETYPE data = *((ETYPE *)vd + H(idx)); \ >> + cpu_##STSUF##_data_ra(env, addr, data, retaddr); \ >> } >> GEN_VEXT_ST_ELEM(ste_b, int8_t, H1, stb) >> @@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d) >> */ >> /* unmasked unit-stride load and store operation */ >> -static void >> +static inline QEMU_ALWAYS_INLINE void >> vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, >> uint32_t desc, >> vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, >> uint32_t evl, >> uintptr_t ra) ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou ` (2 preceding siblings ...) 2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 20:10 ` Richard Henderson 2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou ` (2 subsequent siblings) 6 siblings, 1 reply; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini Signed-off-by: Max Chou <max.chou@sifive.com> --- accel/tcg/user-exec.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c index 68b252cb8e8..c5453810eee 100644 --- a/accel/tcg/user-exec.c +++ b/accel/tcg/user-exec.c @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, target_ulong last) { } /* The system-mode versions of these helpers are in cputlb.c. */ -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr, - MemOp mop, uintptr_t ra, MMUAccessType type) +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu, + vaddr addr, + MemOp mop, + uintptr_t ra, + MMUAccessType type) { int a_bits = get_alignment_bits(mop); void *ret; -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function 2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou @ 2024-02-15 20:10 ` Richard Henderson 2024-02-17 17:27 ` Max Chou 0 siblings, 1 reply; 19+ messages in thread From: Richard Henderson @ 2024-02-15 20:10 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Riku Voipio, Paolo Bonzini On 2/15/24 09:28, Max Chou wrote: > Signed-off-by: Max Chou <max.chou@sifive.com> > --- > accel/tcg/user-exec.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c > index 68b252cb8e8..c5453810eee 100644 > --- a/accel/tcg/user-exec.c > +++ b/accel/tcg/user-exec.c > @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, target_ulong last) { } > > /* The system-mode versions of these helpers are in cputlb.c. */ > > -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr, > - MemOp mop, uintptr_t ra, MMUAccessType type) > +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu, > + vaddr addr, > + MemOp mop, > + uintptr_t ra, > + MMUAccessType type) > { > int a_bits = get_alignment_bits(mop); > void *ret; This is a large function. Why does it need to be inlined? For this and the next two patches I require evidence, because I don't believe you are attacking the problem correctly. r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function 2024-02-15 20:10 ` Richard Henderson @ 2024-02-17 17:27 ` Max Chou 0 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-17 17:27 UTC (permalink / raw) To: Richard Henderson, qemu-devel, qemu-riscv Cc: dbarboza, Riku Voipio, Paolo Bonzini Hi Richard, After I tried to reduce the overhead of unnecessary segment flow and memory plugin callbacks, I observed that there several functions consume most of runtime from perf report. So these three inline patches and previous patch were created to reduce the overhead without modifying the functions. The following are the experiment results. The benchmark target is the bench-memcpy executable generated from the glibc repository (release 2.38 with RVV support patch [1]). - Execution command `qemu-riscv64 -E TIMEOUTFACTOR=10 -R 1G -L {glibc build folder}/rootfs -cpu rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true bench-memcpy` - Total runtime 0. Original riscv-to-apply.next branch (commit ID: deb0ff0) - Total execution time: ~383 sec. 1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0) - Total execution time: ~375 sec. 2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: deb0ff0) - Total execution time: ~342 sec. - Perf report (cycles) 0. Original riscv-to-apply.next branch (commit ID: deb0ff0) # Samples: 385K of event 'cycles:u' # Event count (approx.): 1411574690539 # # Children Self Command Shared Object Symbol # ........ ........ ............ ....................... ............................................ # 47.54% 31.35% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us 25.60% 0.03% qemu-riscv64 qemu-riscv64 [.] helper_vse8_v 18.87% 13.29% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu 17.40% 0.04% qemu-riscv64 qemu-riscv64 [.] helper_vle8_v 17.39% 15.71% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu 17.17% 17.17% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb 12.25% 0.00% qemu-riscv64 qemu-riscv64 [.] helper_stb_mmu (inlined) 8.18% 4.08% qemu-riscv64 qemu-riscv64 [.] lde_b 8.17% 8.17% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup 7.45% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_load_cb (inlined) 7.32% 0.00% qemu-riscv64 qemu-riscv64 [.] do_st1_mmu (inlined) 6.79% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000000ff 5.91% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) 5.70% 0.00% qemu-riscv64 qemu-riscv64 [.] do_ld1_mmu (inlined) 5.58% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000001fe 5.23% 4.25% qemu-riscv64 qemu-riscv64 [.] cpu_ldsb_data_ra 4.93% 0.00% qemu-riscv64 qemu-riscv64 [.] get_memop (inlined) 4.11% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_data_ra (inlined) 4.11% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_mmuidx_ra (inlined) 2.88% 2.88% qemu-riscv64 qemu-riscv64 [.] cpu_stb_data_ra 2.88% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmuidx_ra (inlined) 2.75% 0.00% qemu-riscv64 qemu-riscv64 [.] stb_p (inlined) 2.66% 0.00% qemu-riscv64 qemu-riscv64 [.] get_alignment_bits (inlined) 1.79% 0.00% qemu-riscv64 [unknown] [.] 0x00000000e40203bf 1.68% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_store_cb (inlined) 1.60% 0.00% qemu-riscv64 [unknown] [.] 0x00000000e403733f 1.13% 0.00% qemu-riscv64 qemu-riscv64 [.] ldub_p (inlined) 0.73% 0.73% qemu-riscv64 qemu-riscv64 [.] ste_b 0.53% 0.21% qemu-riscv64 qemu-riscv64 [.] helper_lookup_tb_ptr 1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0) # Samples: 378K of event 'cycles:u' # Event count (approx.): 1381912775966 # # Children Self Command Shared Object Symbol # ........ ........ ............ ....................... ....................................... # 63.30% 29.62% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us 30.77% 0.04% qemu-riscv64 qemu-riscv64 [.] helper_vle8_v 28.59% 0.02% qemu-riscv64 qemu-riscv64 [.] helper_vse8_v 22.78% 22.78% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb 21.40% 5.26% qemu-riscv64 qemu-riscv64 [.] lde_b 20.69% 10.40% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu 20.06% 3.91% qemu-riscv64 qemu-riscv64 [.] cpu_ldsb_data_ra 19.16% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_data_ra (inlined) 19.16% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_mmuidx_ra (inlined) 12.65% 9.36% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu 8.60% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_load_cb (inlined) 6.73% 6.73% qemu-riscv64 qemu-riscv64 [.] do_ld1_mmu.constprop.23 6.73% 0.00% qemu-riscv64 qemu-riscv64 [.] do_ld1_mmu (inlined) 6.31% 0.00% qemu-riscv64 qemu-riscv64 [.] helper_stb_mmu (inlined) 6.20% 6.20% qemu-riscv64 qemu-riscv64 [.] do_st1_mmu 6.20% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) 5.49% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000001fe 3.82% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_store_cb (inlined) 3.01% 0.00% qemu-riscv64 qemu-riscv64 [.] ldub_p (inlined) 2.94% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup (inlined) 2.94% 0.00% qemu-riscv64 qemu-riscv64 [.] get_alignment_bits (inlined) 2.91% 0.00% qemu-riscv64 qemu-riscv64 [.] clear_helper_retaddr (inlined) 2.91% 2.91% qemu-riscv64 qemu-riscv64 [.] ste_b 2.90% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup (inlined) 2.90% 0.00% qemu-riscv64 qemu-riscv64 [.] get_alignment_bits (inlined) 0.59% 0.24% qemu-riscv64 qemu-riscv64 [.] helper_lookup_tb_ptr 2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: deb0ff0) # Samples: 343K of event 'cycles:u' # Event count (approx.): 1259748868940 # # Children Self Command Shared Object Symbol # ........ ........ ............ ....................... ....................................... # 64.16% 35.81% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us 30.96% 0.02% qemu-riscv64 qemu-riscv64 [.] helper_vse8_v 27.44% 0.05% qemu-riscv64 qemu-riscv64 [.] helper_vle8_v 18.37% 18.37% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb 17.33% 7.45% qemu-riscv64 qemu-riscv64 [.] cpu_ldsb_data_ra 14.90% 5.02% qemu-riscv64 qemu-riscv64 [.] lde_b 14.83% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_data_ra (inlined) 14.83% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_ldub_mmuidx_ra (inlined) 14.15% 10.33% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu 11.25% 9.62% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu 7.22% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) 6.99% 6.99% qemu-riscv64 qemu-riscv64 [.] helper_stb_mmu 6.99% 0.00% qemu-riscv64 qemu-riscv64 [.] do_st1_mmu (inlined) 5.96% 0.00% qemu-riscv64 qemu-riscv64 [.] do_ld1_mmu (inlined) 5.95% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup (inlined) 5.30% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000001fe 4.18% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_store_cb (inlined) 3.22% 0.00% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup (inlined) 3.22% 0.00% qemu-riscv64 qemu-riscv64 [.] get_alignment_bits (inlined) 3.22% 3.22% qemu-riscv64 qemu-riscv64 [.] ste_b 3.19% 0.00% qemu-riscv64 qemu-riscv64 [.] get_alignment_bits (inlined) 3.16% 0.00% qemu-riscv64 qemu-riscv64 [.] clear_helper_retaddr (inlined) 2.76% 0.00% qemu-riscv64 qemu-riscv64 [.] g2h (inlined) 2.76% 0.00% qemu-riscv64 qemu-riscv64 [.] g2h_untagged (inlined) 2.30% 0.00% qemu-riscv64 qemu-riscv64 [.] env_cpu (inlined) 2.21% 0.00% qemu-riscv64 qemu-riscv64 [.] plugin_load_cb (inlined) 0.99% 0.00% qemu-riscv64 qemu-riscv64 [.] env_cpu (inlined) I agree that these functions are large functions to inline and I didn't test these patches on all combinations (different guest architecture + different host architecture). So I think that we can drop these three patches until we can make sure that these patches can get benefit on all combinations without side effect. I'll focus on avoiding over-use of the full out-of-line load/store routines for the next version. Thanks for the suggestion and question, Max [1] https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com On 2024/2/16 4:10 AM, Richard Henderson wrote: > On 2/15/24 09:28, Max Chou wrote: >> Signed-off-by: Max Chou <max.chou@sifive.com> >> --- >> accel/tcg/user-exec.c | 7 +++++-- >> 1 file changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c >> index 68b252cb8e8..c5453810eee 100644 >> --- a/accel/tcg/user-exec.c >> +++ b/accel/tcg/user-exec.c >> @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, >> target_ulong last) { } >> /* The system-mode versions of these helpers are in cputlb.c. */ >> -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr, >> - MemOp mop, uintptr_t ra, MMUAccessType >> type) >> +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu, >> + vaddr addr, >> + MemOp mop, >> + uintptr_t ra, >> + MMUAccessType type) >> { >> int a_bits = get_alignment_bits(mop); >> void *ret; > > This is a large function. Why does it need to be inlined? > For this and the next two patches I require evidence, because I don't > believe you are attacking the problem correctly. > > > r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou ` (3 preceding siblings ...) 2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 20:12 ` Richard Henderson 2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou 2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson 6 siblings, 1 reply; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini Signed-off-by: Max Chou <max.chou@sifive.com> --- accel/tcg/user-exec.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c index c5453810eee..803c271df11 100644 --- a/accel/tcg/user-exec.c +++ b/accel/tcg/user-exec.c @@ -963,8 +963,9 @@ static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu, #include "ldst_atomicity.c.inc" -static uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, MemOpIdx oi, - uintptr_t ra, MMUAccessType access_type) +static inline QEMU_ALWAYS_INLINE uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, + MemOpIdx oi, uintptr_t ra, + MMUAccessType access_type) { void *haddr; uint8_t ret; -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function 2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou @ 2024-02-15 20:12 ` Richard Henderson 0 siblings, 0 replies; 19+ messages in thread From: Richard Henderson @ 2024-02-15 20:12 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Riku Voipio, Paolo Bonzini On 2/15/24 09:28, Max Chou wrote: > Signed-off-by: Max Chou <max.chou@sifive.com> > --- > accel/tcg/user-exec.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c > index c5453810eee..803c271df11 100644 > --- a/accel/tcg/user-exec.c > +++ b/accel/tcg/user-exec.c > @@ -963,8 +963,9 @@ static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu, > > #include "ldst_atomicity.c.inc" > > -static uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, MemOpIdx oi, > - uintptr_t ra, MMUAccessType access_type) > +static inline QEMU_ALWAYS_INLINE uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, > + MemOpIdx oi, uintptr_t ra, > + MMUAccessType access_type) > { > void *haddr; > uint8_t ret; I expect the small functions that use this to tail-call into this common function. I do not believe that inlining helps. Same with do_st1_mmu. r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou ` (4 preceding siblings ...) 2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou @ 2024-02-15 19:28 ` Max Chou 2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson 6 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw) To: qemu-devel, qemu-riscv Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini Signed-off-by: Max Chou <max.chou@sifive.com> --- accel/tcg/user-exec.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c index 803c271df11..9ef35a22279 100644 --- a/accel/tcg/user-exec.c +++ b/accel/tcg/user-exec.c @@ -1050,8 +1050,9 @@ static Int128 do_ld16_mmu(CPUState *cpu, abi_ptr addr, return ret; } -static void do_st1_mmu(CPUState *cpu, vaddr addr, uint8_t val, - MemOpIdx oi, uintptr_t ra) +static inline QEMU_ALWAYS_INLINE void do_st1_mmu(CPUState *cpu, vaddr addr, + uint8_t val, MemOpIdx oi, + uintptr_t ra) { void *haddr; -- 2.34.1 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou ` (5 preceding siblings ...) 2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou @ 2024-02-15 20:24 ` Richard Henderson 2024-02-17 9:52 ` Max Chou 6 siblings, 1 reply; 19+ messages in thread From: Richard Henderson @ 2024-02-15 20:24 UTC (permalink / raw) To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza On 2/15/24 09:28, Max Chou wrote: > Hi all, > > When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x > slower than the scalar equivalent on QEMU and it hurts developer > productivity. > > From the performance analysis result, we can observe that the glibc > memcpy spends most of the time in the vector unit-stride load/store > helper functions. > > Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664 > Children Self Command Shared Object Symbol > + 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us > + 26.92% 0.00% qemu-riscv64 [unknown] [.] 0x00000000000000ff > + 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.] qemu_plugin_vcpu_mem_cb > + 13.85% 13.85% qemu-riscv64 qemu-riscv64 [.] lde_b > + 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.] cpu_stb_mmu > + 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.] cpu_ldb_mmu > + 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup > + 7.70% 7.70% qemu-riscv64 qemu-riscv64 [.] ste_b > + 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.] adjust_addr (inlined) > > > So this patchset tries to improve the performance of the RVV version of > glibc memcpy on QEMU by improving the corresponding helper function > quality. > > The overall performance improvement can achieve following numbers > (depending on the size). > Average: 2.86X / Smallest: 1.15X / Largest: 4.49X > > PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions, > the next version or next serious will complete other vector ld/st part. You are still not tackling the root problem, which is over-use of the full out-of-line load/store routines. The reason that cpu_mmu_lookup is in that list is because you are performing the full virtual address resolution for each and every byte. The only way to make a real improvement is to perform virtual address resolution *once* for the entire vector. I refer to my previous advice: https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369 r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions 2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson @ 2024-02-17 9:52 ` Max Chou 0 siblings, 0 replies; 19+ messages in thread From: Max Chou @ 2024-02-17 9:52 UTC (permalink / raw) To: Richard Henderson, qemu-devel, qemu-riscv; +Cc: dbarboza Hi Richard, Thank you for the suggestion and the reference. I'm trying to follow the reference to implement it and I'll send another version for this. Thanks a lot, Max On 2024/2/16 4:24 AM, Richard Henderson wrote: > On 2/15/24 09:28, Max Chou wrote: >> Hi all, >> >> When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x >> slower than the scalar equivalent on QEMU and it hurts developer >> productivity. >> >> From the performance analysis result, we can observe that the glibc >> memcpy spends most of the time in the vector unit-stride load/store >> helper functions. >> >> Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664 >> Children Self Command Shared Object Symbol >> + 28.46% 27.85% qemu-riscv64 qemu-riscv64 [.] >> vext_ldst_us >> + 26.92% 0.00% qemu-riscv64 [unknown] [.] >> 0x00000000000000ff >> + 14.41% 14.41% qemu-riscv64 qemu-riscv64 [.] >> qemu_plugin_vcpu_mem_cb >> + 13.85% 13.85% qemu-riscv64 qemu-riscv64 [.] lde_b >> + 13.64% 13.64% qemu-riscv64 qemu-riscv64 [.] >> cpu_stb_mmu >> + 9.25% 9.19% qemu-riscv64 qemu-riscv64 [.] >> cpu_ldb_mmu >> + 7.81% 7.81% qemu-riscv64 qemu-riscv64 [.] >> cpu_mmu_lookup >> + 7.70% 7.70% qemu-riscv64 qemu-riscv64 [.] ste_b >> + 5.53% 0.00% qemu-riscv64 qemu-riscv64 [.] >> adjust_addr (inlined) >> >> >> So this patchset tries to improve the performance of the RVV version of >> glibc memcpy on QEMU by improving the corresponding helper function >> quality. >> >> The overall performance improvement can achieve following numbers >> (depending on the size). >> Average: 2.86X / Smallest: 1.15X / Largest: 4.49X >> >> PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions, >> the next version or next serious will complete other vector ld/st part. > > You are still not tackling the root problem, which is over-use of the > full out-of-line load/store routines. The reason that cpu_mmu_lookup > is in that list is because you are performing the full virtual address > resolution for each and every byte. > > The only way to make a real improvement is to perform virtual address > resolution *once* for the entire vector. I refer to my previous advice: > > https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369 > > > r~ ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2024-02-17 17:28 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou 2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou 2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou 2024-02-15 20:03 ` Richard Henderson 2024-02-17 9:08 ` Max Chou 2024-02-15 20:21 ` Daniel Henrique Barboza 2024-02-17 9:45 ` Max Chou 2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou 2024-02-15 20:09 ` Richard Henderson 2024-02-15 21:11 ` Daniel Henrique Barboza 2024-02-17 10:10 ` Max Chou 2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou 2024-02-15 20:10 ` Richard Henderson 2024-02-17 17:27 ` Max Chou 2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou 2024-02-15 20:12 ` Richard Henderson 2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou 2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson 2024-02-17 9:52 ` Max Chou
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).