qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
@ 2024-02-15 19:28 Max Chou
  2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv; +Cc: dbarboza, Max Chou

Hi all,

When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
slower than the scalar equivalent on QEMU and it hurts developer
productivity.

From the performance analysis result, we can observe that the glibc
memcpy spends most of the time in the vector unit-stride load/store
helper functions.

Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
  Children      Self  Command       Shared Object            Symbol
+   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
+   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 0x00000000000000ff
+   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] qemu_plugin_vcpu_mem_cb
+   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
+   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] cpu_stb_mmu
+    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] cpu_ldb_mmu
+    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
+    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
+    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] adjust_addr (inlined)   


So this patchset tries to improve the performance of the RVV version of
glibc memcpy on QEMU by improving the corresponding helper function
quality.

The overall performance improvement can achieve following numbers
(depending on the size).
Average: 2.86X / Smallest: 1.15X / Largest: 4.49X

PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
the next version or next serious will complete other vector ld/st part.

Regards,
Max.

[1] https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com

Max Chou (6):
  target/riscv: Seperate vector segment ld/st instructions
  accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  target/riscv: Inline vext_ldst_us and coressponding function for
    performance
  accel/tcg: Inline cpu_mmu_lookup function
  accel/tcg: Inline do_ld1_mmu function
  accel/tcg: Inline do_st1_mmu function

 accel/tcg/ldst_common.c.inc             |  40 ++++++--
 accel/tcg/user-exec.c                   |  17 ++--
 target/riscv/helper.h                   |   4 +
 target/riscv/insn32.decode              |  11 +-
 target/riscv/insn_trans/trans_rvv.c.inc |  61 +++++++++++
 target/riscv/vector_helper.c            | 130 +++++++++++++++++++-----
 6 files changed, 221 insertions(+), 42 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/6] target/riscv: Seperate vector segment ld/st instructions
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Liu Zhiwei, Richard Henderson, Junqiang Wang

This commit seperate the helper function implementations of vector
segment load/store instructions from other vector load/store
instructions.
This can improve performance by avoiding unnecessary segment operation
when NF = 1.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/helper.h                   |   4 +
 target/riscv/insn32.decode              |  11 ++-
 target/riscv/insn_trans/trans_rvv.c.inc |  61 +++++++++++++++
 target/riscv/vector_helper.c            | 100 +++++++++++++++++++++---
 4 files changed, 164 insertions(+), 12 deletions(-)

diff --git a/target/riscv/helper.h b/target/riscv/helper.h
index 8a635238514..8b6ddc4cb88 100644
--- a/target/riscv/helper.h
+++ b/target/riscv/helper.h
@@ -157,18 +157,22 @@ DEF_HELPER_FLAGS_3(hyp_hsv_d, TCG_CALL_NO_WG, void, env, tl, tl)
 /* Vector functions */
 DEF_HELPER_3(vsetvl, tl, env, tl, tl)
 DEF_HELPER_5(vle8_v, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vlsege8_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle16_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle32_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle64_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle8_v_mask, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vlsege8_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle16_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle32_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vle64_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse8_v, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vssege8_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse16_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse32_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse64_v, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse8_v_mask, void, ptr, ptr, tl, env, i32)
+DEF_HELPER_5(vssege8_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse16_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse32_v_mask, void, ptr, ptr, tl, env, i32)
 DEF_HELPER_5(vse64_v_mask, void, ptr, ptr, tl, env, i32)
diff --git a/target/riscv/insn32.decode b/target/riscv/insn32.decode
index f22df04cfd1..0712e9f6314 100644
--- a/target/riscv/insn32.decode
+++ b/target/riscv/insn32.decode
@@ -77,6 +77,7 @@
 @r2      .......   ..... ..... ... ..... ....... &r2 %rs1 %rd
 @r2_vm_1 ...... . ..... ..... ... ..... ....... &rmr vm=1 %rs2 %rd
 @r2_nfvm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm %nf %rs1 %rd
+@r2_nf_1_vm ... ... vm:1 ..... ..... ... ..... ....... &r2nfvm nf=1 %rs1 %rd
 @r2_vm   ...... vm:1 ..... ..... ... ..... ....... &rmr %rs2 %rd
 @r1_vm   ...... vm:1 ..... ..... ... ..... ....... %rd
 @r_nfvm  ... ... vm:1 ..... ..... ... ..... ....... &rnfvm %nf %rs2 %rs1 %rd
@@ -349,11 +350,17 @@ hsv_d     0110111  .....   ..... 100 00000 1110011 @r2_s
 
 # *** Vector loads and stores are encoded within LOADFP/STORE-FP ***
 # Vector unit-stride load/store insns.
-vle8_v     ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm
+{
+  vle8_v     000 000 . 00000 ..... 000 ..... 0000111 @r2_nf_1_vm
+  vlsege8_v  ... 000 . 00000 ..... 000 ..... 0000111 @r2_nfvm
+}
 vle16_v    ... 000 . 00000 ..... 101 ..... 0000111 @r2_nfvm
 vle32_v    ... 000 . 00000 ..... 110 ..... 0000111 @r2_nfvm
 vle64_v    ... 000 . 00000 ..... 111 ..... 0000111 @r2_nfvm
-vse8_v     ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm
+{
+  vse8_v     000 000 . 00000 ..... 000 ..... 0100111 @r2_nf_1_vm
+  vssege8_v  ... 000 . 00000 ..... 000 ..... 0100111 @r2_nfvm
+}
 vse16_v    ... 000 . 00000 ..... 101 ..... 0100111 @r2_nfvm
 vse32_v    ... 000 . 00000 ..... 110 ..... 0100111 @r2_nfvm
 vse64_v    ... 000 . 00000 ..... 111 ..... 0100111 @r2_nfvm
diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 9e101ab4343..04fc6329359 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -690,6 +690,40 @@ GEN_VEXT_TRANS(vle16_v, MO_16, r2nfvm, ld_us_op, ld_us_check)
 GEN_VEXT_TRANS(vle32_v, MO_32, r2nfvm, ld_us_op, ld_us_check)
 GEN_VEXT_TRANS(vle64_v, MO_64, r2nfvm, ld_us_op, ld_us_check)
 
+static bool ld_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
+{
+    uint32_t data = 0;
+    gen_helper_ldst_us *fn;
+    static gen_helper_ldst_us * const fns[2][4] = {
+        /* masked unit stride load */
+        { gen_helper_vlsege8_v_mask, gen_helper_vle16_v_mask,
+          gen_helper_vle32_v_mask, gen_helper_vle64_v_mask },
+        /* unmasked unit stride load */
+        { gen_helper_vlsege8_v, gen_helper_vle16_v,
+          gen_helper_vle32_v, gen_helper_vle64_v }
+    };
+
+    fn =  fns[a->vm][eew];
+    if (fn == NULL) {
+        return false;
+    }
+
+    /*
+     * Vector load/store instructions have the EEW encoded
+     * directly in the instructions. The maximum vector size is
+     * calculated with EMUL rather than LMUL.
+     */
+    uint8_t emul = vext_get_emul(s, eew);
+    data = FIELD_DP32(data, VDATA, VM, a->vm);
+    data = FIELD_DP32(data, VDATA, LMUL, emul);
+    data = FIELD_DP32(data, VDATA, NF, a->nf);
+    data = FIELD_DP32(data, VDATA, VTA, s->vta);
+    data = FIELD_DP32(data, VDATA, VMA, s->vma);
+    return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
+}
+
+GEN_VEXT_TRANS(vlsege8_v,  MO_8,  r2nfvm, ld_us_seg_op, ld_us_check)
+
 static bool st_us_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
 {
     uint32_t data = 0;
@@ -727,6 +761,33 @@ GEN_VEXT_TRANS(vse16_v, MO_16, r2nfvm, st_us_op, st_us_check)
 GEN_VEXT_TRANS(vse32_v, MO_32, r2nfvm, st_us_op, st_us_check)
 GEN_VEXT_TRANS(vse64_v, MO_64, r2nfvm, st_us_op, st_us_check)
 
+static bool st_us_seg_op(DisasContext *s, arg_r2nfvm *a, uint8_t eew)
+{
+    uint32_t data = 0;
+    gen_helper_ldst_us *fn;
+    static gen_helper_ldst_us * const fns[2][4] = {
+        /* masked unit stride store */
+        { gen_helper_vssege8_v_mask, gen_helper_vse16_v_mask,
+          gen_helper_vse32_v_mask, gen_helper_vse64_v_mask },
+        /* unmasked unit stride store */
+        { gen_helper_vssege8_v, gen_helper_vse16_v,
+          gen_helper_vse32_v, gen_helper_vse64_v }
+    };
+
+    fn =  fns[a->vm][eew];
+    if (fn == NULL) {
+        return false;
+    }
+
+    uint8_t emul = vext_get_emul(s, eew);
+    data = FIELD_DP32(data, VDATA, VM, a->vm);
+    data = FIELD_DP32(data, VDATA, LMUL, emul);
+    data = FIELD_DP32(data, VDATA, NF, a->nf);
+    return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
+}
+
+GEN_VEXT_TRANS(vssege8_v,  MO_8,  r2nfvm, st_us_seg_op, st_us_check)
+
 /*
  *** unit stride mask load and store
  */
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 84cec73eb20..e8fbb921449 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -201,6 +201,32 @@ vext_ldst_stride(void *vd, void *v0, target_ulong base,
                  uint32_t desc, uint32_t vm,
                  vext_ldst_elem_fn *ldst_elem,
                  uint32_t log2_esz, uintptr_t ra)
+{
+    uint32_t i;
+    uint32_t max_elems = vext_max_elems(desc, log2_esz);
+    uint32_t esz = 1 << log2_esz;
+    uint32_t vma = vext_vma(desc);
+
+    for (i = env->vstart; i < env->vl; i++, env->vstart++) {
+        if (!vm && !vext_elem_mask(v0, i)) {
+            /* set masked-off elements to 1s */
+            vext_set_elems_1s(vd, vma, i * esz, (i + 1) * esz);
+            continue;
+        }
+        target_ulong addr = base + stride * i;
+        ldst_elem(env, adjust_addr(env, addr), i, vd, ra);
+    }
+    env->vstart = 0;
+
+    vext_set_tail_elems_1s(env->vl, vd, desc, 1, esz, max_elems);
+}
+
+static void
+vext_ldst_stride_segment(void *vd, void *v0, target_ulong base,
+                         target_ulong stride, CPURISCVState *env,
+                         uint32_t desc, uint32_t vm,
+                         vext_ldst_elem_fn *ldst_elem,
+                         uint32_t log2_esz, uintptr_t ra)
 {
     uint32_t i, k;
     uint32_t nf = vext_nf(desc);
@@ -234,8 +260,8 @@ void HELPER(NAME)(void *vd, void * v0, target_ulong base,               \
                   uint32_t desc)                                        \
 {                                                                       \
     uint32_t vm = vext_vm(desc);                                        \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, vm, LOAD_FN,      \
-                     ctzl(sizeof(ETYPE)), GETPC());                     \
+    vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm,       \
+                             LOAD_FN, ctzl(sizeof(ETYPE)), GETPC());    \
 }
 
 GEN_VEXT_LD_STRIDE(vlse8_v,  int8_t,  lde_b)
@@ -249,8 +275,8 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
                   uint32_t desc)                                        \
 {                                                                       \
     uint32_t vm = vext_vm(desc);                                        \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, vm, STORE_FN,     \
-                     ctzl(sizeof(ETYPE)), GETPC());                     \
+    vext_ldst_stride_segment(vd, v0, base, stride, env, desc, vm,       \
+                             STORE_FN, ctzl(sizeof(ETYPE)), GETPC());   \
 }
 
 GEN_VEXT_ST_STRIDE(vsse8_v,  int8_t,  ste_b)
@@ -267,6 +293,26 @@ static void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
              uintptr_t ra)
+{
+    uint32_t i;
+    uint32_t max_elems = vext_max_elems(desc, log2_esz);
+    uint32_t esz = 1 << log2_esz;
+
+    /* load bytes from guest memory */
+    for (i = env->vstart; i < evl; i++, env->vstart++) {
+        target_ulong addr = base + (i << log2_esz);
+        ldst_elem(env, adjust_addr(env, addr), i, vd, ra);
+    }
+    env->vstart = 0;
+
+    vext_set_tail_elems_1s(evl, vd, desc, 1, esz, max_elems);
+}
+
+/* unmasked unit-stride segment load and store operation */
+static void
+vext_ldst_us_segment(void *vd, target_ulong base, CPURISCVState *env,
+                     uint32_t desc, vext_ldst_elem_fn *ldst_elem,
+                     uint32_t log2_esz, uint32_t evl, uintptr_t ra)
 {
     uint32_t i, k;
     uint32_t nf = vext_nf(desc);
@@ -308,10 +354,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
                  ctzl(sizeof(ETYPE)), env->vl, GETPC());                \
 }
 
+#define GEN_VEXT_LD_US_SEG(NAME, ETYPE, LOAD_FN)                        \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,         \
+                         CPURISCVState *env, uint32_t desc)             \
+{                                                                       \
+    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));             \
+    vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false,    \
+                             LOAD_FN, ctzl(sizeof(ETYPE)), GETPC());    \
+}                                                                       \
+                                                                        \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
+                  CPURISCVState *env, uint32_t desc)                    \
+{                                                                       \
+    vext_ldst_us_segment(vd, base, env, desc, LOAD_FN,                  \
+                         ctzl(sizeof(ETYPE)), env->vl, GETPC());        \
+}
+
 GEN_VEXT_LD_US(vle8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_US(vle16_v, int16_t, lde_h)
-GEN_VEXT_LD_US(vle32_v, int32_t, lde_w)
-GEN_VEXT_LD_US(vle64_v, int64_t, lde_d)
+GEN_VEXT_LD_US_SEG(vlsege8_v, int8_t, lde_b)
+GEN_VEXT_LD_US_SEG(vle16_v, int16_t, lde_h)
+GEN_VEXT_LD_US_SEG(vle32_v, int32_t, lde_w)
+GEN_VEXT_LD_US_SEG(vle64_v, int64_t, lde_d)
 
 #define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN)                            \
 void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,          \
@@ -329,10 +392,27 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                 \
                  ctzl(sizeof(ETYPE)), env->vl, GETPC());                 \
 }
 
+#define GEN_VEXT_ST_US_SEG(NAME, ETYPE, STORE_FN)                        \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,          \
+                         CPURISCVState *env, uint32_t desc)              \
+{                                                                        \
+    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));              \
+    vext_ldst_stride_segment(vd, v0, base, stride, env, desc, false,     \
+                             STORE_FN, ctzl(sizeof(ETYPE)), GETPC());    \
+}                                                                        \
+                                                                         \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base,                 \
+                  CPURISCVState *env, uint32_t desc)                     \
+{                                                                        \
+    vext_ldst_us_segment(vd, base, env, desc, STORE_FN,                  \
+                         ctzl(sizeof(ETYPE)), env->vl, GETPC());         \
+}
+
 GEN_VEXT_ST_US(vse8_v,  int8_t,  ste_b)
-GEN_VEXT_ST_US(vse16_v, int16_t, ste_h)
-GEN_VEXT_ST_US(vse32_v, int32_t, ste_w)
-GEN_VEXT_ST_US(vse64_v, int64_t, ste_d)
+GEN_VEXT_ST_US_SEG(vssege8_v, int8_t, ste_b)
+GEN_VEXT_ST_US_SEG(vse16_v, int16_t, ste_h)
+GEN_VEXT_ST_US_SEG(vse32_v, int32_t, ste_w)
+GEN_VEXT_ST_US_SEG(vse64_v, int64_t, ste_d)
 
 /*
  * unit stride mask load and store, EEW = 1
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
  2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 20:03   ` Richard Henderson
  2024-02-15 20:21   ` Daniel Henrique Barboza
  2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Richard Henderson, Paolo Bonzini

If there are not any QEMU plugin memory callback functions, checking
before calling the qemu_plugin_vcpu_mem_cb function can reduce the
function call overhead.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
index c82048e377e..bf24986c562 100644
--- a/accel/tcg/ldst_common.c.inc
+++ b/accel/tcg/ldst_common.c.inc
@@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
 
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
     ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
-    plugin_load_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_load_cb(env, addr, oi);
+    }
     return ret;
 }
 
@@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr addr,
 
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
     ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
-    plugin_load_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_load_cb(env, addr, oi);
+    }
     return ret;
 }
 
@@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr addr,
 
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
     ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
-    plugin_load_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_load_cb(env, addr, oi);
+    }
     return ret;
 }
 
@@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr addr,
 
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
     ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
-    plugin_load_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_load_cb(env, addr, oi);
+    }
     return ret;
 }
 
@@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
 
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
     ret = do_ld16_mmu(env_cpu(env), addr, oi, ra);
-    plugin_load_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_load_cb(env, addr, oi);
+    }
     return ret;
 }
 
@@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
                  MemOpIdx oi, uintptr_t retaddr)
 {
     helper_stb_mmu(env, addr, val, oi, retaddr);
-    plugin_store_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_store_cb(env, addr, oi);
+    }
 }
 
 void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
@@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
 {
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
     do_st2_mmu(env_cpu(env), addr, val, oi, retaddr);
-    plugin_store_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_store_cb(env, addr, oi);
+    }
 }
 
 void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
@@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
 {
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
     do_st4_mmu(env_cpu(env), addr, val, oi, retaddr);
-    plugin_store_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_store_cb(env, addr, oi);
+    }
 }
 
 void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
@@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
 {
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
     do_st8_mmu(env_cpu(env), addr, val, oi, retaddr);
-    plugin_store_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_store_cb(env, addr, oi);
+    }
 }
 
 void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
@@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 {
     tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
     do_st16_mmu(env_cpu(env), addr, val, oi, retaddr);
-    plugin_store_cb(env, addr, oi);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        plugin_store_cb(env, addr, oi);
+    }
 }
 
 /*
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
  2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou
  2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 20:09   ` Richard Henderson
  2024-02-15 21:11   ` Daniel Henrique Barboza
  2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Liu Zhiwei

In the vector unit-stride load/store helper functions. the vext_ldst_us
function corresponding most of the execution time. Inline the functions
can avoid the function call overhead to imperove the helper function
performance.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index e8fbb921449..866f77d321d 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, int index,
 typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
                                uint32_t idx, void *vd, uintptr_t retaddr);
 
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE *cur = ((ETYPE *)vd + H(idx));                   \
-    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);      \
-}                                                          \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)         \
+static inline QEMU_ALWAYS_INLINE                        \
+void NAME(CPURISCVState *env, abi_ptr addr,             \
+          uint32_t idx, void *vd, uintptr_t retaddr)    \
+{                                                       \
+    ETYPE *cur = ((ETYPE *)vd + H(idx));                \
+    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);   \
+}                                                       \
 
 GEN_VEXT_LD_ELEM(lde_b, int8_t,  H1, ldsb)
 GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
 GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
 GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
 
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE data = *((ETYPE *)vd + H(idx));                  \
-    cpu_##STSUF##_data_ra(env, addr, data, retaddr);       \
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)         \
+static inline QEMU_ALWAYS_INLINE                        \
+void NAME(CPURISCVState *env, abi_ptr addr,             \
+          uint32_t idx, void *vd, uintptr_t retaddr)    \
+{                                                       \
+    ETYPE data = *((ETYPE *)vd + H(idx));               \
+    cpu_##STSUF##_data_ra(env, addr, data, retaddr);    \
 }
 
 GEN_VEXT_ST_ELEM(ste_b, int8_t,  H1, stb)
@@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
  */
 
 /* unmasked unit-stride load and store operation */
-static void
+static inline QEMU_ALWAYS_INLINE void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
              uintptr_t ra)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
                   ` (2 preceding siblings ...)
  2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 20:10   ` Richard Henderson
  2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/user-exec.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 68b252cb8e8..c5453810eee 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, target_ulong last) { }
 
 /* The system-mode versions of these helpers are in cputlb.c.  */
 
-static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr,
-                            MemOp mop, uintptr_t ra, MMUAccessType type)
+static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
+                                                      vaddr addr,
+                                                      MemOp mop,
+                                                      uintptr_t ra,
+                                                      MMUAccessType type)
 {
     int a_bits = get_alignment_bits(mop);
     void *ret;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
                   ` (3 preceding siblings ...)
  2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 20:12   ` Richard Henderson
  2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou
  2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson
  6 siblings, 1 reply; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/user-exec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index c5453810eee..803c271df11 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -963,8 +963,9 @@ static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
 
 #include "ldst_atomicity.c.inc"
 
-static uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, MemOpIdx oi,
-                          uintptr_t ra, MMUAccessType access_type)
+static inline QEMU_ALWAYS_INLINE uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr,
+                                                    MemOpIdx oi, uintptr_t ra,
+                                                    MMUAccessType access_type)
 {
     void *haddr;
     uint8_t ret;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
                   ` (4 preceding siblings ...)
  2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou
@ 2024-02-15 19:28 ` Max Chou
  2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson
  6 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-15 19:28 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: dbarboza, Max Chou, Riku Voipio, Richard Henderson, Paolo Bonzini

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/user-exec.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 803c271df11..9ef35a22279 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -1050,8 +1050,9 @@ static Int128 do_ld16_mmu(CPUState *cpu, abi_ptr addr,
     return ret;
 }
 
-static void do_st1_mmu(CPUState *cpu, vaddr addr, uint8_t val,
-                       MemOpIdx oi, uintptr_t ra)
+static inline QEMU_ALWAYS_INLINE void do_st1_mmu(CPUState *cpu, vaddr addr,
+                                                 uint8_t val, MemOpIdx oi,
+                                                 uintptr_t ra)
 {
     void *haddr;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-02-15 20:03   ` Richard Henderson
  2024-02-17  9:08     ` Max Chou
  2024-02-15 20:21   ` Daniel Henrique Barboza
  1 sibling, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2024-02-15 20:03 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Paolo Bonzini

On 2/15/24 09:28, Max Chou wrote:
> If there are not any QEMU plugin memory callback functions, checking
> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
> function call overhead.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++----------
>   1 file changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
> index c82048e377e..bf24986c562 100644
> --- a/accel/tcg/ldst_common.c.inc
> +++ b/accel/tcg/ldst_common.c.inc
> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
>       ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }

Rather than repeating N times, modify plugin_load_cb and plugin_store_cb just above to 
avoid the call to cpu_plugin_mem_cbs_enabled().  I expect the compiler is inlining those 
functions already.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance
  2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou
@ 2024-02-15 20:09   ` Richard Henderson
  2024-02-15 21:11   ` Daniel Henrique Barboza
  1 sibling, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-15 20:09 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: dbarboza, Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Liu Zhiwei

On 2/15/24 09:28, Max Chou wrote:
>   /* unmasked unit-stride load and store operation */
> -static void
> +static inline QEMU_ALWAYS_INLINE void
>   vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
>                vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
>                uintptr_t ra)

Yes, this is important so that ldst_elem itself can always be inlined.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function
  2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou
@ 2024-02-15 20:10   ` Richard Henderson
  2024-02-17 17:27     ` Max Chou
  0 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2024-02-15 20:10 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Riku Voipio, Paolo Bonzini

On 2/15/24 09:28, Max Chou wrote:
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   accel/tcg/user-exec.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
> index 68b252cb8e8..c5453810eee 100644
> --- a/accel/tcg/user-exec.c
> +++ b/accel/tcg/user-exec.c
> @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, target_ulong last) { }
>   
>   /* The system-mode versions of these helpers are in cputlb.c.  */
>   
> -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr,
> -                            MemOp mop, uintptr_t ra, MMUAccessType type)
> +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
> +                                                      vaddr addr,
> +                                                      MemOp mop,
> +                                                      uintptr_t ra,
> +                                                      MMUAccessType type)
>   {
>       int a_bits = get_alignment_bits(mop);
>       void *ret;

This is a large function.  Why does it need to be inlined?
For this and the next two patches I require evidence, because I don't believe you are 
attacking the problem correctly.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function
  2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou
@ 2024-02-15 20:12   ` Richard Henderson
  0 siblings, 0 replies; 19+ messages in thread
From: Richard Henderson @ 2024-02-15 20:12 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza, Riku Voipio, Paolo Bonzini

On 2/15/24 09:28, Max Chou wrote:
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   accel/tcg/user-exec.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
> index c5453810eee..803c271df11 100644
> --- a/accel/tcg/user-exec.c
> +++ b/accel/tcg/user-exec.c
> @@ -963,8 +963,9 @@ static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
>   
>   #include "ldst_atomicity.c.inc"
>   
> -static uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr, MemOpIdx oi,
> -                          uintptr_t ra, MMUAccessType access_type)
> +static inline QEMU_ALWAYS_INLINE uint8_t do_ld1_mmu(CPUState *cpu, vaddr addr,
> +                                                    MemOpIdx oi, uintptr_t ra,
> +                                                    MMUAccessType access_type)
>   {
>       void *haddr;
>       uint8_t ret;

I expect the small functions that use this to tail-call into this common function.  I do 
not believe that inlining helps.  Same with do_st1_mmu.


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
  2024-02-15 20:03   ` Richard Henderson
@ 2024-02-15 20:21   ` Daniel Henrique Barboza
  2024-02-17  9:45     ` Max Chou
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2024-02-15 20:21 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv; +Cc: Richard Henderson, Paolo Bonzini



On 2/15/24 16:28, Max Chou wrote:
> If there are not any QEMU plugin memory callback functions, checking
> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
> function call overhead.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---

This was in my TODO list for some time. Thanks for looking it up.

Can't we avoid all callbacks, not just qemu_plugin_vcpu_mem_cb, if there's no
plugin loaded? The performance increase when building with --disable-plugins
shouldn't be a thing - if the user isn't using plug-ins it should have a
penalty to it.

Thanks,

Daniel

>   accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++----------
>   1 file changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
> index c82048e377e..bf24986c562 100644
> --- a/accel/tcg/ldst_common.c.inc
> +++ b/accel/tcg/ldst_common.c.inc
> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
>       ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }
>   
> @@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr addr,
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
>       ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }
>   
> @@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr addr,
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
>       ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }
>   
> @@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr addr,
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
>       ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }
>   
> @@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
>   
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
>       ret = do_ld16_mmu(env_cpu(env), addr, oi, ra);
> -    plugin_load_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_load_cb(env, addr, oi);
> +    }
>       return ret;
>   }
>   
> @@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
>                    MemOpIdx oi, uintptr_t retaddr)
>   {
>       helper_stb_mmu(env, addr, val, oi, retaddr);
> -    plugin_store_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_store_cb(env, addr, oi);
> +    }
>   }
>   
>   void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
> @@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
>   {
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
>       do_st2_mmu(env_cpu(env), addr, val, oi, retaddr);
> -    plugin_store_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_store_cb(env, addr, oi);
> +    }
>   }
>   
>   void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
> @@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
>   {
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
>       do_st4_mmu(env_cpu(env), addr, val, oi, retaddr);
> -    plugin_store_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_store_cb(env, addr, oi);
> +    }
>   }
>   
>   void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
> @@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
>   {
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
>       do_st8_mmu(env_cpu(env), addr, val, oi, retaddr);
> -    plugin_store_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_store_cb(env, addr, oi);
> +    }
>   }
>   
>   void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
> @@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
>   {
>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
>       do_st16_mmu(env_cpu(env), addr, val, oi, retaddr);
> -    plugin_store_cb(env, addr, oi);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        plugin_store_cb(env, addr, oi);
> +    }
>   }
>   
>   /*


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
  2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
                   ` (5 preceding siblings ...)
  2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou
@ 2024-02-15 20:24 ` Richard Henderson
  2024-02-17  9:52   ` Max Chou
  6 siblings, 1 reply; 19+ messages in thread
From: Richard Henderson @ 2024-02-15 20:24 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv; +Cc: dbarboza

On 2/15/24 09:28, Max Chou wrote:
> Hi all,
> 
> When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
> slower than the scalar equivalent on QEMU and it hurts developer
> productivity.
> 
>  From the performance analysis result, we can observe that the glibc
> memcpy spends most of the time in the vector unit-stride load/store
> helper functions.
> 
> Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
>    Children      Self  Command       Shared Object            Symbol
> +   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
> +   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 0x00000000000000ff
> +   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] qemu_plugin_vcpu_mem_cb
> +   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
> +   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] cpu_stb_mmu
> +    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] cpu_ldb_mmu
> +    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
> +    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
> +    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] adjust_addr (inlined)
> 
> 
> So this patchset tries to improve the performance of the RVV version of
> glibc memcpy on QEMU by improving the corresponding helper function
> quality.
> 
> The overall performance improvement can achieve following numbers
> (depending on the size).
> Average: 2.86X / Smallest: 1.15X / Largest: 4.49X
> 
> PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
> the next version or next serious will complete other vector ld/st part.

You are still not tackling the root problem, which is over-use of the full out-of-line 
load/store routines.  The reason that cpu_mmu_lookup is in that list is because you are 
performing the full virtual address resolution for each and every byte.

The only way to make a real improvement is to perform virtual address resolution *once* 
for the entire vector.  I refer to my previous advice:

https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369


r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance
  2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou
  2024-02-15 20:09   ` Richard Henderson
@ 2024-02-15 21:11   ` Daniel Henrique Barboza
  2024-02-17 10:10     ` Max Chou
  1 sibling, 1 reply; 19+ messages in thread
From: Daniel Henrique Barboza @ 2024-02-15 21:11 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei



On 2/15/24 16:28, Max Chou wrote:
> In the vector unit-stride load/store helper functions. the vext_ldst_us
> function corresponding most of the execution time. Inline the functions
> can avoid the function call overhead to imperove the helper function
> performance.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---

The inline is a good idea but I think we can do better. I mentioned in a thread
last year [1] about the time we're spending in single byte loads/stores, even
for strided instructions.

E.g. in vext_ldst_stride():


     for (i = env->vstart; i < env->vl; i++, env->vstart++) {
         k = 0;
         while (k < nf) {
             if (!vm && !vext_elem_mask(v0, i)) {
                 /* set masked-off elements to 1s */
                 vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz,
                                   (i + k * max_elems + 1) * esz);
                 k++;
                 continue;
             }
             target_ulong addr = base + stride * i + (k << log2_esz);
             ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
             k++;
         }
     }

We're doing single byte load/stores in ldst_elem() when, in this case, we could do
it in a whole block only once. ARM does something similar in SVE.

I update the gitlab bug https://gitlab.com/qemu-project/qemu/-/issues/2137 with this
additional info too.



Thanks,

Daniel


[1] https://lore.kernel.org/qemu-riscv/0e54c6c1-2903-7942-eff2-2b8c5e21187e@ventanamicro.com/


>   target/riscv/vector_helper.c | 30 ++++++++++++++++--------------
>   1 file changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index e8fbb921449..866f77d321d 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, int index,
>   typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
>                                  uint32_t idx, void *vd, uintptr_t retaddr);
>   
> -#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)            \
> -static void NAME(CPURISCVState *env, abi_ptr addr,         \
> -                 uint32_t idx, void *vd, uintptr_t retaddr)\
> -{                                                          \
> -    ETYPE *cur = ((ETYPE *)vd + H(idx));                   \
> -    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);      \
> -}                                                          \
> +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)         \
> +static inline QEMU_ALWAYS_INLINE                        \
> +void NAME(CPURISCVState *env, abi_ptr addr,             \
> +          uint32_t idx, void *vd, uintptr_t retaddr)    \
> +{                                                       \
> +    ETYPE *cur = ((ETYPE *)vd + H(idx));                \
> +    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);   \
> +}                                                       \
>   
>   GEN_VEXT_LD_ELEM(lde_b, int8_t,  H1, ldsb)
>   GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
>   GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
>   GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
>   
> -#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)            \
> -static void NAME(CPURISCVState *env, abi_ptr addr,         \
> -                 uint32_t idx, void *vd, uintptr_t retaddr)\
> -{                                                          \
> -    ETYPE data = *((ETYPE *)vd + H(idx));                  \
> -    cpu_##STSUF##_data_ra(env, addr, data, retaddr);       \
> +#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)         \
> +static inline QEMU_ALWAYS_INLINE                        \
> +void NAME(CPURISCVState *env, abi_ptr addr,             \
> +          uint32_t idx, void *vd, uintptr_t retaddr)    \
> +{                                                       \
> +    ETYPE data = *((ETYPE *)vd + H(idx));               \
> +    cpu_##STSUF##_data_ra(env, addr, data, retaddr);    \
>   }
>   
>   GEN_VEXT_ST_ELEM(ste_b, int8_t,  H1, stb)
> @@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
>    */
>   
>   /* unmasked unit-stride load and store operation */
> -static void
> +static inline QEMU_ALWAYS_INLINE void
>   vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
>                vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
>                uintptr_t ra)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-02-15 20:03   ` Richard Henderson
@ 2024-02-17  9:08     ` Max Chou
  0 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-17  9:08 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv; +Cc: dbarboza, Paolo Bonzini

Hi Richard,

Thank you for the suggestion. I'll do a v2 with this.

Thanks,
Max

On 2024/2/16 4:03 AM, Richard Henderson wrote:
> On 2/15/24 09:28, Max Chou wrote:
>> If there are not any QEMU plugin memory callback functions, checking
>> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
>> function call overhead.
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>>   accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++----------
>>   1 file changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
>> index c82048e377e..bf24986c562 100644
>> --- a/accel/tcg/ldst_common.c.inc
>> +++ b/accel/tcg/ldst_common.c.inc
>> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr 
>> addr, MemOpIdx oi, uintptr_t ra)
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
>>       ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>
> Rather than repeating N times, modify plugin_load_cb and 
> plugin_store_cb just above to avoid the call to 
> cpu_plugin_mem_cbs_enabled().  I expect the compiler is inlining those 
> functions already.
>
>
> r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-02-15 20:21   ` Daniel Henrique Barboza
@ 2024-02-17  9:45     ` Max Chou
  0 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-17  9:45 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini

Hi Daniel,

I think the the overhead from the plugin callbacks depending on the type 
of plugin callbacks.
Because there is only qemu_plugin_vcpu_mem_cb in the perf report of the 
glibc memcpy benchtest, so I just looked it up for the the memory 
callbacks in this RFC.
I'll try to get more experiment results to check the status of other 
plugin callbacks.

Thanks,
Max

On 2024/2/16 4:21 AM, Daniel Henrique Barboza wrote:
>
>
> On 2/15/24 16:28, Max Chou wrote:
>> If there are not any QEMU plugin memory callback functions, checking
>> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
>> function call overhead.
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>
> This was in my TODO list for some time. Thanks for looking it up.
>
> Can't we avoid all callbacks, not just qemu_plugin_vcpu_mem_cb, if 
> there's no
> plugin loaded? The performance increase when building with 
> --disable-plugins
> shouldn't be a thing - if the user isn't using plug-ins it should have a
> penalty to it.
>
> Thanks,
>
> Daniel
>
>>   accel/tcg/ldst_common.c.inc | 40 +++++++++++++++++++++++++++----------
>>   1 file changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
>> index c82048e377e..bf24986c562 100644
>> --- a/accel/tcg/ldst_common.c.inc
>> +++ b/accel/tcg/ldst_common.c.inc
>> @@ -134,7 +134,9 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr 
>> addr, MemOpIdx oi, uintptr_t ra)
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
>>       ret = do_ld1_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>>   @@ -145,7 +147,9 @@ uint16_t cpu_ldw_mmu(CPUArchState *env, abi_ptr 
>> addr,
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
>>       ret = do_ld2_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>>   @@ -156,7 +160,9 @@ uint32_t cpu_ldl_mmu(CPUArchState *env, abi_ptr 
>> addr,
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
>>       ret = do_ld4_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>>   @@ -167,7 +173,9 @@ uint64_t cpu_ldq_mmu(CPUArchState *env, abi_ptr 
>> addr,
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
>>       ret = do_ld8_mmu(env_cpu(env), addr, oi, ra, MMU_DATA_LOAD);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>>   @@ -178,7 +186,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr 
>> addr,
>>         tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
>>       ret = do_ld16_mmu(env_cpu(env), addr, oi, ra);
>> -    plugin_load_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_load_cb(env, addr, oi);
>> +    }
>>       return ret;
>>   }
>>   @@ -195,7 +205,9 @@ void cpu_stb_mmu(CPUArchState *env, abi_ptr 
>> addr, uint8_t val,
>>                    MemOpIdx oi, uintptr_t retaddr)
>>   {
>>       helper_stb_mmu(env, addr, val, oi, retaddr);
>> -    plugin_store_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_store_cb(env, addr, oi);
>> +    }
>>   }
>>     void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
>> @@ -203,7 +215,9 @@ void cpu_stw_mmu(CPUArchState *env, abi_ptr addr, 
>> uint16_t val,
>>   {
>>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
>>       do_st2_mmu(env_cpu(env), addr, val, oi, retaddr);
>> -    plugin_store_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_store_cb(env, addr, oi);
>> +    }
>>   }
>>     void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
>> @@ -211,7 +225,9 @@ void cpu_stl_mmu(CPUArchState *env, abi_ptr addr, 
>> uint32_t val,
>>   {
>>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
>>       do_st4_mmu(env_cpu(env), addr, val, oi, retaddr);
>> -    plugin_store_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_store_cb(env, addr, oi);
>> +    }
>>   }
>>     void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
>> @@ -219,7 +235,9 @@ void cpu_stq_mmu(CPUArchState *env, abi_ptr addr, 
>> uint64_t val,
>>   {
>>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
>>       do_st8_mmu(env_cpu(env), addr, val, oi, retaddr);
>> -    plugin_store_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_store_cb(env, addr, oi);
>> +    }
>>   }
>>     void cpu_st16_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
>> @@ -227,7 +245,9 @@ void cpu_st16_mmu(CPUArchState *env, abi_ptr 
>> addr, Int128 val,
>>   {
>>       tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
>>       do_st16_mmu(env_cpu(env), addr, val, oi, retaddr);
>> -    plugin_store_cb(env, addr, oi);
>> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
>> +        plugin_store_cb(env, addr, oi);
>> +    }
>>   }
>>     /*


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions
  2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson
@ 2024-02-17  9:52   ` Max Chou
  0 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-17  9:52 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv; +Cc: dbarboza

Hi Richard,

Thank you for the suggestion and the reference.
I'm trying to follow the reference to implement it and I'll send another 
version for this.

Thanks a lot,
Max

On 2024/2/16 4:24 AM, Richard Henderson wrote:
> On 2/15/24 09:28, Max Chou wrote:
>> Hi all,
>>
>> When glibc with RVV support [1], the memcpy benchmark will run 2x to 60x
>> slower than the scalar equivalent on QEMU and it hurts developer
>> productivity.
>>
>>  From the performance analysis result, we can observe that the glibc
>> memcpy spends most of the time in the vector unit-stride load/store
>> helper functions.
>>
>> Samples: 465K of event 'cycles:u', Event count (approx.): 1707645730664
>>    Children      Self  Command       Shared Object Symbol
>> +   28.46%    27.85%  qemu-riscv64  qemu-riscv64             [.] 
>> vext_ldst_us
>> +   26.92%     0.00%  qemu-riscv64  [unknown]                [.] 
>> 0x00000000000000ff
>> +   14.41%    14.41%  qemu-riscv64  qemu-riscv64             [.] 
>> qemu_plugin_vcpu_mem_cb
>> +   13.85%    13.85%  qemu-riscv64  qemu-riscv64             [.] lde_b
>> +   13.64%    13.64%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_stb_mmu
>> +    9.25%     9.19%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_ldb_mmu
>> +    7.81%     7.81%  qemu-riscv64  qemu-riscv64             [.] 
>> cpu_mmu_lookup
>> +    7.70%     7.70%  qemu-riscv64  qemu-riscv64             [.] ste_b
>> +    5.53%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
>> adjust_addr (inlined)
>>
>>
>> So this patchset tries to improve the performance of the RVV version of
>> glibc memcpy on QEMU by improving the corresponding helper function
>> quality.
>>
>> The overall performance improvement can achieve following numbers
>> (depending on the size).
>> Average: 2.86X / Smallest: 1.15X / Largest: 4.49X
>>
>> PS: This RFC patchset only focuses on the vle8.v & vse8.v instructions,
>> the next version or next serious will complete other vector ld/st part.
>
> You are still not tackling the root problem, which is over-use of the 
> full out-of-line load/store routines.  The reason that cpu_mmu_lookup 
> is in that list is because you are performing the full virtual address 
> resolution for each and every byte.
>
> The only way to make a real improvement is to perform virtual address 
> resolution *once* for the entire vector.  I refer to my previous advice:
>
> https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
>
>
> r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance
  2024-02-15 21:11   ` Daniel Henrique Barboza
@ 2024-02-17 10:10     ` Max Chou
  0 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-17 10:10 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei

Hi Daniel,

Thank you for the information and suggestion.
Yes, we can do it better if we load/store more bytes at a time.
I'll try to improve the RFC on this way.

Thanks,

Max

On 2024/2/16 5:11 AM, Daniel Henrique Barboza wrote:
>
>
> On 2/15/24 16:28, Max Chou wrote:
>> In the vector unit-stride load/store helper functions. the vext_ldst_us
>> function corresponding most of the execution time. Inline the functions
>> can avoid the function call overhead to imperove the helper function
>> performance.
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>
> The inline is a good idea but I think we can do better. I mentioned in 
> a thread
> last year [1] about the time we're spending in single byte 
> loads/stores, even
> for strided instructions.
>
> E.g. in vext_ldst_stride():
>
>
>     for (i = env->vstart; i < env->vl; i++, env->vstart++) {
>         k = 0;
>         while (k < nf) {
>             if (!vm && !vext_elem_mask(v0, i)) {
>                 /* set masked-off elements to 1s */
>                 vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz,
>                                   (i + k * max_elems + 1) * esz);
>                 k++;
>                 continue;
>             }
>             target_ulong addr = base + stride * i + (k << log2_esz);
>             ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, 
> vd, ra);
>             k++;
>         }
>     }
>
> We're doing single byte load/stores in ldst_elem() when, in this case, 
> we could do
> it in a whole block only once. ARM does something similar in SVE.
>
> I update the gitlab bug 
> https://gitlab.com/qemu-project/qemu/-/issues/2137 with this
> additional info too.
>
>
>
> Thanks,
>
> Daniel
>
>
> [1] 
> https://lore.kernel.org/qemu-riscv/0e54c6c1-2903-7942-eff2-2b8c5e21187e@ventanamicro.com/
>
>
>>   target/riscv/vector_helper.c | 30 ++++++++++++++++--------------
>>   1 file changed, 16 insertions(+), 14 deletions(-)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index e8fbb921449..866f77d321d 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -149,25 +149,27 @@ static inline void vext_set_elem_mask(void *v0, 
>> int index,
>>   typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
>>                                  uint32_t idx, void *vd, uintptr_t 
>> retaddr);
>>   -#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)            \
>> -static void NAME(CPURISCVState *env, abi_ptr addr,         \
>> -                 uint32_t idx, void *vd, uintptr_t retaddr)\
>> -{                                                          \
>> -    ETYPE *cur = ((ETYPE *)vd + H(idx));                   \
>> -    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);      \
>> -}                                                          \
>> +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)         \
>> +static inline QEMU_ALWAYS_INLINE                        \
>> +void NAME(CPURISCVState *env, abi_ptr addr,             \
>> +          uint32_t idx, void *vd, uintptr_t retaddr)    \
>> +{                                                       \
>> +    ETYPE *cur = ((ETYPE *)vd + H(idx));                \
>> +    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);   \
>> +}                                                       \
>>     GEN_VEXT_LD_ELEM(lde_b, int8_t,  H1, ldsb)
>>   GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
>>   GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
>>   GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
>>   -#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)            \
>> -static void NAME(CPURISCVState *env, abi_ptr addr,         \
>> -                 uint32_t idx, void *vd, uintptr_t retaddr)\
>> -{                                                          \
>> -    ETYPE data = *((ETYPE *)vd + H(idx));                  \
>> -    cpu_##STSUF##_data_ra(env, addr, data, retaddr);       \
>> +#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)         \
>> +static inline QEMU_ALWAYS_INLINE                        \
>> +void NAME(CPURISCVState *env, abi_ptr addr,             \
>> +          uint32_t idx, void *vd, uintptr_t retaddr)    \
>> +{                                                       \
>> +    ETYPE data = *((ETYPE *)vd + H(idx));               \
>> +    cpu_##STSUF##_data_ra(env, addr, data, retaddr);    \
>>   }
>>     GEN_VEXT_ST_ELEM(ste_b, int8_t,  H1, stb)
>> @@ -289,7 +291,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
>>    */
>>     /* unmasked unit-stride load and store operation */
>> -static void
>> +static inline QEMU_ALWAYS_INLINE void
>>   vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, 
>> uint32_t desc,
>>                vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, 
>> uint32_t evl,
>>                uintptr_t ra)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function
  2024-02-15 20:10   ` Richard Henderson
@ 2024-02-17 17:27     ` Max Chou
  0 siblings, 0 replies; 19+ messages in thread
From: Max Chou @ 2024-02-17 17:27 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv
  Cc: dbarboza, Riku Voipio, Paolo Bonzini

Hi Richard,

After I tried to reduce the overhead of unnecessary segment flow and 
memory plugin callbacks, I observed that there several functions consume 
most of runtime from perf report.
So these three inline patches and previous patch were created to reduce 
the overhead without modifying the functions.

The following are the experiment results.
The benchmark target is the bench-memcpy executable generated from the 
glibc repository (release 2.38 with RVV support patch [1]).

- Execution command

`qemu-riscv64 -E TIMEOUTFACTOR=10 -R 1G -L {glibc build folder}/rootfs 
-cpu 
rv64,zba=true,zbb=true,v=true,vlen=256,vext_spec=v1.0,rvv_ta_all_1s=true,rvv_ma_all_1s=true 
bench-memcpy`


- Total  runtime

0. Original riscv-to-apply.next branch (commit ID: deb0ff0)
     - Total execution time: ~383 sec.
1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0)
     - Total execution time: ~375 sec.
2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: 
deb0ff0)
     - Total execution time: ~342 sec.


- Perf report (cycles)

0. Original riscv-to-apply.next branch (commit ID: deb0ff0)

# Samples: 385K of event 'cycles:u'
# Event count (approx.): 1411574690539
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
............................................
#
     47.54%    31.35%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     25.60%     0.03%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     18.87%    13.29%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
     17.40%     0.04%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     17.39%    15.71%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
     17.17%    17.17%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     12.25%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu (inlined)
      8.18%     4.08%  qemu-riscv64  qemu-riscv64             [.] lde_b
      8.17%     8.17%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup
      7.45%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      7.32%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_st1_mmu (inlined)
      6.79%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000000ff
      5.91%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      5.70%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      5.58%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      5.23%     4.25%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
      4.93%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_memop (inlined)
      4.11%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
      4.11%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
      2.88%     2.88%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_data_ra
      2.88%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmuidx_ra (inlined)
      2.75%     0.00%  qemu-riscv64  qemu-riscv64             [.] stb_p 
(inlined)
      2.66%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      1.79%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000e40203bf
      1.68%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      1.60%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000e403733f
      1.13%     0.00%  qemu-riscv64  qemu-riscv64             [.] ldub_p 
(inlined)
      0.73%     0.73%  qemu-riscv64  qemu-riscv64             [.] ste_b
      0.53%     0.21%  qemu-riscv64  qemu-riscv64             [.] 
helper_lookup_tb_ptr

1. Cherry pick PATCH 4 to riscv-to-apply.next branch (commit ID: deb0ff0)

# Samples: 378K of event 'cycles:u'
# Event count (approx.): 1381912775966
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
.......................................
#
     63.30%    29.62%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     30.77%     0.04%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     28.59%     0.02%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     22.78%    22.78%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     21.40%     5.26%  qemu-riscv64  qemu-riscv64             [.] lde_b
     20.69%    10.40%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
     20.06%     3.91%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
     19.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
     19.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
     12.65%     9.36%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
      8.60%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      6.73%     6.73%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu.constprop.23
      6.73%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      6.31%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu (inlined)
      6.20%     6.20%  qemu-riscv64  qemu-riscv64             [.] do_st1_mmu
      6.20%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      5.49%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      3.82%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      3.01%     0.00%  qemu-riscv64  qemu-riscv64             [.] ldub_p 
(inlined)
      2.94%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      2.94%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      2.91%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
clear_helper_retaddr (inlined)
      2.91%     2.91%  qemu-riscv64  qemu-riscv64             [.] ste_b
      2.90%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      2.90%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      0.59%     0.24%  qemu-riscv64  qemu-riscv64             [.] 
helper_lookup_tb_ptr

2. Cherry pick PATCH 4+5+6 to riscv-to-apply.next branch (commit ID: 
deb0ff0)

# Samples: 343K of event 'cycles:u'
# Event count (approx.): 1259748868940
#
# Children      Self  Command       Shared Object Symbol
# ........  ........  ............  ....................... 
.......................................
#
     64.16%    35.81%  qemu-riscv64  qemu-riscv64             [.] 
vext_ldst_us
     30.96%     0.02%  qemu-riscv64  qemu-riscv64             [.] 
helper_vse8_v
     27.44%     0.05%  qemu-riscv64  qemu-riscv64             [.] 
helper_vle8_v
     18.37%    18.37%  qemu-riscv64  qemu-riscv64             [.] 
qemu_plugin_vcpu_mem_cb
     17.33%     7.45%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldsb_data_ra
     14.90%     5.02%  qemu-riscv64  qemu-riscv64             [.] lde_b
     14.83%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_data_ra (inlined)
     14.83%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldub_mmuidx_ra (inlined)
     14.15%    10.33%  qemu-riscv64  qemu-riscv64             [.] 
cpu_stb_mmu
     11.25%     9.62%  qemu-riscv64  qemu-riscv64             [.] 
cpu_ldb_mmu
      7.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
adjust_addr (inlined)
      6.99%     6.99%  qemu-riscv64  qemu-riscv64             [.] 
helper_stb_mmu
      6.99%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_st1_mmu (inlined)
      5.96%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
do_ld1_mmu (inlined)
      5.95%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      5.30%     0.00%  qemu-riscv64  [unknown]                [.] 
0x00000000000001fe
      4.18%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_store_cb (inlined)
      3.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
cpu_mmu_lookup (inlined)
      3.22%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      3.22%     3.22%  qemu-riscv64  qemu-riscv64             [.] ste_b
      3.19%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
get_alignment_bits (inlined)
      3.16%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
clear_helper_retaddr (inlined)
      2.76%     0.00%  qemu-riscv64  qemu-riscv64             [.] g2h 
(inlined)
      2.76%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
g2h_untagged (inlined)
      2.30%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
env_cpu (inlined)
      2.21%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
plugin_load_cb (inlined)
      0.99%     0.00%  qemu-riscv64  qemu-riscv64             [.] 
env_cpu (inlined)


I agree that these functions are large functions to inline and I didn't 
test these patches on all combinations (different guest architecture + 
different host architecture).
So I think that we can drop these three patches until we can make sure 
that these patches can get benefit on all combinations without side effect.
I'll focus on avoiding over-use of the full out-of-line load/store 
routines for the next version.


Thanks for the suggestion and question,

Max

[1] 
https://inbox.sourceware.org/libc-alpha/20230504074851.38763-1-hau.hsu@sifive.com

On 2024/2/16 4:10 AM, Richard Henderson wrote:
> On 2/15/24 09:28, Max Chou wrote:
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>>   accel/tcg/user-exec.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
>> index 68b252cb8e8..c5453810eee 100644
>> --- a/accel/tcg/user-exec.c
>> +++ b/accel/tcg/user-exec.c
>> @@ -942,8 +942,11 @@ void page_reset_target_data(target_ulong start, 
>> target_ulong last) { }
>>     /* The system-mode versions of these helpers are in cputlb.c.  */
>>   -static void *cpu_mmu_lookup(CPUState *cpu, vaddr addr,
>> -                            MemOp mop, uintptr_t ra, MMUAccessType 
>> type)
>> +static inline QEMU_ALWAYS_INLINE void *cpu_mmu_lookup(CPUState *cpu,
>> +                                                      vaddr addr,
>> +                                                      MemOp mop,
>> +                                                      uintptr_t ra,
>> + MMUAccessType type)
>>   {
>>       int a_bits = get_alignment_bits(mop);
>>       void *ret;
>
> This is a large function.  Why does it need to be inlined?
> For this and the next two patches I require evidence, because I don't 
> believe you are attacking the problem correctly.
>
>
> r~


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-02-17 17:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-15 19:28 [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Max Chou
2024-02-15 19:28 ` [RFC PATCH 1/6] target/riscv: Seperate vector segment " Max Chou
2024-02-15 19:28 ` [RFC PATCH 2/6] accel/tcg: Avoid uncessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
2024-02-15 20:03   ` Richard Henderson
2024-02-17  9:08     ` Max Chou
2024-02-15 20:21   ` Daniel Henrique Barboza
2024-02-17  9:45     ` Max Chou
2024-02-15 19:28 ` [RFC PATCH 3/6] target/riscv: Inline vext_ldst_us and coressponding function for performance Max Chou
2024-02-15 20:09   ` Richard Henderson
2024-02-15 21:11   ` Daniel Henrique Barboza
2024-02-17 10:10     ` Max Chou
2024-02-15 19:28 ` [RFC PATCH 4/6] accel/tcg: Inline cpu_mmu_lookup function Max Chou
2024-02-15 20:10   ` Richard Henderson
2024-02-17 17:27     ` Max Chou
2024-02-15 19:28 ` [RFC PATCH 5/6] accel/tcg: Inline do_ld1_mmu function Max Chou
2024-02-15 20:12   ` Richard Henderson
2024-02-15 19:28 ` [RFC PATCH 6/6] accel/tcg: Inline do_st1_mmu function Max Chou
2024-02-15 20:24 ` [RFC PATCH 0/6] Improve the performance of RISC-V vector unit-stride ld/st instructions Richard Henderson
2024-02-17  9:52   ` Max Chou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).