[PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg

qemu-riscv.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg
@ 2025-08-19 13:23 Chao Liu
  2025-08-19 13:23 ` [PATCH v5 1/2] Generate strided vector loads/stores with tcg nodes Chao Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Chao Liu @ 2025-08-19 13:23 UTC (permalink / raw)
  To: richard.henderson, paolo.savini, ebiggers, dbarboza, palmer,
	alistair.francis, liwei1518, zhiwei_liu
  Cc: qemu-riscv, qemu-devel, Chao Liu

Hi all,

In this patch (v5), I've removed the redundant call to mark_vs_dirty(s)
within the gen_ldst_stride_main_loop() function.

The reason for this change is that mark_vs_dirty(s) is already being called
at a higher level, making the call inside gen_ldst_stride_main_loop()
unnecessary.


 static void gen_ldst_stride_main_loop(...)
 {
      ...
-     mark_vs_dirty(s);
      ...
 }

 static bool ldst_stride_trans(...)
 {
     ....
     mark_vs_dirty(s);

     gen_ldst_stride_main_loop(s, dest, rs1, rs2, vm, nf, ld_fn, st_fn, is_load);
 }


patch v4 changes:
- Use ctz32() replace to for-loop
  https://lore.kernel.org/qemu-devel/cover.1755333616.git.chao.liu@yeah.net/

patch v3 changes:
- Fix the get_log2() function:
  https://lore.kernel.org/qemu-riscv/cover.1755287531.git.chao.liu@yeah.net/T/#t
- Add test for vlsseg8e32 instruction.
- Rebase on top of the latest master.

patch v2 changes:
- Split the TCG node emulation of the complex strided load/store operation into
  two separate functions to simplify the implementation:
  https://lore.kernel.org/qemu-riscv/20250312155547.289642-1-paolo.savini@embecosm.com/


Best regards,

Chao


Chao Liu (2):
  Generate strided vector loads/stores with tcg nodes.
  tests/tcg/riscv64: Add test for vlsseg8e32 instruction

 target/riscv/insn_trans/trans_rvv.c.inc   | 317 ++++++++++++++++++----
 tests/tcg/riscv64/Makefile.softmmu-target |   8 +-
 tests/tcg/riscv64/test-vlsseg8e32.S       | 107 ++++++++
 3 files changed, 380 insertions(+), 52 deletions(-)
 create mode 100644 tests/tcg/riscv64/test-vlsseg8e32.S

-- 
2.50.1



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v5 1/2] Generate strided vector loads/stores with tcg nodes.
  2025-08-19 13:23 [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Chao Liu
@ 2025-08-19 13:23 ` Chao Liu
  2025-08-19 13:23 ` [PATCH v5 2/2] tests/tcg/riscv64: Add test for vlsseg8e32 instruction Chao Liu
  2025-09-03  2:21 ` [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Nicholas Piggin
  2 siblings, 0 replies; 5+ messages in thread
From: Chao Liu @ 2025-08-19 13:23 UTC (permalink / raw)
  To: richard.henderson, paolo.savini, ebiggers, dbarboza, palmer,
	alistair.francis, liwei1518, zhiwei_liu
  Cc: qemu-riscv, qemu-devel, Chao Liu, Chao Liu

From: Chao Liu <chao.liu@yeah.net>

This commit improves the performance of QEMU when emulating strided vector
loads and stores by substituting the call for the helper function with the
generation of equivalent TCG operations.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
Signed-off-by: Chao Liu <chao.liu@zevorn.cn>
Tested-by: Eric Biggers <ebiggers@kernel.org>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
---
 target/riscv/insn_trans/trans_rvv.c.inc | 317 ++++++++++++++++++++----
 1 file changed, 267 insertions(+), 50 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 71f98fb350..22d598aad6 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -864,32 +864,280 @@ GEN_VEXT_TRANS(vlm_v, MO_8, vlm_v, ld_us_mask_op, ld_us_mask_check)
 GEN_VEXT_TRANS(vsm_v, MO_8, vsm_v, st_us_mask_op, st_us_mask_check)
 
 /*
- *** stride load and store
+ * MAXSZ returns the maximum vector size can be operated in bytes,
+ * which is used in GVEC IR when vl_eq_vlmax flag is set to true
+ * to accelerate vector operation.
+ */
+static inline uint32_t MAXSZ(DisasContext *s)
+{
+    int max_sz = s->cfg_ptr->vlenb << 3;
+    return max_sz >> (3 - s->lmul);
+}
+
+static inline uint32_t get_log2(uint32_t a)
+{
+    assert(is_power_of_2(a));
+    return ctz32(a);
+}
+
+typedef void gen_tl_ldst(TCGv, TCGv_ptr, tcg_target_long);
+
+/*
+ * Simulate the strided load/store main loop:
+ *
+ * for (i = env->vstart; i < env->vl; env->vstart = ++i) {
+ *     k = 0;
+ *     while (k < nf) {
+ *         if (!vm && !vext_elem_mask(v0, i)) {
+ *             vext_set_elems_1s(vd, vma, (i + k * max_elems) * esz,
+ *                               (i + k * max_elems + 1) * esz);
+ *             k++;
+ *             continue;
+ *         }
+ *         target_ulong addr = base + stride * i + (k << log2_esz);
+ *         ldst(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+ *         k++;
+ *     }
+ * }
+ */
+static void gen_ldst_stride_main_loop(DisasContext *s, TCGv dest, uint32_t rs1,
+                                      uint32_t rs2, uint32_t vm, uint32_t nf,
+                                      gen_tl_ldst *ld_fn, gen_tl_ldst *st_fn,
+                                      bool is_load)
+{
+    TCGv addr = tcg_temp_new();
+    TCGv base = get_gpr(s, rs1, EXT_NONE);
+    TCGv stride = get_gpr(s, rs2, EXT_NONE);
+
+    TCGv i = tcg_temp_new();
+    TCGv i_esz = tcg_temp_new();
+    TCGv k = tcg_temp_new();
+    TCGv k_esz = tcg_temp_new();
+    TCGv k_max = tcg_temp_new();
+    TCGv mask = tcg_temp_new();
+    TCGv mask_offs = tcg_temp_new();
+    TCGv mask_offs_64 = tcg_temp_new();
+    TCGv mask_elem = tcg_temp_new();
+    TCGv mask_offs_rem = tcg_temp_new();
+    TCGv vreg = tcg_temp_new();
+    TCGv dest_offs = tcg_temp_new();
+    TCGv stride_offs = tcg_temp_new();
+
+    uint32_t max_elems = MAXSZ(s) >> s->sew;
+
+    TCGLabel *start = gen_new_label();
+    TCGLabel *end = gen_new_label();
+    TCGLabel *start_k = gen_new_label();
+    TCGLabel *inc_k = gen_new_label();
+    TCGLabel *end_k = gen_new_label();
+
+    MemOp atomicity = MO_ATOM_NONE;
+    if (s->sew == 0) {
+        atomicity = MO_ATOM_NONE;
+    } else {
+        atomicity = MO_ATOM_IFALIGN_PAIR;
+    }
+
+    tcg_gen_addi_tl(mask, (TCGv)tcg_env, vreg_ofs(s, 0));
+
+    /* Start of outer loop. */
+    tcg_gen_mov_tl(i, cpu_vstart);
+    gen_set_label(start);
+    tcg_gen_brcond_tl(TCG_COND_GE, i, cpu_vl, end);
+    tcg_gen_shli_tl(i_esz, i, s->sew);
+    /* Start of inner loop. */
+    tcg_gen_movi_tl(k, 0);
+    gen_set_label(start_k);
+    tcg_gen_brcond_tl(TCG_COND_GE, k, tcg_constant_tl(nf), end_k);
+    /*
+     * If we are in mask agnostic regime and the operation is not unmasked we
+     * set the inactive elements to 1.
+     */
+    if (!vm && s->vma) {
+        TCGLabel *active_element = gen_new_label();
+        /* (i + k * max_elems) * esz */
+        tcg_gen_shli_tl(mask_offs, k, get_log2(max_elems << s->sew));
+        tcg_gen_add_tl(mask_offs, mask_offs, i_esz);
+
+        /*
+         * Check whether the i bit of the mask is 0 or 1.
+         *
+         * static inline int vext_elem_mask(void *v0, int index)
+         * {
+         *     int idx = index / 64;
+         *     int pos = index  % 64;
+         *     return (((uint64_t *)v0)[idx] >> pos) & 1;
+         * }
+         */
+        tcg_gen_shri_tl(mask_offs_64, mask_offs, 3);
+        tcg_gen_add_tl(mask_offs_64, mask_offs_64, mask);
+        tcg_gen_ld_i64((TCGv_i64)mask_elem, (TCGv_ptr)mask_offs_64, 0);
+        tcg_gen_rem_tl(mask_offs_rem, mask_offs, tcg_constant_tl(8));
+        tcg_gen_shr_tl(mask_elem, mask_elem, mask_offs_rem);
+        tcg_gen_andi_tl(mask_elem, mask_elem, 1);
+        tcg_gen_brcond_tl(TCG_COND_NE, mask_elem, tcg_constant_tl(0),
+                          active_element);
+        /*
+         * Set masked-off elements in the destination vector register to 1s.
+         * Store instructions simply skip this bit as memory ops access memory
+         * only for active elements.
+         */
+        if (is_load) {
+            tcg_gen_shli_tl(mask_offs, mask_offs, s->sew);
+            tcg_gen_add_tl(mask_offs, mask_offs, dest);
+            st_fn(tcg_constant_tl(-1), (TCGv_ptr)mask_offs, 0);
+        }
+        tcg_gen_br(inc_k);
+        gen_set_label(active_element);
+    }
+    /*
+     * The element is active, calculate the address with stride:
+     * target_ulong addr = base + stride * i + (k << log2_esz);
+     */
+    tcg_gen_mul_tl(stride_offs, stride, i);
+    tcg_gen_shli_tl(k_esz, k, s->sew);
+    tcg_gen_add_tl(stride_offs, stride_offs, k_esz);
+    tcg_gen_add_tl(addr, base, stride_offs);
+    /* Calculate the offset in the dst/src vector register. */
+    tcg_gen_shli_tl(k_max, k, get_log2(max_elems));
+    tcg_gen_add_tl(dest_offs, i, k_max);
+    tcg_gen_shli_tl(dest_offs, dest_offs, s->sew);
+    tcg_gen_add_tl(dest_offs, dest_offs, dest);
+    if (is_load) {
+        tcg_gen_qemu_ld_tl(vreg, addr, s->mem_idx, MO_LE | s->sew | atomicity);
+        st_fn((TCGv)vreg, (TCGv_ptr)dest_offs, 0);
+    } else {
+        ld_fn((TCGv)vreg, (TCGv_ptr)dest_offs, 0);
+        tcg_gen_qemu_st_tl(vreg, addr, s->mem_idx, MO_LE | s->sew | atomicity);
+    }
+    /*
+     * We don't execute the load/store above if the element was inactive.
+     * We jump instead directly to incrementing k and continuing the loop.
+     */
+    if (!vm && s->vma) {
+        gen_set_label(inc_k);
+    }
+    tcg_gen_addi_tl(k, k, 1);
+    tcg_gen_br(start_k);
+    /* End of the inner loop. */
+    gen_set_label(end_k);
+
+    tcg_gen_addi_tl(i, i, 1);
+    tcg_gen_mov_tl(cpu_vstart, i);
+    tcg_gen_br(start);
+
+    /* End of the outer loop. */
+    gen_set_label(end);
+
+    return;
+}
+
+
+/*
+ * Set the tail bytes of the strided loads/stores to 1:
+ *
+ * for (k = 0; k < nf; ++k) {
+ *     cnt = (k * max_elems + vl) * esz;
+ *     tot = (k * max_elems + max_elems) * esz;
+ *     for (i = cnt; i < tot; i += esz) {
+ *         store_1s(-1, vd[vl+i]);
+ *     }
+ * }
  */
-typedef void gen_helper_ldst_stride(TCGv_ptr, TCGv_ptr, TCGv,
-                                    TCGv, TCGv_env, TCGv_i32);
+static void gen_ldst_stride_tail_loop(DisasContext *s, TCGv dest, uint32_t nf,
+                                      gen_tl_ldst *st_fn)
+{
+    TCGv i = tcg_temp_new();
+    TCGv k = tcg_temp_new();
+    TCGv tail_cnt = tcg_temp_new();
+    TCGv tail_tot = tcg_temp_new();
+    TCGv tail_addr = tcg_temp_new();
+
+    TCGLabel *start = gen_new_label();
+    TCGLabel *end = gen_new_label();
+    TCGLabel *start_i = gen_new_label();
+    TCGLabel *end_i = gen_new_label();
+
+    uint32_t max_elems_b = MAXSZ(s);
+    uint32_t esz = 1 << s->sew;
+
+    /* Start of the outer loop. */
+    tcg_gen_movi_tl(k, 0);
+    tcg_gen_shli_tl(tail_cnt, cpu_vl, s->sew);
+    tcg_gen_movi_tl(tail_tot, max_elems_b);
+    tcg_gen_add_tl(tail_addr, dest, tail_cnt);
+    gen_set_label(start);
+    tcg_gen_brcond_tl(TCG_COND_GE, k, tcg_constant_tl(nf), end);
+    /* Start of the inner loop. */
+    tcg_gen_mov_tl(i, tail_cnt);
+    gen_set_label(start_i);
+    tcg_gen_brcond_tl(TCG_COND_GE, i, tail_tot, end_i);
+    /* store_1s(-1, vd[vl+i]); */
+    st_fn(tcg_constant_tl(-1), (TCGv_ptr)tail_addr, 0);
+    tcg_gen_addi_tl(tail_addr, tail_addr, esz);
+    tcg_gen_addi_tl(i, i, esz);
+    tcg_gen_br(start_i);
+    /* End of the inner loop. */
+    gen_set_label(end_i);
+    /* Update the counts */
+    tcg_gen_addi_tl(tail_cnt, tail_cnt, max_elems_b);
+    tcg_gen_addi_tl(tail_tot, tail_cnt, max_elems_b);
+    tcg_gen_addi_tl(k, k, 1);
+    tcg_gen_br(start);
+    /* End of the outer loop. */
+    gen_set_label(end);
+
+    return;
+}
 
 static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
-                              uint32_t data, gen_helper_ldst_stride *fn,
-                              DisasContext *s)
+                              uint32_t data, DisasContext *s, bool is_load)
 {
-    TCGv_ptr dest, mask;
-    TCGv base, stride;
-    TCGv_i32 desc;
+    if (!s->vstart_eq_zero) {
+        return false;
+    }
 
-    dest = tcg_temp_new_ptr();
-    mask = tcg_temp_new_ptr();
-    base = get_gpr(s, rs1, EXT_NONE);
-    stride = get_gpr(s, rs2, EXT_NONE);
-    desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
-                                      s->cfg_ptr->vlenb, data));
+    TCGv dest = tcg_temp_new();
 
-    tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
-    tcg_gen_addi_ptr(mask, tcg_env, vreg_ofs(s, 0));
+    uint32_t nf = FIELD_EX32(data, VDATA, NF);
+    uint32_t vm = FIELD_EX32(data, VDATA, VM);
+
+    /* Destination register and mask register */
+    tcg_gen_addi_tl(dest, (TCGv)tcg_env, vreg_ofs(s, vd));
+
+    /*
+     * Select the appropriate load/tore to retrieve data from the vector
+     * register given a specific sew.
+     */
+    static gen_tl_ldst * const ld_fns[4] = {
+        tcg_gen_ld8u_tl, tcg_gen_ld16u_tl,
+        tcg_gen_ld32u_tl, tcg_gen_ld_tl
+    };
+
+    static gen_tl_ldst * const st_fns[4] = {
+        tcg_gen_st8_tl, tcg_gen_st16_tl,
+        tcg_gen_st32_tl, tcg_gen_st_tl
+    };
+
+    gen_tl_ldst *ld_fn = ld_fns[s->sew];
+    gen_tl_ldst *st_fn = st_fns[s->sew];
+
+    if (ld_fn == NULL || st_fn == NULL) {
+        return false;
+    }
 
     mark_vs_dirty(s);
 
-    fn(dest, mask, base, stride, tcg_env, desc);
+    gen_ldst_stride_main_loop(s, dest, rs1, rs2, vm, nf, ld_fn, st_fn, is_load);
+
+    tcg_gen_movi_tl(cpu_vstart, 0);
+
+    /*
+     * Set the tail bytes to 1 if tail agnostic:
+     */
+    if (s->vta != 0 && is_load) {
+        gen_ldst_stride_tail_loop(s, dest, nf, st_fn);
+    }
 
     finalize_rvv_inst(s);
     return true;
@@ -898,16 +1146,6 @@ static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
 static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
 {
     uint32_t data = 0;
-    gen_helper_ldst_stride *fn;
-    static gen_helper_ldst_stride * const fns[4] = {
-        gen_helper_vlse8_v, gen_helper_vlse16_v,
-        gen_helper_vlse32_v, gen_helper_vlse64_v
-    };
-
-    fn = fns[eew];
-    if (fn == NULL) {
-        return false;
-    }
 
     uint8_t emul = vext_get_emul(s, eew);
     data = FIELD_DP32(data, VDATA, VM, a->vm);
@@ -915,7 +1153,7 @@ static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
     data = FIELD_DP32(data, VDATA, NF, a->nf);
     data = FIELD_DP32(data, VDATA, VTA, s->vta);
     data = FIELD_DP32(data, VDATA, VMA, s->vma);
-    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, fn, s);
+    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, true);
 }
 
 static bool ld_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
@@ -933,23 +1171,13 @@ GEN_VEXT_TRANS(vlse64_v, MO_64, rnfvm, ld_stride_op, ld_stride_check)
 static bool st_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
 {
     uint32_t data = 0;
-    gen_helper_ldst_stride *fn;
-    static gen_helper_ldst_stride * const fns[4] = {
-        /* masked stride store */
-        gen_helper_vsse8_v,  gen_helper_vsse16_v,
-        gen_helper_vsse32_v,  gen_helper_vsse64_v
-    };
 
     uint8_t emul = vext_get_emul(s, eew);
     data = FIELD_DP32(data, VDATA, VM, a->vm);
     data = FIELD_DP32(data, VDATA, LMUL, emul);
     data = FIELD_DP32(data, VDATA, NF, a->nf);
-    fn = fns[eew];
-    if (fn == NULL) {
-        return false;
-    }
 
-    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, fn, s);
+    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, false);
 }
 
 static bool st_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
@@ -1300,17 +1528,6 @@ GEN_LDST_WHOLE_TRANS(vs8r_v, int8_t, 8, false)
  *** Vector Integer Arithmetic Instructions
  */
 
-/*
- * MAXSZ returns the maximum vector size can be operated in bytes,
- * which is used in GVEC IR when vl_eq_vlmax flag is set to true
- * to accelerate vector operation.
- */
-static inline uint32_t MAXSZ(DisasContext *s)
-{
-    int max_sz = s->cfg_ptr->vlenb * 8;
-    return max_sz >> (3 - s->lmul);
-}
-
 static bool opivv_check(DisasContext *s, arg_rmrr *a)
 {
     return require_rvv(s) &&
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v5 2/2] tests/tcg/riscv64: Add test for vlsseg8e32 instruction
  2025-08-19 13:23 [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Chao Liu
  2025-08-19 13:23 ` [PATCH v5 1/2] Generate strided vector loads/stores with tcg nodes Chao Liu
@ 2025-08-19 13:23 ` Chao Liu
  2025-09-03  2:21 ` [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Nicholas Piggin
  2 siblings, 0 replies; 5+ messages in thread
From: Chao Liu @ 2025-08-19 13:23 UTC (permalink / raw)
  To: richard.henderson, paolo.savini, ebiggers, dbarboza, palmer,
	alistair.francis, liwei1518, zhiwei_liu
  Cc: qemu-riscv, qemu-devel, Chao Liu, Chao Liu

From: Chao Liu <chao.liu@yeah.net>

This case, it copied 64 bytes from a0 to a1 with vlsseg8e32.

Signed-off-by: Chao Liu <chao.liu@zevorn.cn>
---
 tests/tcg/riscv64/Makefile.softmmu-target |   8 +-
 tests/tcg/riscv64/test-vlsseg8e32.S       | 107 ++++++++++++++++++++++
 2 files changed, 113 insertions(+), 2 deletions(-)
 create mode 100644 tests/tcg/riscv64/test-vlsseg8e32.S

diff --git a/tests/tcg/riscv64/Makefile.softmmu-target b/tests/tcg/riscv64/Makefile.softmmu-target
index 3ca595335d..384c291554 100644
--- a/tests/tcg/riscv64/Makefile.softmmu-target
+++ b/tests/tcg/riscv64/Makefile.softmmu-target
@@ -7,14 +7,14 @@ VPATH += $(TEST_SRC)
 
 LINK_SCRIPT = $(TEST_SRC)/semihost.ld
 LDFLAGS = -T $(LINK_SCRIPT)
-CFLAGS += -g -Og
+CFLAGS += -march=rv64gcv -mabi=lp64d -g -Og
 
 %.o: %.S
 	$(CC) $(CFLAGS) $< -Wa,--noexecstack -c -o $@
 %: %.o $(LINK_SCRIPT)
 	$(LD) $(LDFLAGS) $< -o $@
 
-QEMU_OPTS += -M virt -display none -semihosting -device loader,file=
+QEMU_OPTS += -M virt -cpu rv64,v=true -display none -semihosting -device loader,file=
 
 EXTRA_RUNS += run-issue1060
 run-issue1060: issue1060
@@ -24,5 +24,9 @@ EXTRA_RUNS += run-test-mepc-masking
 run-test-mepc-masking: test-mepc-masking
 	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
 
+EXTRA_RUNS += run-vlsseg8e32
+run-vlsseg8e32: test-vlsseg8e32
+	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
+
 # We don't currently support the multiarch system tests
 undefine MULTIARCH_TESTS
diff --git a/tests/tcg/riscv64/test-vlsseg8e32.S b/tests/tcg/riscv64/test-vlsseg8e32.S
new file mode 100644
index 0000000000..bbc79d5e8d
--- /dev/null
+++ b/tests/tcg/riscv64/test-vlsseg8e32.S
@@ -0,0 +1,107 @@
+#
+# QEMU RISC-V Vector Strided Load Instruction testcase
+#
+# Copyright (c) 2025 Chao Liu chao.liu@yeah.net
+#
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+	.option	norvc
+
+	.section .data
+	.align 4
+source_data:
+	.asciz "Test the vssseg8e32 insn by copy 64b and verifying correctness."
+	.equ source_len, 64
+
+	.text
+	.global _start
+_start:
+	lla	t0, trap
+	csrw	mtvec, t0
+
+enable_rvv:
+
+	li	x15, 0x800000000024112d
+	csrw	0x301, x15
+	li	x1, 0x2200
+	csrr	x2, mstatus
+	or	x2, x2, x1
+	csrw	mstatus, x2
+
+rvv_test_func:
+	la	a0, source_data
+	li	a1, 0x80020000
+	vsetivli	zero, 1, e32, m1, ta, ma
+	li	t0, 64
+
+	vlsseg8e32.v	v0, (a0), t0
+	addi	a0, a0, 32
+	vlsseg8e32.v	v8, (a0), t0
+
+	vssseg8e32.v	v0, (a1), t0
+	addi	a1, a1, 32
+	vssseg8e32.v	v8, (a1), t0
+
+compare_start:
+	la	a0, source_data
+	li	a1, 0x80020000
+	li	t0, 0
+	li	t1, source_len
+
+compare_loop:
+	# when t0 >= len, compare end
+	bge	 t0, t1, compare_done
+
+	lb	t2, 0(a0)
+	lb	t3, 0(a1)
+	bne	t2, t3, compare_fail
+
+	addi	a0, a0, 1
+	addi	a1, a1, 1
+	addi	t0, t0, 1
+	j	compare_loop
+
+compare_done:
+	# compare ok, return 0
+	li	a0, 0
+	j	_exit
+
+compare_fail:
+	# compare failed, return 2
+	li	a0, 2
+	j	_exit
+
+trap:
+	# When an instruction traps, compare it to the insn in memory.
+	csrr	t0, mepc
+	csrr	t1, mtval
+	lwu	t2, 0(t0)
+	bne	t1, t2, fail
+
+	# Skip the insn and continue.
+	addi	t0, t0, 4
+	csrw	mepc, t0
+	mret
+
+fail:
+	li	a0, 1
+
+# Exit code in a0
+_exit:
+	lla	a1, semiargs
+	li	t0, 0x20026	# ADP_Stopped_ApplicationExit
+	sd	t0, 0(a1)
+	sd	a0, 8(a1)
+	li	a0, 0x20	# TARGET_SYS_EXIT_EXTENDED
+
+	# Semihosting call sequence
+	.balign	16
+	slli	zero, zero, 0x1f
+	ebreak
+	srai	zero, zero, 0x7
+	j	.
+
+	.data
+	.balign	16
+semiargs:
+	.space	16
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg
  2025-08-19 13:23 [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Chao Liu
  2025-08-19 13:23 ` [PATCH v5 1/2] Generate strided vector loads/stores with tcg nodes Chao Liu
  2025-08-19 13:23 ` [PATCH v5 2/2] tests/tcg/riscv64: Add test for vlsseg8e32 instruction Chao Liu
@ 2025-09-03  2:21 ` Nicholas Piggin
  2025-09-03  2:54   ` Zevorn(Chao Liu)
  2 siblings, 1 reply; 5+ messages in thread
From: Nicholas Piggin @ 2025-09-03  2:21 UTC (permalink / raw)
  To: Chao Liu
  Cc: richard.henderson, paolo.savini, ebiggers, dbarboza, palmer,
	alistair.francis, liwei1518, zhiwei_liu, qemu-riscv, qemu-devel

On Tue, Aug 19, 2025 at 09:23:38PM +0800, Chao Liu wrote:
> Hi all,
> 
> In this patch (v5), I've removed the redundant call to mark_vs_dirty(s)
> within the gen_ldst_stride_main_loop() function.
> 
> The reason for this change is that mark_vs_dirty(s) is already being called
> at a higher level, making the call inside gen_ldst_stride_main_loop()
> unnecessary.

Hey, nice patch. Do you have any performance numbers?

I hit a problem with this being unable to deal with restarts. You left
the existing heleprs in there that can deal with vstart != 0, so I guess
you intended it to fall back, but it's not quite wired up right.

I tried adding that in and it seems to work. Also made a little
adjustment to your test case if you wouldn't mind changing that too.

I have a tcg test for interrupted vector memory operations that caught
this bug, I will submit it soon and cc you on it.

Thanks,
Nick

[PATCH] target/riscv: Fix "Generate strided vector ld/st with tcg"

If a strided vector memory access instruction has non-zero
vstart, fall back to the helper functions rather than causing
an illegal instruction trap. The vlse/vsse helpers were dead
code before this.

An implementation is permitted to cause an illegal instruction
if vstart is not 0 and it is set to a value that can not be
produced implicitly by the implementation, but memory accesses
will generally always need to deal with page faults.

This also adjusts the tcg test Makefile change to specify the
cpu type on a per-test basis, because I have another test that
needs different CPU options, and that gets broken if you
change it this way.

[ feel free to take changes or parts of the changelog and adjust
/ merge them into your patches ]

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 target/riscv/insn_trans/trans_rvv.c.inc   | 37 ++++++++++++++++++++---
 tests/tcg/riscv64/Makefile.softmmu-target |  3 +-
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 5e200249ef..439ea0edcf 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -1090,11 +1090,30 @@ static void gen_ldst_stride_tail_loop(DisasContext *s, TCGv dest, uint32_t nf,
     return;
 }
 
+typedef void gen_helper_ldst_stride(TCGv_ptr, TCGv_ptr, TCGv,
+                                    TCGv, TCGv_env, TCGv_i32);
+
 static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
-                              uint32_t data, DisasContext *s, bool is_load)
+                              uint32_t data, gen_helper_ldst_stride *fn,
+                              DisasContext *s, bool is_load)
 {
     if (!s->vstart_eq_zero) {
-        return false;
+        TCGv_ptr dest, mask;
+        TCGv base, stride;
+        TCGv_i32 desc;
+
+        dest = tcg_temp_new_ptr();
+        mask = tcg_temp_new_ptr();
+        base = get_gpr(s, rs1, EXT_NONE);
+        stride = get_gpr(s, rs2, EXT_NONE);
+        desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
+                                          s->cfg_ptr->vlenb, data));
+
+        tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
+        tcg_gen_addi_ptr(mask, tcg_env, vreg_ofs(s, 0));
+        mark_vs_dirty(s);
+        fn(dest, mask, base, stride, tcg_env, desc);
+        return true;
     }
 
     TCGv dest = tcg_temp_new();
@@ -1146,6 +1165,16 @@ static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
 static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
 {
     uint32_t data = 0;
+    gen_helper_ldst_stride *fn;
+    static gen_helper_ldst_stride *const fns[4] = {
+        gen_helper_vlse8_v, gen_helper_vlse16_v,
+        gen_helper_vlse32_v, gen_helper_vlse64_v
+    };
+
+    fn = fns[eew];
+    if (fn == NULL) {
+        return false;
+    }
 
     uint8_t emul = vext_get_emul(s, eew);
     data = FIELD_DP32(data, VDATA, VM, a->vm);
@@ -1153,7 +1182,7 @@ static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
     data = FIELD_DP32(data, VDATA, NF, a->nf);
     data = FIELD_DP32(data, VDATA, VTA, s->vta);
     data = FIELD_DP32(data, VDATA, VMA, s->vma);
-    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, true);
+    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, fn, s, true);
 }
 
 static bool ld_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
@@ -1177,7 +1206,7 @@ static bool st_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
     data = FIELD_DP32(data, VDATA, LMUL, emul);
     data = FIELD_DP32(data, VDATA, NF, a->nf);
 
-    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, false);
+    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, NULL, s, false);
 }
 
 static bool st_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
diff --git a/tests/tcg/riscv64/Makefile.softmmu-target b/tests/tcg/riscv64/Makefile.softmmu-target
index f09f1a57c4..d9f067dbd4 100644
--- a/tests/tcg/riscv64/Makefile.softmmu-target
+++ b/tests/tcg/riscv64/Makefile.softmmu-target
@@ -14,7 +14,7 @@ CFLAGS += -march=rv64gcv -mabi=lp64d -g -Og
 %: %.o $(LINK_SCRIPT)
 	$(LD) $(LDFLAGS) $< -o $@
 
-QEMU_OPTS += -M virt -cpu rv64,v=true -display none -semihosting -device loader,file=
+QEMU_OPTS += -M virt -display none -semihosting -device loader,file=
 
 EXTRA_RUNS += run-issue1060
 run-issue1060: issue1060
@@ -30,6 +30,7 @@ run-misa-ialign: misa-ialign
 	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
 
 EXTRA_RUNS += run-vlsseg8e32
+run-vlsseg8e32: QEMU_OPTS := -cpu rv64,v=true $(QEMU_OPTS)
 run-vlsseg8e32: test-vlsseg8e32
 	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
 
-- 
2.51.0




^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg
  2025-09-03  2:21 ` [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Nicholas Piggin
@ 2025-09-03  2:54   ` Zevorn(Chao Liu)
  0 siblings, 0 replies; 5+ messages in thread
From: Zevorn(Chao Liu) @ 2025-09-03  2:54 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: richard.henderson, paolo.savini, ebiggers, dbarboza, palmer,
	alistair.francis, liwei1518, zhiwei_liu, qemu-riscv, qemu-devel

On 2025/9/3 10:21, Nicholas Piggin wrote:
> On Tue, Aug 19, 2025 at 09:23:38PM +0800, Chao Liu wrote:
>> Hi all,
>>
>> In this patch (v5), I've removed the redundant call to mark_vs_dirty(s)
>> within the gen_ldst_stride_main_loop() function.
>>
>> The reason for this change is that mark_vs_dirty(s) is already being called
>> at a higher level, making the call inside gen_ldst_stride_main_loop()
>> unnecessary.
> 
> Hey, nice patch. Do you have any performance numbers?
> 

Thanks! Please modify my test case patch according to the following code. You
can perform a simple performance test:

```
enable_rvv:
	li	x15, 0x800000000024112d
	csrw	0x301, x15
	li	x1, 0x2200
	csrr	x2, mstatus
	or	x2, x2, x1
	csrw	mstatus, x2

rvv_test_func:
	vsetivli	zero, 1, e32, m1, ta, ma
	li	t0, 64  # copy 64 byte
copy_start:
	li	t2, 0
	li	t3, 10000000 # loop number： 10,000,000
copy_loop:
	# when t2 >= t3, copy end
	bge	 t2, t3, copy_done
	la	a0, source_data  # source_data
	li	a1, 0x80020000   # dest_data

	vlsseg8e32.v	v0, (a0), t0
	addi	a0, a0, 32
	vlsseg8e32.v	v8, (a0), t0

	vssseg8e32.v	v0, (a1), t0
	addi	a1, a1, 32
	vssseg8e32.v	v8, (a1), t0
	addi	t2, t2, 1
	j	copy_loop

copy_done:
	nop
```

Comparing it with the helper version. I tested it and observed a 25x performance
improvement.

> I hit a problem with this being unable to deal with restarts. You left
> the existing heleprs in there that can deal with vstart != 0, so I guess
> you intended it to fall back, but it's not quite wired up right.
> 
> I tried adding that in and it seems to work. Also made a little
> adjustment to your test case if you wouldn't mind changing that too.
> 
> I have a tcg test for interrupted vector memory operations that caught
> this bug, I will submit it soon and cc you on it.
> 
> Thanks,
> Nick
> 
> [PATCH] target/riscv: Fix "Generate strided vector ld/st with tcg"
> 
> If a strided vector memory access instruction has non-zero
> vstart, fall back to the helper functions rather than causing
> an illegal instruction trap. The vlse/vsse helpers were dead
> code before this.
> 
> An implementation is permitted to cause an illegal instruction
> if vstart is not 0 and it is set to a value that can not be
> produced implicitly by the implementation, but memory accesses
> will generally always need to deal with page faults.
> 
> This also adjusts the tcg test Makefile change to specify the
> cpu type on a per-test basis, because I have another test that
> needs different CPU options, and that gets broken if you
> change it this way.
> 
> [ feel free to take changes or parts of the changelog and adjust
> / merge them into your patches ]
> 

Thanks for the review. The primary author of this performance optimization patch
is Paolo. I will incorporate your changes and cc Paolo.

> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  target/riscv/insn_trans/trans_rvv.c.inc   | 37 ++++++++++++++++++++---
>  tests/tcg/riscv64/Makefile.softmmu-target |  3 +-
>  2 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
> index 5e200249ef..439ea0edcf 100644
> --- a/target/riscv/insn_trans/trans_rvv.c.inc
> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
> @@ -1090,11 +1090,30 @@ static void gen_ldst_stride_tail_loop(DisasContext *s, TCGv dest, uint32_t nf,
>      return;
>  }
>  
> +typedef void gen_helper_ldst_stride(TCGv_ptr, TCGv_ptr, TCGv,
> +                                    TCGv, TCGv_env, TCGv_i32);
> +
>  static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
> -                              uint32_t data, DisasContext *s, bool is_load)
> +                              uint32_t data, gen_helper_ldst_stride *fn,
> +                              DisasContext *s, bool is_load)
>  {
>      if (!s->vstart_eq_zero) {
> -        return false;
> +        TCGv_ptr dest, mask;
> +        TCGv base, stride;
> +        TCGv_i32 desc;
> +
> +        dest = tcg_temp_new_ptr();
> +        mask = tcg_temp_new_ptr();
> +        base = get_gpr(s, rs1, EXT_NONE);
> +        stride = get_gpr(s, rs2, EXT_NONE);
> +        desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
> +                                          s->cfg_ptr->vlenb, data));
> +
> +        tcg_gen_addi_ptr(dest, tcg_env, vreg_ofs(s, vd));
> +        tcg_gen_addi_ptr(mask, tcg_env, vreg_ofs(s, 0));
> +        mark_vs_dirty(s);
> +        fn(dest, mask, base, stride, tcg_env, desc);
> +        return true;
>      }
>  
>      TCGv dest = tcg_temp_new();
> @@ -1146,6 +1165,16 @@ static bool ldst_stride_trans(uint32_t vd, uint32_t rs1, uint32_t rs2,
>  static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
>  {
>      uint32_t data = 0;
> +    gen_helper_ldst_stride *fn;
> +    static gen_helper_ldst_stride *const fns[4] = {
> +        gen_helper_vlse8_v, gen_helper_vlse16_v,
> +        gen_helper_vlse32_v, gen_helper_vlse64_v
> +    };
> +
> +    fn = fns[eew];
> +    if (fn == NULL) {
> +        return false;
> +    }
>  
>      uint8_t emul = vext_get_emul(s, eew);
>      data = FIELD_DP32(data, VDATA, VM, a->vm);
> @@ -1153,7 +1182,7 @@ static bool ld_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
>      data = FIELD_DP32(data, VDATA, NF, a->nf);
>      data = FIELD_DP32(data, VDATA, VTA, s->vta);
>      data = FIELD_DP32(data, VDATA, VMA, s->vma);
> -    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, true);
> +    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, fn, s, true);
>  }
>  
>  static bool ld_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
> @@ -1177,7 +1206,7 @@ static bool st_stride_op(DisasContext *s, arg_rnfvm *a, uint8_t eew)
>      data = FIELD_DP32(data, VDATA, LMUL, emul);
>      data = FIELD_DP32(data, VDATA, NF, a->nf);
>  
> -    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, s, false);
> +    return ldst_stride_trans(a->rd, a->rs1, a->rs2, data, NULL, s, false);
>  }
>  
>  static bool st_stride_check(DisasContext *s, arg_rnfvm* a, uint8_t eew)
> diff --git a/tests/tcg/riscv64/Makefile.softmmu-target b/tests/tcg/riscv64/Makefile.softmmu-target
> index f09f1a57c4..d9f067dbd4 100644
> --- a/tests/tcg/riscv64/Makefile.softmmu-target
> +++ b/tests/tcg/riscv64/Makefile.softmmu-target
> @@ -14,7 +14,7 @@ CFLAGS += -march=rv64gcv -mabi=lp64d -g -Og
>  %: %.o $(LINK_SCRIPT)
>  	$(LD) $(LDFLAGS) $< -o $@
>  
> -QEMU_OPTS += -M virt -cpu rv64,v=true -display none -semihosting -device loader,file=
> +QEMU_OPTS += -M virt -display none -semihosting -device loader,file=
>  
>  EXTRA_RUNS += run-issue1060
>  run-issue1060: issue1060
> @@ -30,6 +30,7 @@ run-misa-ialign: misa-ialign
>  	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
>  
>  EXTRA_RUNS += run-vlsseg8e32
> +run-vlsseg8e32: QEMU_OPTS := -cpu rv64,v=true $(QEMU_OPTS)
>  run-vlsseg8e32: test-vlsseg8e32
>  	$(call run-test, $<, $(QEMU) $(QEMU_OPTS)$<)
>  

nice improvement~

Thanks,
Chao


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-09-03  2:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 13:23 [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Chao Liu
2025-08-19 13:23 ` [PATCH v5 1/2] Generate strided vector loads/stores with tcg nodes Chao Liu
2025-08-19 13:23 ` [PATCH v5 2/2] tests/tcg/riscv64: Add test for vlsseg8e32 instruction Chao Liu
2025-09-03  2:21 ` [PATCH v5 0/2] target/riscv: Generate strided vector ld/st with tcg Nicholas Piggin
2025-09-03  2:54   ` Zevorn(Chao Liu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).