[RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions
@ 2024-06-13 17:51 Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

Hi,

Sorry for the quick update the version, this version fixes the
cross-page probe checking bug that I forgot to apply to the v3 version.

This RFC patch set tries to fix the issue of
https://gitlab.com/qemu-project/qemu/-/issues/2137.

In this RFC, we added patches that
1. Provide a fast path to direct access the host ram for some vector
   load/store instructions (e.g. unmasked vector unit-stride load/store
   instructions) and perform virtual address resolution once for the
   entire vector at beginning of helper function. (Thanks for Richard
   Henderson's suggestion)
2. Replace the group elements load/store TCG ops by the group element
   load/store flow in helper functions with some assumption (e.g. no
   masking, continuous memory load/store, the endian of host and guest
   architecture are the same). (Thanks for Richard Henderson's
   suggestion)
3. Try inline the vector load/store related functions that corresponding
   most of the execution time.

This version can improve the performance of the test case provided in
https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
- QEMU user mode (vlen=512): from ~51.8 sec. to ~4.5 sec.
- QEMU system mode (vlen=512): from ~125.6 sec to ~6.6 sec.

This RFC is tested with SPEC CPU2006 with test input.
We will try to measure the performance on SPEC CPU2006 benchmarks.

Series based on riscv-to-apply.next branch (commit d82f37f).

Changes from v3:
- patch 2
    - Modify vext_cont_ldst_pages for corss-page checking
- patch 3
    - Modify vext_ldst_whole for vext_cont_ldst_pages

Previous version:
- v1: https://lore.kernel.org/all/20240215192823.729209-1-max.chou@sifive.com/
- v2: https://lore.kernel.org/all/20240531174504.281461-1-max.chou@sifive.com/
- v3: https://lore.kernel.org/all/20240613141906.1276105-1-max.chou@sifive.com/

Max Chou (5):
  accel/tcg: Avoid unnecessary call overhead from
    qemu_plugin_vcpu_mem_cb
  target/riscv: rvv: Provide a fast path using direct access to host ram
    for unmasked unit-stride load/store
  target/riscv: rvv: Provide a fast path using direct access to host ram
    for unit-stride whole register load/store
  target/riscv: rvv: Provide group continuous ld/st flow for unit-stride
    ld/st instructions
  target/riscv: Inline unit-stride ld/st and corresponding functions for
    performance

 accel/tcg/ldst_common.c.inc             |   8 +-
 target/riscv/insn_trans/trans_rvv.c.inc |   3 +
 target/riscv/vector_helper.c            | 854 +++++++++++++++++++-----
 target/riscv/vector_internals.h         |  48 ++
 4 files changed, 745 insertions(+), 168 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
@ 2024-06-13 17:51 ` Max Chou
  2024-06-20  2:48   ` Richard Henderson
                     ` (2 more replies)
  2024-06-13 17:51 ` [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
                   ` (3 subsequent siblings)
  4 siblings, 3 replies; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

If there are not any QEMU plugin memory callback functions, checking
before calling the qemu_plugin_vcpu_mem_cb function can reduce the
function call overhead.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/ldst_common.c.inc | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
index c82048e377e..87ceb954873 100644
--- a/accel/tcg/ldst_common.c.inc
+++ b/accel/tcg/ldst_common.c.inc
@@ -125,7 +125,9 @@ void helper_st_i128(CPUArchState *env, uint64_t addr, Int128 val, MemOpIdx oi)
 
 static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    }
 }
 
 uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
@@ -188,7 +190,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
 
 static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    }
 }
 
 void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
  2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-06-13 17:51 ` Max Chou
  2024-06-20  4:29   ` Richard Henderson
  2024-06-20  4:41   ` Richard Henderson
  2024-06-13 17:51 ` [RFC PATCH v4 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

This commit references the sve_ldN_r/sve_stN_r helper functions in ARM
target to optimize the vector unmasked unit-stride load/store
instructions by following items:

* Get the loose bound of activate elements
* Probing pages/resolving host memory address/handling watchpoint at beginning
* Provide new interface to direct access host memory

The original element load/store interface is replaced by the new element
load/store functions with _tlb & _host postfix that means doing the
element load/store through the original softmmu flow and the direct
access host memory flow.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/insn_trans/trans_rvv.c.inc |   3 +
 target/riscv/vector_helper.c            | 637 +++++++++++++++++++-----
 target/riscv/vector_internals.h         |  48 ++
 3 files changed, 551 insertions(+), 137 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 3a3896ba06c..14e10568bd7 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -770,6 +770,7 @@ static bool ld_us_mask_op(DisasContext *s, arg_vlm_v *a, uint8_t eew)
     /* Mask destination register are always tail-agnostic */
     data = FIELD_DP32(data, VDATA, VTA, s->cfg_vta_all_1s);
     data = FIELD_DP32(data, VDATA, VMA, s->vma);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
 }
 
@@ -787,6 +788,7 @@ static bool st_us_mask_op(DisasContext *s, arg_vsm_v *a, uint8_t eew)
     /* EMUL = 1, NFIELDS = 1 */
     data = FIELD_DP32(data, VDATA, LMUL, 0);
     data = FIELD_DP32(data, VDATA, NF, 1);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
 }
 
@@ -1106,6 +1108,7 @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
     TCGv_i32 desc;
 
     uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     dest = tcg_temp_new_ptr();
     desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
                                       s->cfg_ptr->vlenb, data));
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 1b4d5a8e378..3d284138fb3 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -29,6 +29,7 @@
 #include "tcg/tcg-gvec-desc.h"
 #include "internals.h"
 #include "vector_internals.h"
+#include "hw/core/tcg-cpu-ops.h"
 #include <math.h>
 
 target_ulong HELPER(vsetvl)(CPURISCVState *env, target_ulong s1,
@@ -136,6 +137,270 @@ static void probe_pages(CPURISCVState *env, target_ulong addr,
     }
 }
 
+/*
+ * Find first active element on each page, and a loose bound for the
+ * final element on each page.  Identify any single element that spans
+ * the page boundary. Return true if there are any active elements.
+ */
+static bool vext_cont_ldst_elements(RVVContLdSt *info, target_ulong addr,
+                                    void *v0, uint32_t vstart, uint32_t evl,
+                                    uint32_t desc, uint32_t log2_esz,
+                                    bool is_us_whole)
+{
+    uint32_t vm = vext_vm(desc);
+    uint32_t nf = vext_nf(desc);
+    uint32_t max_elems = vext_max_elems(desc, log2_esz);
+    uint32_t esz = 1 << log2_esz;
+    uint32_t msize = is_us_whole ? esz : nf * esz;
+    int32_t reg_idx_first = -1, reg_idx_last = -1, reg_idx_split;
+    int32_t mem_off_last, mem_off_split;
+    int32_t page_split, elt_split;
+    int32_t i;
+
+    /* Set all of the element indices to -1, and the TLB data to 0. */
+    memset(info, -1, offsetof(RVVContLdSt, page));
+    memset(info->page, 0, sizeof(info->page));
+
+    /* Gross scan over the mask register v0 to find bounds. */
+    if (vm == 0) {
+        for (i = vstart; i < evl; ++i) {
+            if (vext_elem_mask(v0, i)) {
+                reg_idx_last = i;
+                if (reg_idx_first < 0) {
+                    reg_idx_first = i;
+                }
+            }
+        }
+    } else {
+        reg_idx_first = vstart;
+        reg_idx_last = evl - 1;
+    }
+
+    if (unlikely(reg_idx_first < 0)) {
+        /* No active elements, no pages touched. */
+        return false;
+    }
+    tcg_debug_assert(reg_idx_last >= 0 && reg_idx_last < max_elems);
+
+    info->reg_idx_first[0] = reg_idx_first;
+    info->mem_off_first[0] = reg_idx_first * msize;
+    mem_off_last = reg_idx_last * msize;
+
+    page_split = -(addr | TARGET_PAGE_MASK);
+    if (likely(mem_off_last + msize <= page_split)) {
+        /* The entire operation fits within a single page. */
+        info->reg_idx_last[0] = reg_idx_last;
+        return true;
+    }
+
+    info->page_split = page_split;
+    elt_split = page_split / msize;
+    reg_idx_split = elt_split;
+    mem_off_split = elt_split * msize;
+
+    /*
+     * This is the last full element on the first page, but it is not
+     * necessarily active.  If there is no full element, i.e. the first
+     * active element is the one that's split, this value remains -1.
+     * It is useful as iteration bounds.
+     */
+    if (elt_split != 0) {
+        info->reg_idx_last[0] = reg_idx_split - 1;
+    }
+
+    /* Determine if an unaligned element spans the pages.  */
+    if (page_split % msize != 0) {
+        /* It is helpful to know if the split element is active. */
+        if (vm == 1 || (vm == 0 && vext_elem_mask(v0, reg_idx_split))) {
+            info->reg_idx_split = reg_idx_split;
+            info->mem_off_split = mem_off_split;
+
+            if (reg_idx_split == reg_idx_last) {
+                /* The page crossing element is last. */
+                return true;
+            }
+        }
+        reg_idx_split++;
+        mem_off_split += msize;
+    }
+
+    /*
+     * We do want the first active element on the second page, because
+     * this may affect the address reported in an exception.
+     */
+    if (vm == 0) {
+        for (; reg_idx_split < evl; ++reg_idx_split) {
+            if (vext_elem_mask(v0, reg_idx_split)) {
+                break;
+            }
+        }
+    }
+    tcg_debug_assert(reg_idx_split <= reg_idx_last);
+    info->reg_idx_first[1] = reg_idx_split;
+    info->mem_off_first[1] = reg_idx_split * msize;
+    info->reg_idx_last[1] = reg_idx_last;
+    return true;
+}
+
+/*
+ * Resolve the guest virtual address to info->host and info->flags.
+ * If @nofault, return false if the page is invalid, otherwise
+ * exit via page fault exception.
+ */
+static bool vext_probe_page(CPURISCVState *env, RVVHostPage *info,
+                            bool nofault, target_ulong addr, int mem_off,
+                            int size, MMUAccessType access_type, int mmu_idx,
+                            uintptr_t ra)
+{
+    int flags;
+
+    addr += mem_off;
+
+#ifdef CONFIG_USER_ONLY
+    flags = probe_access_flags(env, adjust_addr(env, addr), size, access_type,
+                               mmu_idx, nofault, &info->host, ra);
+#else
+    CPUTLBEntryFull *full;
+    flags = probe_access_full(env, adjust_addr(env, addr), size, access_type,
+                              mmu_idx, nofault, &info->host, &full, ra);
+#endif
+    info->flags = flags;
+
+    if (flags & TLB_INVALID_MASK) {
+        g_assert(nofault);
+        return false;
+    }
+
+#ifdef CONFIG_USER_ONLY
+    memset(&info->attrs, 0, sizeof(info->attrs));
+#else
+    info->attrs = full->attrs;
+#endif
+
+    /* Ensure that info->host[] is relative to addr, not addr + mem_off. */
+    info->host -= mem_off;
+    return true;
+}
+
+/*
+ * Resolve the guest virtual addresses to info->page[].
+ * Control the generation of page faults with @fault.  Return false if
+ * there is no work to do, which can only happen with @fault == FAULT_NO.
+ */
+static bool vext_cont_ldst_pages(CPURISCVState *env, RVVContLdSt *info,
+                                 target_ulong addr, bool is_load,
+                                 uint32_t desc, uint32_t esz, uintptr_t ra,
+                                 bool is_us_whole)
+{
+    uint32_t vm = vext_vm(desc);
+    uint32_t nf = vext_nf(desc);
+    bool nofault = (vm == 1 ? false : true);
+    int mmu_index = riscv_env_mmu_index(env, false);
+    int mem_off = info->mem_off_first[0];
+    int elem_size = is_us_whole ? esz : (nf * esz);
+    int size, last_idx;
+    MMUAccessType access_type = is_load ? MMU_DATA_LOAD : MMU_DATA_STORE;
+    bool have_work;
+
+    size = (info->reg_idx_last[0] - info->reg_idx_first[0] + 1) * elem_size;
+
+    have_work = vext_probe_page(env, &info->page[0], nofault, addr, mem_off,
+                                size, access_type, mmu_index, ra);
+    if (!have_work) {
+        /* No work to be done. */
+        return false;
+    }
+
+    if (likely(info->page_split < 0)) {
+        /* The entire operation was on the one page. */
+        return true;
+    }
+
+    /*
+     * If the second page is invalid, then we want the fault address to be
+     * the first byte on that page which is accessed.
+     */
+    if (info->mem_off_split >= 0) {
+        /*
+         * There is an element split across the pages.  The fault address
+         * should be the first byte of the second page.
+         */
+        mem_off = info->page_split;
+        last_idx = info->reg_idx_split + 1;
+    } else {
+        /*
+         * There is no element split across the pages.  The fault address
+         * should be the first active element on the second page.
+         */
+        mem_off = info->mem_off_first[1];
+        last_idx = info->reg_idx_last[1];
+    }
+    size = last_idx * elem_size - mem_off + esz;
+    have_work |= vext_probe_page(env, &info->page[1], nofault, addr, mem_off,
+                                 size, access_type, mmu_index, ra);
+    return have_work;
+}
+
+#ifndef CONFIG_USER_ONLY
+void vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info,
+                                uint64_t *v0, target_ulong addr,
+                                uint32_t esz, bool is_load, uintptr_t ra,
+                                uint32_t desc)
+{
+    int32_t i;
+    intptr_t mem_off, reg_off, reg_last;
+    uint32_t vm = vext_vm(desc);
+    int wp_access = is_load == true ? BP_MEM_READ : BP_MEM_WRITE;
+    int flags0 = info->page[0].flags;
+    int flags1 = info->page[1].flags;
+
+    if (likely(!((flags0 | flags1) & TLB_WATCHPOINT))) {
+        return;
+    }
+
+    /* Indicate that watchpoints are handled. */
+    info->page[0].flags = flags0 & ~TLB_WATCHPOINT;
+    info->page[1].flags = flags1 & ~TLB_WATCHPOINT;
+
+    if (flags0 & TLB_WATCHPOINT) {
+        mem_off = info->mem_off_first[0];
+        reg_off = info->reg_idx_first[0];
+        reg_last = info->reg_idx_last[0];
+
+        for (i = reg_off; i < reg_last; ++i, mem_off += esz) {
+            if (vm == 1 || (vm == 0 && vext_elem_mask(v0, i))) {
+                cpu_check_watchpoint(env_cpu(env),
+                                     adjust_addr(env, addr + mem_off), esz,
+                                     info->page[0].attrs, wp_access, ra);
+            }
+        }
+    }
+
+    mem_off = info->mem_off_split;
+    if (mem_off >= 0) {
+        if (vm == 1 || (vm == 0 && vext_elem_mask(v0, mem_off / esz))) {
+            cpu_check_watchpoint(env_cpu(env),
+                                 adjust_addr(env, addr + mem_off), esz,
+                                 info->page[0].attrs, wp_access, ra);
+        }
+    }
+
+    mem_off = info->mem_off_first[1];
+    if ((flags1 & TLB_WATCHPOINT) && mem_off >= 0) {
+        reg_off = info->reg_idx_first[1];
+        reg_last = info->reg_idx_last[1];
+
+        for (i = reg_off; i < reg_last; ++i, mem_off += esz) {
+            if (vm == 1 || (vm == 0 && vext_elem_mask(v0, i))) {
+                cpu_check_watchpoint(env_cpu(env),
+                                     adjust_addr(env, addr + mem_off), esz,
+                                     info->page[1].attrs, wp_access, ra);
+            }
+        }
+    }
+}
+#endif
+
 static inline void vext_set_elem_mask(void *v0, int index,
                                       uint8_t value)
 {
@@ -146,34 +411,51 @@ static inline void vext_set_elem_mask(void *v0, int index,
 }
 
 /* elements operations for load and store */
-typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
-                               uint32_t idx, void *vd, uintptr_t retaddr);
+typedef void vext_ldst_elem_fn_tlb(CPURISCVState *env, abi_ptr addr,
+                                   uint32_t idx, void *vd, uintptr_t retaddr);
+typedef void vext_ldst_elem_fn_host(void *vd, uint32_t idx, void *host);
 
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE *cur = ((ETYPE *)vd + H(idx));                   \
-    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);      \
-}                                                          \
-
-GEN_VEXT_LD_ELEM(lde_b, int8_t,  H1, ldsb)
-GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
-GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
-GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
-
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE data = *((ETYPE *)vd + H(idx));                  \
-    cpu_##STSUF##_data_ra(env, addr, data, retaddr);       \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                         \
+static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
+                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE *cur = ((ETYPE *)reg);                                        \
+    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);                   \
+}                                                                       \
+                                                                        \
+static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
+{                                                                       \
+    ETYPE val = LDSUF##_p(host);                                        \
+    uint8_t *reg = (uint8_t *)(vd + byte_off);                          \
+    *(ETYPE *)(reg) = val;                                              \
+}
+
+GEN_VEXT_LD_ELEM(lde_b, uint8_t,  H1, ldub)
+GEN_VEXT_LD_ELEM(lde_h, uint16_t, H2, lduw)
+GEN_VEXT_LD_ELEM(lde_w, uint32_t, H4, ldl)
+GEN_VEXT_LD_ELEM(lde_d, uint64_t, H8, ldq)
+
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                         \
+static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
+                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE data = *((ETYPE *)reg);                                       \
+    cpu_##STSUF##_data_ra(env, addr, data, retaddr);                    \
+}                                                                       \
+                                                                        \
+static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE val = *(ETYPE *)(reg);                                        \
+    STSUF##_p(host, val);                                               \
 }
 
-GEN_VEXT_ST_ELEM(ste_b, int8_t,  H1, stb)
-GEN_VEXT_ST_ELEM(ste_h, int16_t, H2, stw)
-GEN_VEXT_ST_ELEM(ste_w, int32_t, H4, stl)
-GEN_VEXT_ST_ELEM(ste_d, int64_t, H8, stq)
+GEN_VEXT_ST_ELEM(ste_b, uint8_t,  H1, stb)
+GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
+GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
+GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
 
 static void vext_set_tail_elems_1s(target_ulong vl, void *vd,
                                    uint32_t desc, uint32_t nf,
@@ -199,7 +481,7 @@ static void
 vext_ldst_stride(void *vd, void *v0, target_ulong base,
                  target_ulong stride, CPURISCVState *env,
                  uint32_t desc, uint32_t vm,
-                 vext_ldst_elem_fn *ldst_elem,
+                 vext_ldst_elem_fn_tlb *ldst_elem,
                  uint32_t log2_esz, uintptr_t ra)
 {
     uint32_t i, k;
@@ -221,7 +503,8 @@ vext_ldst_stride(void *vd, void *v0, target_ulong base,
                 continue;
             }
             target_ulong addr = base + stride * i + (k << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -240,10 +523,10 @@ void HELPER(NAME)(void *vd, void * v0, target_ulong base,               \
                      ctzl(sizeof(ETYPE)), GETPC());                     \
 }
 
-GEN_VEXT_LD_STRIDE(vlse8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_STRIDE(vlse16_v, int16_t, lde_h)
-GEN_VEXT_LD_STRIDE(vlse32_v, int32_t, lde_w)
-GEN_VEXT_LD_STRIDE(vlse64_v, int64_t, lde_d)
+GEN_VEXT_LD_STRIDE(vlse8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_STRIDE(vlse16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_STRIDE(vlse32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_STRIDE(vlse64_v, int64_t, lde_d_tlb)
 
 #define GEN_VEXT_ST_STRIDE(NAME, ETYPE, STORE_FN)                       \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
@@ -255,10 +538,10 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
                      ctzl(sizeof(ETYPE)), GETPC());                     \
 }
 
-GEN_VEXT_ST_STRIDE(vsse8_v,  int8_t,  ste_b)
-GEN_VEXT_ST_STRIDE(vsse16_v, int16_t, ste_h)
-GEN_VEXT_ST_STRIDE(vsse32_v, int32_t, ste_w)
-GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
+GEN_VEXT_ST_STRIDE(vsse8_v,  int8_t,  ste_b_tlb)
+GEN_VEXT_ST_STRIDE(vsse16_v, int16_t, ste_h_tlb)
+GEN_VEXT_ST_STRIDE(vsse32_v, int32_t, ste_w_tlb)
+GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d_tlb)
 
 /*
  * unit-stride: access elements stored contiguously in memory
@@ -267,9 +550,14 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
 /* unmasked unit-stride load and store operation */
 static void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-             vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
-             uintptr_t ra)
+             vext_ldst_elem_fn_tlb *ldst_tlb,
+             vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
+             uint32_t evl, uintptr_t ra, bool is_load)
 {
+    RVVContLdSt info;
+    void *host;
+    int flags;
+    intptr_t reg_start, reg_last;
     uint32_t i, k;
     uint32_t nf = vext_nf(desc);
     uint32_t max_elems = vext_max_elems(desc, log2_esz);
@@ -277,17 +565,88 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     VSTART_CHECK_EARLY_EXIT(env);
 
-    /* load bytes from guest memory */
-    for (i = env->vstart; i < evl; env->vstart = ++i) {
+    vext_cont_ldst_elements(&info, base, env->vreg, env->vstart, evl, desc,
+                            log2_esz, false);
+    /* Probe the page(s).  Exit with exception for any invalid page. */
+    vext_cont_ldst_pages(env, &info, base, is_load, desc, esz, ra, false);
+    /* Handle watchpoints for all active elements. */
+    vext_cont_ldst_watchpoints(env, &info, env->vreg, base, esz, is_load, ra,
+                               desc);
+
+    /* Load bytes from guest memory */
+    flags = info.page[0].flags | info.page[1].flags;
+    if (unlikely(flags != 0)) {
+        /* At least one page includes MMIO. */
+        reg_start = info.reg_idx_first[0];
+        reg_last = info.reg_idx_last[1];
+        if (reg_last < 0) {
+            reg_last = info.reg_idx_split;
+            if (reg_last < 0) {
+                reg_last = info.reg_idx_last[0];
+            }
+        }
+        reg_last += 1;
+
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                target_ulong addr = base + ((i * nf + k) << log2_esz);
+                ldst_tlb(env, adjust_addr(env, addr),
+                         (i + k * max_elems) << log2_esz, vd, ra);
+                k++;
+            }
+        }
+
+        env->vstart = 0;
+        vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
+        return;
+    }
+
+    /* The entire operation is in RAM, on valid pages. */
+    reg_start = info.reg_idx_first[0];
+    reg_last = info.reg_idx_last[0] + 1;
+    host = info.page[0].host;
+
+    for (i = reg_start; i < reg_last; ++i) {
         k = 0;
         while (k < nf) {
-            target_ulong addr = base + ((i * nf + k) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_host(vd, (i + k * max_elems) << log2_esz,
+                      host + ((i * nf + k) << log2_esz));
             k++;
         }
     }
-    env->vstart = 0;
 
+    /*
+     * Use the slow path to manage the cross-page misalignment.
+     * But we know this is RAM and cannot trap.
+     */
+    if (unlikely(info.mem_off_split >= 0)) {
+        reg_start = info.reg_idx_split;
+        k = 0;
+        while (k < nf) {
+            target_ulong addr = base + ((reg_start * nf + k) << log2_esz);
+            ldst_tlb(env, adjust_addr(env, addr),
+                     (reg_start + k * max_elems) << log2_esz, vd, ra);
+            k++;
+        }
+    }
+
+    if (unlikely(info.mem_off_first[1] >= 0)) {
+        reg_start = info.reg_idx_first[1];
+        reg_last = info.reg_idx_last[1] + 1;
+        host = info.page[1].host;
+
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                ldst_host(vd, (i + k * max_elems) << log2_esz,
+                          host + ((i * nf + k) << log2_esz));
+                k++;
+            }
+        }
+    }
+
+    env->vstart = 0;
     vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
 }
 
@@ -296,47 +655,47 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
  * stride, stride = NF * sizeof (ETYPE)
  */
 
-#define GEN_VEXT_LD_US(NAME, ETYPE, LOAD_FN)                            \
-void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,         \
-                         CPURISCVState *env, uint32_t desc)             \
-{                                                                       \
-    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));             \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, false, LOAD_FN,   \
-                     ctzl(sizeof(ETYPE)), GETPC());                     \
-}                                                                       \
-                                                                        \
-void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
-                  CPURISCVState *env, uint32_t desc)                    \
-{                                                                       \
-    vext_ldst_us(vd, base, env, desc, LOAD_FN,                          \
-                 ctzl(sizeof(ETYPE)), env->vl, GETPC());                \
+#define GEN_VEXT_LD_US(NAME, ETYPE, LOAD_FN_TLB, LOAD_FN_HOST)      \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,     \
+                         CPURISCVState *env, uint32_t desc)         \
+{                                                                   \
+    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));         \
+    vext_ldst_stride(vd, v0, base, stride, env, desc, false,        \
+                     LOAD_FN_TLB, ctzl(sizeof(ETYPE)), GETPC());    \
+}                                                                   \
+                                                                    \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base,            \
+                  CPURISCVState *env, uint32_t desc)                \
+{                                                                   \
+    vext_ldst_us(vd, base, env, desc, LOAD_FN_TLB, LOAD_FN_HOST,    \
+                 ctzl(sizeof(ETYPE)), env->vl, GETPC(), true);      \
 }
 
-GEN_VEXT_LD_US(vle8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_US(vle16_v, int16_t, lde_h)
-GEN_VEXT_LD_US(vle32_v, int32_t, lde_w)
-GEN_VEXT_LD_US(vle64_v, int64_t, lde_d)
+GEN_VEXT_LD_US(vle8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_US(vle16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_US(vle32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_US(vle64_v, int64_t, lde_d_tlb, lde_d_host)
 
-#define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN)                            \
+#define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN_TLB, STORE_FN_HOST)         \
 void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,          \
                          CPURISCVState *env, uint32_t desc)              \
 {                                                                        \
     uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));              \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, false, STORE_FN,   \
-                     ctzl(sizeof(ETYPE)), GETPC());                      \
+    vext_ldst_stride(vd, v0, base, stride, env, desc, false,             \
+                     STORE_FN_TLB, ctzl(sizeof(ETYPE)), GETPC());        \
 }                                                                        \
                                                                          \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,                 \
                   CPURISCVState *env, uint32_t desc)                     \
 {                                                                        \
-    vext_ldst_us(vd, base, env, desc, STORE_FN,                          \
-                 ctzl(sizeof(ETYPE)), env->vl, GETPC());                 \
+    vext_ldst_us(vd, base, env, desc, STORE_FN_TLB, STORE_FN_HOST,       \
+                 ctzl(sizeof(ETYPE)), env->vl, GETPC(), false);          \
 }
 
-GEN_VEXT_ST_US(vse8_v,  int8_t,  ste_b)
-GEN_VEXT_ST_US(vse16_v, int16_t, ste_h)
-GEN_VEXT_ST_US(vse32_v, int32_t, ste_w)
-GEN_VEXT_ST_US(vse64_v, int64_t, ste_d)
+GEN_VEXT_ST_US(vse8_v,  int8_t,  ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_US(vse16_v, int16_t, ste_h_tlb, ste_h_host)
+GEN_VEXT_ST_US(vse32_v, int32_t, ste_w_tlb, ste_w_host)
+GEN_VEXT_ST_US(vse64_v, int64_t, ste_d_tlb, ste_d_host)
 
 /*
  * unit stride mask load and store, EEW = 1
@@ -346,8 +705,8 @@ void HELPER(vlm_v)(void *vd, void *v0, target_ulong base,
 {
     /* evl = ceil(vl/8) */
     uint8_t evl = (env->vl + 7) >> 3;
-    vext_ldst_us(vd, base, env, desc, lde_b,
-                 0, evl, GETPC());
+    vext_ldst_us(vd, base, env, desc, lde_b_tlb, lde_b_host,
+                 0, evl, GETPC(), true);
 }
 
 void HELPER(vsm_v)(void *vd, void *v0, target_ulong base,
@@ -355,8 +714,8 @@ void HELPER(vsm_v)(void *vd, void *v0, target_ulong base,
 {
     /* evl = ceil(vl/8) */
     uint8_t evl = (env->vl + 7) >> 3;
-    vext_ldst_us(vd, base, env, desc, ste_b,
-                 0, evl, GETPC());
+    vext_ldst_us(vd, base, env, desc, ste_b_tlb, ste_b_host,
+                 0, evl, GETPC(), false);
 }
 
 /*
@@ -381,7 +740,7 @@ static inline void
 vext_ldst_index(void *vd, void *v0, target_ulong base,
                 void *vs2, CPURISCVState *env, uint32_t desc,
                 vext_get_index_addr get_index_addr,
-                vext_ldst_elem_fn *ldst_elem,
+                vext_ldst_elem_fn_tlb *ldst_elem,
                 uint32_t log2_esz, uintptr_t ra)
 {
     uint32_t i, k;
@@ -405,7 +764,8 @@ vext_ldst_index(void *vd, void *v0, target_ulong base,
                 continue;
             }
             abi_ptr addr = get_index_addr(base, i, vs2) + (k << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -422,22 +782,22 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                   \
                     LOAD_FN, ctzl(sizeof(ETYPE)), GETPC());                \
 }
 
-GEN_VEXT_LD_INDEX(vlxei8_8_v,   int8_t,  idx_b, lde_b)
-GEN_VEXT_LD_INDEX(vlxei8_16_v,  int16_t, idx_b, lde_h)
-GEN_VEXT_LD_INDEX(vlxei8_32_v,  int32_t, idx_b, lde_w)
-GEN_VEXT_LD_INDEX(vlxei8_64_v,  int64_t, idx_b, lde_d)
-GEN_VEXT_LD_INDEX(vlxei16_8_v,  int8_t,  idx_h, lde_b)
-GEN_VEXT_LD_INDEX(vlxei16_16_v, int16_t, idx_h, lde_h)
-GEN_VEXT_LD_INDEX(vlxei16_32_v, int32_t, idx_h, lde_w)
-GEN_VEXT_LD_INDEX(vlxei16_64_v, int64_t, idx_h, lde_d)
-GEN_VEXT_LD_INDEX(vlxei32_8_v,  int8_t,  idx_w, lde_b)
-GEN_VEXT_LD_INDEX(vlxei32_16_v, int16_t, idx_w, lde_h)
-GEN_VEXT_LD_INDEX(vlxei32_32_v, int32_t, idx_w, lde_w)
-GEN_VEXT_LD_INDEX(vlxei32_64_v, int64_t, idx_w, lde_d)
-GEN_VEXT_LD_INDEX(vlxei64_8_v,  int8_t,  idx_d, lde_b)
-GEN_VEXT_LD_INDEX(vlxei64_16_v, int16_t, idx_d, lde_h)
-GEN_VEXT_LD_INDEX(vlxei64_32_v, int32_t, idx_d, lde_w)
-GEN_VEXT_LD_INDEX(vlxei64_64_v, int64_t, idx_d, lde_d)
+GEN_VEXT_LD_INDEX(vlxei8_8_v,   int8_t,  idx_b, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_16_v,  int16_t, idx_b, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_32_v,  int32_t, idx_b, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_64_v,  int64_t, idx_b, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_8_v,  int8_t,  idx_h, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_16_v, int16_t, idx_h, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_32_v, int32_t, idx_h, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_64_v, int64_t, idx_h, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_8_v,  int8_t,  idx_w, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_16_v, int16_t, idx_w, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_32_v, int32_t, idx_w, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_64_v, int64_t, idx_w, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_8_v,  int8_t,  idx_d, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_16_v, int16_t, idx_d, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_32_v, int32_t, idx_d, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_64_v, int64_t, idx_d, lde_d_tlb)
 
 #define GEN_VEXT_ST_INDEX(NAME, ETYPE, INDEX_FN, STORE_FN)       \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,         \
@@ -448,22 +808,22 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,         \
                     GETPC());                                    \
 }
 
-GEN_VEXT_ST_INDEX(vsxei8_8_v,   int8_t,  idx_b, ste_b)
-GEN_VEXT_ST_INDEX(vsxei8_16_v,  int16_t, idx_b, ste_h)
-GEN_VEXT_ST_INDEX(vsxei8_32_v,  int32_t, idx_b, ste_w)
-GEN_VEXT_ST_INDEX(vsxei8_64_v,  int64_t, idx_b, ste_d)
-GEN_VEXT_ST_INDEX(vsxei16_8_v,  int8_t,  idx_h, ste_b)
-GEN_VEXT_ST_INDEX(vsxei16_16_v, int16_t, idx_h, ste_h)
-GEN_VEXT_ST_INDEX(vsxei16_32_v, int32_t, idx_h, ste_w)
-GEN_VEXT_ST_INDEX(vsxei16_64_v, int64_t, idx_h, ste_d)
-GEN_VEXT_ST_INDEX(vsxei32_8_v,  int8_t,  idx_w, ste_b)
-GEN_VEXT_ST_INDEX(vsxei32_16_v, int16_t, idx_w, ste_h)
-GEN_VEXT_ST_INDEX(vsxei32_32_v, int32_t, idx_w, ste_w)
-GEN_VEXT_ST_INDEX(vsxei32_64_v, int64_t, idx_w, ste_d)
-GEN_VEXT_ST_INDEX(vsxei64_8_v,  int8_t,  idx_d, ste_b)
-GEN_VEXT_ST_INDEX(vsxei64_16_v, int16_t, idx_d, ste_h)
-GEN_VEXT_ST_INDEX(vsxei64_32_v, int32_t, idx_d, ste_w)
-GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d)
+GEN_VEXT_ST_INDEX(vsxei8_8_v,   int8_t,  idx_b, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_16_v,  int16_t, idx_b, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_32_v,  int32_t, idx_b, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_64_v,  int64_t, idx_b, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_8_v,  int8_t,  idx_h, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_16_v, int16_t, idx_h, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_32_v, int32_t, idx_h, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_64_v, int64_t, idx_h, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_8_v,  int8_t,  idx_w, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_16_v, int16_t, idx_w, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_32_v, int32_t, idx_w, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_64_v, int64_t, idx_w, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_8_v,  int8_t,  idx_d, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_16_v, int16_t, idx_d, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_32_v, int32_t, idx_d, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d_tlb)
 
 /*
  * unit-stride fault-only-fisrt load instructions
@@ -471,7 +831,7 @@ GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d)
 static inline void
 vext_ldff(void *vd, void *v0, target_ulong base,
           CPURISCVState *env, uint32_t desc,
-          vext_ldst_elem_fn *ldst_elem,
+          vext_ldst_elem_fn_tlb *ldst_elem,
           uint32_t log2_esz, uintptr_t ra)
 {
     void *host;
@@ -537,7 +897,8 @@ ProbeSuccess:
                 continue;
             }
             addr = base + ((i * nf + k) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -554,10 +915,10 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,  \
               ctzl(sizeof(ETYPE)), GETPC());              \
 }
 
-GEN_VEXT_LDFF(vle8ff_v,  int8_t,  lde_b)
-GEN_VEXT_LDFF(vle16ff_v, int16_t, lde_h)
-GEN_VEXT_LDFF(vle32ff_v, int32_t, lde_w)
-GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d)
+GEN_VEXT_LDFF(vle8ff_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LDFF(vle16ff_v, int16_t, lde_h_tlb)
+GEN_VEXT_LDFF(vle32ff_v, int32_t, lde_w_tlb)
+GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
 
 #define DO_SWAP(N, M) (M)
 #define DO_AND(N, M)  (N & M)
@@ -574,7 +935,8 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d)
  */
 static void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-                vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uintptr_t ra)
+                vext_ldst_elem_fn_tlb *ldst_elem, uint32_t log2_esz,
+                uintptr_t ra)
 {
     uint32_t i, k, off, pos;
     uint32_t nf = vext_nf(desc);
@@ -593,8 +955,8 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         /* load/store rest of elements of current segment pointed by vstart */
         for (pos = off; pos < max_elems; pos++, env->vstart++) {
             target_ulong addr = base + ((pos + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), pos + k * max_elems, vd,
-                      ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (pos + k * max_elems) << log2_esz, vd, ra);
         }
         k++;
     }
@@ -603,7 +965,8 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
     for (; k < nf; k++) {
         for (i = 0; i < max_elems; i++, env->vstart++) {
             target_ulong addr = base + ((i + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
         }
     }
 
@@ -618,22 +981,22 @@ void HELPER(NAME)(void *vd, target_ulong base,       \
                     ctzl(sizeof(ETYPE)), GETPC());   \
 }
 
-GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d)
+GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb)
 
 #define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN)     \
 void HELPER(NAME)(void *vd, target_ulong base,       \
@@ -643,10 +1006,10 @@ void HELPER(NAME)(void *vd, target_ulong base,       \
                     ctzl(sizeof(ETYPE)), GETPC());   \
 }
 
-GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b)
+GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb)
 
 /*
  * Vector Integer Arithmetic Instructions
diff --git a/target/riscv/vector_internals.h b/target/riscv/vector_internals.h
index 9e1e15b5750..f59d7d5c19f 100644
--- a/target/riscv/vector_internals.h
+++ b/target/riscv/vector_internals.h
@@ -233,4 +233,52 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong s1,    \
 #define WOP_UUU_H uint32_t, uint16_t, uint16_t, uint32_t, uint32_t
 #define WOP_UUU_W uint64_t, uint32_t, uint32_t, uint64_t, uint64_t
 
+typedef struct {
+    void *host;
+    int flags;
+    MemTxAttrs attrs;
+} RVVHostPage;
+
+typedef struct {
+    /*
+     * First and last element wholly contained within the two pages.
+     * mem_off_first[0] and reg_idx_first[0] are always set >= 0.
+     * reg_idx_last[0] may be < 0 if the first element crosses pages.
+     * All of mem_off_first[1], reg_idx_first[1] and reg_idx_last[1]
+     * are set >= 0 only if there are complete elements on a second page.
+     */
+    int16_t mem_off_first[2];
+    int16_t reg_idx_first[2];
+    int16_t reg_idx_last[2];
+
+    /*
+     * One element that is misaligned and spans both pages,
+     * or -1 if there is no such active element.
+     */
+    int16_t mem_off_split;
+    int16_t reg_idx_split;
+
+    /*
+     * The byte offset at which the entire operation crosses a page boundary.
+     * Set >= 0 if and only if the entire operation spans two pages.
+     */
+    int16_t page_split;
+
+    /* TLB data for the two pages. */
+    RVVHostPage page[2];
+} RVVContLdSt;
+
+#ifdef CONFIG_USER_ONLY
+static inline void
+vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info, uint64_t *v0,
+                           target_ulong addr, uint32_t log2_esz, bool is_load,
+                           uintptr_t ra, uint32_t desc)
+{}
+#else
+void vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info,
+                                uint64_t *v0, target_ulong addr,
+                                uint32_t log2_esz, bool is_load, uintptr_t ra,
+                                uint32_t desc);
+#endif
+
 #endif /* TARGET_RISCV_VECTOR_INTERNALS_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v4 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store
  2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
@ 2024-06-13 17:51 ` Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
  2024-06-13 17:51 ` [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
  4 siblings, 0 replies; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

The vector unit-stride whole register load/store instructions are
similar to unmasked unit-stride load/store instructions that is suitable
to be optimized by using a direct access to host ram fast path.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 185 +++++++++++++++++++++++++----------
 1 file changed, 133 insertions(+), 52 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 3d284138fb3..793337a6f96 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -935,81 +935,162 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
  */
 static void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-                vext_ldst_elem_fn_tlb *ldst_elem, uint32_t log2_esz,
-                uintptr_t ra)
+                vext_ldst_elem_fn_tlb *ldst_tlb,
+                vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
+                uintptr_t ra, bool is_load)
 {
-    uint32_t i, k, off, pos;
+    RVVContLdSt info;
+    target_ulong addr;
+    void *host;
+    int flags;
+    intptr_t reg_start, reg_last;
+    uint32_t idx_nf, off, evl;
     uint32_t nf = vext_nf(desc);
     uint32_t vlenb = riscv_cpu_cfg(env)->vlenb;
     uint32_t max_elems = vlenb >> log2_esz;
+    uint32_t esz = 1 << log2_esz;
 
     if (env->vstart >= ((vlenb * nf) >> log2_esz)) {
         env->vstart = 0;
         return;
     }
 
-    k = env->vstart / max_elems;
-    off = env->vstart % max_elems;
+    vext_cont_ldst_elements(&info, base, env->vreg, env->vstart,
+                            nf * max_elems, desc, log2_esz, true);
+    vext_cont_ldst_pages(env, &info, base, is_load, desc, esz, ra, true);
+    vext_cont_ldst_watchpoints(env, &info, env->vreg, base, esz, is_load, ra,
+                               desc);
+
+    flags = info.page[0].flags | info.page[1].flags;
+    if (unlikely(flags != 0)) {
+        /* At least one page includes MMIO. */
+        reg_start = info.reg_idx_first[0];
+        idx_nf = reg_start / max_elems;
+        off = reg_start % max_elems;
+        evl = (idx_nf + 1) * max_elems;
+
+        if (off) {
+            /*
+             * load/store rest of elements of current segment pointed by vstart
+             */
+            addr = base + (reg_start << log2_esz);
+            for (; reg_start < evl; reg_start++, addr += esz) {
+                ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz,
+                         vd, ra);
+            }
+            idx_nf++;
+        }
+
+        /* load/store elements for rest of segments */
+        evl = nf * max_elems;
+        addr = base + (reg_start << log2_esz);
+        for (; reg_start < evl; reg_start++, addr += esz) {
+            ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd,
+                     ra);
+        }
+
+        env->vstart = 0;
+        return;
+    }
+
+    /* The entire operation is in RAM, on valid pages. */
+    reg_start = info.reg_idx_first[0];
+    reg_last = info.reg_idx_last[0] + 1;
+    host = info.page[0].host;
+    idx_nf = reg_start / max_elems;
+    off = reg_start % max_elems;
+    evl = (idx_nf + 1) * max_elems;
 
     if (off) {
         /* load/store rest of elements of current segment pointed by vstart */
-        for (pos = off; pos < max_elems; pos++, env->vstart++) {
-            target_ulong addr = base + ((pos + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr),
-                      (pos + k * max_elems) << log2_esz, vd, ra);
+        for (; reg_start < evl; reg_start++) {
+            ldst_host(vd, reg_start << log2_esz,
+                      host + (reg_start << log2_esz));
         }
-        k++;
+        idx_nf++;
     }
 
     /* load/store elements for rest of segments */
-    for (; k < nf; k++) {
-        for (i = 0; i < max_elems; i++, env->vstart++) {
-            target_ulong addr = base + ((i + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr),
-                      (i + k * max_elems) << log2_esz, vd, ra);
+    for (; reg_start < reg_last; reg_start++) {
+        ldst_host(vd, reg_start << log2_esz, host + (reg_start << log2_esz));
+    }
+
+    /*
+     * Use the slow path to manage the cross-page misalignment.
+     * But we know this is RAM and cannot trap.
+     */
+    if (unlikely(info.mem_off_split >= 0)) {
+        reg_start = info.reg_idx_split;
+        addr = base + (reg_start << log2_esz);
+        ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd, ra);
+    }
+
+    if (unlikely(info.mem_off_first[1] >= 0)) {
+        reg_start = info.reg_idx_first[1];
+        reg_last = info.reg_idx_last[1] + 1;
+        host = info.page[1].host;
+        idx_nf = reg_start / max_elems;
+        off = reg_start % max_elems;
+        evl = (idx_nf + 1) * max_elems;
+
+        if (off) {
+            /*
+             * load/store rest of elements of current segment pointed by vstart
+             */
+            for (; reg_start < evl; reg_start++) {
+                ldst_host(vd, reg_start << log2_esz,
+                          host + (reg_start << log2_esz));
+            }
+            idx_nf++;
+        }
+
+        /* load/store elements for rest of segments */
+        for (; reg_start < reg_last; reg_start++) {
+            ldst_host(vd, reg_start << log2_esz,
+                      host + (reg_start << log2_esz));
         }
     }
 
     env->vstart = 0;
 }
 
-#define GEN_VEXT_LD_WHOLE(NAME, ETYPE, LOAD_FN)      \
-void HELPER(NAME)(void *vd, target_ulong base,       \
-                  CPURISCVState *env, uint32_t desc) \
-{                                                    \
-    vext_ldst_whole(vd, base, env, desc, LOAD_FN,    \
-                    ctzl(sizeof(ETYPE)), GETPC());   \
-}
-
-GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb)
-
-#define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN)     \
-void HELPER(NAME)(void *vd, target_ulong base,       \
-                  CPURISCVState *env, uint32_t desc) \
-{                                                    \
-    vext_ldst_whole(vd, base, env, desc, STORE_FN,   \
-                    ctzl(sizeof(ETYPE)), GETPC());   \
-}
-
-GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb)
+#define GEN_VEXT_LD_WHOLE(NAME, ETYPE, LOAD_FN_TLB, LOAD_FN_HOST)   \
+void HELPER(NAME)(void *vd, target_ulong base, CPURISCVState *env,  \
+                  uint32_t desc)                                    \
+{                                                                   \
+    vext_ldst_whole(vd, base, env, desc, LOAD_FN_TLB, LOAD_FN_HOST, \
+                    ctzl(sizeof(ETYPE)), GETPC(), true);            \
+}
+
+GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb, lde_d_host)
+
+#define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN_TLB, STORE_FN_HOST)     \
+void HELPER(NAME)(void *vd, target_ulong base, CPURISCVState *env,      \
+                  uint32_t desc)                                        \
+{                                                                       \
+    vext_ldst_whole(vd, base, env, desc, STORE_FN_TLB, STORE_FN_HOST,   \
+                    ctzl(sizeof(ETYPE)), GETPC(), false);               \
+}
+
+GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb, ste_b_host)
 
 /*
  * Vector Integer Arithmetic Instructions
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions
  2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
                   ` (2 preceding siblings ...)
  2024-06-13 17:51 ` [RFC PATCH v4 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
@ 2024-06-13 17:51 ` Max Chou
  2024-06-20  4:38   ` Richard Henderson
  2024-06-13 17:51 ` [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
  4 siblings, 1 reply; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

The vector unmasked unit-stride and whole register load/store
instructions will load/store continuous memory. If the endian of both
the host and guest architecture are the same, then we can group the
element load/store to load/store more data at a time.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 160 +++++++++++++++++++++++++----------
 1 file changed, 117 insertions(+), 43 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 793337a6f96..cba46ef16a5 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -457,6 +457,69 @@ GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
 GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
 GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
 
+static inline uint32_t
+vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
+                     uint32_t byte_offset, void *host, uint32_t esz,
+                     bool is_load)
+{
+    uint32_t group_size;
+    static vext_ldst_elem_fn_host * const fns[2][4] = {
+        /* Store */
+        { ste_b_host, ste_h_host, ste_w_host, ste_d_host },
+        /* Load */
+        { lde_b_host, lde_h_host, lde_w_host, lde_d_host }
+    };
+    vext_ldst_elem_fn_host *fn;
+
+    if (byte_offset + 8 < byte_end) {
+        group_size = MO_64;
+    } else if (byte_offset + 4 < byte_end) {
+        group_size = MO_32;
+    } else if (byte_offset + 2 < byte_end) {
+        group_size = MO_16;
+    } else {
+        group_size = MO_8;
+    }
+
+    fn = fns[is_load][group_size];
+    fn(vd, byte_offset, host + byte_offset);
+
+    return 1 << group_size;
+}
+
+static inline void
+vext_continus_ldst_tlb(CPURISCVState *env, vext_ldst_elem_fn_tlb *ldst_tlb,
+                       void *vd, uint32_t evl, target_ulong addr,
+                       uint32_t reg_start, uintptr_t ra, uint32_t esz,
+                       bool is_load)
+{
+    for (; reg_start < evl; reg_start++, addr += esz) {
+        ldst_tlb(env, adjust_addr(env, addr), reg_start * esz, vd, ra);
+    }
+}
+
+static inline void
+vext_continus_ldst_host(CPURISCVState *env, vext_ldst_elem_fn_host *ldst_host,
+                        void *vd, uint32_t evl, uint32_t reg_start, void *host,
+                        uint32_t esz, bool is_load)
+{
+#if TARGET_BIG_ENDIAN != HOST_BIG_ENDIAN
+    for (; reg_start < evl; reg_start++) {
+        uint32_t byte_off = reg_start * esz;
+        ldst_host(vd, byte_off, host + byte_off);
+    }
+#else
+    uint32_t group_byte;
+    uint32_t byte_start = reg_start * esz;
+    uint32_t byte_end = evl * esz;
+    while (byte_start < byte_end) {
+        group_byte = vext_group_ldst_host(env, vd, byte_end, byte_start, host,
+                                          esz, is_load);
+        byte_start += group_byte;
+    }
+#endif
+}
+
 static void vext_set_tail_elems_1s(target_ulong vl, void *vd,
                                    uint32_t desc, uint32_t nf,
                                    uint32_t esz, uint32_t max_elems)
@@ -555,6 +618,7 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              uint32_t evl, uintptr_t ra, bool is_load)
 {
     RVVContLdSt info;
+    target_ulong addr;
     void *host;
     int flags;
     intptr_t reg_start, reg_last;
@@ -587,13 +651,19 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         }
         reg_last += 1;
 
-        for (i = reg_start; i < reg_last; ++i) {
-            k = 0;
-            while (k < nf) {
-                target_ulong addr = base + ((i * nf + k) << log2_esz);
-                ldst_tlb(env, adjust_addr(env, addr),
-                         (i + k * max_elems) << log2_esz, vd, ra);
-                k++;
+        if (nf == 1) {
+            addr = base + reg_start * esz;
+            vext_continus_ldst_tlb(env, ldst_tlb, vd, reg_last, addr,
+                                   reg_start, ra, esz, is_load);
+        } else {
+            for (i = reg_start; i < reg_last; ++i) {
+                k = 0;
+                while (k < nf) {
+                    addr = base + ((i * nf + k) * esz);
+                    ldst_tlb(env, adjust_addr(env, addr),
+                             (i + k * max_elems) << log2_esz, vd, ra);
+                    k++;
+                }
             }
         }
 
@@ -607,12 +677,17 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
     reg_last = info.reg_idx_last[0] + 1;
     host = info.page[0].host;
 
-    for (i = reg_start; i < reg_last; ++i) {
-        k = 0;
-        while (k < nf) {
-            ldst_host(vd, (i + k * max_elems) << log2_esz,
-                      host + ((i * nf + k) << log2_esz));
-            k++;
+    if (nf == 1) {
+        vext_continus_ldst_host(env, ldst_host, vd, reg_last, reg_start, host,
+                                esz, is_load);
+    } else {
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                ldst_host(vd, (i + k * max_elems) << log2_esz,
+                          host + ((i * nf + k) * esz));
+                k++;
+            }
         }
     }
 
@@ -624,7 +699,7 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         reg_start = info.reg_idx_split;
         k = 0;
         while (k < nf) {
-            target_ulong addr = base + ((reg_start * nf + k) << log2_esz);
+            addr = base + ((reg_start * nf + k) << log2_esz);
             ldst_tlb(env, adjust_addr(env, addr),
                      (reg_start + k * max_elems) << log2_esz, vd, ra);
             k++;
@@ -636,12 +711,17 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         reg_last = info.reg_idx_last[1] + 1;
         host = info.page[1].host;
 
-        for (i = reg_start; i < reg_last; ++i) {
-            k = 0;
-            while (k < nf) {
-                ldst_host(vd, (i + k * max_elems) << log2_esz,
-                          host + ((i * nf + k) << log2_esz));
-                k++;
+        if (nf == 1) {
+            vext_continus_ldst_host(env, ldst_host, vd, reg_last, reg_start,
+                                    host, esz, is_load);
+        } else {
+            for (i = reg_start; i < reg_last; ++i) {
+                k = 0;
+                while (k < nf) {
+                    ldst_host(vd, (i + k * max_elems) << log2_esz,
+                              host + ((i * nf + k) << log2_esz));
+                    k++;
+                }
             }
         }
     }
@@ -974,20 +1054,17 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              * load/store rest of elements of current segment pointed by vstart
              */
             addr = base + (reg_start << log2_esz);
-            for (; reg_start < evl; reg_start++, addr += esz) {
-                ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz,
-                         vd, ra);
-            }
+            vext_continus_ldst_tlb(env, ldst_tlb, vd, evl, addr, reg_start, ra,
+                                   esz, is_load);
             idx_nf++;
         }
 
         /* load/store elements for rest of segments */
         evl = nf * max_elems;
         addr = base + (reg_start << log2_esz);
-        for (; reg_start < evl; reg_start++, addr += esz) {
-            ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd,
-                     ra);
-        }
+        reg_start = idx_nf * max_elems;
+        vext_continus_ldst_tlb(env, ldst_tlb, vd, evl, addr, reg_start, ra,
+                               esz, is_load);
 
         env->vstart = 0;
         return;
@@ -1003,17 +1080,16 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     if (off) {
         /* load/store rest of elements of current segment pointed by vstart */
-        for (; reg_start < evl; reg_start++) {
-            ldst_host(vd, reg_start << log2_esz,
-                      host + (reg_start << log2_esz));
-        }
+        vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                                is_load);
         idx_nf++;
     }
 
     /* load/store elements for rest of segments */
-    for (; reg_start < reg_last; reg_start++) {
-        ldst_host(vd, reg_start << log2_esz, host + (reg_start << log2_esz));
-    }
+    evl = reg_last;
+    reg_start = idx_nf * max_elems;
+    vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                            is_load);
 
     /*
      * Use the slow path to manage the cross-page misalignment.
@@ -1037,18 +1113,16 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
             /*
              * load/store rest of elements of current segment pointed by vstart
              */
-            for (; reg_start < evl; reg_start++) {
-                ldst_host(vd, reg_start << log2_esz,
-                          host + (reg_start << log2_esz));
-            }
+            vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host,
+                                    esz, is_load);
             idx_nf++;
         }
 
         /* load/store elements for rest of segments */
-        for (; reg_start < reg_last; reg_start++) {
-            ldst_host(vd, reg_start << log2_esz,
-                      host + (reg_start << log2_esz));
-        }
+        evl = reg_last;
+        reg_start = idx_nf * max_elems;
+        vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                                is_load);
     }
 
     env->vstart = 0;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance
  2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
                   ` (3 preceding siblings ...)
  2024-06-13 17:51 ` [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
@ 2024-06-13 17:51 ` Max Chou
  2024-06-20  4:44   ` Richard Henderson
  4 siblings, 1 reply; 15+ messages in thread
From: Max Chou @ 2024-06-13 17:51 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

In the vector unit-stride load/store helper functions. the vext_ldst_us
& vext_ldst_whole functions corresponding most of the execution time.
Inline the functions can avoid the function call overhead to improve the
helper function performance.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 64 +++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 30 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index cba46ef16a5..29849a8b66f 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -415,20 +415,22 @@ typedef void vext_ldst_elem_fn_tlb(CPURISCVState *env, abi_ptr addr,
                                    uint32_t idx, void *vd, uintptr_t retaddr);
 typedef void vext_ldst_elem_fn_host(void *vd, uint32_t idx, void *host);
 
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                         \
-static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
-                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE *cur = ((ETYPE *)reg);                                        \
-    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);                   \
-}                                                                       \
-                                                                        \
-static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
-{                                                                       \
-    ETYPE val = LDSUF##_p(host);                                        \
-    uint8_t *reg = (uint8_t *)(vd + byte_off);                          \
-    *(ETYPE *)(reg) = val;                                              \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                 \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_tlb(CPURISCVState *env, abi_ptr addr,               \
+                uint32_t byte_off, void *vd, uintptr_t retaddr) \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE *cur = ((ETYPE *)reg);                                \
+    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);           \
+}                                                               \
+                                                                \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_host(void *vd, uint32_t byte_off, void *host)       \
+{                                                               \
+    ETYPE val = LDSUF##_p(host);                                \
+    uint8_t *reg = (uint8_t *)(vd + byte_off);                  \
+    *(ETYPE *)(reg) = val;                                      \
 }
 
 GEN_VEXT_LD_ELEM(lde_b, uint8_t,  H1, ldub)
@@ -436,20 +438,22 @@ GEN_VEXT_LD_ELEM(lde_h, uint16_t, H2, lduw)
 GEN_VEXT_LD_ELEM(lde_w, uint32_t, H4, ldl)
 GEN_VEXT_LD_ELEM(lde_d, uint64_t, H8, ldq)
 
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                         \
-static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
-                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE data = *((ETYPE *)reg);                                       \
-    cpu_##STSUF##_data_ra(env, addr, data, retaddr);                    \
-}                                                                       \
-                                                                        \
-static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE val = *(ETYPE *)(reg);                                        \
-    STSUF##_p(host, val);                                               \
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                 \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_tlb(CPURISCVState *env, abi_ptr addr,               \
+                uint32_t byte_off, void *vd, uintptr_t retaddr) \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE data = *((ETYPE *)reg);                               \
+    cpu_##STSUF##_data_ra(env, addr, data, retaddr);            \
+}                                                               \
+                                                                \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_host(void *vd, uint32_t byte_off, void *host)       \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE val = *(ETYPE *)(reg);                                \
+    STSUF##_p(host, val);                                       \
 }
 
 GEN_VEXT_ST_ELEM(ste_b, uint8_t,  H1, stb)
@@ -611,7 +615,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d_tlb)
  */
 
 /* unmasked unit-stride load and store operation */
-static void
+static inline QEMU_ALWAYS_INLINE void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              vext_ldst_elem_fn_tlb *ldst_tlb,
              vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
@@ -1013,7 +1017,7 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
 /*
  * load and store whole register instructions
  */
-static void
+static inline QEMU_ALWAYS_INLINE void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
                 vext_ldst_elem_fn_tlb *ldst_tlb,
                 vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-06-20  2:48   ` Richard Henderson
  2024-06-20  7:50   ` Frank Chang
  2024-06-20 14:27   ` Alex Bennée
  2 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2024-06-20  2:48 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 6/13/24 10:51, Max Chou wrote:
> If there are not any QEMU plugin memory callback functions, checking
> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
> function call overhead.
> 
> Signed-off-by: Max Chou<max.chou@sifive.com>
> ---
>   accel/tcg/ldst_common.c.inc | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>

r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
  2024-06-13 17:51 ` [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
@ 2024-06-20  4:29   ` Richard Henderson
  2024-06-25 15:14     ` Max Chou
  2024-06-20  4:41   ` Richard Henderson
  1 sibling, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2024-06-20  4:29 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 6/13/24 10:51, Max Chou wrote:
> This commit references the sve_ldN_r/sve_stN_r helper functions in ARM
> target to optimize the vector unmasked unit-stride load/store
> instructions by following items:
> 
> * Get the loose bound of activate elements
> * Probing pages/resolving host memory address/handling watchpoint at beginning
> * Provide new interface to direct access host memory
> 
> The original element load/store interface is replaced by the new element
> load/store functions with _tlb & _host postfix that means doing the
> element load/store through the original softmmu flow and the direct
> access host memory flow.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   target/riscv/insn_trans/trans_rvv.c.inc |   3 +
>   target/riscv/vector_helper.c            | 637 +++++++++++++++++++-----
>   target/riscv/vector_internals.h         |  48 ++
>   3 files changed, 551 insertions(+), 137 deletions(-)
> 
> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
> index 3a3896ba06c..14e10568bd7 100644
> --- a/target/riscv/insn_trans/trans_rvv.c.inc
> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
> @@ -770,6 +770,7 @@ static bool ld_us_mask_op(DisasContext *s, arg_vlm_v *a, uint8_t eew)
>       /* Mask destination register are always tail-agnostic */
>       data = FIELD_DP32(data, VDATA, VTA, s->cfg_vta_all_1s);
>       data = FIELD_DP32(data, VDATA, VMA, s->vma);
> +    data = FIELD_DP32(data, VDATA, VM, 1);
>       return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
>   }
>   
> @@ -787,6 +788,7 @@ static bool st_us_mask_op(DisasContext *s, arg_vsm_v *a, uint8_t eew)
>       /* EMUL = 1, NFIELDS = 1 */
>       data = FIELD_DP32(data, VDATA, LMUL, 0);
>       data = FIELD_DP32(data, VDATA, NF, 1);
> +    data = FIELD_DP32(data, VDATA, VM, 1);
>       return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
>   }
>   
> @@ -1106,6 +1108,7 @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
>       TCGv_i32 desc;
>   
>       uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
> +    data = FIELD_DP32(data, VDATA, VM, 1);
>       dest = tcg_temp_new_ptr();
>       desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
>                                         s->cfg_ptr->vlenb, data));

This is ok, and would warrant a separate patch.


> +    if (vm == 0) {
> +        for (i = vstart; i < evl; ++i) {
> +            if (vext_elem_mask(v0, i)) {
> +                reg_idx_last = i;
> +                if (reg_idx_first < 0) {
> +                    reg_idx_first = i;
> +                }
> +            }
> +        }

This isn't great, and isn't used for now, since only unmasked unit-stride is handled so 
far.  I think this first patch should be simpler and *assume* VM is set.

> +/*
> + * Resolve the guest virtual addresses to info->page[].
> + * Control the generation of page faults with @fault.  Return false if
> + * there is no work to do, which can only happen with @fault == FAULT_NO.
> + */
> +static bool vext_cont_ldst_pages(CPURISCVState *env, RVVContLdSt *info,
> +                                 target_ulong addr, bool is_load,
> +                                 uint32_t desc, uint32_t esz, uintptr_t ra,
> +                                 bool is_us_whole)
> +{
> +    uint32_t vm = vext_vm(desc);
> +    uint32_t nf = vext_nf(desc);
> +    bool nofault = (vm == 1 ? false : true);

Why is nofault == "!vm"?

Also, it's silly to use ?: with true/false -- use the proper boolean expression in the 
first place.

That said... faults with RVV must interact with vstart.

I'm not sure what the best code organization is.

Perhaps a subroutine, passed the first and last elements for a single page.

   Update vstart, resolve the page, allowing the exception.
   If watchpoints, one call to cpu_check_watchpoint for the entire memory range.
   If ram, iterate through the rest of the page using host accesses; otherwise,
   iterate through the rest of the page using tlb accesses, making sure vstart
   is always up-to-date.

The main routine looks for the page_split, invokes the subroutine for the first (and 
likely only) page.  Special case any split-page element.  Invoke the subroutine for the 
second page.


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions
  2024-06-13 17:51 ` [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
@ 2024-06-20  4:38   ` Richard Henderson
  2024-06-24  6:50     ` Max Chou
  0 siblings, 1 reply; 15+ messages in thread
From: Richard Henderson @ 2024-06-20  4:38 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 6/13/24 10:51, Max Chou wrote:
> The vector unmasked unit-stride and whole register load/store
> instructions will load/store continuous memory. If the endian of both
> the host and guest architecture are the same, then we can group the
> element load/store to load/store more data at a time.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   target/riscv/vector_helper.c | 160 +++++++++++++++++++++++++----------
>   1 file changed, 117 insertions(+), 43 deletions(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 793337a6f96..cba46ef16a5 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -457,6 +457,69 @@ GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
>   GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
>   GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
>   
> +static inline uint32_t
> +vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
> +                     uint32_t byte_offset, void *host, uint32_t esz,
> +                     bool is_load)
> +{
> +    uint32_t group_size;
> +    static vext_ldst_elem_fn_host * const fns[2][4] = {
> +        /* Store */
> +        { ste_b_host, ste_h_host, ste_w_host, ste_d_host },
> +        /* Load */
> +        { lde_b_host, lde_h_host, lde_w_host, lde_d_host }
> +    };
> +    vext_ldst_elem_fn_host *fn;
> +
> +    if (byte_offset + 8 < byte_end) {
> +        group_size = MO_64;
> +    } else if (byte_offset + 4 < byte_end) {
> +        group_size = MO_32;
> +    } else if (byte_offset + 2 < byte_end) {
> +        group_size = MO_16;
> +    } else {
> +        group_size = MO_8;
> +    }
> +
> +    fn = fns[is_load][group_size];
> +    fn(vd, byte_offset, host + byte_offset);

This is a really bad idea.  The table and indirect call means that none of these will be 
properly inlined.  Anyway...

> +
> +    return 1 << group_size;
> +}
> +
> +static inline void
> +vext_continus_ldst_tlb(CPURISCVState *env, vext_ldst_elem_fn_tlb *ldst_tlb,
> +                       void *vd, uint32_t evl, target_ulong addr,
> +                       uint32_t reg_start, uintptr_t ra, uint32_t esz,
> +                       bool is_load)
> +{
> +    for (; reg_start < evl; reg_start++, addr += esz) {
> +        ldst_tlb(env, adjust_addr(env, addr), reg_start * esz, vd, ra);
> +    }
> +}
> +
> +static inline void
> +vext_continus_ldst_host(CPURISCVState *env, vext_ldst_elem_fn_host *ldst_host,
> +                        void *vd, uint32_t evl, uint32_t reg_start, void *host,
> +                        uint32_t esz, bool is_load)
> +{
> +#if TARGET_BIG_ENDIAN != HOST_BIG_ENDIAN
> +    for (; reg_start < evl; reg_start++) {
> +        uint32_t byte_off = reg_start * esz;
> +        ldst_host(vd, byte_off, host + byte_off);
> +    }
> +#else
> +    uint32_t group_byte;
> +    uint32_t byte_start = reg_start * esz;
> +    uint32_t byte_end = evl * esz;
> +    while (byte_start < byte_end) {
> +        group_byte = vext_group_ldst_host(env, vd, byte_end, byte_start, host,
> +                                          esz, is_load);
> +        byte_start += group_byte;
> +    }

... this is much better handled with memcpy, given that you know endianness matches.


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
  2024-06-13 17:51 ` [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
  2024-06-20  4:29   ` Richard Henderson
@ 2024-06-20  4:41   ` Richard Henderson
  1 sibling, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2024-06-20  4:41 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 6/13/24 10:51, Max Chou wrote:
> +#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                         \
> +static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
> +                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
> +{                                                                       \
> +    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
> +    ETYPE *cur = ((ETYPE *)reg);                                        \
> +    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);                   \
> +}                                                                       \
> +                                                                        \
> +static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
> +{                                                                       \
> +    ETYPE val = LDSUF##_p(host);                                        \
> +    uint8_t *reg = (uint8_t *)(vd + byte_off);                          \
> +    *(ETYPE *)(reg) = val;                                              \
> +}

Why are you casting to and from uint8_t* ?

Surely this is cleaner as

     ETYPE *cur = vd + byte_off;


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance
  2024-06-13 17:51 ` [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
@ 2024-06-20  4:44   ` Richard Henderson
  0 siblings, 0 replies; 15+ messages in thread
From: Richard Henderson @ 2024-06-20  4:44 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 6/13/24 10:51, Max Chou wrote:
> In the vector unit-stride load/store helper functions. the vext_ldst_us
> & vext_ldst_whole functions corresponding most of the execution time.
> Inline the functions can avoid the function call overhead to improve the
> helper function performance.
> 
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>   target/riscv/vector_helper.c | 64 +++++++++++++++++++-----------------
>   1 file changed, 34 insertions(+), 30 deletions(-)


Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
  2024-06-20  2:48   ` Richard Henderson
@ 2024-06-20  7:50   ` Frank Chang
  2024-06-20 14:27   ` Alex Bennée
  2 siblings, 0 replies; 15+ messages in thread
From: Frank Chang @ 2024-06-20  7:50 UTC (permalink / raw)
  To: Max Chou
  Cc: qemu-devel, qemu-riscv, Richard Henderson, Paolo Bonzini,
	Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei

Reviewed-by: Frank Chang <frank.chang@sifive.com>

Max Chou <max.chou@sifive.com> 於 2024年6月14日 週五 上午1:52寫道：
>
> If there are not any QEMU plugin memory callback functions, checking
> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
> function call overhead.
>
> Signed-off-by: Max Chou <max.chou@sifive.com>
> ---
>  accel/tcg/ldst_common.c.inc | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
> index c82048e377e..87ceb954873 100644
> --- a/accel/tcg/ldst_common.c.inc
> +++ b/accel/tcg/ldst_common.c.inc
> @@ -125,7 +125,9 @@ void helper_st_i128(CPUArchState *env, uint64_t addr, Int128 val, MemOpIdx oi)
>
>  static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
>  {
> -    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
> +    }
>  }
>
>  uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
> @@ -188,7 +190,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
>
>  static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
>  {
> -    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
> +    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
> +        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
> +    }
>  }
>
>  void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
> --
> 2.34.1
>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
  2024-06-20  2:48   ` Richard Henderson
  2024-06-20  7:50   ` Frank Chang
@ 2024-06-20 14:27   ` Alex Bennée
  2 siblings, 0 replies; 15+ messages in thread
From: Alex Bennée @ 2024-06-20 14:27 UTC (permalink / raw)
  To: Max Chou
  Cc: qemu-devel, qemu-riscv, Richard Henderson, Paolo Bonzini,
	Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei

Max Chou <max.chou@sifive.com> writes:

> If there are not any QEMU plugin memory callback functions, checking
> before calling the qemu_plugin_vcpu_mem_cb function can reduce the
> function call overhead.
>
> Signed-off-by: Max Chou <max.chou@sifive.com>

Queued this patch to maintainer/june-2024-omnibus, thanks.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions
  2024-06-20  4:38   ` Richard Henderson
@ 2024-06-24  6:50     ` Max Chou
  0 siblings, 0 replies; 15+ messages in thread
From: Max Chou @ 2024-06-24  6:50 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 2024/6/20 12:38 PM, Richard Henderson wrote:

> On 6/13/24 10:51, Max Chou wrote:
>> The vector unmasked unit-stride and whole register load/store
>> instructions will load/store continuous memory. If the endian of both
>> the host and guest architecture are the same, then we can group the
>> element load/store to load/store more data at a time.
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>>   target/riscv/vector_helper.c | 160 +++++++++++++++++++++++++----------
>>   1 file changed, 117 insertions(+), 43 deletions(-)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index 793337a6f96..cba46ef16a5 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -457,6 +457,69 @@ GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
>>   GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
>>   GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
>>   +static inline uint32_t
>> +vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>> +                     uint32_t byte_offset, void *host, uint32_t esz,
>> +                     bool is_load)
>> +{
>> +    uint32_t group_size;
>> +    static vext_ldst_elem_fn_host * const fns[2][4] = {
>> +        /* Store */
>> +        { ste_b_host, ste_h_host, ste_w_host, ste_d_host },
>> +        /* Load */
>> +        { lde_b_host, lde_h_host, lde_w_host, lde_d_host }
>> +    };
>> +    vext_ldst_elem_fn_host *fn;
>> +
>> +    if (byte_offset + 8 < byte_end) {
>> +        group_size = MO_64;
>> +    } else if (byte_offset + 4 < byte_end) {
>> +        group_size = MO_32;
>> +    } else if (byte_offset + 2 < byte_end) {
>> +        group_size = MO_16;
>> +    } else {
>> +        group_size = MO_8;
>> +    }
>> +
>> +    fn = fns[is_load][group_size];
>> +    fn(vd, byte_offset, host + byte_offset);
>
> This is a really bad idea.  The table and indirect call means that 
> none of these will be properly inlined.  Anyway...
>
>> +
>> +    return 1 << group_size;
>> +}
>> +
>> +static inline void
>> +vext_continus_ldst_tlb(CPURISCVState *env, vext_ldst_elem_fn_tlb 
>> *ldst_tlb,
>> +                       void *vd, uint32_t evl, target_ulong addr,
>> +                       uint32_t reg_start, uintptr_t ra, uint32_t esz,
>> +                       bool is_load)
>> +{
>> +    for (; reg_start < evl; reg_start++, addr += esz) {
>> +        ldst_tlb(env, adjust_addr(env, addr), reg_start * esz, vd, ra);
>> +    }
>> +}
>> +
>> +static inline void
>> +vext_continus_ldst_host(CPURISCVState *env, vext_ldst_elem_fn_host 
>> *ldst_host,
>> +                        void *vd, uint32_t evl, uint32_t reg_start, 
>> void *host,
>> +                        uint32_t esz, bool is_load)
>> +{
>> +#if TARGET_BIG_ENDIAN != HOST_BIG_ENDIAN
>> +    for (; reg_start < evl; reg_start++) {
>> +        uint32_t byte_off = reg_start * esz;
>> +        ldst_host(vd, byte_off, host + byte_off);
>> +    }
>> +#else
>> +    uint32_t group_byte;
>> +    uint32_t byte_start = reg_start * esz;
>> +    uint32_t byte_end = evl * esz;
>> +    while (byte_start < byte_end) {
>> +        group_byte = vext_group_ldst_host(env, vd, byte_end, 
>> byte_start, host,
>> +                                          esz, is_load);
>> +        byte_start += group_byte;
>> +    }
>
> ... this is much better handled with memcpy, given that you know 
> endianness matches.
Thanks for the suggestion.
I'll try to replace the original implementation of the table and 
indirect calls by handled with memcpy at the next version.

Max.
>
>
> r~


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
  2024-06-20  4:29   ` Richard Henderson
@ 2024-06-25 15:14     ` Max Chou
  0 siblings, 0 replies; 15+ messages in thread
From: Max Chou @ 2024-06-25 15:14 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv
  Cc: Paolo Bonzini, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei

On 2024/6/20 12:29 PM, Richard Henderson wrote:

> On 6/13/24 10:51, Max Chou wrote:
>> This commit references the sve_ldN_r/sve_stN_r helper functions in ARM
>> target to optimize the vector unmasked unit-stride load/store
>> instructions by following items:
>>
>> * Get the loose bound of activate elements
>> * Probing pages/resolving host memory address/handling watchpoint at 
>> beginning
>> * Provide new interface to direct access host memory
>>
>> The original element load/store interface is replaced by the new element
>> load/store functions with _tlb & _host postfix that means doing the
>> element load/store through the original softmmu flow and the direct
>> access host memory flow.
>>
>> Signed-off-by: Max Chou <max.chou@sifive.com>
>> ---
>>   target/riscv/insn_trans/trans_rvv.c.inc |   3 +
>>   target/riscv/vector_helper.c            | 637 +++++++++++++++++++-----
>>   target/riscv/vector_internals.h         |  48 ++
>>   3 files changed, 551 insertions(+), 137 deletions(-)
>>
>> diff --git a/target/riscv/insn_trans/trans_rvv.c.inc 
>> b/target/riscv/insn_trans/trans_rvv.c.inc
>> index 3a3896ba06c..14e10568bd7 100644
>> --- a/target/riscv/insn_trans/trans_rvv.c.inc
>> +++ b/target/riscv/insn_trans/trans_rvv.c.inc
>> @@ -770,6 +770,7 @@ static bool ld_us_mask_op(DisasContext *s, 
>> arg_vlm_v *a, uint8_t eew)
>>       /* Mask destination register are always tail-agnostic */
>>       data = FIELD_DP32(data, VDATA, VTA, s->cfg_vta_all_1s);
>>       data = FIELD_DP32(data, VDATA, VMA, s->vma);
>> +    data = FIELD_DP32(data, VDATA, VM, 1);
>>       return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
>>   }
>>   @@ -787,6 +788,7 @@ static bool st_us_mask_op(DisasContext *s, 
>> arg_vsm_v *a, uint8_t eew)
>>       /* EMUL = 1, NFIELDS = 1 */
>>       data = FIELD_DP32(data, VDATA, LMUL, 0);
>>       data = FIELD_DP32(data, VDATA, NF, 1);
>> +    data = FIELD_DP32(data, VDATA, VM, 1);
>>       return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
>>   }
>>   @@ -1106,6 +1108,7 @@ static bool ldst_whole_trans(uint32_t vd, 
>> uint32_t rs1, uint32_t nf,
>>       TCGv_i32 desc;
>>         uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
>> +    data = FIELD_DP32(data, VDATA, VM, 1);
>>       dest = tcg_temp_new_ptr();
>>       desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
>>                                         s->cfg_ptr->vlenb, data));
>
> This is ok, and would warrant a separate patch.
Ok, I'll split this part to a separate patch at next version.
Thanks.

>
>
>> +    if (vm == 0) {
>> +        for (i = vstart; i < evl; ++i) {
>> +            if (vext_elem_mask(v0, i)) {
>> +                reg_idx_last = i;
>> +                if (reg_idx_first < 0) {
>> +                    reg_idx_first = i;
>> +                }
>> +            }
>> +        }
>
> This isn't great, and isn't used for now, since only unmasked 
> unit-stride is handled so far.  I think this first patch should be 
> simpler and *assume* VM is set.
Indeed. I'll remove this part at next version.
I agree that this first patch should be simpler and assume VM is set.
Thank you for the suggestion.

>
>> +/*
>> + * Resolve the guest virtual addresses to info->page[].
>> + * Control the generation of page faults with @fault.  Return false if
>> + * there is no work to do, which can only happen with @fault == 
>> FAULT_NO.
>> + */
>> +static bool vext_cont_ldst_pages(CPURISCVState *env, RVVContLdSt *info,
>> +                                 target_ulong addr, bool is_load,
>> +                                 uint32_t desc, uint32_t esz, 
>> uintptr_t ra,
>> +                                 bool is_us_whole)
>> +{
>> +    uint32_t vm = vext_vm(desc);
>> +    uint32_t nf = vext_nf(desc);
>> +    bool nofault = (vm == 1 ? false : true);
>
> Why is nofault == "!vm"?
>
> Also, it's silly to use ?: with true/false -- use the proper boolean 
> expression in the first place.
>
> That said... faults with RVV must interact with vstart.
>
> I'm not sure what the best code organization is.
>
> Perhaps a subroutine, passed the first and last elements for a single 
> page.
>
>   Update vstart, resolve the page, allowing the exception.
>   If watchpoints, one call to cpu_check_watchpoint for the entire 
> memory range.
>   If ram, iterate through the rest of the page using host accesses; 
> otherwise,
>   iterate through the rest of the page using tlb accesses, making sure 
> vstart
>   is always up-to-date.
>
> The main routine looks for the page_split, invokes the subroutine for 
> the first (and likely only) page.  Special case any split-page 
> element.  Invoke the subroutine for the second page.
>
>
> r~
According to the v spec, the vector unit-stride and constant-stride 
memory accesses do not guarantee ordering between individual element 
accesses.
So it may affects how we handle the precise vector traps here.
I'll replace the nofault part and try to update the subroutine flow with 
the suggestion and think about how to update vstart here.
Thanks you for the suggestions.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-06-25 15:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 17:51 [RFC PATCH v4 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
2024-06-13 17:51 ` [RFC PATCH v4 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
2024-06-20  2:48   ` Richard Henderson
2024-06-20  7:50   ` Frank Chang
2024-06-20 14:27   ` Alex Bennée
2024-06-13 17:51 ` [RFC PATCH v4 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
2024-06-20  4:29   ` Richard Henderson
2024-06-25 15:14     ` Max Chou
2024-06-20  4:41   ` Richard Henderson
2024-06-13 17:51 ` [RFC PATCH v4 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
2024-06-13 17:51 ` [RFC PATCH v4 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
2024-06-20  4:38   ` Richard Henderson
2024-06-24  6:50     ` Max Chou
2024-06-13 17:51 ` [RFC PATCH v4 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
2024-06-20  4:44   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).