[RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions
@ 2024-06-13 14:19 Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

Hi,

This RFC patch set tries to fix the issue of
https://gitlab.com/qemu-project/qemu/-/issues/2137.

In this new version, we added patches that
1. Provide a fast path to direct access the host ram for some vector
   load/store instructions (e.g. unmasked vector unit-stride load/store
   instructions) and perform virtual address resolution once for the
   entire vector at beginning of helper function. (Thanks for Richard
   Henderson's suggestion)
2. Replace the group elements load/store TCG ops by the group element
   load/store flow in helper functions with some assumption (e.g. no
   masking, continuous memory load/store, the endian of host and guest
   architecture are the same). (Thanks for Richard Henderson's
   suggestion)
3. Try inline the vector load/store related functions that corresponding
   most of the execution time.

This version can improve the performance of the test case provided in
https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
- QEMU user mode (vlen=512): from ~51.8 sec. to ~4.5 sec.
- QEMU system mode (vlen=512): from ~125.6 sec to ~6.6 sec.

Series based on riscv-to-apply.next branch (commit d82f37f).

Changes from v2:
- Drop v2 patches 1/4/5/6
- patch 2
    - Provide direct access host ram flow for vector unit-stride ld/st
- patch 3
    - Provide direct access host ram flow for vector whole reg ld/st
- patch 4
    - Provide group element load/store flow for vector continuous ld/st
- patch 5
    - Extend v2 patch 3 to more vector ld/st functions

Previous version:
- v1: https://lore.kernel.org/all/20240215192823.729209-1-max.chou@sifive.com/
- v2: https://lore.kernel.org/all/20240531174504.281461-1-max.chou@sifive.com/

Max Chou (5):
  accel/tcg: Avoid unnecessary call overhead from
    qemu_plugin_vcpu_mem_cb
  target/riscv: rvv: Provide a fast path using direct access to host ram
    for unmasked unit-stride load/store
  target/riscv: rvv: Provide a fast path using direct access to host ram
    for unit-stride whole register load/store
  target/riscv: rvv: Provide group continuous ld/st flow for unit-stride
    ld/st instructions
  target/riscv: Inline unit-stride ld/st and corresponding functions for
    performance

 accel/tcg/ldst_common.c.inc             |   8 +-
 target/riscv/insn_trans/trans_rvv.c.inc |   3 +
 target/riscv/vector_helper.c            | 847 +++++++++++++++++++-----
 target/riscv/vector_internals.h         |  48 ++
 4 files changed, 738 insertions(+), 168 deletions(-)

-- 
2.34.1



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
@ 2024-06-13 14:19 ` Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

If there are not any QEMU plugin memory callback functions, checking
before calling the qemu_plugin_vcpu_mem_cb function can reduce the
function call overhead.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 accel/tcg/ldst_common.c.inc | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/accel/tcg/ldst_common.c.inc b/accel/tcg/ldst_common.c.inc
index c82048e377e..87ceb954873 100644
--- a/accel/tcg/ldst_common.c.inc
+++ b/accel/tcg/ldst_common.c.inc
@@ -125,7 +125,9 @@ void helper_st_i128(CPUArchState *env, uint64_t addr, Int128 val, MemOpIdx oi)
 
 static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    }
 }
 
 uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
@@ -188,7 +190,9 @@ Int128 cpu_ld16_mmu(CPUArchState *env, abi_ptr addr,
 
 static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    if (cpu_plugin_mem_cbs_enabled(env_cpu(env))) {
+        qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    }
 }
 
 void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
@ 2024-06-13 14:19 ` Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

This commit references the sve_ldN_r/sve_stN_r helper functions in ARM
target to optimize the vector unmasked unit-stride load/store
instructions by following items:

* Get the loose bound of activate elements
* Probing pages/resolving host memory address/handling watchpoint at beginning
* Provide new interface to direct access host memory

The original element load/store interface is replaced by the new element
load/store functions with _tlb & _host postfix that means doing the
element load/store through the original softmmu flow and the direct
access host memory flow.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/insn_trans/trans_rvv.c.inc |   3 +
 target/riscv/vector_helper.c            | 630 ++++++++++++++++++------
 target/riscv/vector_internals.h         |  48 ++
 3 files changed, 544 insertions(+), 137 deletions(-)

diff --git a/target/riscv/insn_trans/trans_rvv.c.inc b/target/riscv/insn_trans/trans_rvv.c.inc
index 3a3896ba06c..14e10568bd7 100644
--- a/target/riscv/insn_trans/trans_rvv.c.inc
+++ b/target/riscv/insn_trans/trans_rvv.c.inc
@@ -770,6 +770,7 @@ static bool ld_us_mask_op(DisasContext *s, arg_vlm_v *a, uint8_t eew)
     /* Mask destination register are always tail-agnostic */
     data = FIELD_DP32(data, VDATA, VTA, s->cfg_vta_all_1s);
     data = FIELD_DP32(data, VDATA, VMA, s->vma);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     return ldst_us_trans(a->rd, a->rs1, data, fn, s, false);
 }
 
@@ -787,6 +788,7 @@ static bool st_us_mask_op(DisasContext *s, arg_vsm_v *a, uint8_t eew)
     /* EMUL = 1, NFIELDS = 1 */
     data = FIELD_DP32(data, VDATA, LMUL, 0);
     data = FIELD_DP32(data, VDATA, NF, 1);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     return ldst_us_trans(a->rd, a->rs1, data, fn, s, true);
 }
 
@@ -1106,6 +1108,7 @@ static bool ldst_whole_trans(uint32_t vd, uint32_t rs1, uint32_t nf,
     TCGv_i32 desc;
 
     uint32_t data = FIELD_DP32(0, VDATA, NF, nf);
+    data = FIELD_DP32(data, VDATA, VM, 1);
     dest = tcg_temp_new_ptr();
     desc = tcg_constant_i32(simd_desc(s->cfg_ptr->vlenb,
                                       s->cfg_ptr->vlenb, data));
diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 1b4d5a8e378..d33ba5aeca1 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -29,6 +29,7 @@
 #include "tcg/tcg-gvec-desc.h"
 #include "internals.h"
 #include "vector_internals.h"
+#include "hw/core/tcg-cpu-ops.h"
 #include <math.h>
 
 target_ulong HELPER(vsetvl)(CPURISCVState *env, target_ulong s1,
@@ -136,6 +137,263 @@ static void probe_pages(CPURISCVState *env, target_ulong addr,
     }
 }
 
+/*
+ * Find first active element on each page, and a loose bound for the
+ * final element on each page.  Identify any single element that spans
+ * the page boundary. Return true if there are any active elements.
+ */
+static bool vext_cont_ldst_elements(RVVContLdSt *info, target_ulong addr,
+                                    void *v0, uint32_t vstart, uint32_t evl,
+                                    uint32_t desc, uint32_t log2_esz,
+                                    bool is_us_whole)
+{
+    uint32_t vm = vext_vm(desc);
+    uint32_t nf = vext_nf(desc);
+    uint32_t max_elems = vext_max_elems(desc, log2_esz);
+    uint32_t esz = 1 << log2_esz;
+    uint32_t msize = is_us_whole ? esz : nf * esz;
+    int32_t reg_idx_first = -1, reg_idx_last = -1, reg_idx_split;
+    int32_t mem_off_last, mem_off_split;
+    int32_t page_split, elt_split;
+    int32_t i;
+
+    /* Set all of the element indices to -1, and the TLB data to 0. */
+    memset(info, -1, offsetof(RVVContLdSt, page));
+    memset(info->page, 0, sizeof(info->page));
+
+    /* Gross scan over the mask register v0 to find bounds. */
+    if (vm == 0) {
+        for (i = vstart; i < evl; ++i) {
+            if (vext_elem_mask(v0, i)) {
+                reg_idx_last = i;
+                if (reg_idx_first < 0) {
+                    reg_idx_first = i;
+                }
+            }
+        }
+    } else {
+        reg_idx_first = vstart;
+        reg_idx_last = evl - 1;
+    }
+
+    if (unlikely(reg_idx_first < 0)) {
+        /* No active elements, no pages touched. */
+        return false;
+    }
+    tcg_debug_assert(reg_idx_last >= 0 && reg_idx_last < max_elems);
+
+    info->reg_idx_first[0] = reg_idx_first;
+    info->mem_off_first[0] = reg_idx_first * msize;
+    mem_off_last = reg_idx_last * msize;
+
+    page_split = -(addr | TARGET_PAGE_MASK);
+    if (likely(mem_off_last + msize <= page_split)) {
+        /* The entire operation fits within a single page. */
+        info->reg_idx_last[0] = reg_idx_last;
+        return true;
+    }
+
+    info->page_split = page_split;
+    elt_split = page_split / msize;
+    reg_idx_split = elt_split;
+    mem_off_split = elt_split * msize;
+
+    /*
+     * This is the last full element on the first page, but it is not
+     * necessarily active.  If there is no full element, i.e. the first
+     * active element is the one that's split, this value remains -1.
+     * It is useful as iteration bounds.
+     */
+    if (elt_split != 0) {
+        info->reg_idx_last[0] = reg_idx_split - 1;
+    }
+
+    /* Determine if an unaligned element spans the pages.  */
+    if (page_split % msize != 0) {
+        /* It is helpful to know if the split element is active. */
+        if (vm == 1 || (vm == 0 && vext_elem_mask(v0, reg_idx_split))) {
+            info->reg_idx_split = reg_idx_split;
+            info->mem_off_split = mem_off_split;
+
+            if (reg_idx_split == reg_idx_last) {
+                /* The page crossing element is last. */
+                return true;
+            }
+        }
+        reg_idx_split++;
+        mem_off_split += msize;
+    }
+
+    /*
+     * We do want the first active element on the second page, because
+     * this may affect the address reported in an exception.
+     */
+    if (vm == 0) {
+        for (; reg_idx_split < evl; ++reg_idx_split) {
+            if (vext_elem_mask(v0, reg_idx_split)) {
+                break;
+            }
+        }
+    }
+    tcg_debug_assert(reg_idx_split <= reg_idx_last);
+    info->reg_idx_first[1] = reg_idx_split;
+    info->mem_off_first[1] = reg_idx_split * msize;
+    info->reg_idx_last[1] = reg_idx_last;
+    return true;
+}
+
+/*
+ * Resolve the guest virtual address to info->host and info->flags.
+ * If @nofault, return false if the page is invalid, otherwise
+ * exit via page fault exception.
+ */
+static bool vext_probe_page(CPURISCVState *env, RVVHostPage *info,
+                            bool nofault, target_ulong addr, int mem_off,
+                            int size, MMUAccessType access_type, int mmu_idx,
+                            uintptr_t ra)
+{
+    int flags;
+
+    addr += mem_off;
+
+#ifdef CONFIG_USER_ONLY
+    flags = probe_access_flags(env, adjust_addr(env, addr), size, access_type,
+                               mmu_idx, nofault, &info->host, ra);
+#else
+    CPUTLBEntryFull *full;
+    flags = probe_access_full(env, adjust_addr(env, addr), size, access_type,
+                              mmu_idx, nofault, &info->host, &full, ra);
+#endif
+    info->flags = flags;
+
+    if (flags & TLB_INVALID_MASK) {
+        g_assert(nofault);
+        return false;
+    }
+
+#ifdef CONFIG_USER_ONLY
+    memset(&info->attrs, 0, sizeof(info->attrs));
+#else
+    info->attrs = full->attrs;
+#endif
+
+    /* Ensure that info->host[] is relative to addr, not addr + mem_off. */
+    info->host -= mem_off;
+    return true;
+}
+
+/*
+ * Resolve the guest virtual addresses to info->page[].
+ * Control the generation of page faults with @fault.  Return false if
+ * there is no work to do, which can only happen with @fault == FAULT_NO.
+ */
+static bool vext_cont_ldst_pages(CPURISCVState *env, RVVContLdSt *info,
+                                 target_ulong addr, bool is_load,
+                                 uint32_t desc, uint32_t esz, uintptr_t ra)
+{
+    uint32_t vm = vext_vm(desc);
+    bool nofault = (vm == 1 ? false : true);
+    int mmu_index = riscv_env_mmu_index(env, false);
+    int mem_off = info->mem_off_first[0];
+    int size = (info->reg_idx_last[0] - info->reg_idx_first[0] + 1) * esz;
+    MMUAccessType access_type = is_load ? MMU_DATA_LOAD : MMU_DATA_STORE;
+    bool have_work;
+
+    have_work = vext_probe_page(env, &info->page[0], nofault, addr, mem_off,
+                                size, access_type, mmu_index, ra);
+    if (!have_work) {
+        /* No work to be done. */
+        return false;
+    }
+
+    if (likely(info->page_split < 0)) {
+        /* The entire operation was on the one page. */
+        return true;
+    }
+
+    /*
+     * If the second page is invalid, then we want the fault address to be
+     * the first byte on that page which is accessed.
+     */
+    if (info->mem_off_split >= 0) {
+        /*
+         * There is an element split across the pages.  The fault address
+         * should be the first byte of the second page.
+         */
+        mem_off = info->page_split;
+    } else {
+        /*
+         * There is no element split across the pages.  The fault address
+         * should be the first active element on the second page.
+         */
+        mem_off = info->mem_off_first[1];
+    }
+    size = info->reg_idx_last[1] * esz - mem_off + esz;
+    have_work |= vext_probe_page(env, &info->page[1], nofault, addr, mem_off,
+                                 size, access_type, mmu_index, ra);
+    return have_work;
+}
+
+#ifndef CONFIG_USER_ONLY
+void vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info,
+                                uint64_t *v0, target_ulong addr,
+                                uint32_t esz, bool is_load, uintptr_t ra,
+                                uint32_t desc)
+{
+    int32_t i;
+    intptr_t mem_off, reg_off, reg_last;
+    uint32_t vm = vext_vm(desc);
+    int wp_access = is_load == true ? BP_MEM_READ : BP_MEM_WRITE;
+    int flags0 = info->page[0].flags;
+    int flags1 = info->page[1].flags;
+
+    if (likely(!((flags0 | flags1) & TLB_WATCHPOINT))) {
+        return;
+    }
+
+    /* Indicate that watchpoints are handled. */
+    info->page[0].flags = flags0 & ~TLB_WATCHPOINT;
+    info->page[1].flags = flags1 & ~TLB_WATCHPOINT;
+
+    if (flags0 & TLB_WATCHPOINT) {
+        mem_off = info->mem_off_first[0];
+        reg_off = info->reg_idx_first[0];
+        reg_last = info->reg_idx_last[0];
+
+        for (i = reg_off; i < reg_last; ++i, mem_off += esz) {
+            if (vm == 1 || (vm == 0 && vext_elem_mask(v0, i))) {
+                cpu_check_watchpoint(env_cpu(env),
+                                     adjust_addr(env, addr + mem_off), esz,
+                                     info->page[0].attrs, wp_access, ra);
+            }
+        }
+    }
+
+    mem_off = info->mem_off_split;
+    if (mem_off >= 0) {
+        if (vm == 1 || (vm == 0 && vext_elem_mask(v0, mem_off / esz))) {
+            cpu_check_watchpoint(env_cpu(env),
+                                 adjust_addr(env, addr + mem_off), esz,
+                                 info->page[0].attrs, wp_access, ra);
+        }
+    }
+
+    mem_off = info->mem_off_first[1];
+    if ((flags1 & TLB_WATCHPOINT) && mem_off >= 0) {
+        reg_off = info->reg_idx_first[1];
+        reg_last = info->reg_idx_last[1];
+
+        for (i = reg_off; i < reg_last; ++i, mem_off += esz) {
+            if (vm == 1 || (vm == 0 && vext_elem_mask(v0, i))) {
+                cpu_check_watchpoint(env_cpu(env),
+                                     adjust_addr(env, addr + mem_off), esz,
+                                     info->page[1].attrs, wp_access, ra);
+            }
+        }
+    }
+}
+#endif
+
 static inline void vext_set_elem_mask(void *v0, int index,
                                       uint8_t value)
 {
@@ -146,34 +404,51 @@ static inline void vext_set_elem_mask(void *v0, int index,
 }
 
 /* elements operations for load and store */
-typedef void vext_ldst_elem_fn(CPURISCVState *env, abi_ptr addr,
-                               uint32_t idx, void *vd, uintptr_t retaddr);
+typedef void vext_ldst_elem_fn_tlb(CPURISCVState *env, abi_ptr addr,
+                                   uint32_t idx, void *vd, uintptr_t retaddr);
+typedef void vext_ldst_elem_fn_host(void *vd, uint32_t idx, void *host);
 
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE *cur = ((ETYPE *)vd + H(idx));                   \
-    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);      \
-}                                                          \
-
-GEN_VEXT_LD_ELEM(lde_b, int8_t,  H1, ldsb)
-GEN_VEXT_LD_ELEM(lde_h, int16_t, H2, ldsw)
-GEN_VEXT_LD_ELEM(lde_w, int32_t, H4, ldl)
-GEN_VEXT_LD_ELEM(lde_d, int64_t, H8, ldq)
-
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)            \
-static void NAME(CPURISCVState *env, abi_ptr addr,         \
-                 uint32_t idx, void *vd, uintptr_t retaddr)\
-{                                                          \
-    ETYPE data = *((ETYPE *)vd + H(idx));                  \
-    cpu_##STSUF##_data_ra(env, addr, data, retaddr);       \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                         \
+static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
+                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE *cur = ((ETYPE *)reg);                                        \
+    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);                   \
+}                                                                       \
+                                                                        \
+static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
+{                                                                       \
+    ETYPE val = LDSUF##_p(host);                                        \
+    uint8_t *reg = (uint8_t *)(vd + byte_off);                          \
+    *(ETYPE *)(reg) = val;                                              \
+}
+
+GEN_VEXT_LD_ELEM(lde_b, uint8_t,  H1, ldub)
+GEN_VEXT_LD_ELEM(lde_h, uint16_t, H2, lduw)
+GEN_VEXT_LD_ELEM(lde_w, uint32_t, H4, ldl)
+GEN_VEXT_LD_ELEM(lde_d, uint64_t, H8, ldq)
+
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                         \
+static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
+                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE data = *((ETYPE *)reg);                                       \
+    cpu_##STSUF##_data_ra(env, addr, data, retaddr);                    \
+}                                                                       \
+                                                                        \
+static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
+{                                                                       \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
+    ETYPE val = *(ETYPE *)(reg);                                        \
+    STSUF##_p(host, val);                                               \
 }
 
-GEN_VEXT_ST_ELEM(ste_b, int8_t,  H1, stb)
-GEN_VEXT_ST_ELEM(ste_h, int16_t, H2, stw)
-GEN_VEXT_ST_ELEM(ste_w, int32_t, H4, stl)
-GEN_VEXT_ST_ELEM(ste_d, int64_t, H8, stq)
+GEN_VEXT_ST_ELEM(ste_b, uint8_t,  H1, stb)
+GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
+GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
+GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
 
 static void vext_set_tail_elems_1s(target_ulong vl, void *vd,
                                    uint32_t desc, uint32_t nf,
@@ -199,7 +474,7 @@ static void
 vext_ldst_stride(void *vd, void *v0, target_ulong base,
                  target_ulong stride, CPURISCVState *env,
                  uint32_t desc, uint32_t vm,
-                 vext_ldst_elem_fn *ldst_elem,
+                 vext_ldst_elem_fn_tlb *ldst_elem,
                  uint32_t log2_esz, uintptr_t ra)
 {
     uint32_t i, k;
@@ -221,7 +496,8 @@ vext_ldst_stride(void *vd, void *v0, target_ulong base,
                 continue;
             }
             target_ulong addr = base + stride * i + (k << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -240,10 +516,10 @@ void HELPER(NAME)(void *vd, void * v0, target_ulong base,               \
                      ctzl(sizeof(ETYPE)), GETPC());                     \
 }
 
-GEN_VEXT_LD_STRIDE(vlse8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_STRIDE(vlse16_v, int16_t, lde_h)
-GEN_VEXT_LD_STRIDE(vlse32_v, int32_t, lde_w)
-GEN_VEXT_LD_STRIDE(vlse64_v, int64_t, lde_d)
+GEN_VEXT_LD_STRIDE(vlse8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_STRIDE(vlse16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_STRIDE(vlse32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_STRIDE(vlse64_v, int64_t, lde_d_tlb)
 
 #define GEN_VEXT_ST_STRIDE(NAME, ETYPE, STORE_FN)                       \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
@@ -255,10 +531,10 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
                      ctzl(sizeof(ETYPE)), GETPC());                     \
 }
 
-GEN_VEXT_ST_STRIDE(vsse8_v,  int8_t,  ste_b)
-GEN_VEXT_ST_STRIDE(vsse16_v, int16_t, ste_h)
-GEN_VEXT_ST_STRIDE(vsse32_v, int32_t, ste_w)
-GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
+GEN_VEXT_ST_STRIDE(vsse8_v,  int8_t,  ste_b_tlb)
+GEN_VEXT_ST_STRIDE(vsse16_v, int16_t, ste_h_tlb)
+GEN_VEXT_ST_STRIDE(vsse32_v, int32_t, ste_w_tlb)
+GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d_tlb)
 
 /*
  * unit-stride: access elements stored contiguously in memory
@@ -267,9 +543,14 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d)
 /* unmasked unit-stride load and store operation */
 static void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-             vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uint32_t evl,
-             uintptr_t ra)
+             vext_ldst_elem_fn_tlb *ldst_tlb,
+             vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
+             uint32_t evl, uintptr_t ra, bool is_load)
 {
+    RVVContLdSt info;
+    void *host;
+    int flags;
+    intptr_t reg_start, reg_last;
     uint32_t i, k;
     uint32_t nf = vext_nf(desc);
     uint32_t max_elems = vext_max_elems(desc, log2_esz);
@@ -277,17 +558,88 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     VSTART_CHECK_EARLY_EXIT(env);
 
-    /* load bytes from guest memory */
-    for (i = env->vstart; i < evl; env->vstart = ++i) {
+    vext_cont_ldst_elements(&info, base, env->vreg, env->vstart, evl, desc,
+                            log2_esz, false);
+    /* Probe the page(s).  Exit with exception for any invalid page. */
+    vext_cont_ldst_pages(env, &info, base, is_load, desc, esz, ra);
+    /* Handle watchpoints for all active elements. */
+    vext_cont_ldst_watchpoints(env, &info, env->vreg, base, esz, is_load, ra,
+                               desc);
+
+    /* Load bytes from guest memory */
+    flags = info.page[0].flags | info.page[1].flags;
+    if (unlikely(flags != 0)) {
+        /* At least one page includes MMIO. */
+        reg_start = info.reg_idx_first[0];
+        reg_last = info.reg_idx_last[1];
+        if (reg_last < 0) {
+            reg_last = info.reg_idx_split;
+            if (reg_last < 0) {
+                reg_last = info.reg_idx_last[0];
+            }
+        }
+        reg_last += 1;
+
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                target_ulong addr = base + ((i * nf + k) << log2_esz);
+                ldst_tlb(env, adjust_addr(env, addr),
+                         (i + k * max_elems) << log2_esz, vd, ra);
+                k++;
+            }
+        }
+
+        env->vstart = 0;
+        vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
+        return;
+    }
+
+    /* The entire operation is in RAM, on valid pages. */
+    reg_start = info.reg_idx_first[0];
+    reg_last = info.reg_idx_last[0] + 1;
+    host = info.page[0].host;
+
+    for (i = reg_start; i < reg_last; ++i) {
         k = 0;
         while (k < nf) {
-            target_ulong addr = base + ((i * nf + k) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_host(vd, (i + k * max_elems) << log2_esz,
+                      host + ((i * nf + k) << log2_esz));
             k++;
         }
     }
-    env->vstart = 0;
 
+    /*
+     * Use the slow path to manage the cross-page misalignment.
+     * But we know this is RAM and cannot trap.
+     */
+    if (unlikely(info.mem_off_split >= 0)) {
+        reg_start = info.reg_idx_split;
+        k = 0;
+        while (k < nf) {
+            target_ulong addr = base + ((reg_start * nf + k) << log2_esz);
+            ldst_tlb(env, adjust_addr(env, addr),
+                     (reg_start + k * max_elems) << log2_esz, vd, ra);
+            k++;
+        }
+    }
+
+    if (unlikely(info.mem_off_first[1] >= 0)) {
+        reg_start = info.reg_idx_first[1];
+        reg_last = info.reg_idx_last[1] + 1;
+        host = info.page[1].host;
+
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                ldst_host(vd, (i + k * max_elems) << log2_esz,
+                          host + ((i * nf + k) << log2_esz));
+                k++;
+            }
+        }
+    }
+
+    env->vstart = 0;
     vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
 }
 
@@ -296,47 +648,47 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
  * stride, stride = NF * sizeof (ETYPE)
  */
 
-#define GEN_VEXT_LD_US(NAME, ETYPE, LOAD_FN)                            \
-void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,         \
-                         CPURISCVState *env, uint32_t desc)             \
-{                                                                       \
-    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));             \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, false, LOAD_FN,   \
-                     ctzl(sizeof(ETYPE)), GETPC());                     \
-}                                                                       \
-                                                                        \
-void HELPER(NAME)(void *vd, void *v0, target_ulong base,                \
-                  CPURISCVState *env, uint32_t desc)                    \
-{                                                                       \
-    vext_ldst_us(vd, base, env, desc, LOAD_FN,                          \
-                 ctzl(sizeof(ETYPE)), env->vl, GETPC());                \
+#define GEN_VEXT_LD_US(NAME, ETYPE, LOAD_FN_TLB, LOAD_FN_HOST)      \
+void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,     \
+                         CPURISCVState *env, uint32_t desc)         \
+{                                                                   \
+    uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));         \
+    vext_ldst_stride(vd, v0, base, stride, env, desc, false,        \
+                     LOAD_FN_TLB, ctzl(sizeof(ETYPE)), GETPC());    \
+}                                                                   \
+                                                                    \
+void HELPER(NAME)(void *vd, void *v0, target_ulong base,            \
+                  CPURISCVState *env, uint32_t desc)                \
+{                                                                   \
+    vext_ldst_us(vd, base, env, desc, LOAD_FN_TLB, LOAD_FN_HOST,    \
+                 ctzl(sizeof(ETYPE)), env->vl, GETPC(), true);      \
 }
 
-GEN_VEXT_LD_US(vle8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_US(vle16_v, int16_t, lde_h)
-GEN_VEXT_LD_US(vle32_v, int32_t, lde_w)
-GEN_VEXT_LD_US(vle64_v, int64_t, lde_d)
+GEN_VEXT_LD_US(vle8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_US(vle16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_US(vle32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_US(vle64_v, int64_t, lde_d_tlb, lde_d_host)
 
-#define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN)                            \
+#define GEN_VEXT_ST_US(NAME, ETYPE, STORE_FN_TLB, STORE_FN_HOST)         \
 void HELPER(NAME##_mask)(void *vd, void *v0, target_ulong base,          \
                          CPURISCVState *env, uint32_t desc)              \
 {                                                                        \
     uint32_t stride = vext_nf(desc) << ctzl(sizeof(ETYPE));              \
-    vext_ldst_stride(vd, v0, base, stride, env, desc, false, STORE_FN,   \
-                     ctzl(sizeof(ETYPE)), GETPC());                      \
+    vext_ldst_stride(vd, v0, base, stride, env, desc, false,             \
+                     STORE_FN_TLB, ctzl(sizeof(ETYPE)), GETPC());        \
 }                                                                        \
                                                                          \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,                 \
                   CPURISCVState *env, uint32_t desc)                     \
 {                                                                        \
-    vext_ldst_us(vd, base, env, desc, STORE_FN,                          \
-                 ctzl(sizeof(ETYPE)), env->vl, GETPC());                 \
+    vext_ldst_us(vd, base, env, desc, STORE_FN_TLB, STORE_FN_HOST,       \
+                 ctzl(sizeof(ETYPE)), env->vl, GETPC(), false);          \
 }
 
-GEN_VEXT_ST_US(vse8_v,  int8_t,  ste_b)
-GEN_VEXT_ST_US(vse16_v, int16_t, ste_h)
-GEN_VEXT_ST_US(vse32_v, int32_t, ste_w)
-GEN_VEXT_ST_US(vse64_v, int64_t, ste_d)
+GEN_VEXT_ST_US(vse8_v,  int8_t,  ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_US(vse16_v, int16_t, ste_h_tlb, ste_h_host)
+GEN_VEXT_ST_US(vse32_v, int32_t, ste_w_tlb, ste_w_host)
+GEN_VEXT_ST_US(vse64_v, int64_t, ste_d_tlb, ste_d_host)
 
 /*
  * unit stride mask load and store, EEW = 1
@@ -346,8 +698,8 @@ void HELPER(vlm_v)(void *vd, void *v0, target_ulong base,
 {
     /* evl = ceil(vl/8) */
     uint8_t evl = (env->vl + 7) >> 3;
-    vext_ldst_us(vd, base, env, desc, lde_b,
-                 0, evl, GETPC());
+    vext_ldst_us(vd, base, env, desc, lde_b_tlb, lde_b_host,
+                 0, evl, GETPC(), true);
 }
 
 void HELPER(vsm_v)(void *vd, void *v0, target_ulong base,
@@ -355,8 +707,8 @@ void HELPER(vsm_v)(void *vd, void *v0, target_ulong base,
 {
     /* evl = ceil(vl/8) */
     uint8_t evl = (env->vl + 7) >> 3;
-    vext_ldst_us(vd, base, env, desc, ste_b,
-                 0, evl, GETPC());
+    vext_ldst_us(vd, base, env, desc, ste_b_tlb, ste_b_host,
+                 0, evl, GETPC(), false);
 }
 
 /*
@@ -381,7 +733,7 @@ static inline void
 vext_ldst_index(void *vd, void *v0, target_ulong base,
                 void *vs2, CPURISCVState *env, uint32_t desc,
                 vext_get_index_addr get_index_addr,
-                vext_ldst_elem_fn *ldst_elem,
+                vext_ldst_elem_fn_tlb *ldst_elem,
                 uint32_t log2_esz, uintptr_t ra)
 {
     uint32_t i, k;
@@ -405,7 +757,8 @@ vext_ldst_index(void *vd, void *v0, target_ulong base,
                 continue;
             }
             abi_ptr addr = get_index_addr(base, i, vs2) + (k << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -422,22 +775,22 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,                   \
                     LOAD_FN, ctzl(sizeof(ETYPE)), GETPC());                \
 }
 
-GEN_VEXT_LD_INDEX(vlxei8_8_v,   int8_t,  idx_b, lde_b)
-GEN_VEXT_LD_INDEX(vlxei8_16_v,  int16_t, idx_b, lde_h)
-GEN_VEXT_LD_INDEX(vlxei8_32_v,  int32_t, idx_b, lde_w)
-GEN_VEXT_LD_INDEX(vlxei8_64_v,  int64_t, idx_b, lde_d)
-GEN_VEXT_LD_INDEX(vlxei16_8_v,  int8_t,  idx_h, lde_b)
-GEN_VEXT_LD_INDEX(vlxei16_16_v, int16_t, idx_h, lde_h)
-GEN_VEXT_LD_INDEX(vlxei16_32_v, int32_t, idx_h, lde_w)
-GEN_VEXT_LD_INDEX(vlxei16_64_v, int64_t, idx_h, lde_d)
-GEN_VEXT_LD_INDEX(vlxei32_8_v,  int8_t,  idx_w, lde_b)
-GEN_VEXT_LD_INDEX(vlxei32_16_v, int16_t, idx_w, lde_h)
-GEN_VEXT_LD_INDEX(vlxei32_32_v, int32_t, idx_w, lde_w)
-GEN_VEXT_LD_INDEX(vlxei32_64_v, int64_t, idx_w, lde_d)
-GEN_VEXT_LD_INDEX(vlxei64_8_v,  int8_t,  idx_d, lde_b)
-GEN_VEXT_LD_INDEX(vlxei64_16_v, int16_t, idx_d, lde_h)
-GEN_VEXT_LD_INDEX(vlxei64_32_v, int32_t, idx_d, lde_w)
-GEN_VEXT_LD_INDEX(vlxei64_64_v, int64_t, idx_d, lde_d)
+GEN_VEXT_LD_INDEX(vlxei8_8_v,   int8_t,  idx_b, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_16_v,  int16_t, idx_b, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_32_v,  int32_t, idx_b, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei8_64_v,  int64_t, idx_b, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_8_v,  int8_t,  idx_h, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_16_v, int16_t, idx_h, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_32_v, int32_t, idx_h, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei16_64_v, int64_t, idx_h, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_8_v,  int8_t,  idx_w, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_16_v, int16_t, idx_w, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_32_v, int32_t, idx_w, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei32_64_v, int64_t, idx_w, lde_d_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_8_v,  int8_t,  idx_d, lde_b_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_16_v, int16_t, idx_d, lde_h_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_32_v, int32_t, idx_d, lde_w_tlb)
+GEN_VEXT_LD_INDEX(vlxei64_64_v, int64_t, idx_d, lde_d_tlb)
 
 #define GEN_VEXT_ST_INDEX(NAME, ETYPE, INDEX_FN, STORE_FN)       \
 void HELPER(NAME)(void *vd, void *v0, target_ulong base,         \
@@ -448,22 +801,22 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,         \
                     GETPC());                                    \
 }
 
-GEN_VEXT_ST_INDEX(vsxei8_8_v,   int8_t,  idx_b, ste_b)
-GEN_VEXT_ST_INDEX(vsxei8_16_v,  int16_t, idx_b, ste_h)
-GEN_VEXT_ST_INDEX(vsxei8_32_v,  int32_t, idx_b, ste_w)
-GEN_VEXT_ST_INDEX(vsxei8_64_v,  int64_t, idx_b, ste_d)
-GEN_VEXT_ST_INDEX(vsxei16_8_v,  int8_t,  idx_h, ste_b)
-GEN_VEXT_ST_INDEX(vsxei16_16_v, int16_t, idx_h, ste_h)
-GEN_VEXT_ST_INDEX(vsxei16_32_v, int32_t, idx_h, ste_w)
-GEN_VEXT_ST_INDEX(vsxei16_64_v, int64_t, idx_h, ste_d)
-GEN_VEXT_ST_INDEX(vsxei32_8_v,  int8_t,  idx_w, ste_b)
-GEN_VEXT_ST_INDEX(vsxei32_16_v, int16_t, idx_w, ste_h)
-GEN_VEXT_ST_INDEX(vsxei32_32_v, int32_t, idx_w, ste_w)
-GEN_VEXT_ST_INDEX(vsxei32_64_v, int64_t, idx_w, ste_d)
-GEN_VEXT_ST_INDEX(vsxei64_8_v,  int8_t,  idx_d, ste_b)
-GEN_VEXT_ST_INDEX(vsxei64_16_v, int16_t, idx_d, ste_h)
-GEN_VEXT_ST_INDEX(vsxei64_32_v, int32_t, idx_d, ste_w)
-GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d)
+GEN_VEXT_ST_INDEX(vsxei8_8_v,   int8_t,  idx_b, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_16_v,  int16_t, idx_b, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_32_v,  int32_t, idx_b, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei8_64_v,  int64_t, idx_b, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_8_v,  int8_t,  idx_h, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_16_v, int16_t, idx_h, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_32_v, int32_t, idx_h, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei16_64_v, int64_t, idx_h, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_8_v,  int8_t,  idx_w, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_16_v, int16_t, idx_w, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_32_v, int32_t, idx_w, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei32_64_v, int64_t, idx_w, ste_d_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_8_v,  int8_t,  idx_d, ste_b_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_16_v, int16_t, idx_d, ste_h_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_32_v, int32_t, idx_d, ste_w_tlb)
+GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d_tlb)
 
 /*
  * unit-stride fault-only-fisrt load instructions
@@ -471,7 +824,7 @@ GEN_VEXT_ST_INDEX(vsxei64_64_v, int64_t, idx_d, ste_d)
 static inline void
 vext_ldff(void *vd, void *v0, target_ulong base,
           CPURISCVState *env, uint32_t desc,
-          vext_ldst_elem_fn *ldst_elem,
+          vext_ldst_elem_fn_tlb *ldst_elem,
           uint32_t log2_esz, uintptr_t ra)
 {
     void *host;
@@ -537,7 +890,8 @@ ProbeSuccess:
                 continue;
             }
             addr = base + ((i * nf + k) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
             k++;
         }
     }
@@ -554,10 +908,10 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong base,  \
               ctzl(sizeof(ETYPE)), GETPC());              \
 }
 
-GEN_VEXT_LDFF(vle8ff_v,  int8_t,  lde_b)
-GEN_VEXT_LDFF(vle16ff_v, int16_t, lde_h)
-GEN_VEXT_LDFF(vle32ff_v, int32_t, lde_w)
-GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d)
+GEN_VEXT_LDFF(vle8ff_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LDFF(vle16ff_v, int16_t, lde_h_tlb)
+GEN_VEXT_LDFF(vle32ff_v, int32_t, lde_w_tlb)
+GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
 
 #define DO_SWAP(N, M) (M)
 #define DO_AND(N, M)  (N & M)
@@ -574,7 +928,8 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d)
  */
 static void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-                vext_ldst_elem_fn *ldst_elem, uint32_t log2_esz, uintptr_t ra)
+                vext_ldst_elem_fn_tlb *ldst_elem, uint32_t log2_esz,
+                uintptr_t ra)
 {
     uint32_t i, k, off, pos;
     uint32_t nf = vext_nf(desc);
@@ -593,8 +948,8 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         /* load/store rest of elements of current segment pointed by vstart */
         for (pos = off; pos < max_elems; pos++, env->vstart++) {
             target_ulong addr = base + ((pos + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), pos + k * max_elems, vd,
-                      ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (pos + k * max_elems) << log2_esz, vd, ra);
         }
         k++;
     }
@@ -603,7 +958,8 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
     for (; k < nf; k++) {
         for (i = 0; i < max_elems; i++, env->vstart++) {
             target_ulong addr = base + ((i + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
+            ldst_elem(env, adjust_addr(env, addr),
+                      (i + k * max_elems) << log2_esz, vd, ra);
         }
     }
 
@@ -618,22 +974,22 @@ void HELPER(NAME)(void *vd, target_ulong base,       \
                     ctzl(sizeof(ETYPE)), GETPC());   \
 }
 
-GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d)
-GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b)
-GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h)
-GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w)
-GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d)
+GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb)
+GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb)
+GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb)
+GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb)
+GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb)
 
 #define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN)     \
 void HELPER(NAME)(void *vd, target_ulong base,       \
@@ -643,10 +999,10 @@ void HELPER(NAME)(void *vd, target_ulong base,       \
                     ctzl(sizeof(ETYPE)), GETPC());   \
 }
 
-GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b)
-GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b)
+GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb)
+GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb)
 
 /*
  * Vector Integer Arithmetic Instructions
diff --git a/target/riscv/vector_internals.h b/target/riscv/vector_internals.h
index 9e1e15b5750..f59d7d5c19f 100644
--- a/target/riscv/vector_internals.h
+++ b/target/riscv/vector_internals.h
@@ -233,4 +233,52 @@ void HELPER(NAME)(void *vd, void *v0, target_ulong s1,    \
 #define WOP_UUU_H uint32_t, uint16_t, uint16_t, uint32_t, uint32_t
 #define WOP_UUU_W uint64_t, uint32_t, uint32_t, uint64_t, uint64_t
 
+typedef struct {
+    void *host;
+    int flags;
+    MemTxAttrs attrs;
+} RVVHostPage;
+
+typedef struct {
+    /*
+     * First and last element wholly contained within the two pages.
+     * mem_off_first[0] and reg_idx_first[0] are always set >= 0.
+     * reg_idx_last[0] may be < 0 if the first element crosses pages.
+     * All of mem_off_first[1], reg_idx_first[1] and reg_idx_last[1]
+     * are set >= 0 only if there are complete elements on a second page.
+     */
+    int16_t mem_off_first[2];
+    int16_t reg_idx_first[2];
+    int16_t reg_idx_last[2];
+
+    /*
+     * One element that is misaligned and spans both pages,
+     * or -1 if there is no such active element.
+     */
+    int16_t mem_off_split;
+    int16_t reg_idx_split;
+
+    /*
+     * The byte offset at which the entire operation crosses a page boundary.
+     * Set >= 0 if and only if the entire operation spans two pages.
+     */
+    int16_t page_split;
+
+    /* TLB data for the two pages. */
+    RVVHostPage page[2];
+} RVVContLdSt;
+
+#ifdef CONFIG_USER_ONLY
+static inline void
+vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info, uint64_t *v0,
+                           target_ulong addr, uint32_t log2_esz, bool is_load,
+                           uintptr_t ra, uint32_t desc)
+{}
+#else
+void vext_cont_ldst_watchpoints(CPURISCVState *env, RVVContLdSt *info,
+                                uint64_t *v0, target_ulong addr,
+                                uint32_t log2_esz, bool is_load, uintptr_t ra,
+                                uint32_t desc);
+#endif
+
 #endif /* TARGET_RISCV_VECTOR_INTERNALS_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
@ 2024-06-13 14:19 ` Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

The vector unit-stride whole register load/store instructions are
similar to unmasked unit-stride load/store instructions that is suitable
to be optimized by using a direct access to host ram fast path.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 185 +++++++++++++++++++++++++----------
 1 file changed, 133 insertions(+), 52 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index d33ba5aeca1..b34d10b1b5d 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -928,81 +928,162 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
  */
 static void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
-                vext_ldst_elem_fn_tlb *ldst_elem, uint32_t log2_esz,
-                uintptr_t ra)
+                vext_ldst_elem_fn_tlb *ldst_tlb,
+                vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
+                uintptr_t ra, bool is_load)
 {
-    uint32_t i, k, off, pos;
+    RVVContLdSt info;
+    target_ulong addr;
+    void *host;
+    int flags;
+    intptr_t reg_start, reg_last;
+    uint32_t idx_nf, off, evl;
     uint32_t nf = vext_nf(desc);
     uint32_t vlenb = riscv_cpu_cfg(env)->vlenb;
     uint32_t max_elems = vlenb >> log2_esz;
+    uint32_t esz = 1 << log2_esz;
 
     if (env->vstart >= ((vlenb * nf) >> log2_esz)) {
         env->vstart = 0;
         return;
     }
 
-    k = env->vstart / max_elems;
-    off = env->vstart % max_elems;
+    vext_cont_ldst_elements(&info, base, env->vreg, env->vstart,
+                            nf * max_elems, desc, log2_esz, true);
+    vext_cont_ldst_pages(env, &info, base, is_load, desc, esz, ra);
+    vext_cont_ldst_watchpoints(env, &info, env->vreg, base, esz, is_load, ra,
+                               desc);
+
+    flags = info.page[0].flags | info.page[1].flags;
+    if (unlikely(flags != 0)) {
+        /* At least one page includes MMIO. */
+        reg_start = info.reg_idx_first[0];
+        idx_nf = reg_start / max_elems;
+        off = reg_start % max_elems;
+        evl = (idx_nf + 1) * max_elems;
+
+        if (off) {
+            /*
+             * load/store rest of elements of current segment pointed by vstart
+             */
+            addr = base + (reg_start << log2_esz);
+            for (; reg_start < evl; reg_start++, addr += esz) {
+                ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz,
+                         vd, ra);
+            }
+            idx_nf++;
+        }
+
+        /* load/store elements for rest of segments */
+        evl = nf * max_elems;
+        addr = base + (reg_start << log2_esz);
+        for (; reg_start < evl; reg_start++, addr += esz) {
+            ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd,
+                     ra);
+        }
+
+        env->vstart = 0;
+        return;
+    }
+
+    /* The entire operation is in RAM, on valid pages. */
+    reg_start = info.reg_idx_first[0];
+    reg_last = info.reg_idx_last[0] + 1;
+    host = info.page[0].host;
+    idx_nf = reg_start / max_elems;
+    off = reg_start % max_elems;
+    evl = (idx_nf + 1) * max_elems;
 
     if (off) {
         /* load/store rest of elements of current segment pointed by vstart */
-        for (pos = off; pos < max_elems; pos++, env->vstart++) {
-            target_ulong addr = base + ((pos + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr),
-                      (pos + k * max_elems) << log2_esz, vd, ra);
+        for (; reg_start < evl; reg_start++) {
+            ldst_host(vd, reg_start << log2_esz,
+                      host + (reg_start << log2_esz));
         }
-        k++;
+        idx_nf++;
     }
 
     /* load/store elements for rest of segments */
-    for (; k < nf; k++) {
-        for (i = 0; i < max_elems; i++, env->vstart++) {
-            target_ulong addr = base + ((i + k * max_elems) << log2_esz);
-            ldst_elem(env, adjust_addr(env, addr),
-                      (i + k * max_elems) << log2_esz, vd, ra);
+    for (; reg_start < reg_last; reg_start++) {
+        ldst_host(vd, reg_start << log2_esz, host + (reg_start << log2_esz));
+    }
+
+    /*
+     * Use the slow path to manage the cross-page misalignment.
+     * But we know this is RAM and cannot trap.
+     */
+    if (unlikely(info.mem_off_split >= 0)) {
+        reg_start = info.reg_idx_split;
+        addr = base + (reg_start << log2_esz);
+        ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd, ra);
+    }
+
+    if (unlikely(info.mem_off_first[1] >= 0)) {
+        reg_start = info.reg_idx_first[1];
+        reg_last = info.reg_idx_last[1] + 1;
+        host = info.page[1].host;
+        idx_nf = reg_start / max_elems;
+        off = reg_start % max_elems;
+        evl = (idx_nf + 1) * max_elems;
+
+        if (off) {
+            /*
+             * load/store rest of elements of current segment pointed by vstart
+             */
+            for (; reg_start < evl; reg_start++) {
+                ldst_host(vd, reg_start << log2_esz,
+                          host + (reg_start << log2_esz));
+            }
+            idx_nf++;
+        }
+
+        /* load/store elements for rest of segments */
+        for (; reg_start < reg_last; reg_start++) {
+            ldst_host(vd, reg_start << log2_esz,
+                      host + (reg_start << log2_esz));
         }
     }
 
     env->vstart = 0;
 }
 
-#define GEN_VEXT_LD_WHOLE(NAME, ETYPE, LOAD_FN)      \
-void HELPER(NAME)(void *vd, target_ulong base,       \
-                  CPURISCVState *env, uint32_t desc) \
-{                                                    \
-    vext_ldst_whole(vd, base, env, desc, LOAD_FN,    \
-                    ctzl(sizeof(ETYPE)), GETPC());   \
-}
-
-GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb)
-GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb)
-GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb)
-GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb)
-GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb)
-
-#define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN)     \
-void HELPER(NAME)(void *vd, target_ulong base,       \
-                  CPURISCVState *env, uint32_t desc) \
-{                                                    \
-    vext_ldst_whole(vd, base, env, desc, STORE_FN,   \
-                    ctzl(sizeof(ETYPE)), GETPC());   \
-}
-
-GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb)
-GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb)
+#define GEN_VEXT_LD_WHOLE(NAME, ETYPE, LOAD_FN_TLB, LOAD_FN_HOST)   \
+void HELPER(NAME)(void *vd, target_ulong base, CPURISCVState *env,  \
+                  uint32_t desc)                                    \
+{                                                                   \
+    vext_ldst_whole(vd, base, env, desc, LOAD_FN_TLB, LOAD_FN_HOST, \
+                    ctzl(sizeof(ETYPE)), GETPC(), true);            \
+}
+
+GEN_VEXT_LD_WHOLE(vl1re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl1re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl1re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl1re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl2re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl2re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl2re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl2re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl4re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl4re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl4re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl4re64_v, int64_t, lde_d_tlb, lde_d_host)
+GEN_VEXT_LD_WHOLE(vl8re8_v,  int8_t,  lde_b_tlb, lde_b_host)
+GEN_VEXT_LD_WHOLE(vl8re16_v, int16_t, lde_h_tlb, lde_h_host)
+GEN_VEXT_LD_WHOLE(vl8re32_v, int32_t, lde_w_tlb, lde_w_host)
+GEN_VEXT_LD_WHOLE(vl8re64_v, int64_t, lde_d_tlb, lde_d_host)
+
+#define GEN_VEXT_ST_WHOLE(NAME, ETYPE, STORE_FN_TLB, STORE_FN_HOST)     \
+void HELPER(NAME)(void *vd, target_ulong base, CPURISCVState *env,      \
+                  uint32_t desc)                                        \
+{                                                                       \
+    vext_ldst_whole(vd, base, env, desc, STORE_FN_TLB, STORE_FN_HOST,   \
+                    ctzl(sizeof(ETYPE)), GETPC(), false);               \
+}
+
+GEN_VEXT_ST_WHOLE(vs1r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs2r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs4r_v, int8_t, ste_b_tlb, ste_b_host)
+GEN_VEXT_ST_WHOLE(vs8r_v, int8_t, ste_b_tlb, ste_b_host)
 
 /*
  * Vector Integer Arithmetic Instructions
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
                   ` (2 preceding siblings ...)
  2024-06-13 14:19 ` [RFC PATCH v3 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
@ 2024-06-13 14:19 ` Max Chou
  2024-06-13 14:19 ` [RFC PATCH v3 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
  2024-06-13 15:42 ` [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Daniel Henrique Barboza
  5 siblings, 0 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

The vector unmasked unit-stride and whole register load/store
instructions will load/store continuous memory. If the endian of both
the host and guest architecture are the same, then we can group the
element load/store to load/store more data at a time.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 160 +++++++++++++++++++++++++----------
 1 file changed, 117 insertions(+), 43 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index b34d10b1b5d..09c9b231c3f 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -450,6 +450,69 @@ GEN_VEXT_ST_ELEM(ste_h, uint16_t, H2, stw)
 GEN_VEXT_ST_ELEM(ste_w, uint32_t, H4, stl)
 GEN_VEXT_ST_ELEM(ste_d, uint64_t, H8, stq)
 
+static inline uint32_t
+vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
+                     uint32_t byte_offset, void *host, uint32_t esz,
+                     bool is_load)
+{
+    uint32_t group_size;
+    static vext_ldst_elem_fn_host * const fns[2][4] = {
+        /* Store */
+        { ste_b_host, ste_h_host, ste_w_host, ste_d_host },
+        /* Load */
+        { lde_b_host, lde_h_host, lde_w_host, lde_d_host }
+    };
+    vext_ldst_elem_fn_host *fn;
+
+    if (byte_offset + 8 < byte_end) {
+        group_size = MO_64;
+    } else if (byte_offset + 4 < byte_end) {
+        group_size = MO_32;
+    } else if (byte_offset + 2 < byte_end) {
+        group_size = MO_16;
+    } else {
+        group_size = MO_8;
+    }
+
+    fn = fns[is_load][group_size];
+    fn(vd, byte_offset, host + byte_offset);
+
+    return 1 << group_size;
+}
+
+static inline void
+vext_continus_ldst_tlb(CPURISCVState *env, vext_ldst_elem_fn_tlb *ldst_tlb,
+                       void *vd, uint32_t evl, target_ulong addr,
+                       uint32_t reg_start, uintptr_t ra, uint32_t esz,
+                       bool is_load)
+{
+    for (; reg_start < evl; reg_start++, addr += esz) {
+        ldst_tlb(env, adjust_addr(env, addr), reg_start * esz, vd, ra);
+    }
+}
+
+static inline void
+vext_continus_ldst_host(CPURISCVState *env, vext_ldst_elem_fn_host *ldst_host,
+                        void *vd, uint32_t evl, uint32_t reg_start, void *host,
+                        uint32_t esz, bool is_load)
+{
+#if TARGET_BIG_ENDIAN != HOST_BIG_ENDIAN
+    for (; reg_start < evl; reg_start++) {
+        uint32_t byte_off = reg_start * esz;
+        ldst_host(vd, byte_off, host + byte_off);
+    }
+#else
+    uint32_t group_byte;
+    uint32_t byte_start = reg_start * esz;
+    uint32_t byte_end = evl * esz;
+    while (byte_start < byte_end) {
+        group_byte = vext_group_ldst_host(env, vd, byte_end, byte_start, host,
+                                          esz, is_load);
+        byte_start += group_byte;
+    }
+#endif
+}
+
 static void vext_set_tail_elems_1s(target_ulong vl, void *vd,
                                    uint32_t desc, uint32_t nf,
                                    uint32_t esz, uint32_t max_elems)
@@ -548,6 +611,7 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              uint32_t evl, uintptr_t ra, bool is_load)
 {
     RVVContLdSt info;
+    target_ulong addr;
     void *host;
     int flags;
     intptr_t reg_start, reg_last;
@@ -580,13 +644,19 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         }
         reg_last += 1;
 
-        for (i = reg_start; i < reg_last; ++i) {
-            k = 0;
-            while (k < nf) {
-                target_ulong addr = base + ((i * nf + k) << log2_esz);
-                ldst_tlb(env, adjust_addr(env, addr),
-                         (i + k * max_elems) << log2_esz, vd, ra);
-                k++;
+        if (nf == 1) {
+            addr = base + reg_start * esz;
+            vext_continus_ldst_tlb(env, ldst_tlb, vd, reg_last, addr,
+                                   reg_start, ra, esz, is_load);
+        } else {
+            for (i = reg_start; i < reg_last; ++i) {
+                k = 0;
+                while (k < nf) {
+                    addr = base + ((i * nf + k) * esz);
+                    ldst_tlb(env, adjust_addr(env, addr),
+                             (i + k * max_elems) << log2_esz, vd, ra);
+                    k++;
+                }
             }
         }
 
@@ -600,12 +670,17 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
     reg_last = info.reg_idx_last[0] + 1;
     host = info.page[0].host;
 
-    for (i = reg_start; i < reg_last; ++i) {
-        k = 0;
-        while (k < nf) {
-            ldst_host(vd, (i + k * max_elems) << log2_esz,
-                      host + ((i * nf + k) << log2_esz));
-            k++;
+    if (nf == 1) {
+        vext_continus_ldst_host(env, ldst_host, vd, reg_last, reg_start, host,
+                                esz, is_load);
+    } else {
+        for (i = reg_start; i < reg_last; ++i) {
+            k = 0;
+            while (k < nf) {
+                ldst_host(vd, (i + k * max_elems) << log2_esz,
+                          host + ((i * nf + k) * esz));
+                k++;
+            }
         }
     }
 
@@ -617,7 +692,7 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         reg_start = info.reg_idx_split;
         k = 0;
         while (k < nf) {
-            target_ulong addr = base + ((reg_start * nf + k) << log2_esz);
+            addr = base + ((reg_start * nf + k) << log2_esz);
             ldst_tlb(env, adjust_addr(env, addr),
                      (reg_start + k * max_elems) << log2_esz, vd, ra);
             k++;
@@ -629,12 +704,17 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
         reg_last = info.reg_idx_last[1] + 1;
         host = info.page[1].host;
 
-        for (i = reg_start; i < reg_last; ++i) {
-            k = 0;
-            while (k < nf) {
-                ldst_host(vd, (i + k * max_elems) << log2_esz,
-                          host + ((i * nf + k) << log2_esz));
-                k++;
+        if (nf == 1) {
+            vext_continus_ldst_host(env, ldst_host, vd, reg_last, reg_start,
+                                    host, esz, is_load);
+        } else {
+            for (i = reg_start; i < reg_last; ++i) {
+                k = 0;
+                while (k < nf) {
+                    ldst_host(vd, (i + k * max_elems) << log2_esz,
+                              host + ((i * nf + k) << log2_esz));
+                    k++;
+                }
             }
         }
     }
@@ -967,20 +1047,17 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              * load/store rest of elements of current segment pointed by vstart
              */
             addr = base + (reg_start << log2_esz);
-            for (; reg_start < evl; reg_start++, addr += esz) {
-                ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz,
-                         vd, ra);
-            }
+            vext_continus_ldst_tlb(env, ldst_tlb, vd, evl, addr, reg_start, ra,
+                                   esz, is_load);
             idx_nf++;
         }
 
         /* load/store elements for rest of segments */
         evl = nf * max_elems;
         addr = base + (reg_start << log2_esz);
-        for (; reg_start < evl; reg_start++, addr += esz) {
-            ldst_tlb(env, adjust_addr(env, addr), reg_start << log2_esz, vd,
-                     ra);
-        }
+        reg_start = idx_nf * max_elems;
+        vext_continus_ldst_tlb(env, ldst_tlb, vd, evl, addr, reg_start, ra,
+                               esz, is_load);
 
         env->vstart = 0;
         return;
@@ -996,17 +1073,16 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     if (off) {
         /* load/store rest of elements of current segment pointed by vstart */
-        for (; reg_start < evl; reg_start++) {
-            ldst_host(vd, reg_start << log2_esz,
-                      host + (reg_start << log2_esz));
-        }
+        vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                                is_load);
         idx_nf++;
     }
 
     /* load/store elements for rest of segments */
-    for (; reg_start < reg_last; reg_start++) {
-        ldst_host(vd, reg_start << log2_esz, host + (reg_start << log2_esz));
-    }
+    evl = reg_last;
+    reg_start = idx_nf * max_elems;
+    vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                            is_load);
 
     /*
      * Use the slow path to manage the cross-page misalignment.
@@ -1030,18 +1106,16 @@ vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
             /*
              * load/store rest of elements of current segment pointed by vstart
              */
-            for (; reg_start < evl; reg_start++) {
-                ldst_host(vd, reg_start << log2_esz,
-                          host + (reg_start << log2_esz));
-            }
+            vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host,
+                                    esz, is_load);
             idx_nf++;
         }
 
         /* load/store elements for rest of segments */
-        for (; reg_start < reg_last; reg_start++) {
-            ldst_host(vd, reg_start << log2_esz,
-                      host + (reg_start << log2_esz));
-        }
+        evl = reg_last;
+        reg_start = idx_nf * max_elems;
+        vext_continus_ldst_host(env, ldst_host, vd, evl, reg_start, host, esz,
+                                is_load);
     }
 
     env->vstart = 0;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH v3 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
                   ` (3 preceding siblings ...)
  2024-06-13 14:19 ` [RFC PATCH v3 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
@ 2024-06-13 14:19 ` Max Chou
  2024-06-13 15:42 ` [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Daniel Henrique Barboza
  5 siblings, 0 replies; 7+ messages in thread
From: Max Chou @ 2024-06-13 14:19 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Daniel Henrique Barboza,
	Liu Zhiwei, Max Chou

In the vector unit-stride load/store helper functions. the vext_ldst_us
& vext_ldst_whole functions corresponding most of the execution time.
Inline the functions can avoid the function call overhead to improve the
helper function performance.

Signed-off-by: Max Chou <max.chou@sifive.com>
---
 target/riscv/vector_helper.c | 64 +++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 30 deletions(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 09c9b231c3f..4a21064a366 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -408,20 +408,22 @@ typedef void vext_ldst_elem_fn_tlb(CPURISCVState *env, abi_ptr addr,
                                    uint32_t idx, void *vd, uintptr_t retaddr);
 typedef void vext_ldst_elem_fn_host(void *vd, uint32_t idx, void *host);
 
-#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                         \
-static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
-                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE *cur = ((ETYPE *)reg);                                        \
-    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);                   \
-}                                                                       \
-                                                                        \
-static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
-{                                                                       \
-    ETYPE val = LDSUF##_p(host);                                        \
-    uint8_t *reg = (uint8_t *)(vd + byte_off);                          \
-    *(ETYPE *)(reg) = val;                                              \
+#define GEN_VEXT_LD_ELEM(NAME, ETYPE, H, LDSUF)                 \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_tlb(CPURISCVState *env, abi_ptr addr,               \
+                uint32_t byte_off, void *vd, uintptr_t retaddr) \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE *cur = ((ETYPE *)reg);                                \
+    *cur = cpu_##LDSUF##_data_ra(env, addr, retaddr);           \
+}                                                               \
+                                                                \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_host(void *vd, uint32_t byte_off, void *host)       \
+{                                                               \
+    ETYPE val = LDSUF##_p(host);                                \
+    uint8_t *reg = (uint8_t *)(vd + byte_off);                  \
+    *(ETYPE *)(reg) = val;                                      \
 }
 
 GEN_VEXT_LD_ELEM(lde_b, uint8_t,  H1, ldub)
@@ -429,20 +431,22 @@ GEN_VEXT_LD_ELEM(lde_h, uint16_t, H2, lduw)
 GEN_VEXT_LD_ELEM(lde_w, uint32_t, H4, ldl)
 GEN_VEXT_LD_ELEM(lde_d, uint64_t, H8, ldq)
 
-#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                         \
-static void NAME##_tlb(CPURISCVState *env, abi_ptr addr,                \
-                       uint32_t byte_off, void *vd, uintptr_t retaddr)  \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE data = *((ETYPE *)reg);                                       \
-    cpu_##STSUF##_data_ra(env, addr, data, retaddr);                    \
-}                                                                       \
-                                                                        \
-static void NAME##_host(void *vd, uint32_t byte_off, void *host)        \
-{                                                                       \
-    uint8_t *reg = ((uint8_t *)vd + byte_off);                          \
-    ETYPE val = *(ETYPE *)(reg);                                        \
-    STSUF##_p(host, val);                                               \
+#define GEN_VEXT_ST_ELEM(NAME, ETYPE, H, STSUF)                 \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_tlb(CPURISCVState *env, abi_ptr addr,               \
+                uint32_t byte_off, void *vd, uintptr_t retaddr) \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE data = *((ETYPE *)reg);                               \
+    cpu_##STSUF##_data_ra(env, addr, data, retaddr);            \
+}                                                               \
+                                                                \
+static inline QEMU_ALWAYS_INLINE                                \
+void NAME##_host(void *vd, uint32_t byte_off, void *host)       \
+{                                                               \
+    uint8_t *reg = ((uint8_t *)vd + byte_off);                  \
+    ETYPE val = *(ETYPE *)(reg);                                \
+    STSUF##_p(host, val);                                       \
 }
 
 GEN_VEXT_ST_ELEM(ste_b, uint8_t,  H1, stb)
@@ -604,7 +608,7 @@ GEN_VEXT_ST_STRIDE(vsse64_v, int64_t, ste_d_tlb)
  */
 
 /* unmasked unit-stride load and store operation */
-static void
+static inline QEMU_ALWAYS_INLINE void
 vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
              vext_ldst_elem_fn_tlb *ldst_tlb,
              vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
@@ -1006,7 +1010,7 @@ GEN_VEXT_LDFF(vle64ff_v, int64_t, lde_d_tlb)
 /*
  * load and store whole register instructions
  */
-static void
+static inline QEMU_ALWAYS_INLINE void
 vext_ldst_whole(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
                 vext_ldst_elem_fn_tlb *ldst_tlb,
                 vext_ldst_elem_fn_host *ldst_host, uint32_t log2_esz,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions
  2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
                   ` (4 preceding siblings ...)
  2024-06-13 14:19 ` [RFC PATCH v3 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
@ 2024-06-13 15:42 ` Daniel Henrique Barboza
  5 siblings, 0 replies; 7+ messages in thread
From: Daniel Henrique Barboza @ 2024-06-13 15:42 UTC (permalink / raw)
  To: Max Chou, qemu-devel, qemu-riscv
  Cc: Richard Henderson, Paolo Bonzini, Palmer Dabbelt,
	Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei

Richard,

On 6/13/24 11:19 AM, Max Chou wrote:
> Hi,
> 
> This RFC patch set tries to fix the issue of
> https://gitlab.com/qemu-project/qemu/-/issues/2137.

To avoid confusion on what we're doing here: this series is another optimization
effort that Max is doing for RISC-V vector. We're also working into optimizing
the front-end load/store ops like we discussed in this bug a few months ago.

This series is not meant to be some sort of replacement/alternative of the approach
we've discussed. We're aiming for both.


Thanks,

Daniel

> 
> In this new version, we added patches that
> 1. Provide a fast path to direct access the host ram for some vector
>     load/store instructions (e.g. unmasked vector unit-stride load/store
>     instructions) and perform virtual address resolution once for the
>     entire vector at beginning of helper function. (Thanks for Richard
>     Henderson's suggestion)
> 2. Replace the group elements load/store TCG ops by the group element
>     load/store flow in helper functions with some assumption (e.g. no
>     masking, continuous memory load/store, the endian of host and guest
>     architecture are the same). (Thanks for Richard Henderson's
>     suggestion)
> 3. Try inline the vector load/store related functions that corresponding
>     most of the execution time.
> 
> This version can improve the performance of the test case provided in
> https://gitlab.com/qemu-project/qemu/-/issues/2137#note_1757501369
> - QEMU user mode (vlen=512): from ~51.8 sec. to ~4.5 sec.
> - QEMU system mode (vlen=512): from ~125.6 sec to ~6.6 sec.
> 
> Series based on riscv-to-apply.next branch (commit d82f37f).
> 
> Changes from v2:
> - Drop v2 patches 1/4/5/6
> - patch 2
>      - Provide direct access host ram flow for vector unit-stride ld/st
> - patch 3
>      - Provide direct access host ram flow for vector whole reg ld/st
> - patch 4
>      - Provide group element load/store flow for vector continuous ld/st
> - patch 5
>      - Extend v2 patch 3 to more vector ld/st functions
> 
> Previous version:
> - v1: https://lore.kernel.org/all/20240215192823.729209-1-max.chou@sifive.com/
> - v2: https://lore.kernel.org/all/20240531174504.281461-1-max.chou@sifive.com/
> 
> Max Chou (5):
>    accel/tcg: Avoid unnecessary call overhead from
>      qemu_plugin_vcpu_mem_cb
>    target/riscv: rvv: Provide a fast path using direct access to host ram
>      for unmasked unit-stride load/store
>    target/riscv: rvv: Provide a fast path using direct access to host ram
>      for unit-stride whole register load/store
>    target/riscv: rvv: Provide group continuous ld/st flow for unit-stride
>      ld/st instructions
>    target/riscv: Inline unit-stride ld/st and corresponding functions for
>      performance
> 
>   accel/tcg/ldst_common.c.inc             |   8 +-
>   target/riscv/insn_trans/trans_rvv.c.inc |   3 +
>   target/riscv/vector_helper.c            | 847 +++++++++++++++++++-----
>   target/riscv/vector_internals.h         |  48 ++
>   4 files changed, 738 insertions(+), 168 deletions(-)
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-06-13 15:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 14:19 [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Max Chou
2024-06-13 14:19 ` [RFC PATCH v3 1/5] accel/tcg: Avoid unnecessary call overhead from qemu_plugin_vcpu_mem_cb Max Chou
2024-06-13 14:19 ` [RFC PATCH v3 2/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unmasked unit-stride load/store Max Chou
2024-06-13 14:19 ` [RFC PATCH v3 3/5] target/riscv: rvv: Provide a fast path using direct access to host ram for unit-stride whole register load/store Max Chou
2024-06-13 14:19 ` [RFC PATCH v3 4/5] target/riscv: rvv: Provide group continuous ld/st flow for unit-stride ld/st instructions Max Chou
2024-06-13 14:19 ` [RFC PATCH v3 5/5] target/riscv: Inline unit-stride ld/st and corresponding functions for performance Max Chou
2024-06-13 15:42 ` [RFC PATCH v3 0/5] Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).