qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [RFC v3 0/2] target/riscv: add endianness checks and atomicity guarantees.
@ 2024-10-14 22:01 Paolo Savini
  2024-10-14 22:01 ` [RFC v3 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
  2024-10-14 22:01 ` [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
  0 siblings, 2 replies; 4+ messages in thread
From: Paolo Savini @ 2024-10-14 22:01 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Nathan Egge, Max Chou

This version 3 of the patch adds endianness safety to both the optimizations
brought by the patch set.
It also adds some conditions that allow the __builtin_memcpy to be executed
on chunks of 16 bytes with guarantee of atomicity.

Changes from V2:
- patch 1:
  - add condition for the host not to be big endian.
- patch 2:
  - add condition for the host not to be big endian.
  - add condition for the host to support 16-byte atomic memory operations.
  - limit the large loads and stores to 16 byte chunks in order to guarantee
    atomicity on a larger range of processors.

Cc: Richard Handerson <richard.henderson@linaro.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bmeng.cn@gmail.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Cc: Helene Chelin <helene.chelin@embecosm.com>
Cc: Nathan Egge <negge@google.com>
Cc: Max Chou <max.chou@sifive.com>

Helene CHELIN (1):
  target/riscv: rvv: reduce the overhead for simple RISC-V vector
    unit-stride loads and stores

Paolo Savini (1):
  target/riscv: rvv: improve performance of RISC-V vector loads and
    stores on large amounts of data.

 target/riscv/vector_helper.c | 61 +++++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC v3 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores
  2024-10-14 22:01 [RFC v3 0/2] target/riscv: add endianness checks and atomicity guarantees Paolo Savini
@ 2024-10-14 22:01 ` Paolo Savini
  2024-10-14 22:01 ` [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
  1 sibling, 0 replies; 4+ messages in thread
From: Paolo Savini @ 2024-10-14 22:01 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Nathan Egge, Max Chou

From: Helene CHELIN <helene.chelin@embecosm.com>

This patch improves the performance of the emulation of the RVV unit-stride
loads and stores in the following cases:

- when the data being loaded/stored per iteration amounts to 8 bytes or less.
- when the vector length is 16 bytes (VLEN=128) and there's no grouping of the
  vector registers (LMUL=1).

The optimization consists of avoiding the overhead of probing the RAM of the
host machine and doing a loop load/store on the input data grouped in chunks
of as many bytes as possible (8,4,2,1 bytes).

Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com>
Co-authored-by: Paolo Savini <paolo.savini@embecosm.com>

Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com>
---
 target/riscv/vector_helper.c | 47 ++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 4479726acf..75c24653f0 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -635,6 +635,53 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     VSTART_CHECK_EARLY_EXIT(env);
 
+#if defined(CONFIG_USER_ONLY) && !HOST_BIG_ENDIAN
+    /* For data sizes <= 64 bits and for LMUL=1 with VLEN=128 bits we get a
+     * better performance by doing a simple simulation of the load/store
+     * without the overhead of prodding the host RAM */
+    if ((nf == 1) && ((evl << log2_esz) <= 8 ||
+	((vext_lmul(desc) == 0) && (simd_maxsz(desc) == 16)))) {
+
+	uint32_t evl_b = evl << log2_esz;
+
+        for (uint32_t j = env->vstart; j < evl_b;) {
+	    addr = base + j;
+            if ((evl_b - j) >= 8) {
+                if (is_load)
+                    lde_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 8;
+            }
+            else if ((evl_b - j) >= 4) {
+                if (is_load)
+                    lde_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 4;
+            }
+            else if ((evl_b - j) >= 2) {
+                if (is_load)
+                    lde_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 2;
+            }
+            else {
+                if (is_load)
+                    lde_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 1;
+            }
+        }
+
+        env->vstart = 0;
+        vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
+        return;
+    }
+#endif
+
     vext_cont_ldst_elements(&info, base, env->vreg, env->vstart, evl, desc,
                             log2_esz, false);
     /* Probe the page(s).  Exit with exception for any invalid page. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-10-14 22:01 [RFC v3 0/2] target/riscv: add endianness checks and atomicity guarantees Paolo Savini
  2024-10-14 22:01 ` [RFC v3 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
@ 2024-10-14 22:01 ` Paolo Savini
  2024-10-14 23:11   ` Richard Henderson
  1 sibling, 1 reply; 4+ messages in thread
From: Paolo Savini @ 2024-10-14 22:01 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Nathan Egge, Max Chou

This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 64 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 128
bytes between the memory address of the simulated vector register and the
destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine,
if the host is little endiand and if it supports atomic 128 bit memory
operations.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
 target/riscv/vector_helper.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 75c24653f0..b3d0be8e39 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -488,7 +488,19 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
     }
 
     fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+    /* x86 and AMD processors provide strong guarantees of atomicity for
+     * 16-byte memory operations if the memory operands are 16-byte aligned */
+    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) && ((byte_offset % 16) == 0) &&
+        ((cpuinfo & (CPUINFO_ATOMIC_VMOVDQA | CPUINFO_ATOMIC_VMOVDQU)) != 0)) {
+      group_size = MO_128;
+      if (is_load)
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
+      else
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
+    } else {
+      fn(vd, byte_offset, host + byte_offset);
+    }
 
     return 1 << group_size;
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-10-14 22:01 ` [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
@ 2024-10-14 23:11   ` Richard Henderson
  0 siblings, 0 replies; 4+ messages in thread
From: Richard Henderson @ 2024-10-14 23:11 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei, Helene Chelin, Nathan Egge,
	Max Chou

On 10/14/24 15:01, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 64 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 128
> bytes between the memory address of the simulated vector register and the
> destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine,
> if the host is little endiand and if it supports atomic 128 bit memory
> operations.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 75c24653f0..b3d0be8e39 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -488,7 +488,19 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    /* x86 and AMD processors provide strong guarantees of atomicity for
> +     * 16-byte memory operations if the memory operands are 16-byte aligned */
> +    if (!HOST_BIG_ENDIAN && (byte_offset + 16 < byte_end) && ((byte_offset % 16) == 0) &&
> +        ((cpuinfo & (CPUINFO_ATOMIC_VMOVDQA | CPUINFO_ATOMIC_VMOVDQU)) != 0)) {
> +      group_size = MO_128;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +    } else {

This will not compile on anything other than x86.
Moreover, your comment about vmovdqa bears no relation to __builtin_memcpy.


r~


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-10-14 23:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14 22:01 [RFC v3 0/2] target/riscv: add endianness checks and atomicity guarantees Paolo Savini
2024-10-14 22:01 ` [RFC v3 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
2024-10-14 22:01 ` [RFC v3 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
2024-10-14 23:11   ` Richard Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).