[RFC 0/2] Improve the performance of unit-stride RVV ld/st on

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] Improve the performance of unit-stride RVV ld/st on
@ 2024-07-17 15:30 Paolo Savini
  2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Paolo Savini @ 2024-07-17 15:30 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Max Chou

This series of patches builds on top of Max Chou's patches:

https://lore.kernel.org/all/20240613175122.1299212-1-max.chou@sifive.com/

The aim of these patches is to improve the performance of QEMU emulation
of RVV unit-stride load and store instructions in the following cases

1. when the data being loaded/stored per iteration amounts to 8 bytes or less.
2. when the vector length is 16 bytes (VLEN=128) and there is no grouping of the
   vector registers (LMUL=1).
3. when the data being loaded/stored per iteration is more than 64 bytes.

In the first two cases the optimization consists of avoiding the
overhead of probing the RAM of the host machine and perform a simple loop
load/store on the data grouped in chunks of as many bytes as possible (8,4,2 or 1).

The third case is optimized by calling the __builtin_memcpy function on
data chuncks of 128 bytes and 256 bytes per time.

These patches on top of Max Chou's patches have been tested with SPEC
CPU 2017 and achieve an average reduction of 13% of the time needed by
QEMU for running the benchmarks compared with the master branch of QEMU.

You can find the source code being developed here: https://github.com/embecosm/rise-rvv-tcg-qemu
and regular updates and more statistics about the patch here: https://github.com/embecosm/rise-rvv-tcg-qemu-reports

Changes:
- patch 1:
  - Modify vext_ldst_us to run the simple loop load/store if we
    are in one of the two cases above.
- patch 2:
  - Modify vext_group_ldst_host to use __builtin_memcpy for data sizes
    of 128 bits and above.

Cc: Richard Handerson <richard.henderson@linaro.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Alistair Francis <alistair.francis@wdc.com>
Cc: Bin Meng <bmeng.cn@gmail.com>
Cc: Weiwei Li <liwei1518@gmail.com>
Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
Cc: Helene Chelin <helene.chelin@embecosm.com>
Cc: Max Chou <max.chou@sifive.com>

Helene CHELIN (1):
  target/riscv: rvv: reduce the overhead for simple RISC-V vector
    unit-stride loads and stores

Paolo Savini (1):
  target/riscv: rvv: improve performance of RISC-V vector loads and
    stores on large amounts of data.

 target/riscv/vector_helper.c | 63 +++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

-- 
2.17.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores
  2024-07-17 15:30 [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Paolo Savini
@ 2024-07-17 15:30 ` Paolo Savini
  2024-07-26 12:22   ` Daniel Henrique Barboza
  2024-07-27  7:13   ` Richard Henderson
  2024-07-17 15:30 ` [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
  2024-07-26 12:31 ` [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Daniel Henrique Barboza
  2 siblings, 2 replies; 11+ messages in thread
From: Paolo Savini @ 2024-07-17 15:30 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Max Chou

From: Helene CHELIN <helene.chelin@embecosm.com>

This patch improves the performance of the emulation of the RVV unit-stride
loads and stores in the following cases:

- when the data being loaded/stored per iteration amounts to 8 bytes or less.
- when the vector length is 16 bytes (VLEN=128) and there's no grouping of the
  vector registers (LMUL=1).

The optimization consists of avoiding the overhead of probing the RAM of the
host machine and doing a loop load/store on the input data grouped in chunks
of as many bytes as possible (8,4,2,1 bytes).

Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com>
Co-authored-by: Paolo Savini <paolo.savini@embecosm.com>

Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com>
---
 target/riscv/vector_helper.c | 46 ++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 29849a8b66..4b444c6bc5 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -633,6 +633,52 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
 
     VSTART_CHECK_EARLY_EXIT(env);
 
+    /* For data sizes <= 64 bits and for LMUL=1 with VLEN=128 bits we get a
+     * better performance by doing a simple simulation of the load/store
+     * without the overhead of prodding the host RAM */
+    if ((nf == 1) && ((evl << log2_esz) <= 8 ||
+	((vext_lmul(desc) == 0) && (simd_maxsz(desc) == 16)))) {
+
+	uint32_t evl_b = evl << log2_esz;
+
+        for (uint32_t j = env->vstart; j < evl_b;) {
+	    addr = base + j;
+            if ((evl_b - j) >= 8) {
+                if (is_load)
+                    lde_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 8;
+            }
+            else if ((evl_b - j) >= 4) {
+                if (is_load)
+                    lde_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 4;
+            }
+            else if ((evl_b - j) >= 2) {
+                if (is_load)
+                    lde_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 2;
+            }
+            else {
+                if (is_load)
+                    lde_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                else
+                    ste_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
+                j += 1;
+            }
+        }
+
+        env->vstart = 0;
+        vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
+        return;
+    }
+
+
     vext_cont_ldst_elements(&info, base, env->vreg, env->vstart, evl, desc,
                             log2_esz, false);
     /* Probe the page(s).  Exit with exception for any invalid page. */
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores
  2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
@ 2024-07-26 12:22   ` Daniel Henrique Barboza
  2024-07-27  7:13   ` Richard Henderson
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel Henrique Barboza @ 2024-07-26 12:22 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Richard Handerson, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Liu Zhiwei, Helene Chelin, Max Chou



On 7/17/24 12:30 PM, Paolo Savini wrote:
> From: Helene CHELIN <helene.chelin@embecosm.com>
> 
> This patch improves the performance of the emulation of the RVV unit-stride
> loads and stores in the following cases:
> 
> - when the data being loaded/stored per iteration amounts to 8 bytes or less.
> - when the vector length is 16 bytes (VLEN=128) and there's no grouping of the
>    vector registers (LMUL=1).
> 
> The optimization consists of avoiding the overhead of probing the RAM of the
> host machine and doing a loop load/store on the input data grouped in chunks
> of as many bytes as possible (8,4,2,1 bytes).
> 
> Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com>
> Co-authored-by: Paolo Savini <paolo.savini@embecosm.com>
> 
> Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 46 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 46 insertions(+)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 29849a8b66..4b444c6bc5 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -633,6 +633,52 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
>   
>       VSTART_CHECK_EARLY_EXIT(env);
>   
> +    /* For data sizes <= 64 bits and for LMUL=1 with VLEN=128 bits we get a
> +     * better performance by doing a simple simulation of the load/store
> +     * without the overhead of prodding the host RAM */
> +    if ((nf == 1) && ((evl << log2_esz) <= 8 ||
> +	((vext_lmul(desc) == 0) && (simd_maxsz(desc) == 16)))) {
> +
> +	uint32_t evl_b = evl << log2_esz;
> +
> +        for (uint32_t j = env->vstart; j < evl_b;) {
> +	    addr = base + j;
> +            if ((evl_b - j) >= 8) {
> +                if (is_load)
> +                    lde_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 8;
> +            }
> +            else if ((evl_b - j) >= 4) {
> +                if (is_load)
> +                    lde_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 4;
> +            }
> +            else if ((evl_b - j) >= 2) {
> +                if (is_load)
> +                    lde_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 2;
> +            }
> +            else {
> +                if (is_load)
> +                    lde_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 1;
> +            }
> +        }
> +
> +        env->vstart = 0;
> +        vext_set_tail_elems_1s(evl, vd, desc, nf, esz, max_elems);
> +        return;
> +    }
> +

Aside from the code style remarks that ./scripts/checkpatch.pl will make here (we always
use curly braces in all ifs and elses, regardless of being a single statement or not),
LGTM.


Thanks,


Daniel

> +
>       vext_cont_ldst_elements(&info, base, env->vreg, env->vstart, evl, desc,
>                               log2_esz, false);
>       /* Probe the page(s).  Exit with exception for any invalid page. */


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores
  2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
  2024-07-26 12:22   ` Daniel Henrique Barboza
@ 2024-07-27  7:13   ` Richard Henderson
  2024-07-31 12:38     ` Daniel Henrique Barboza
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Henderson @ 2024-07-27  7:13 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei, Helene Chelin, Max Chou

On 7/18/24 01:30, Paolo Savini wrote:
> From: Helene CHELIN <helene.chelin@embecosm.com>
> 
> This patch improves the performance of the emulation of the RVV unit-stride
> loads and stores in the following cases:
> 
> - when the data being loaded/stored per iteration amounts to 8 bytes or less.
> - when the vector length is 16 bytes (VLEN=128) and there's no grouping of the
>    vector registers (LMUL=1).
> 
> The optimization consists of avoiding the overhead of probing the RAM of the
> host machine and doing a loop load/store on the input data grouped in chunks
> of as many bytes as possible (8,4,2,1 bytes).
> 
> Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com>
> Co-authored-by: Paolo Savini <paolo.savini@embecosm.com>
> 
> Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 46 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 46 insertions(+)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 29849a8b66..4b444c6bc5 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -633,6 +633,52 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
>   
>       VSTART_CHECK_EARLY_EXIT(env);
>   
> +    /* For data sizes <= 64 bits and for LMUL=1 with VLEN=128 bits we get a
> +     * better performance by doing a simple simulation of the load/store
> +     * without the overhead of prodding the host RAM */
> +    if ((nf == 1) && ((evl << log2_esz) <= 8 ||
> +	((vext_lmul(desc) == 0) && (simd_maxsz(desc) == 16)))) {
> +
> +	uint32_t evl_b = evl << log2_esz;
> +
> +        for (uint32_t j = env->vstart; j < evl_b;) {
> +	    addr = base + j;
> +            if ((evl_b - j) >= 8) {
> +                if (is_load)
> +                    lde_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 8;
> +            }
> +            else if ((evl_b - j) >= 4) {
> +                if (is_load)
> +                    lde_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 4;
> +            }
> +            else if ((evl_b - j) >= 2) {
> +                if (is_load)
> +                    lde_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 2;
> +            }
> +            else {
> +                if (is_load)
> +                    lde_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                else
> +                    ste_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
> +                j += 1;
> +            }
> +        }

For system mode, this performs the tlb lookup N times, and so will not be an improvement.

This will not work on a big-endian host.


r~


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores
  2024-07-27  7:13   ` Richard Henderson
@ 2024-07-31 12:38     ` Daniel Henrique Barboza
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel Henrique Barboza @ 2024-07-31 12:38 UTC (permalink / raw)
  To: Richard Henderson, Paolo Savini, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li, Liu Zhiwei,
	Helene Chelin, Max Chou



On 7/27/24 4:13 AM, Richard Henderson wrote:
> On 7/18/24 01:30, Paolo Savini wrote:
>> From: Helene CHELIN <helene.chelin@embecosm.com>
>>
>> This patch improves the performance of the emulation of the RVV unit-stride
>> loads and stores in the following cases:
>>
>> - when the data being loaded/stored per iteration amounts to 8 bytes or less.
>> - when the vector length is 16 bytes (VLEN=128) and there's no grouping of the
>>    vector registers (LMUL=1).
>>
>> The optimization consists of avoiding the overhead of probing the RAM of the
>> host machine and doing a loop load/store on the input data grouped in chunks
>> of as many bytes as possible (8,4,2,1 bytes).
>>
>> Co-authored-by: Helene CHELIN <helene.chelin@embecosm.com>
>> Co-authored-by: Paolo Savini <paolo.savini@embecosm.com>
>>
>> Signed-off-by: Helene CHELIN <helene.chelin@embecosm.com>
>> ---
>>   target/riscv/vector_helper.c | 46 ++++++++++++++++++++++++++++++++++++
>>   1 file changed, 46 insertions(+)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index 29849a8b66..4b444c6bc5 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -633,6 +633,52 @@ vext_ldst_us(void *vd, target_ulong base, CPURISCVState *env, uint32_t desc,
>>       VSTART_CHECK_EARLY_EXIT(env);
>> +    /* For data sizes <= 64 bits and for LMUL=1 with VLEN=128 bits we get a
>> +     * better performance by doing a simple simulation of the load/store
>> +     * without the overhead of prodding the host RAM */
>> +    if ((nf == 1) && ((evl << log2_esz) <= 8 ||
>> +    ((vext_lmul(desc) == 0) && (simd_maxsz(desc) == 16)))) {
>> +
>> +    uint32_t evl_b = evl << log2_esz;
>> +
>> +        for (uint32_t j = env->vstart; j < evl_b;) {
>> +        addr = base + j;
>> +            if ((evl_b - j) >= 8) {
>> +                if (is_load)
>> +                    lde_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                else
>> +                    ste_d_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                j += 8;
>> +            }
>> +            else if ((evl_b - j) >= 4) {
>> +                if (is_load)
>> +                    lde_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                else
>> +                    ste_w_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                j += 4;
>> +            }
>> +            else if ((evl_b - j) >= 2) {
>> +                if (is_load)
>> +                    lde_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                else
>> +                    ste_h_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                j += 2;
>> +            }
>> +            else {
>> +                if (is_load)
>> +                    lde_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                else
>> +                    ste_b_tlb(env, adjust_addr(env, addr), j, vd, ra);
>> +                j += 1;
>> +            }
>> +        }
> 
> For system mode, this performs the tlb lookup N times, and so will not be an improvement.

I believe we can wrap this up in an "#ifdef CONFIG_USER_ONLY" block to allow
linux-user mode to benefit from it. We would still need to take care of the
host endianess though.


Thanks,

Daniel

> 
> This will not work on a big-endian host.
> 
> 
> r~


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-07-17 15:30 [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Paolo Savini
  2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
@ 2024-07-17 15:30 ` Paolo Savini
  2024-07-26 12:27   ` Daniel Henrique Barboza
  2024-07-27  7:15   ` Richard Henderson
  2024-07-26 12:31 ` [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Daniel Henrique Barboza
  2 siblings, 2 replies; 11+ messages in thread
From: Paolo Savini @ 2024-07-17 15:30 UTC (permalink / raw)
  To: qemu-devel, qemu-riscv
  Cc: Paolo Savini, Richard Handerson, Palmer Dabbelt, Alistair Francis,
	Bin Meng, Weiwei Li, Daniel Henrique Barboza, Liu Zhiwei,
	Helene Chelin, Max Chou

This patch optimizes the emulation of unit-stride load/store RVV instructions
when the data being loaded/stored per iteration amounts to 64 bytes or more.
The optimization consists of calling __builtin_memcpy on chunks of data of 128
and 256 bytes between the memory address of the simulated vector register and
the destination memory address and vice versa.
This is done only if we have direct access to the RAM of the host machine.

Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
---
 target/riscv/vector_helper.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
index 4b444c6bc5..7674972784 100644
--- a/target/riscv/vector_helper.c
+++ b/target/riscv/vector_helper.c
@@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
     }
 
     fn = fns[is_load][group_size];
-    fn(vd, byte_offset, host + byte_offset);
+
+    if (byte_offset + 32 < byte_end) {
+      group_size = MO_256;
+      if (is_load)
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
+      else
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
+    } else if (byte_offset + 16 < byte_end) {
+      group_size = MO_128;
+      if (is_load)
+        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
+      else
+        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
+    } else {
+      fn(vd, byte_offset, host + byte_offset);
+    }
 
     return 1 << group_size;
 }
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-07-17 15:30 ` [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
@ 2024-07-26 12:27   ` Daniel Henrique Barboza
  2024-07-27  7:15   ` Richard Henderson
  1 sibling, 0 replies; 11+ messages in thread
From: Daniel Henrique Barboza @ 2024-07-26 12:27 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Richard Handerson, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Liu Zhiwei, Helene Chelin, Max Chou



On 7/17/24 12:30 PM, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 64 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 128
> and 256 bytes between the memory address of the simulated vector register and
> the destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 4b444c6bc5..7674972784 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    if (byte_offset + 32 < byte_end) {
> +      group_size = MO_256;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
> +    } else if (byte_offset + 16 < byte_end) {
> +      group_size = MO_128;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +    } else {
> +      fn(vd, byte_offset, host + byte_offset);
> +    }
>  

I see that we don't have any precedence with this particular built-in in the TCG code. We do have
some instances in other parts of QEMU though (e.g. util/guest-random.c).

If we're ok with adding these builtin calls in the execution helpers in TCG, and aside from the
style warnings that ./scripts/checkpatch.pl will give, LGTM.


Thanks,

Daniel

>       return 1 << group_size;
>   }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-07-17 15:30 ` [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
  2024-07-26 12:27   ` Daniel Henrique Barboza
@ 2024-07-27  7:15   ` Richard Henderson
  2024-09-10 11:20     ` Paolo Savini
  1 sibling, 1 reply; 11+ messages in thread
From: Richard Henderson @ 2024-07-27  7:15 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei, Helene Chelin, Max Chou

On 7/18/24 01:30, Paolo Savini wrote:
> This patch optimizes the emulation of unit-stride load/store RVV instructions
> when the data being loaded/stored per iteration amounts to 64 bytes or more.
> The optimization consists of calling __builtin_memcpy on chunks of data of 128
> and 256 bytes between the memory address of the simulated vector register and
> the destination memory address and vice versa.
> This is done only if we have direct access to the RAM of the host machine.
> 
> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
> ---
>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
> index 4b444c6bc5..7674972784 100644
> --- a/target/riscv/vector_helper.c
> +++ b/target/riscv/vector_helper.c
> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void *vd, uint32_t byte_end,
>       }
>   
>       fn = fns[is_load][group_size];
> -    fn(vd, byte_offset, host + byte_offset);
> +
> +    if (byte_offset + 32 < byte_end) {
> +      group_size = MO_256;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 32);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 32);
> +    } else if (byte_offset + 16 < byte_end) {
> +      group_size = MO_128;
> +      if (is_load)
> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t *)(host + byte_offset), 16);
> +      else
> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t *)(vd + byte_offset), 16);
> +    } else {
> +      fn(vd, byte_offset, host + byte_offset);
> +    }
>   

This will not work for big-endian hosts.

This may have atomicity issues, depending on the spec, the compiler options, and the host 
capabilities.


r~



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-07-27  7:15   ` Richard Henderson
@ 2024-09-10 11:20     ` Paolo Savini
  2024-09-10 18:18       ` Richard Henderson
  0 siblings, 1 reply; 11+ messages in thread
From: Paolo Savini @ 2024-09-10 11:20 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei, Helene Chelin, Max Chou

Thanks for the feedback Richard, I'm working on the endianness. Could 
you please give me more details about the atomicity issues you are 
referring to?

Best wishes

Paolo

On 7/27/24 08:15, Richard Henderson wrote:
> On 7/18/24 01:30, Paolo Savini wrote:
>> This patch optimizes the emulation of unit-stride load/store RVV 
>> instructions
>> when the data being loaded/stored per iteration amounts to 64 bytes 
>> or more.
>> The optimization consists of calling __builtin_memcpy on chunks of 
>> data of 128
>> and 256 bytes between the memory address of the simulated vector 
>> register and
>> the destination memory address and vice versa.
>> This is done only if we have direct access to the RAM of the host 
>> machine.
>>
>> Signed-off-by: Paolo Savini <paolo.savini@embecosm.com>
>> ---
>>   target/riscv/vector_helper.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/target/riscv/vector_helper.c b/target/riscv/vector_helper.c
>> index 4b444c6bc5..7674972784 100644
>> --- a/target/riscv/vector_helper.c
>> +++ b/target/riscv/vector_helper.c
>> @@ -486,7 +486,22 @@ vext_group_ldst_host(CPURISCVState *env, void 
>> *vd, uint32_t byte_end,
>>       }
>>         fn = fns[is_load][group_size];
>> -    fn(vd, byte_offset, host + byte_offset);
>> +
>> +    if (byte_offset + 32 < byte_end) {
>> +      group_size = MO_256;
>> +      if (is_load)
>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t 
>> *)(host + byte_offset), 32);
>> +      else
>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t 
>> *)(vd + byte_offset), 32);
>> +    } else if (byte_offset + 16 < byte_end) {
>> +      group_size = MO_128;
>> +      if (is_load)
>> +        __builtin_memcpy((uint8_t *)(vd + byte_offset), (uint8_t 
>> *)(host + byte_offset), 16);
>> +      else
>> +        __builtin_memcpy((uint8_t *)(host + byte_offset), (uint8_t 
>> *)(vd + byte_offset), 16);
>> +    } else {
>> +      fn(vd, byte_offset, host + byte_offset);
>> +    }
>
> This will not work for big-endian hosts.
>
> This may have atomicity issues, depending on the spec, the compiler 
> options, and the host capabilities.
>
>
> r~
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data.
  2024-09-10 11:20     ` Paolo Savini
@ 2024-09-10 18:18       ` Richard Henderson
  0 siblings, 0 replies; 11+ messages in thread
From: Richard Henderson @ 2024-09-10 18:18 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Palmer Dabbelt, Alistair Francis, Bin Meng, Weiwei Li,
	Daniel Henrique Barboza, Liu Zhiwei, Helene Chelin, Max Chou

On 9/10/24 04:20, Paolo Savini wrote:
> Thanks for the feedback Richard, I'm working on the endianness. Could you please give me 
> more details about the atomicity issues you are referring to?

For instance a 32-bit atomic memory operation in the guest must be implemented with a >= 
32-bit atomic memory operation in the host.

The main thing to remember is that memcpy() has no atomicity guarantee.  It could be 
implemented as a byte loop.  Thus you may only use memcpy with guest byte vectors.



r~


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC 0/2] Improve the performance of unit-stride RVV ld/st on
  2024-07-17 15:30 [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Paolo Savini
  2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
  2024-07-17 15:30 ` [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
@ 2024-07-26 12:31 ` Daniel Henrique Barboza
  2 siblings, 0 replies; 11+ messages in thread
From: Daniel Henrique Barboza @ 2024-07-26 12:31 UTC (permalink / raw)
  To: Paolo Savini, qemu-devel, qemu-riscv
  Cc: Richard Handerson, Palmer Dabbelt, Alistair Francis, Bin Meng,
	Weiwei Li, Liu Zhiwei, Helene Chelin, Max Chou

Hi Paolo,


I suggest adding a "riscv:" at the start of the cover letter subject for the next
version. This will make it easier for everyone else to quickly identify which arch
the patches are changing.

Other than that, and checkpatch.pl style changes, looks good to me.


Thanks,


Daniel


On 7/17/24 12:30 PM, Paolo Savini wrote:
> This series of patches builds on top of Max Chou's patches:
> 
> https://lore.kernel.org/all/20240613175122.1299212-1-max.chou@sifive.com/
> 
> The aim of these patches is to improve the performance of QEMU emulation
> of RVV unit-stride load and store instructions in the following cases
> 
> 1. when the data being loaded/stored per iteration amounts to 8 bytes or less.
> 2. when the vector length is 16 bytes (VLEN=128) and there is no grouping of the
>     vector registers (LMUL=1).
> 3. when the data being loaded/stored per iteration is more than 64 bytes.
> 
> In the first two cases the optimization consists of avoiding the
> overhead of probing the RAM of the host machine and perform a simple loop
> load/store on the data grouped in chunks of as many bytes as possible (8,4,2 or 1).
> 
> The third case is optimized by calling the __builtin_memcpy function on
> data chuncks of 128 bytes and 256 bytes per time.
> 
> These patches on top of Max Chou's patches have been tested with SPEC
> CPU 2017 and achieve an average reduction of 13% of the time needed by
> QEMU for running the benchmarks compared with the master branch of QEMU.
> 
> You can find the source code being developed here: https://github.com/embecosm/rise-rvv-tcg-qemu
> and regular updates and more statistics about the patch here: https://github.com/embecosm/rise-rvv-tcg-qemu-reports
> 
> Changes:
> - patch 1:
>    - Modify vext_ldst_us to run the simple loop load/store if we
>      are in one of the two cases above.
> - patch 2:
>    - Modify vext_group_ldst_host to use __builtin_memcpy for data sizes
>      of 128 bits and above.
> 
> Cc: Richard Handerson <richard.henderson@linaro.org>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Alistair Francis <alistair.francis@wdc.com>
> Cc: Bin Meng <bmeng.cn@gmail.com>
> Cc: Weiwei Li <liwei1518@gmail.com>
> Cc: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
> Cc: Liu Zhiwei <zhiwei_liu@linux.alibaba.com>
> Cc: Helene Chelin <helene.chelin@embecosm.com>
> Cc: Max Chou <max.chou@sifive.com>
> 
> Helene CHELIN (1):
>    target/riscv: rvv: reduce the overhead for simple RISC-V vector
>      unit-stride loads and stores
> 
> Paolo Savini (1):
>    target/riscv: rvv: improve performance of RISC-V vector loads and
>      stores on large amounts of data.
> 
>   target/riscv/vector_helper.c | 63 +++++++++++++++++++++++++++++++++++-
>   1 file changed, 62 insertions(+), 1 deletion(-)
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-09-10 18:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-17 15:30 [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Paolo Savini
2024-07-17 15:30 ` [RFC 1/2] target/riscv: rvv: reduce the overhead for simple RISC-V vector unit-stride loads and stores Paolo Savini
2024-07-26 12:22   ` Daniel Henrique Barboza
2024-07-27  7:13   ` Richard Henderson
2024-07-31 12:38     ` Daniel Henrique Barboza
2024-07-17 15:30 ` [RFC 2/2] target/riscv: rvv: improve performance of RISC-V vector loads and stores on large amounts of data Paolo Savini
2024-07-26 12:27   ` Daniel Henrique Barboza
2024-07-27  7:15   ` Richard Henderson
2024-09-10 11:20     ` Paolo Savini
2024-09-10 18:18       ` Richard Henderson
2024-07-26 12:31 ` [RFC 0/2] Improve the performance of unit-stride RVV ld/st on Daniel Henrique Barboza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).