Re: [RFC] risc-v vector (RVV) emulation performance issues

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Richard Henderson <richard.henderson@linaro.org>
To: Daniel Henrique Barboza <dbarboza@ventanamicro.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"open list:RISC-V" <qemu-riscv@nongnu.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>,
	Jeff Law <jlaw@ventanamicro.com>,
	Alistair Francis <alistair.francis@wdc.com>
Subject: Re: [RFC] risc-v vector (RVV) emulation performance issues
Date: Tue, 25 Jul 2023 11:53:46 -0700	[thread overview]
Message-ID: <9fc36ebe-6ec4-23dd-bbb6-5333905f7d2f@linaro.org> (raw)
In-Reply-To: <0e54c6c1-2903-7942-eff2-2b8c5e21187e@ventanamicro.com>

On 7/24/23 06:40, Daniel Henrique Barboza wrote:
> Hi,
> 
> As some of you are already aware the current RVV emulation could be faster.
> We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
> skip set tail when vta is zero") that tried to address at least part of the
> problem.
> 
> Running a simple program like this:
> 
> -------
> 
> #define SZ 10000000
> 
> int main ()
> {
>    int *a = malloc (SZ * sizeof (int));
>    int *b = malloc (SZ * sizeof (int));
>    int *c = malloc (SZ * sizeof (int));
> 
>    for (int i = 0; i < SZ; i++)
>      c[i] = a[i] + b[i];
>    return c[SZ - 1];
> }
> 
> -------
> 
> And then compiling it without RVV support will run in 50 milis or so:
> 
> $ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 
> ./foo-novect.out
> 
> real    0m0.043s
> user    0m0.025s
> sys    0m0.018s
> 
> Building the same program with RVV support slows it 4-5 times:
> 
> $ time ~/work/qemu/build/qemu-riscv64 -cpu 
> rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
> 
> real    0m0.196s
> user    0m0.177s
> sys    0m0.018s
> 
> Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
> ~0.260s.
> 
> 
> 'perf record' shows the following profile on the aforementioned binary:
> 
>    23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
>    21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
>    14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
>    11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
>     8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
>     8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
>     2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
>     1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
>     1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
>     0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
>     0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
>     0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
>     0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert
> 
> 
> First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:
> 
>      /* load bytes from guest memory */
>      for (i = env->vstart; i < evl; i++, env->vstart++) {
>          k = 0;
>          while (k < nf) {
>              target_ulong addr = base + ((i * nf + k) << log2_esz);
>              ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
>              k++;
>          }
>      }
>      env->vstart = 0;
> 
> Given that this is a unit-stride load that access contiguous elements in memory it
> seems that this loop could be optimized/removed since it's loading/storing bytes
> one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
> have something of the sorts. Richard, care to comment?

Yes, SVE optimizes this case -- see

https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/sve_helper.c?ref_type=heads#L5651

It's not possible to do this generically, due to the predication. There's quite a lot of 
machinery that goes into expanding this such that each helper uses the correct host 
load/store insn in the fast case.


r~

     prev parent reply	other threads:[~2023-07-25 18:54 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
2023-07-24 15:23 ` Philippe Mathieu-Daudé
2023-07-25 18:53 ` Richard Henderson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9fc36ebe-6ec4-23dd-bbb6-5333905f7d2f@linaro.org \
    --to=richard.henderson@linaro.org \
    --cc=alistair.francis@wdc.com \
    --cc=dbarboza@ventanamicro.com \
    --cc=jlaw@ventanamicro.com \
    --cc=palmer@dabbelt.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-riscv@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).