[RFC PATCH] target/ppc: Inline most of dcbz helper

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] target/ppc: Inline most of dcbz helper
@ 2024-07-01  0:59 BALATON Zoltan
  2025-04-24 12:45 ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2024-07-01  0:59 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc
  Cc: Nicholas Piggin, Daniel Henrique Barboza, Richard Henderson

This is an RFC patch, not finished, just to show the idea and test
this approach. I'm not sure it's correct but I'm sure it can be
improved so comments are requested.

The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
and the rest of it I've written to make it a test case is here:
http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
Replace the body of has_altivec() with just "return false". Sorry for
only giving pieces but the code posted above has a copyright that does
not allow me to include it in the test. This is not measuring VRAM
access now just memory copy but shows the effect of dcbz. I've got
these results with this patch:

Linux user master:                  Linux user patch:
byte loop: 2.2 sec                  byte loop: 2.2 sec
memcpy: 2.19 sec                    memcpy: 2.19 sec
copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec

Linux system master:                Linux system patch:
byte loop: 5.86 sec                 byte loop: 5.9 sec
memcpy: 5.45 sec                    memcpy: 5.47 sec
copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec

It could probably be further optimised with using vector instuctions
(dcbz_size is between 32 and 128) or by eliminating the check left in
the helper for 970 but I don't know how to do those. (Also the series
that convert AltiVec to use 128 bit access may help but I haven't
tested that, only trying to optimise dcbz here,)

Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
 target/ppc/helper.h     |  1 +
 target/ppc/mem_helper.c | 14 ++++++++++++++
 target/ppc/translate.c  | 34 ++++++++++++++++++++++++++++------
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 76b8f25c77..e49681c25b 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -46,6 +46,7 @@ DEF_HELPER_FLAGS_3(stmw, TCG_CALL_NO_WG, void, env, tl, i32)
 DEF_HELPER_4(lsw, void, env, tl, i32, i32)
 DEF_HELPER_5(lswx, void, env, tl, i32, i32, i32)
 DEF_HELPER_FLAGS_4(stsw, TCG_CALL_NO_WG, void, env, tl, i32, i32)
+DEF_HELPER_FLAGS_2(dcbz_size, TCG_CALL_NO_WG_SE, tl, env, i32)
 DEF_HELPER_FLAGS_3(dcbz, TCG_CALL_NO_WG, void, env, tl, i32)
 DEF_HELPER_FLAGS_3(dcbzep, TCG_CALL_NO_WG, void, env, tl, i32)
 DEF_HELPER_FLAGS_2(icbi, TCG_CALL_NO_WG, void, env, tl)
diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
index f88155ad45..b06cb2d00e 100644
--- a/target/ppc/mem_helper.c
+++ b/target/ppc/mem_helper.c
@@ -270,6 +270,20 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
     }
 }
 
+target_ulong helper_dcbz_size(CPUPPCState *env, uint32_t opcode)
+{
+    target_ulong dcbz_size = env->dcache_line_size;
+
+#if defined(TARGET_PPC64)
+    /* Check for dcbz vs dcbzl on 970 */
+    if (env->excp_model == POWERPC_EXCP_970 &&
+        !(opcode & 0x00200000) && ((env->spr[SPR_970_HID5] >> 7) & 0x3) == 1) {
+        dcbz_size = 32;
+    }
+#endif
+    return dcbz_size;
+}
+
 static void dcbz_common(CPUPPCState *env, target_ulong addr,
                         uint32_t opcode, bool epid, uintptr_t retaddr)
 {
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 0bc16d7251..49221b8303 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -4445,14 +4445,36 @@ static void gen_dcblc(DisasContext *ctx)
 /* dcbz */
 static void gen_dcbz(DisasContext *ctx)
 {
-    TCGv tcgv_addr;
-    TCGv_i32 tcgv_op;
+    TCGv addr, mask, dcbz_size, t0;
+    TCGv_i32 op = tcg_constant_i32(ctx->opcode & 0x03FF000);
+    TCGv_i64 z64 = tcg_constant_i64(0);
+    TCGv_i128 z128 = tcg_temp_new_i128();
+    TCGLabel *l;
+
+    addr = tcg_temp_new();
+    mask = tcg_temp_new();
+    dcbz_size = tcg_temp_new();
+    t0 = tcg_temp_new();
+    l = gen_new_label();
 
     gen_set_access_type(ctx, ACCESS_CACHE);
-    tcgv_addr = tcg_temp_new();
-    tcgv_op = tcg_constant_i32(ctx->opcode & 0x03FF000);
-    gen_addr_reg_index(ctx, tcgv_addr);
-    gen_helper_dcbz(tcg_env, tcgv_addr, tcgv_op);
+    gen_helper_dcbz_size(dcbz_size, tcg_env, op);
+    tcg_gen_mov_tl(mask, dcbz_size);
+    tcg_gen_subi_tl(mask, mask, 1);
+    tcg_gen_not_tl(mask, mask);
+    gen_addr_reg_index(ctx, addr);
+    tcg_gen_and_tl(addr, addr, mask);
+    tcg_gen_mov_tl(t0, cpu_reserve);
+    tcg_gen_and_tl(t0, t0, mask);
+    tcg_gen_movcond_tl(TCG_COND_EQ, cpu_reserve, addr, t0,
+                       tcg_constant_tl(-1), cpu_reserve);
+
+    tcg_gen_concat_i64_i128(z128, z64, z64);
+    gen_set_label(l);
+    tcg_gen_qemu_st_i128(z128, addr, ctx->mem_idx, DEF_MEMOP(MO_128));
+    tcg_gen_addi_tl(addr, addr, 16);
+    tcg_gen_subi_tl(dcbz_size, dcbz_size, 16);
+    tcg_gen_brcondi_tl(TCG_COND_GT, dcbz_size, 0, l);
 }
 
 /* dcbzep */
-- 
2.30.9



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2024-07-01  0:59 [RFC PATCH] target/ppc: Inline most of dcbz helper BALATON Zoltan
@ 2025-04-24 12:45 ` BALATON Zoltan
  2025-04-28  0:12   ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-24 12:45 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc; +Cc: Nicholas Piggin, Richard Henderson

Hello,

On Mon, 1 Jul 2024, BALATON Zoltan wrote:
> This is an RFC patch, not finished, just to show the idea and test
> this approach. I'm not sure it's correct but I'm sure it can be
> improved so comments are requested.

Last time I did not get any replies to this so I try again. Some people 
recently tried using passed through GPUs with qemu-system-ppc running 
AmigaOS again and while it works it was slower than expected. This was 
previosly found to maybe related to dcbz and accessing passed through PCI 
memory with 32 bit instead of 128 bit access. The dcbz opcode is also 
commonly used for clearing memory in MacOS so optimising it would make 
these run faster even without pass through. (This was explored with KVM 
here: 
https://www.talospace.com/2018/08/making-your-talos-ii-into-power-mac_29.html)

dcbz was improved with user emulation after my last try but system 
emulation is still affected. Results from the test case below now on QEMU 
master on the same host machine (accessing RAM not passed through VRAM 
same as previous tests below which also were to RAM):

qemu-ppc -cpu 7457:
byte loop: 2.22 sec
memcpy: 2.21 sec
copyToVRAMNoAltivec: 1.69 sec
copyToVRAMAltivec: 2.12 sec
copyFromVRAMNoAltivec: 2.24 sec
copyFromVRAMAltivec: 2.82 sec

qemu-system-ppc -machine pegasos2:
byte loop: 5.28 sec
memcpy: 5.06 sec
copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.66 sec
copyFromVRAMNoAltivec: 6.37 sec
copyFromVRAMAltivec: 6.84 sec

The qemu-system-ppc case is still very much not optimal. Is threre anybody 
who wants to give this a try or any recommendations on what to do? I think 
ideally maybe we should try to implement dcbz with TCG ops to avoid the 
helper and try to use vector ops so these may be translated to wider 128 
bit access ops on the host which is believed to be needed to for accessing 
VRAM to avoid overhead of the transfer that's why these used AltiVec on 
PPC. See original message below for more details.

Thank you,
BALATON Zoltan

> The test case I've used came out of a discussion about very slow
> access to VRAM of a graphics card passed through with vfio the reason
> for which is still not clear but it was already known that dcbz is
> often used by MacOS and AmigaOS for clearing memory and to avoid
> reading values about to be overwritten which is faster on real CPU but
> was found to be slower on QEMU. The optimised copy routines were
> posted here:
> https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
> and the rest of it I've written to make it a test case is here:
> http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
> Replace the body of has_altivec() with just "return false". Sorry for
> only giving pieces but the code posted above has a copyright that does
> not allow me to include it in the test. This is not measuring VRAM
> access now just memory copy but shows the effect of dcbz. I've got
> these results with this patch:
>
> Linux user master:                  Linux user patch:
> byte loop: 2.2 sec                  byte loop: 2.2 sec
> memcpy: 2.19 sec                    memcpy: 2.19 sec
> copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
> copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
> copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
> copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec
>
> Linux system master:                Linux system patch:
> byte loop: 5.86 sec                 byte loop: 5.9 sec
> memcpy: 5.45 sec                    memcpy: 5.47 sec
> copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
> copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
> copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
> copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec
>
> It could probably be further optimised with using vector instuctions
> (dcbz_size is between 32 and 128) or by eliminating the check left in
> the helper for 970 but I don't know how to do those. (Also the series
> that convert AltiVec to use 128 bit access may help but I haven't
> tested that, only trying to optimise dcbz here,)
>
> Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
> ---
> target/ppc/helper.h     |  1 +
> target/ppc/mem_helper.c | 14 ++++++++++++++
> target/ppc/translate.c  | 34 ++++++++++++++++++++++++++++------
> 3 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/target/ppc/helper.h b/target/ppc/helper.h
> index 76b8f25c77..e49681c25b 100644
> --- a/target/ppc/helper.h
> +++ b/target/ppc/helper.h
> @@ -46,6 +46,7 @@ DEF_HELPER_FLAGS_3(stmw, TCG_CALL_NO_WG, void, env, tl, i32)
> DEF_HELPER_4(lsw, void, env, tl, i32, i32)
> DEF_HELPER_5(lswx, void, env, tl, i32, i32, i32)
> DEF_HELPER_FLAGS_4(stsw, TCG_CALL_NO_WG, void, env, tl, i32, i32)
> +DEF_HELPER_FLAGS_2(dcbz_size, TCG_CALL_NO_WG_SE, tl, env, i32)
> DEF_HELPER_FLAGS_3(dcbz, TCG_CALL_NO_WG, void, env, tl, i32)
> DEF_HELPER_FLAGS_3(dcbzep, TCG_CALL_NO_WG, void, env, tl, i32)
> DEF_HELPER_FLAGS_2(icbi, TCG_CALL_NO_WG, void, env, tl)
> diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
> index f88155ad45..b06cb2d00e 100644
> --- a/target/ppc/mem_helper.c
> +++ b/target/ppc/mem_helper.c
> @@ -270,6 +270,20 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
>     }
> }
>
> +target_ulong helper_dcbz_size(CPUPPCState *env, uint32_t opcode)
> +{
> +    target_ulong dcbz_size = env->dcache_line_size;
> +
> +#if defined(TARGET_PPC64)
> +    /* Check for dcbz vs dcbzl on 970 */
> +    if (env->excp_model == POWERPC_EXCP_970 &&
> +        !(opcode & 0x00200000) && ((env->spr[SPR_970_HID5] >> 7) & 0x3) == 1) {
> +        dcbz_size = 32;
> +    }
> +#endif
> +    return dcbz_size;
> +}
> +
> static void dcbz_common(CPUPPCState *env, target_ulong addr,
>                         uint32_t opcode, bool epid, uintptr_t retaddr)
> {
> diff --git a/target/ppc/translate.c b/target/ppc/translate.c
> index 0bc16d7251..49221b8303 100644
> --- a/target/ppc/translate.c
> +++ b/target/ppc/translate.c
> @@ -4445,14 +4445,36 @@ static void gen_dcblc(DisasContext *ctx)
> /* dcbz */
> static void gen_dcbz(DisasContext *ctx)
> {
> -    TCGv tcgv_addr;
> -    TCGv_i32 tcgv_op;
> +    TCGv addr, mask, dcbz_size, t0;
> +    TCGv_i32 op = tcg_constant_i32(ctx->opcode & 0x03FF000);
> +    TCGv_i64 z64 = tcg_constant_i64(0);
> +    TCGv_i128 z128 = tcg_temp_new_i128();
> +    TCGLabel *l;
> +
> +    addr = tcg_temp_new();
> +    mask = tcg_temp_new();
> +    dcbz_size = tcg_temp_new();
> +    t0 = tcg_temp_new();
> +    l = gen_new_label();
>
>     gen_set_access_type(ctx, ACCESS_CACHE);
> -    tcgv_addr = tcg_temp_new();
> -    tcgv_op = tcg_constant_i32(ctx->opcode & 0x03FF000);
> -    gen_addr_reg_index(ctx, tcgv_addr);
> -    gen_helper_dcbz(tcg_env, tcgv_addr, tcgv_op);
> +    gen_helper_dcbz_size(dcbz_size, tcg_env, op);
> +    tcg_gen_mov_tl(mask, dcbz_size);
> +    tcg_gen_subi_tl(mask, mask, 1);
> +    tcg_gen_not_tl(mask, mask);
> +    gen_addr_reg_index(ctx, addr);
> +    tcg_gen_and_tl(addr, addr, mask);
> +    tcg_gen_mov_tl(t0, cpu_reserve);
> +    tcg_gen_and_tl(t0, t0, mask);
> +    tcg_gen_movcond_tl(TCG_COND_EQ, cpu_reserve, addr, t0,
> +                       tcg_constant_tl(-1), cpu_reserve);
> +
> +    tcg_gen_concat_i64_i128(z128, z64, z64);
> +    gen_set_label(l);
> +    tcg_gen_qemu_st_i128(z128, addr, ctx->mem_idx, DEF_MEMOP(MO_128));
> +    tcg_gen_addi_tl(addr, addr, 16);
> +    tcg_gen_subi_tl(dcbz_size, dcbz_size, 16);
> +    tcg_gen_brcondi_tl(TCG_COND_GT, dcbz_size, 0, l);
> }
>
> /* dcbzep */
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-24 12:45 ` BALATON Zoltan
@ 2025-04-28  0:12   ` BALATON Zoltan
  2025-04-28 10:44     ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-28  0:12 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc; +Cc: Nicholas Piggin, Richard Henderson

On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>> The test case I've used came out of a discussion about very slow
>> access to VRAM of a graphics card passed through with vfio the reason
>> for which is still not clear but it was already known that dcbz is
>> often used by MacOS and AmigaOS for clearing memory and to avoid
>> reading values about to be overwritten which is faster on real CPU but
>> was found to be slower on QEMU. The optimised copy routines were
>> posted here:
>> https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
>> and the rest of it I've written to make it a test case is here:
>> http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
>> Replace the body of has_altivec() with just "return false". Sorry for
>> only giving pieces but the code posted above has a copyright that does
>> not allow me to include it in the test. This is not measuring VRAM
>> access now just memory copy but shows the effect of dcbz. I've got
>> these results with this patch:
>> 
>> Linux user master:                  Linux user patch:
>> byte loop: 2.2 sec                  byte loop: 2.2 sec
>> memcpy: 2.19 sec                    memcpy: 2.19 sec
>> copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
>> copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
>> copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
>> copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec
>> 
>> Linux system master:                Linux system patch:
>> byte loop: 5.86 sec                 byte loop: 5.9 sec
>> memcpy: 5.45 sec                    memcpy: 5.47 sec
>> copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
>> copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
>> copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
>> copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec

I did some more benchmarking to identify what slows it down. I noticed 
that memset uses dcbz too so I added a test for that. I've also added a 
parameter to allow testing actual VRAM and now that I have a card working 
with vfio-pci passthrough I could also test that. The updated 
vramcopy.tar.xz is at the same URL as above. These tests were run with the 
amigaone machine under Linux booted as described here:
https://www.qemu.org/docs/master/system/ppc/amigang.html

I compiled the benchmark twice, once as in the tar and once replacing dcbz 
in the copyFromVRAM* routines with dcba (which is noop on QEMU). First two 
results are with both src and dst in RAM, second two tests are with dst in 
VRAM (mapped from phys address 0x80800000 where the card's framebuffer is 
mapped). The left column shows results with emulated ati-vga as in the 
amigang.html docs. The right column is with real ATI X550 card (old and 
slow but works with this old PPC Linux) passed through with vfio-pci.

with ati-vga                            with vfio-pci

src 0xb79c8008 dst 0xb78c7008	      |	src 0xb7c92008 dst 0xb7b91008
byte loop: 21.16 sec			byte loop: 21.16 sec
memset: 3.85 sec		      |	memset: 3.87 sec
memcpy: 5.07 sec			memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec	      |	copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 2.42 sec	      |	copyToVRAMAltivec: 2.37 sec
copyFromVRAMNoAltivec: 6.39 sec	      |	copyFromVRAMNoAltivec: 6.38 sec
copyFromVRAMAltivec: 7.02 sec	      |	copyFromVRAMAltivec: 7 sec

using dcba instead of dcbz	      |	using dcba instead of dcbz
src 0xb7b69008 dst 0xb7a68008	      |	src 0xb7c44008 dst 0xb7b43008
byte loop: 21.14 sec			byte loop: 21.14 sec
memset: 3.85 sec		      |	memset: 3.88 sec
memcpy: 5.06 sec		      |	memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.3 sec		copyToVRAMAltivec: 2.3 sec
copyFromVRAMNoAltivec: 2.59 sec		copyFromVRAMNoAltivec: 2.59 sec
copyFromVRAMAltivec: 2.95 sec		copyFromVRAMAltivec: 2.95 sec

dst in emulated ati-vga		      |	dst in real card vfio vram
mapping 0x80800000			mapping 0x80800000
src 0xb78e0008 dst 0xb77de000	      |	src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec		      |	byte loop: 563.98 sec
memset: 3.89 sec		      |	memset: 39.25 sec
memcpy: 5.07 sec		      |	memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec	      |	copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec	      |	copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec	      |	copyFromVRAMAltivec: 754.95 sec

dst in emulated ati-vga using dcba    |	dst in real card vfio vram using dcba
mapping 0x80800000			mapping 0x80800000
src 0xb7ba7008 dst 0xb7aa5000	      |	src 0xb77f4008 dst 0xb76f2000
byte loop: 21.15 sec		      |	byte loop: 577.42 sec
memset: 3.85 sec		      |	memset: 39.52 sec
memcpy: 5.06 sec		      |	memcpy: 142.8 sec
copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 71.71 sec
copyToVRAMAltivec: 12.2 sec	      |	copyToVRAMAltivec: 78.09 sec
copyFromVRAMNoAltivec: 2.6 sec	      |	copyFromVRAMNoAltivec: 727.23 sec
copyFromVRAMAltivec: 35.03 sec	      |	copyFromVRAMAltivec: 753.15 sec

The results show that dcbz has some effect but an even bigger slow down is 
caused by using AltiVec which is supposed to do wider access to reduce the 
overhead but maybe it's not translated to host vector instructions 
correctly. The host in the above test was Intel i7-9700K. So to solve this 
maybe AltiVec should be improved more than dcbz but I don't know what and 
how.

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-28  0:12   ` BALATON Zoltan
@ 2025-04-28 10:44     ` BALATON Zoltan
  2025-04-28 13:26       ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-28 10:44 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc; +Cc: Nicholas Piggin, Richard Henderson

On Mon, 28 Apr 2025, BALATON Zoltan wrote:
> On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>>> The test case I've used came out of a discussion about very slow
>>> access to VRAM of a graphics card passed through with vfio the reason
>>> for which is still not clear but it was already known that dcbz is
>>> often used by MacOS and AmigaOS for clearing memory and to avoid
>>> reading values about to be overwritten which is faster on real CPU but
>>> was found to be slower on QEMU. The optimised copy routines were
>>> posted here:
>>> https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
>>> and the rest of it I've written to make it a test case is here:
>>> http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
>>> Replace the body of has_altivec() with just "return false". Sorry for
>>> only giving pieces but the code posted above has a copyright that does
>>> not allow me to include it in the test. This is not measuring VRAM
>>> access now just memory copy but shows the effect of dcbz. I've got
>>> these results with this patch:
>>> 
>>> Linux user master:                  Linux user patch:
>>> byte loop: 2.2 sec                  byte loop: 2.2 sec
>>> memcpy: 2.19 sec                    memcpy: 2.19 sec
>>> copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
>>> copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
>>> copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
>>> copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec
>>> 
>>> Linux system master:                Linux system patch:
>>> byte loop: 5.86 sec                 byte loop: 5.9 sec
>>> memcpy: 5.45 sec                    memcpy: 5.47 sec
>>> copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
>>> copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
>>> copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
>>> copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec
>
> I did some more benchmarking to identify what slows it down. I noticed that 
> memset uses dcbz too so I added a test for that. I've also added a parameter 
> to allow testing actual VRAM and now that I have a card working with vfio-pci 
> passthrough I could also test that. The updated vramcopy.tar.xz is at the 
> same URL as above. These tests were run with the amigaone machine under Linux 
> booted as described here:
> https://www.qemu.org/docs/master/system/ppc/amigang.html
>
> I compiled the benchmark twice, once as in the tar and once replacing dcbz in 
> the copyFromVRAM* routines with dcba (which is noop on QEMU). First two 
> results are with both src and dst in RAM, second two tests are with dst in 
> VRAM (mapped from phys address 0x80800000 where the card's framebuffer is 
> mapped). The left column shows results with emulated ati-vga as in the 
> amigang.html docs. The right column is with real ATI X550 card (old and slow 
> but works with this old PPC Linux) passed through with vfio-pci.
>
> with ati-vga                            with vfio-pci
>
> src 0xb79c8008 dst 0xb78c7008	      |	src 0xb7c92008 dst 0xb7b91008
> byte loop: 21.16 sec			byte loop: 21.16 sec
> memset: 3.85 sec		      |	memset: 3.87 sec
> memcpy: 5.07 sec			memcpy: 5.07 sec
> copyToVRAMNoAltivec: 2.52 sec	      |	copyToVRAMNoAltivec: 2.53 sec
> copyToVRAMAltivec: 2.42 sec	      |	copyToVRAMAltivec: 2.37 sec
> copyFromVRAMNoAltivec: 6.39 sec	      |	copyFromVRAMNoAltivec: 6.38 sec
> copyFromVRAMAltivec: 7.02 sec	      |	copyFromVRAMAltivec: 7 sec
>
> using dcba instead of dcbz	      |	using dcba instead of dcbz
> src 0xb7b69008 dst 0xb7a68008	      |	src 0xb7c44008 dst 0xb7b43008
> byte loop: 21.14 sec			byte loop: 21.14 sec
> memset: 3.85 sec		      |	memset: 3.88 sec
> memcpy: 5.06 sec		      |	memcpy: 5.07 sec
> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 2.52 sec
> copyToVRAMAltivec: 2.3 sec		copyToVRAMAltivec: 2.3 sec
> copyFromVRAMNoAltivec: 2.59 sec		copyFromVRAMNoAltivec: 2.59 sec
> copyFromVRAMAltivec: 2.95 sec		copyFromVRAMAltivec: 2.95 sec
>
> dst in emulated ati-vga		      |	dst in real card vfio vram
> mapping 0x80800000			mapping 0x80800000
> src 0xb78e0008 dst 0xb77de000	      |	src 0xb7ec5008 dst 0xb7dc3000
> byte loop: 21.2 sec		      |	byte loop: 563.98 sec
> memset: 3.89 sec		      |	memset: 39.25 sec
> memcpy: 5.07 sec		      |	memcpy: 140.49 sec
> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 72.03 sec
> copyToVRAMAltivec: 12.22 sec	      |	copyToVRAMAltivec: 78.12 sec
> copyFromVRAMNoAltivec: 6.43 sec	      |	copyFromVRAMNoAltivec: 728.52 sec
> copyFromVRAMAltivec: 35.33 sec	      |	copyFromVRAMAltivec: 754.95 sec
>
> dst in emulated ati-vga using dcba    |	dst in real card vfio vram using dcba
> mapping 0x80800000			mapping 0x80800000
> src 0xb7ba7008 dst 0xb7aa5000	      |	src 0xb77f4008 dst 0xb76f2000
> byte loop: 21.15 sec		      |	byte loop: 577.42 sec
> memset: 3.85 sec		      |	memset: 39.52 sec
> memcpy: 5.06 sec		      |	memcpy: 142.8 sec
> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 71.71 sec
> copyToVRAMAltivec: 12.2 sec	      |	copyToVRAMAltivec: 78.09 sec
> copyFromVRAMNoAltivec: 2.6 sec	      |	copyFromVRAMNoAltivec: 727.23 sec
> copyFromVRAMAltivec: 35.03 sec	      |	copyFromVRAMAltivec: 753.15 sec
>
> The results show that dcbz has some effect but an even bigger slow down is 
> caused by using AltiVec which is supposed to do wider access to reduce the 
> overhead but maybe it's not translated to host vector instructions correctly. 
> The host in the above test was Intel i7-9700K. So to solve this maybe AltiVec 
> should be improved more than dcbz but I don't know what and how.

Looking at what AltiVec ops are used there aren't many. lvx and stvx 
should translate to 128 bit ops so those are probably ok, there are some 
lvsl lvsr ops which may be ok too and the only other one left is vperm 
which seems very much unoptimised, so my guess is likely that vperm causes 
the slow down here (I could try profiling to confirm if needed). Is there 
a way to improve that? I don't know vector support on different archs. 
Maybe other archs have less general permutation ops that's why ppc has 
unoptimised implementation or is it possible just wasn't addressed yet?

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-28 10:44     ` BALATON Zoltan
@ 2025-04-28 13:26       ` BALATON Zoltan
  2025-04-28 13:47         ` Richard Henderson
  2025-04-29 15:27         ` Alex Bennée
  0 siblings, 2 replies; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-28 13:26 UTC (permalink / raw)
  To: qemu-devel, qemu-ppc; +Cc: Nicholas Piggin, Richard Henderson

On Mon, 28 Apr 2025, BALATON Zoltan wrote:
> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>> On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>>>> The test case I've used came out of a discussion about very slow
>>>> access to VRAM of a graphics card passed through with vfio the reason
>>>> for which is still not clear but it was already known that dcbz is
>>>> often used by MacOS and AmigaOS for clearing memory and to avoid
>>>> reading values about to be overwritten which is faster on real CPU but
>>>> was found to be slower on QEMU. The optimised copy routines were
>>>> posted here:
>>>> https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
>>>> and the rest of it I've written to make it a test case is here:
>>>> http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
>>>> Replace the body of has_altivec() with just "return false". Sorry for
>>>> only giving pieces but the code posted above has a copyright that does
>>>> not allow me to include it in the test. This is not measuring VRAM
>>>> access now just memory copy but shows the effect of dcbz. I've got
>>>> these results with this patch:
>>>> 
>>>> Linux user master:                  Linux user patch:
>>>> byte loop: 2.2 sec                  byte loop: 2.2 sec
>>>> memcpy: 2.19 sec                    memcpy: 2.19 sec
>>>> copyToVRAMNoAltivec: 1.7 sec        copyToVRAMNoAltivec: 1.71 sec
>>>> copyToVRAMAltivec: 2.13 sec         copyToVRAMAltivec: 2.12 sec
>>>> copyFromVRAMNoAltivec: 5.11 sec     copyFromVRAMNoAltivec: 2.79 sec
>>>> copyFromVRAMAltivec: 5.87 sec       copyFromVRAMAltivec: 3.26 sec
>>>> 
>>>> Linux system master:                Linux system patch:
>>>> byte loop: 5.86 sec                 byte loop: 5.9 sec
>>>> memcpy: 5.45 sec                    memcpy: 5.47 sec
>>>> copyToVRAMNoAltivec: 2.51 sec       copyToVRAMNoAltivec: 2.53 sec
>>>> copyToVRAMAltivec: 3.84 sec         copyToVRAMAltivec: 3.85 sec
>>>> copyFromVRAMNoAltivec: 6.11 sec     copyFromVRAMNoAltivec: 3.92 sec
>>>> copyFromVRAMAltivec: 7.22 sec       copyFromVRAMAltivec: 5.51 sec
>> 
>> I did some more benchmarking to identify what slows it down. I noticed that 
>> memset uses dcbz too so I added a test for that. I've also added a 
>> parameter to allow testing actual VRAM and now that I have a card working 
>> with vfio-pci passthrough I could also test that. The updated 
>> vramcopy.tar.xz is at the same URL as above. These tests were run with the 
>> amigaone machine under Linux booted as described here:
>> https://www.qemu.org/docs/master/system/ppc/amigang.html
>> 
>> I compiled the benchmark twice, once as in the tar and once replacing dcbz 
>> in the copyFromVRAM* routines with dcba (which is noop on QEMU). First two 
>> results are with both src and dst in RAM, second two tests are with dst in 
>> VRAM (mapped from phys address 0x80800000 where the card's framebuffer is 
>> mapped). The left column shows results with emulated ati-vga as in the 
>> amigang.html docs. The right column is with real ATI X550 card (old and 
>> slow but works with this old PPC Linux) passed through with vfio-pci.
>> 
>> with ati-vga                            with vfio-pci
>> 
>> src 0xb79c8008 dst 0xb78c7008	      |	src 0xb7c92008 dst 0xb7b91008
>> byte loop: 21.16 sec			byte loop: 21.16 sec
>> memset: 3.85 sec		      |	memset: 3.87 sec
>> memcpy: 5.07 sec			memcpy: 5.07 sec
>> copyToVRAMNoAltivec: 2.52 sec	      |	copyToVRAMNoAltivec: 2.53 sec
>> copyToVRAMAltivec: 2.42 sec	      |	copyToVRAMAltivec: 2.37 sec
>> copyFromVRAMNoAltivec: 6.39 sec	      |	copyFromVRAMNoAltivec: 6.38 
>> sec
>> copyFromVRAMAltivec: 7.02 sec	      |	copyFromVRAMAltivec: 7 sec
>> 
>> using dcba instead of dcbz	      |	using dcba instead of dcbz
>> src 0xb7b69008 dst 0xb7a68008	      |	src 0xb7c44008 dst 0xb7b43008
>> byte loop: 21.14 sec			byte loop: 21.14 sec
>> memset: 3.85 sec		      |	memset: 3.88 sec
>> memcpy: 5.06 sec		      |	memcpy: 5.07 sec
>> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 2.52 sec
>> copyToVRAMAltivec: 2.3 sec		copyToVRAMAltivec: 2.3 sec
>> copyFromVRAMNoAltivec: 2.59 sec		copyFromVRAMNoAltivec: 2.59 
>> sec
>> copyFromVRAMAltivec: 2.95 sec		copyFromVRAMAltivec: 2.95 sec
>> 
>> dst in emulated ati-vga		      |	dst in real card vfio vram
>> mapping 0x80800000			mapping 0x80800000
>> src 0xb78e0008 dst 0xb77de000	      |	src 0xb7ec5008 dst 0xb7dc3000
>> byte loop: 21.2 sec		      |	byte loop: 563.98 sec
>> memset: 3.89 sec		      |	memset: 39.25 sec
>> memcpy: 5.07 sec		      |	memcpy: 140.49 sec
>> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 72.03 sec
>> copyToVRAMAltivec: 12.22 sec	      |	copyToVRAMAltivec: 78.12 sec
>> copyFromVRAMNoAltivec: 6.43 sec	      |	copyFromVRAMNoAltivec: 728.52 
>> sec
>> copyFromVRAMAltivec: 35.33 sec	      |	copyFromVRAMAltivec: 754.95 
>> sec
>> 
>> dst in emulated ati-vga using dcba    |	dst in real card vfio vram 
>> using dcba
>> mapping 0x80800000			mapping 0x80800000
>> src 0xb7ba7008 dst 0xb7aa5000	      |	src 0xb77f4008 dst 0xb76f2000
>> byte loop: 21.15 sec		      |	byte loop: 577.42 sec
>> memset: 3.85 sec		      |	memset: 39.52 sec
>> memcpy: 5.06 sec		      |	memcpy: 142.8 sec
>> copyToVRAMNoAltivec: 2.53 sec	      |	copyToVRAMNoAltivec: 71.71 sec
>> copyToVRAMAltivec: 12.2 sec	      |	copyToVRAMAltivec: 78.09 sec
>> copyFromVRAMNoAltivec: 2.6 sec	      |	copyFromVRAMNoAltivec: 727.23 
>> sec
>> copyFromVRAMAltivec: 35.03 sec	      |	copyFromVRAMAltivec: 753.15 
>> sec
>> 
>> The results show that dcbz has some effect but an even bigger slow down is 
>> caused by using AltiVec which is supposed to do wider access to reduce the 
>> overhead but maybe it's not translated to host vector instructions 
>> correctly. The host in the above test was Intel i7-9700K. So to solve this 
>> maybe AltiVec should be improved more than dcbz but I don't know what and 
>> how.
>
> Looking at what AltiVec ops are used there aren't many. lvx and stvx should 
> translate to 128 bit ops so those are probably ok, there are some lvsl lvsr 
> ops which may be ok too and the only other one left is vperm which seems very 
> much unoptimised, so my guess is likely that vperm causes the slow down here 
> (I could try profiling to confirm if needed). Is there a way to improve that?

I have tried profiling the dst in real card vfio vram with dcbz case (with 
100 iterations instead of 10000 in above tests) but I'm not sure I 
understand the results. vperm and dcbz show up but not too high. Can 
somebody explain what is happening here and where the overhead likely 
comes from? Here is the profile result I got:

Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
   Children      Self  Command          Shared Object            Symbol
-   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec_loop
    - 98.49% cpu_exec_loop
       - 98.48% cpu_tb_exec
          - 90.95% 0x7f4e705d8f15
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 45.79% cpu_loop_exit_noexc
                   - cpu_loop_exit
                     __longjmp_chk
                     cpu_exec_setjmp
                   - cpu_exec_loop
                      - 45.78% cpu_tb_exec
                           42.35% 0x7f4e6f3f0000
                         - 0.72% 0x7f4e99f37037
                              helper_VPERM
                         - 0.68% 0x7f4e99f3716d
                              helper_VPERM
                - 45.16% rr_cpu_thread_fn
                   - 45.16% tcg_cpu_exec
                      - 45.15% cpu_exec
                         - 45.15% cpu_exec_setjmp
                            - cpu_exec_loop
                               - 45.14% cpu_tb_exec
                                    42.08% 0x7f4e6f3f0000
                                  - 0.72% 0x7f4e99f37037
                                       helper_VPERM
                                  - 0.67% 0x7f4e99f3716d
                                       helper_VPERM
          + 2.40% 0x7f4e74e85bae
          + 2.15% 0x7f4e7060a2dc
          + 0.99% 0x7f4e73d93781
+   99.32%     0.37%  qemu-system-ppc  qemu-system-ppc          [.] cpu_tb_exec
+   98.73%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec_setjmp
-   94.11%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_io_recompile
    - 94.11% cpu_io_recompile
       - 89.79% rr_cpu_thread_fn
          - 89.78% tcg_cpu_exec
             - 89.78% cpu_exec
                  cpu_exec_setjmp
                - cpu_exec_loop
                   - 89.78% cpu_tb_exec
                      - 88.40% 0x7f4e705d8f15
                           helper_ldub_mmu
                           do_ld_mmio_beN
                         - cpu_io_recompile
                            - 44.47% cpu_loop_exit_noexc
                               - cpu_loop_exit
                                 __longjmp_chk
                                 cpu_exec_setjmp
                               - cpu_exec_loop
                                  - 44.46% cpu_tb_exec
                                       41.22% 0x7f4e6f3f0000
                                     - 0.70% 0x7f4e99f37037
                                          helper_VPERM
                                     - 0.67% 0x7f4e99f3716d
                                          helper_VPERM
                            - 43.94% rr_cpu_thread_fn
                               - 43.93% tcg_cpu_exec
                                  - cpu_exec
                                     - 43.93% cpu_exec_setjmp
                                        - cpu_exec_loop
                                           - 43.90% cpu_tb_exec
                                                40.95% 0x7f4e6f3f0000
                                              - 0.71% 0x7f4e99f37037
                                                   helper_VPERM
                                              - 0.66% 0x7f4e99f3716d
                                                   helper_VPERM
                        1.23% 0x7f4e6f3f0000
       + 4.32% cpu_loop_exit_noexc
+   91.90%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec
+   91.90%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] tcg_cpu_exec
+   91.88%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] rr_cpu_thread_fn
+   91.12%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] helper_ldub_mmu
+   91.12%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] do_ld_mmio_beN
-   91.10%     0.00%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e705d8f15
      0x7f4e705d8f15
      helper_ldub_mmu
      do_ld_mmio_beN
    - cpu_io_recompile
       - 45.93% cpu_loop_exit_noexc
          - cpu_loop_exit
            __longjmp_chk
            cpu_exec_setjmp
          - cpu_exec_loop
             - 45.92% cpu_tb_exec
                  42.35% 0x7f4e6f3f0000
                - 0.72% 0x7f4e99f37037
                     helper_VPERM
                - 0.68% 0x7f4e99f3716d
                     helper_VPERM
       - 45.18% rr_cpu_thread_fn
          - 45.17% tcg_cpu_exec
             - 45.17% cpu_exec
                - 45.17% cpu_exec_setjmp
                   - cpu_exec_loop
                      - 45.14% cpu_tb_exec
                           42.08% 0x7f4e6f3f0000
                         - 0.72% 0x7f4e99f37037
                              helper_VPERM
                         - 0.67% 0x7f4e99f3716d
                              helper_VPERM
+   88.80%     0.00%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e6f3f0000
+   53.56%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_loop_exit
+   53.56%     0.00%  qemu-system-ppc  libc.so.6                [.] __longjmp_chk
+   48.82%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_loop_exit_noexc
+    7.41%     7.41%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ef5c69
+    6.89%     6.89%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99f3c0a2
+    6.37%     6.37%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ef5d47
+    6.33%     6.33%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ef5b9b
+    6.21%     6.21%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ef5cdc
+    5.78%     5.78%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99f3c0a8
+    5.60%     5.60%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99f3bdd1
+    5.55%     5.55%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99f3bdd7
+    5.43%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] cpu_loop_exit_restore
+    5.32%     5.32%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ea5beb
+    5.30%     5.30%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ea5be5
+    4.82%     4.82%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ea5bd5
+    4.78%     4.78%  qemu-system-ppc  [JIT] tid 4074           [.] 0x00007f4e99ea5bdd
+    4.68%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] helper_raise_exception_err
-    3.99%     3.97%  qemu-system-ppc  qemu-system-ppc          [.] helper_VPERM
      3.72% do_ld_mmio_beN
         cpu_io_recompile
         rr_cpu_thread_fn
         tcg_cpu_exec
         cpu_exec
         cpu_exec_setjmp
         cpu_exec_loop
         cpu_tb_exec
         0x7f4e705d8f15
         helper_ldub_mmu
         do_ld_mmio_beN
       - cpu_io_recompile
          - 1.90% cpu_loop_exit_noexc
               cpu_loop_exit
               __longjmp_chk
               cpu_exec_setjmp
               cpu_exec_loop
             - cpu_tb_exec
                - 0.69% 0x7f4e99f37037
                     helper_VPERM
                - 0.66% 0x7f4e99f3716d
                     helper_VPERM
          - 1.82% rr_cpu_thread_fn
               tcg_cpu_exec
               cpu_exec
               cpu_exec_setjmp
               cpu_exec_loop
             - cpu_tb_exec
                - 0.70% 0x7f4e99f37037
                     helper_VPERM
                - 0.65% 0x7f4e99f3716d
                     helper_VPERM
+    3.65%     0.00%  qemu-system-ppc  qemu-system-ppc          [.] helper_raise_exception
+    3.51%     0.82%  qemu-system-ppc  qemu-system-ppc          [.] helper_lookup_tb_ptr
[...]
-    1.71%     1.52%  qemu-system-ppc  qemu-system-ppc          [.] probe_access
      1.30% do_ld_mmio_beN
         cpu_io_recompile
         rr_cpu_thread_fn
         tcg_cpu_exec
         cpu_exec
         cpu_exec_setjmp
         cpu_exec_loop
         cpu_tb_exec
         0x7f4e705d8f15
         helper_ldub_mmu
         do_ld_mmio_beN
       - cpu_io_recompile
          - 0.66% cpu_loop_exit_noexc
               cpu_loop_exit
               __longjmp_chk
               cpu_exec_setjmp
               cpu_exec_loop
               cpu_tb_exec
          - 0.64% rr_cpu_thread_fn
               tcg_cpu_exec
               cpu_exec
               cpu_exec_setjmp
               cpu_exec_loop
               cpu_tb_exec
-    1.64%     0.05%  qemu-system-ppc  qemu-system-ppc          [.] helper_dcbz
    - 1.58% helper_dcbz
         probe_access

Regards,
BALATON Zoltan


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-28 13:26       ` BALATON Zoltan
@ 2025-04-28 13:47         ` Richard Henderson
  2025-04-29 14:40           ` BALATON Zoltan
  2025-04-29 15:27         ` Alex Bennée
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2025-04-28 13:47 UTC (permalink / raw)
  To: BALATON Zoltan, qemu-devel, qemu-ppc; +Cc: Nicholas Piggin

On 4/28/25 06:26, BALATON Zoltan wrote:
> I have tried profiling the dst in real card vfio vram with dcbz case (with 100 iterations 
> instead of 10000 in above tests) but I'm not sure I understand the results. vperm and dcbz 
> show up but not too high. Can somebody explain what is happening here and where the 
> overhead likely comes from? Here is the profile result I got:
> 
> Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
>    Children      Self  Command          Shared Object            Symbol
> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec_loop
>     - 98.49% cpu_exec_loop
>        - 98.48% cpu_tb_exec
>           - 90.95% 0x7f4e705d8f15
>                helper_ldub_mmu
>                do_ld_mmio_beN
>              - cpu_io_recompile
>                 - 45.79% cpu_loop_exit_noexc

I think the real problem is the number of loop exits due to i/o.  If I'm reading this 
rightly, 45% of execution is in cpu_io_recompile.

I/O can only happen as the last insn of a translation block.  When we detect that it has 
happened in the middle of a translation block, we abort the block, compile a new one, and 
restart execution.

Where this becomes a bottleneck is when this same translation block is in a loop.  Exactly 
this case of memset/memcpy of VRAM.  This could be addressed by invalidating the previous 
translation block and creating a new one which always ends with the i/o.


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-28 13:47         ` Richard Henderson
@ 2025-04-29 14:40           ` BALATON Zoltan
  2025-04-29 16:04             ` Alex Bennée
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-29 14:40 UTC (permalink / raw)
  To: Richard Henderson; +Cc: qemu-devel, qemu-ppc, Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 2291 bytes --]

On Mon, 28 Apr 2025, Richard Henderson wrote:
> On 4/28/25 06:26, BALATON Zoltan wrote:
>> I have tried profiling the dst in real card vfio vram with dcbz case (with 
>> 100 iterations instead of 10000 in above tests) but I'm not sure I 
>> understand the results. vperm and dcbz show up but not too high. Can 
>> somebody explain what is happening here and where the overhead likely comes 
>> from? Here is the profile result I got:
>> 
>> Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
>>    Children      Self  Command          Shared Object            Symbol
>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] 
>> cpu_exec_loop
>>     - 98.49% cpu_exec_loop
>>        - 98.48% cpu_tb_exec
>>           - 90.95% 0x7f4e705d8f15
>>                helper_ldub_mmu
>>                do_ld_mmio_beN
>>              - cpu_io_recompile
>>                 - 45.79% cpu_loop_exit_noexc
>
> I think the real problem is the number of loop exits due to i/o.  If I'm 
> reading this rightly, 45% of execution is in cpu_io_recompile.
>
> I/O can only happen as the last insn of a translation block.

I'm not sure I understand this. A comment above cpu_io_recompile says "In 
deterministic execution mode, instructions doing device I/Os must be at 
the end of the TB." Is that wrong? Otherwise shouldn't this only apply if 
running with icount or something like that?

> When we detect 
> that it has happened in the middle of a translation block, we abort the 
> block, compile a new one, and restart execution.

Where does that happen? The calls of cpu_io_recompile in this case seem to 
come from io_prepare which is called from do_ld16_mmio_beN if 
(!cpu->neg.can_do_io) but I don't see how can_do_io is set.

> Where this becomes a bottleneck is when this same translation block is in a 
> loop.  Exactly this case of memset/memcpy of VRAM.  This could be addressed 
> by invalidating the previous translation block and creating a new one which 
> always ends with the i/o.

And where to do that? cpu_io_recompile just exits the TB but what 
generates the new TB? I need some more clues to understands how to do 
this.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-28 13:26       ` BALATON Zoltan
  2025-04-28 13:47         ` Richard Henderson
@ 2025-04-29 15:27         ` Alex Bennée
  2025-04-29 17:11           ` BALATON Zoltan
  2025-04-29 17:30           ` Richard Henderson
  1 sibling, 2 replies; 20+ messages in thread
From: Alex Bennée @ 2025-04-29 15:27 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: qemu-devel, qemu-ppc, Nicholas Piggin, Richard Henderson

BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>>> On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>>>>> The test case I've used came out of a discussion about very slow
>>>>> access to VRAM of a graphics card passed through with vfio the reason
>>>>> for which is still not clear but it was already known that dcbz is
>>>>> often used by MacOS and AmigaOS for clearing memory and to avoid
>>>>> reading values about to be overwritten which is faster on real CPU but
>>>>> was found to be slower on QEMU. The optimised copy routines were
>>>>> posted here:
<snip>
>
> I have tried profiling the dst in real card vfio vram with dcbz case
> (with 100 iterations instead of 10000 in above tests) but I'm not sure
> I understand the results. vperm and dcbz show up but not too high. Can
> somebody explain what is happening here and where the overhead likely
> comes from? Here is the profile result I got:
>
> Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
>   Children      Self  Command          Shared Object            Symbol
> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec_loop
>    - 98.49% cpu_exec_loop
>       - 98.48% cpu_tb_exec
>          - 90.95% 0x7f4e705d8f15
>               helper_ldub_mmu
>               do_ld_mmio_beN
>             - cpu_io_recompile

This looks like the dbz instructions are being used to clear device
memory and tripping over the can_do_io check (normally the translator
tries to ensure all device access is at the end of a block).

You could try ending the block on dbz instructions and seeing if that
helps. Normally I would expect the helper to be more efficient as it can
probe the whole address range once and then use host insns to blat the
memory.

>                - 45.79% cpu_loop_exit_noexc
>                   - cpu_loop_exit
>                     __longjmp_chk
>                     cpu_exec_setjmp
>                   - cpu_exec_loop
>                      - 45.78% cpu_tb_exec
>                           42.35% 0x7f4e6f3f0000
>                         - 0.72% 0x7f4e99f37037
>                              helper_VPERM
>                         - 0.68% 0x7f4e99f3716d
>                              helper_VPERM
>                - 45.16% rr_cpu_thread_fn

Hmm you seem to be running in icount mode here for some reason.

>                   - 45.16% tcg_cpu_exec
>                      - 45.15% cpu_exec
>                         - 45.15% cpu_exec_setjmp
>                            - cpu_exec_loop
>                               - 45.14% cpu_tb_exec
>                                    42.08% 0x7f4e6f3f0000
>                                  - 0.72% 0x7f4e99f37037
>                                       helper_VPERM
>                                  - 0.67% 0x7f4e99f3716d
>                                       helper_VPERM
<snip>

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 14:40           ` BALATON Zoltan
@ 2025-04-29 16:04             ` Alex Bennée
  2025-04-29 17:14               ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2025-04-29 16:04 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Richard Henderson, qemu-devel, qemu-ppc, Nicholas Piggin

BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Mon, 28 Apr 2025, Richard Henderson wrote:
>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>> I have tried profiling the dst in real card vfio vram with dcbz
>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>> not sure I understand the results. vperm and dcbz show up but not
>>> too high. Can somebody explain what is happening here and where the
>>> overhead likely comes from? Here is the profile result I got:
>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>> 122371086557
>>>    Children      Self  Command          Shared Object            Symbol
>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>> cpu_exec_loop
>>>     - 98.49% cpu_exec_loop
>>>        - 98.48% cpu_tb_exec
>>>           - 90.95% 0x7f4e705d8f15
>>>                helper_ldub_mmu
>>>                do_ld_mmio_beN
>>>              - cpu_io_recompile
>>>                 - 45.79% cpu_loop_exit_noexc
>>
>> I think the real problem is the number of loop exits due to i/o.  If
>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>
>> I/O can only happen as the last insn of a translation block.
>
> I'm not sure I understand this. A comment above cpu_io_recompile says
> "In deterministic execution mode, instructions doing device I/Os must
> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
> apply if running with icount or something like that?

That comment should be fixed. It used to only be the case for icount
mode but there was another race bug that meant we need to honour device
access as the last insn for both modes.

>
>> When we detect that it has happened in the middle of a translation
>> block, we abort the block, compile a new one, and restart execution.
>
> Where does that happen? The calls of cpu_io_recompile in this case
> seem to come from io_prepare which is called from do_ld16_mmio_beN if
> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.

Inline by set_can_do_io()

>> Where this becomes a bottleneck is when this same translation block
>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>> could be addressed by invalidating the previous translation block
>> and creating a new one which always ends with the i/o.
>
> And where to do that? cpu_io_recompile just exits the TB but what
> generates the new TB? I need some more clues to understands how to do
> this.

  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;

sets the cflags for the next cb, which typically will fail to find and
then regenerate. Normally cflags_next_tb is empty.

>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 15:27         ` Alex Bennée
@ 2025-04-29 17:11           ` BALATON Zoltan
  2025-04-29 17:30           ` Richard Henderson
  1 sibling, 0 replies; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-29 17:11 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, qemu-ppc, Nicholas Piggin, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 4228 bytes --]

On Tue, 29 Apr 2025, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>>> On Mon, 28 Apr 2025, BALATON Zoltan wrote:
>>>> On Thu, 24 Apr 2025, BALATON Zoltan wrote:
>>>>>> The test case I've used came out of a discussion about very slow
>>>>>> access to VRAM of a graphics card passed through with vfio the reason
>>>>>> for which is still not clear but it was already known that dcbz is
>>>>>> often used by MacOS and AmigaOS for clearing memory and to avoid
>>>>>> reading values about to be overwritten which is faster on real CPU but
>>>>>> was found to be slower on QEMU. The optimised copy routines were
>>>>>> posted here:
> <snip>
>>
>> I have tried profiling the dst in real card vfio vram with dcbz case
>> (with 100 iterations instead of 10000 in above tests) but I'm not sure
>> I understand the results. vperm and dcbz show up but not too high. Can
>> somebody explain what is happening here and where the overhead likely
>> comes from? Here is the profile result I got:
>>
>> Samples: 104K of event 'cycles:Pu', Event count (approx.): 122371086557
>>   Children      Self  Command          Shared Object            Symbol
>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.] cpu_exec_loop
>>    - 98.49% cpu_exec_loop
>>       - 98.48% cpu_tb_exec
>>          - 90.95% 0x7f4e705d8f15
>>               helper_ldub_mmu
>>               do_ld_mmio_beN
>>             - cpu_io_recompile
>
> This looks like the dbz instructions are being used to clear device
> memory and tripping over the can_do_io check (normally the translator
> tries to ensure all device access is at the end of a block).

If you look at the benchmark results I posted earlier in this thread in 
https://lists.nongnu.org/archive/html/qemu-ppc/2025-04/msg00326.html
I also tried using dcba instead of dcbz in the CopyFromVRAM* functions but 
that only helped very little so not sure it's because of dcbz. Then I 
thought it might be VPERM but the NoAltivec variants are also only a 
little faster. It could be that using 64 bit access instead of 128 bit 
(the NoAltivec functions use FPU regs) makes it slower while avoiding 
VPERM makes it faster which cancel each other but the profile also shows 
VPERM not high and somebody else also tested this with -cpu g3 and only 
got 1% faster result so maybe it's also not primarily because of VPERM but 
there's a bigger overhead before these..

> You could try ending the block on dbz instructions and seeing if that
> helps. Normally I would expect the helper to be more efficient as it can
> probe the whole address range once and then use host insns to blat the
> memory.

Maybe I could try that if I can do that the same way as done in 
io_prepare.

>>                - 45.79% cpu_loop_exit_noexc
>>                   - cpu_loop_exit
>>                     __longjmp_chk
>>                     cpu_exec_setjmp
>>                   - cpu_exec_loop
>>                      - 45.78% cpu_tb_exec
>>                           42.35% 0x7f4e6f3f0000
>>                         - 0.72% 0x7f4e99f37037
>>                              helper_VPERM
>>                         - 0.68% 0x7f4e99f3716d
>>                              helper_VPERM
>>                - 45.16% rr_cpu_thread_fn
>
> Hmm you seem to be running in icount mode here for some reason.

No idea why. I had no such options and complied without --enable-debug and 
nothing special on QEMU command just defaults options. How can I check if 
icount is enabled? Can profiling with perf tool interfere? I thought that 
only reads CPU performance counters and does not attach to the process 
otherwise.

Regards,
BALATON Zoltan

>>                   - 45.16% tcg_cpu_exec
>>                      - 45.15% cpu_exec
>>                         - 45.15% cpu_exec_setjmp
>>                            - cpu_exec_loop
>>                               - 45.14% cpu_tb_exec
>>                                    42.08% 0x7f4e6f3f0000
>>                                  - 0.72% 0x7f4e99f37037
>>                                       helper_VPERM
>>                                  - 0.67% 0x7f4e99f3716d
>>                                       helper_VPERM
> <snip>
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 16:04             ` Alex Bennée
@ 2025-04-29 17:14               ` BALATON Zoltan
  2025-04-29 17:58                 ` Alex Bennée
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-29 17:14 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Richard Henderson, qemu-devel, qemu-ppc, Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 3076 bytes --]

On Tue, 29 Apr 2025, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>> not sure I understand the results. vperm and dcbz show up but not
>>>> too high. Can somebody explain what is happening here and where the
>>>> overhead likely comes from? Here is the profile result I got:
>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>> 122371086557
>>>>    Children      Self  Command          Shared Object            Symbol
>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>> cpu_exec_loop
>>>>     - 98.49% cpu_exec_loop
>>>>        - 98.48% cpu_tb_exec
>>>>           - 90.95% 0x7f4e705d8f15
>>>>                helper_ldub_mmu
>>>>                do_ld_mmio_beN
>>>>              - cpu_io_recompile
>>>>                 - 45.79% cpu_loop_exit_noexc
>>>
>>> I think the real problem is the number of loop exits due to i/o.  If
>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>
>>> I/O can only happen as the last insn of a translation block.
>>
>> I'm not sure I understand this. A comment above cpu_io_recompile says
>> "In deterministic execution mode, instructions doing device I/Os must
>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>> apply if running with icount or something like that?
>
> That comment should be fixed. It used to only be the case for icount
> mode but there was another race bug that meant we need to honour device
> access as the last insn for both modes.
>
>>
>>> When we detect that it has happened in the middle of a translation
>>> block, we abort the block, compile a new one, and restart execution.
>>
>> Where does that happen? The calls of cpu_io_recompile in this case
>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>
> Inline by set_can_do_io()

That one I've found but don't know where the cpu_loop_exit returns from 
the end of cpu_io_recompile.

>>> Where this becomes a bottleneck is when this same translation block
>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>> could be addressed by invalidating the previous translation block
>>> and creating a new one which always ends with the i/o.
>>
>> And where to do that? cpu_io_recompile just exits the TB but what
>> generates the new TB? I need some more clues to understands how to do
>> this.
>
>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>
> sets the cflags for the next cb, which typically will fail to find and
> then regenerate. Normally cflags_next_tb is empty.

Shouldn't this only regenerate the next TB on the first loop iteration and 
not afterwards?

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 15:27         ` Alex Bennée
  2025-04-29 17:11           ` BALATON Zoltan
@ 2025-04-29 17:30           ` Richard Henderson
  2025-04-29 18:00             ` Alex Bennée
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2025-04-29 17:30 UTC (permalink / raw)
  To: Alex Bennée, BALATON Zoltan; +Cc: qemu-devel, qemu-ppc, Nicholas Piggin

On 4/29/25 08:27, Alex Bennée wrote:
>>                 - 45.16% rr_cpu_thread_fn
> 
> Hmm you seem to be running in icount mode here for some reason.

For some reason ppc32 does not enable mttcg.
I'm not sure what's missing to enable it properly.


r~


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 17:14               ` BALATON Zoltan
@ 2025-04-29 17:58                 ` Alex Bennée
  2025-04-29 21:09                   ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2025-04-29 17:58 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Richard Henderson, qemu-devel, qemu-ppc, Nicholas Piggin

BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Tue, 29 Apr 2025, Alex Bennée wrote:
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>> too high. Can somebody explain what is happening here and where the
>>>>> overhead likely comes from? Here is the profile result I got:
>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>> 122371086557
>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>> cpu_exec_loop
>>>>>     - 98.49% cpu_exec_loop
>>>>>        - 98.48% cpu_tb_exec
>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>                helper_ldub_mmu
>>>>>                do_ld_mmio_beN
>>>>>              - cpu_io_recompile
>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>
>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>
>>>> I/O can only happen as the last insn of a translation block.
>>>
>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>> "In deterministic execution mode, instructions doing device I/Os must
>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>> apply if running with icount or something like that?
>>
>> That comment should be fixed. It used to only be the case for icount
>> mode but there was another race bug that meant we need to honour device
>> access as the last insn for both modes.
>>
>>>
>>>> When we detect that it has happened in the middle of a translation
>>>> block, we abort the block, compile a new one, and restart execution.
>>>
>>> Where does that happen? The calls of cpu_io_recompile in this case
>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>
>> Inline by set_can_do_io()
>
> That one I've found but don't know where the cpu_loop_exit returns
> from the end of cpu_io_recompile.

cpu_loop_exit longjmp's back to the top of the execution loop.

>
>>>> Where this becomes a bottleneck is when this same translation block
>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>> could be addressed by invalidating the previous translation block
>>>> and creating a new one which always ends with the i/o.
>>>
>>> And where to do that? cpu_io_recompile just exits the TB but what
>>> generates the new TB? I need some more clues to understands how to do
>>> this.
>>
>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>
>> sets the cflags for the next cb, which typically will fail to find and
>> then regenerate. Normally cflags_next_tb is empty.
>
> Shouldn't this only regenerate the next TB on the first loop iteration
> and not afterwards?

if we've been here before (needing n insn from the base addr) we will
have a cached translation we can re-use. It doesn't stop the longer TB
being called again as we re-enter a loop.

>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 17:30           ` Richard Henderson
@ 2025-04-29 18:00             ` Alex Bennée
  2025-04-29 20:51               ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2025-04-29 18:00 UTC (permalink / raw)
  To: Richard Henderson; +Cc: BALATON Zoltan, qemu-devel, qemu-ppc, Nicholas Piggin

Richard Henderson <richard.henderson@linaro.org> writes:

> On 4/29/25 08:27, Alex Bennée wrote:
>>>                 - 45.16% rr_cpu_thread_fn
>> Hmm you seem to be running in icount mode here for some reason.
>
> For some reason ppc32 does not enable mttcg.
> I'm not sure what's missing to enable it properly.

I seem to recall it may have been reverted due to instability but I
can't find the commit.

>
>
> r~

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 18:00             ` Alex Bennée
@ 2025-04-29 20:51               ` BALATON Zoltan
  0 siblings, 0 replies; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-29 20:51 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Richard Henderson, qemu-devel, qemu-ppc, Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 7397 bytes --]

On Tue, 29 Apr 2025, Alex Bennée wrote:
> Richard Henderson <richard.henderson@linaro.org> writes:
>
>> On 4/29/25 08:27, Alex Bennée wrote:
>>>>                 - 45.16% rr_cpu_thread_fn
>>> Hmm you seem to be running in icount mode here for some reason.
>>
>> For some reason ppc32 does not enable mttcg.
>> I'm not sure what's missing to enable it properly.
>
> I seem to recall it may have been reverted due to instability but I
> can't find the commit.

Or maybe it was never enabled? We've recently tried mttcg with G4 mac99 
machine and it seems to work but the needed patches were not cleaned up 
for upstream yet so they are using a fork for that now. But that's a 
digression.

I've tried to rerun the benchmark with qemu-system-ppc64 instead of 
qemu-system-ppc (no other change in the command) and it did not seem to 
help much, it's still slow. Here's the profile:

   Children      Self  Command          Shared Object            Symbol
-   99.42%     0.78%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_exec_loop
    - 99.32% cpu_exec_loop
       - 99.32% cpu_tb_exec
          - 91.29% 0x7f25d079f8b4
               helper_ldub_mmu
               do_ld_mmio_beN
             - cpu_io_recompile
                - 49.05% mttcg_cpu_thread_fn
                   - 49.05% tcg_cpu_exec
                      - 49.05% cpu_exec
                         - 49.04% cpu_exec_setjmp
                            - cpu_exec_loop
                               - 49.03% cpu_tb_exec
                                    38.92% 0x7f25cf3f0000
                                  - 0.63% 0x7f25fe78bd93
                                       helper_VPERM
                                  - 0.61% 0x7f25fe78bed8
                                       helper_VPERM
                - 42.24% cpu_loop_exit_noexc
                     cpu_loop_exit
                     __longjmp_chk
                     cpu_exec_setjmp
                   - cpu_exec_loop
                      - 42.23% cpu_tb_exec
                           38.67% 0x7f25cf3f0000
                         - 0.62% 0x7f25fe78bd93
                              helper_VPERM
                         - 0.60% 0x7f25fe78bed8
                              helper_VPERM
          - 5.78% 0x7f25d0625055
               helper_raise_exception
               mttcg_cpu_thread_fn
               tcg_cpu_exec
               cpu_exec
               cpu_exec_setjmp
               cpu_exec_loop
               cpu_tb_exec
               0x7f25d0625055
               helper_raise_exception
               mttcg_cpu_thread_fn
               tcg_cpu_exec
               cpu_exec
               cpu_exec_setjmp
               cpu_exec_loop
             - cpu_tb_exec
                - 5.78% 0x7f25d0625055
                   - helper_raise_exception
                      - 5.49% mttcg_cpu_thread_fn
                         - 5.16% tcg_cpu_exec
                            - 5.11% cpu_exec
                               - 5.03% cpu_exec_setjmp
                                  - 5.01% cpu_exec_loop
                                     - 4.27% cpu_tb_exec
                                          1.60% 0x7f25cf3f0000
+   99.41%     0.25%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_tb_exec
+   99.41%     0.01%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_exec_setjmp
+   98.02%     0.17%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_exec
+   97.99%     0.02%  qemu-system-ppc  qemu-system-ppc64        [.] tcg_cpu_exec
+   97.98%     0.05%  qemu-system-ppc  qemu-system-ppc64        [.] mttcg_cpu_thread_fn
+   92.38%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_io_recompile
+   91.54%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] do_ld_mmio_beN
+   91.51%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] helper_ldub_mmu
+   91.49%     0.00%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25d079f8b4
+   81.15%     0.00%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25cf3f0000
+   44.70%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_loop_exit
+   44.50%     0.01%  qemu-system-ppc  libc.so.6                [.] __longjmp_chk
+   43.16%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_loop_exit_noexc
+    9.57%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] helper_raise_exception
+    8.02%     0.08%  qemu-system-ppc  qemu-system-ppc64        [.] notdirty_write.isra.0
+    7.60%     0.05%  qemu-system-ppc  qemu-system-ppc64        [.] mmu_lookup
+    7.50%     0.03%  qemu-system-ppc  qemu-system-ppc64        [.] tb_invalidate_phys_range_fast
+    7.34%     0.05%  qemu-system-ppc  qemu-system-ppc64        [.] do_st4_mmu
+    7.18%     0.02%  qemu-system-ppc  qemu-system-ppc64        [.] mmu_watch_or_dirty
+    6.99%     6.99%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7bba4b
+    6.82%     6.82%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7c6545
+    6.01%     6.01%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7bbac9
+    5.94%     5.94%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7bbb47
+    5.90%     5.90%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7bb968
+    5.85%     0.00%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25d0625055
+    5.45%     1.17%  qemu-system-ppc  qemu-system-ppc64        [.] page_collection_lock
+    5.13%     5.13%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7c654b
+    5.08%     5.08%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe71f74b
+    5.07%     5.07%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7c624f
+    5.05%     5.05%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe7c6249
+    4.93%     4.93%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe71f740
+    4.64%     4.64%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe71f890
+    4.49%     4.49%  qemu-system-ppc  [JIT] tid 12410          [.] 0x00007f25fe71f885
+    4.05%     1.51%  qemu-system-ppc  qemu-system-ppc64        [.] page_trylock_add
+    3.64%     3.62%  qemu-system-ppc  qemu-system-ppc64        [.] helper_VPERM
+    2.43%     1.40%  qemu-system-ppc  qemu-system-ppc64        [.] probe_access
+    2.16%     0.51%  qemu-system-ppc  libglib-2.0.so.0.7600.3  [.] g_tree_lookup
+    2.09%     0.00%  qemu-system-ppc  qemu-system-ppc64        [.] cpu_loop_exit_restore
+    1.66%     0.06%  qemu-system-ppc  qemu-system-ppc64        [.] helper_store_msr
+    1.61%     0.12%  qemu-system-ppc  qemu-system-ppc64        [.] hreg_store_msr
+    1.52%     1.52%  qemu-system-ppc  qemu-system-ppc64        [.] tb_invalidate_phys_page_range__locked.constprop.0
+    1.49%     0.05%  qemu-system-ppc  qemu-system-ppc64        [.] dcbz_common

The times with 100 iterations were:
mapping 0x80800000
src 0xb773a008 dst 0xb7638000
byte loop: 6.49 sec
memset: 0.44 sec
memcpy: 1.6 sec
copyToVRAMNoAltivec: 0.8 sec
copyToVRAMAltivec: 0.88 sec
copyFromVRAMNoAltivec: 8.15 sec
copyFromVRAMAltivec: 8.41 sec

(previous results were with 10000 iterations but I did not rerun that now, 
I assume we can roughly take 100 times these results to compare to that. 
Then this may be even slower with qemu-system-ppc64 which can be as some 
code is compiled out without TARGET_PPC64 defined.)

I try to investigate more but I'm still quite lost.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 17:58                 ` Alex Bennée
@ 2025-04-29 21:09                   ` BALATON Zoltan
  2025-04-30  0:35                     ` Nicholas Piggin
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-29 21:09 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Richard Henderson, qemu-devel, qemu-ppc, Nicholas Piggin

[-- Attachment #1: Type: text/plain, Size: 4040 bytes --]

On Tue, 29 Apr 2025, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>>> too high. Can somebody explain what is happening here and where the
>>>>>> overhead likely comes from? Here is the profile result I got:
>>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>>> 122371086557
>>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>>> cpu_exec_loop
>>>>>>     - 98.49% cpu_exec_loop
>>>>>>        - 98.48% cpu_tb_exec
>>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>>                helper_ldub_mmu
>>>>>>                do_ld_mmio_beN
>>>>>>              - cpu_io_recompile
>>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>>
>>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>>
>>>>> I/O can only happen as the last insn of a translation block.
>>>>
>>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>>> "In deterministic execution mode, instructions doing device I/Os must
>>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>>> apply if running with icount or something like that?
>>>
>>> That comment should be fixed. It used to only be the case for icount
>>> mode but there was another race bug that meant we need to honour device
>>> access as the last insn for both modes.
>>>
>>>>
>>>>> When we detect that it has happened in the middle of a translation
>>>>> block, we abort the block, compile a new one, and restart execution.
>>>>
>>>> Where does that happen? The calls of cpu_io_recompile in this case
>>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>>
>>> Inline by set_can_do_io()
>>
>> That one I've found but don't know where the cpu_loop_exit returns
>> from the end of cpu_io_recompile.
>
> cpu_loop_exit longjmp's back to the top of the execution loop.
>
>>
>>>>> Where this becomes a bottleneck is when this same translation block
>>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>>> could be addressed by invalidating the previous translation block
>>>>> and creating a new one which always ends with the i/o.
>>>>
>>>> And where to do that? cpu_io_recompile just exits the TB but what
>>>> generates the new TB? I need some more clues to understands how to do
>>>> this.
>>>
>>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>>
>>> sets the cflags for the next cb, which typically will fail to find and
>>> then regenerate. Normally cflags_next_tb is empty.
>>
>> Shouldn't this only regenerate the next TB on the first loop iteration
>> and not afterwards?
>
> if we've been here before (needing n insn from the base addr) we will
> have a cached translation we can re-use. It doesn't stop the longer TB
> being called again as we re-enter a loop.

So then maybe it should at least check if there's already a cached TB 
where it can continue before calling cpu_io_recompile in io_prepare and 
only recompile if needed? I was thinking maybe we need a flag or counter 
to see if cpu_io_recompile is called more than once and after a limit 
invalidate the TB and create two new ones the first ending at the I/O and 
then what cpu_io_recompile does now which as I understood was what Richard 
suggested but I don't know how to do that.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-29 21:09                   ` BALATON Zoltan
@ 2025-04-30  0:35                     ` Nicholas Piggin
  2025-04-30 11:20                       ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: Nicholas Piggin @ 2025-04-30  0:35 UTC (permalink / raw)
  To: BALATON Zoltan, Alex Bennée; +Cc: Richard Henderson, qemu-devel, qemu-ppc

On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
> On Tue, 29 Apr 2025, Alex Bennée wrote:
>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>>>> too high. Can somebody explain what is happening here and where the
>>>>>>> overhead likely comes from? Here is the profile result I got:
>>>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>>>> 122371086557
>>>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>>>> cpu_exec_loop
>>>>>>>     - 98.49% cpu_exec_loop
>>>>>>>        - 98.48% cpu_tb_exec
>>>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>>>                helper_ldub_mmu
>>>>>>>                do_ld_mmio_beN
>>>>>>>              - cpu_io_recompile
>>>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>>>
>>>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>>>
>>>>>> I/O can only happen as the last insn of a translation block.
>>>>>
>>>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>>>> "In deterministic execution mode, instructions doing device I/Os must
>>>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>>>> apply if running with icount or something like that?
>>>>
>>>> That comment should be fixed. It used to only be the case for icount
>>>> mode but there was another race bug that meant we need to honour device
>>>> access as the last insn for both modes.
>>>>
>>>>>
>>>>>> When we detect that it has happened in the middle of a translation
>>>>>> block, we abort the block, compile a new one, and restart execution.
>>>>>
>>>>> Where does that happen? The calls of cpu_io_recompile in this case
>>>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>>>
>>>> Inline by set_can_do_io()
>>>
>>> That one I've found but don't know where the cpu_loop_exit returns
>>> from the end of cpu_io_recompile.
>>
>> cpu_loop_exit longjmp's back to the top of the execution loop.
>>
>>>
>>>>>> Where this becomes a bottleneck is when this same translation block
>>>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>>>> could be addressed by invalidating the previous translation block
>>>>>> and creating a new one which always ends with the i/o.
>>>>>
>>>>> And where to do that? cpu_io_recompile just exits the TB but what
>>>>> generates the new TB? I need some more clues to understands how to do
>>>>> this.
>>>>
>>>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>>>
>>>> sets the cflags for the next cb, which typically will fail to find and
>>>> then regenerate. Normally cflags_next_tb is empty.
>>>
>>> Shouldn't this only regenerate the next TB on the first loop iteration
>>> and not afterwards?
>>
>> if we've been here before (needing n insn from the base addr) we will
>> have a cached translation we can re-use. It doesn't stop the longer TB
>> being called again as we re-enter a loop.
>
> So then maybe it should at least check if there's already a cached TB 
> where it can continue before calling cpu_io_recompile in io_prepare and 
> only recompile if needed?

It basically does do that AFAIKS. cpu_io_recompile() name is misleading
it does not cause a recompile, it just updates cflags and exits. Next
entry will look up TB that has just 1 insn and enter that.

> I was thinking maybe we need a flag or counter 
> to see if cpu_io_recompile is called more than once and after a limit 
> invalidate the TB and create two new ones the first ending at the I/O and 
> then what cpu_io_recompile does now which as I understood was what Richard 
> suggested but I don't know how to do that.

memset/cpy routines had kind of the same problem with real hardware.
They wanted to use vector instructions for best performance, but when
those are used on MMIO they would trap and be very slow.

Problem is we don't know ahead of time if some routine will access
MMIO or not. You could recompile it with fewer instructions but then
it will be slow when used for regular memory.

Heuristics are tough because you could have e.g., one initial big
memset to clear a MMIO region that iterates many times over inner
loop of dcbz instructions, but then is never used again for MMIO but
important for regular page clearing. Making something that dynamically
decays or periodically would recompile to non-IO case perhaps, but
then complexity goes up.

I would prefer not like to do that just for a microbenchmark, but if
you think it is reasonable overall win for average workloads of your
users then perhaps.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-30  0:35                     ` Nicholas Piggin
@ 2025-04-30 11:20                       ` BALATON Zoltan
  2025-04-30 13:47                         ` Alex Bennée
  0 siblings, 1 reply; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-30 11:20 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: Alex Bennée, Richard Henderson, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 7478 bytes --]

On Wed, 30 Apr 2025, Nicholas Piggin wrote:
> On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
>>>>>>>> I have tried profiling the dst in real card vfio vram with dcbz
>>>>>>>> case (with 100 iterations instead of 10000 in above tests) but I'm
>>>>>>>> not sure I understand the results. vperm and dcbz show up but not
>>>>>>>> too high. Can somebody explain what is happening here and where the
>>>>>>>> overhead likely comes from? Here is the profile result I got:
>>>>>>>> Samples: 104K of event 'cycles:Pu', Event count (approx.):
>>>>>>>> 122371086557
>>>>>>>>    Children      Self  Command          Shared Object            Symbol
>>>>>>>> -   99.44%     0.95%  qemu-system-ppc  qemu-system-ppc          [.]
>>>>>>>> cpu_exec_loop
>>>>>>>>     - 98.49% cpu_exec_loop
>>>>>>>>        - 98.48% cpu_tb_exec
>>>>>>>>           - 90.95% 0x7f4e705d8f15
>>>>>>>>                helper_ldub_mmu
>>>>>>>>                do_ld_mmio_beN
>>>>>>>>              - cpu_io_recompile
>>>>>>>>                 - 45.79% cpu_loop_exit_noexc
>>>>>>>
>>>>>>> I think the real problem is the number of loop exits due to i/o.  If
>>>>>>> I'm reading this rightly, 45% of execution is in cpu_io_recompile.
>>>>>>>
>>>>>>> I/O can only happen as the last insn of a translation block.
>>>>>>
>>>>>> I'm not sure I understand this. A comment above cpu_io_recompile says
>>>>>> "In deterministic execution mode, instructions doing device I/Os must
>>>>>> be at the end of the TB." Is that wrong? Otherwise shouldn't this only
>>>>>> apply if running with icount or something like that?
>>>>>
>>>>> That comment should be fixed. It used to only be the case for icount
>>>>> mode but there was another race bug that meant we need to honour device
>>>>> access as the last insn for both modes.
>>>>>
>>>>>>
>>>>>>> When we detect that it has happened in the middle of a translation
>>>>>>> block, we abort the block, compile a new one, and restart execution.
>>>>>>
>>>>>> Where does that happen? The calls of cpu_io_recompile in this case
>>>>>> seem to come from io_prepare which is called from do_ld16_mmio_beN if
>>>>>> (!cpu->neg.can_do_io) but I don't see how can_do_io is set.
>>>>>
>>>>> Inline by set_can_do_io()
>>>>
>>>> That one I've found but don't know where the cpu_loop_exit returns
>>>> from the end of cpu_io_recompile.
>>>
>>> cpu_loop_exit longjmp's back to the top of the execution loop.
>>>
>>>>
>>>>>>> Where this becomes a bottleneck is when this same translation block
>>>>>>> is in a loop.  Exactly this case of memset/memcpy of VRAM.  This
>>>>>>> could be addressed by invalidating the previous translation block
>>>>>>> and creating a new one which always ends with the i/o.
>>>>>>
>>>>>> And where to do that? cpu_io_recompile just exits the TB but what
>>>>>> generates the new TB? I need some more clues to understands how to do
>>>>>> this.
>>>>>
>>>>>  cpu->cflags_next_tb = curr_cflags(cpu) | CF_MEMI_ONLY | CF_NOIRQ | n;
>>>>>
>>>>> sets the cflags for the next cb, which typically will fail to find and
>>>>> then regenerate. Normally cflags_next_tb is empty.
>>>>
>>>> Shouldn't this only regenerate the next TB on the first loop iteration
>>>> and not afterwards?
>>>
>>> if we've been here before (needing n insn from the base addr) we will
>>> have a cached translation we can re-use. It doesn't stop the longer TB
>>> being called again as we re-enter a loop.
>>
>> So then maybe it should at least check if there's already a cached TB
>> where it can continue before calling cpu_io_recompile in io_prepare and
>> only recompile if needed?
>
> It basically does do that AFAIKS. cpu_io_recompile() name is misleading
> it does not cause a recompile, it just updates cflags and exits. Next
> entry will look up TB that has just 1 insn and enter that.

After reading it I came to the same conclusion but then I don't understand 
what causes the problem. Is it just that it will exit the loop for every 
IO to look up the recompiled TB? It looks like it tries to chain TBs, why 
does that not work here?

>> I was thinking maybe we need a flag or counter
>> to see if cpu_io_recompile is called more than once and after a limit
>> invalidate the TB and create two new ones the first ending at the I/O and
>> then what cpu_io_recompile does now which as I understood was what Richard
>> suggested but I don't know how to do that.
>
> memset/cpy routines had kind of the same problem with real hardware.
> They wanted to use vector instructions for best performance, but when
> those are used on MMIO they would trap and be very slow.

Why do those trap on MMIO on real machine? These routines were tested on 
real machines and the reasoning to use the widest possible access was that 
PCI transfer has overhead and that is minimised by transferring more bits 
in one op. I think they also verifed that it works at least for the 32 bit 
CPUs up to G4 that were used on real AmigaNG machines. There are some 
benchmark results here: 
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which is 
also where the benchmark I used comes from so this should be similar. I 
think the MemCopy on that page has plain unoptimised copy as Copy to/from 
VRAM and optimised routines similar to this benchmark as Read/Write Pixel 
Array, but it's not easy to search. Some of the machines like Pegasos II 
and AmigaOne XE were made with both G3 or G4 CPUs so if I find a result 
from those with same graphics card that could show if AltiVec is faster 
(although the G4s were also higher clock so not directly comparable). Some 
results there are also from QEMU, mostly those that are with SiliconMotion 
502 but that does not have this problem only vfio-pci pass through. So 
maybe it's something with how vfio-pci maps PCI memory BARs?

> Problem is we don't know ahead of time if some routine will access
> MMIO or not. You could recompile it with fewer instructions but then
> it will be slow when used for regular memory.
>
> Heuristics are tough because you could have e.g., one initial big
> memset to clear a MMIO region that iterates many times over inner
> loop of dcbz instructions, but then is never used again for MMIO but
> important for regular page clearing. Making something that dynamically
> decays or periodically would recompile to non-IO case perhaps, but
> then complexity goes up.
>
> I would prefer not like to do that just for a microbenchmark, but if
> you think it is reasonable overall win for average workloads of your
> users then perhaps.

I'm still trying to understand what to optimise. So far it looks like that 
dcbz has the least impact, then vperm a bit bigger but still only about a 
few percent and the biggest impact is still not known for sure but we see 
faster access on real machines that run on slower PCIe (only 4x at best) 
while CPU benchmarks don't show slower performance on QEMU only accessing 
passed through card's VRAM is slower than expected. But if there's a trap 
involved I've found before that exceptions are slower with QEMU but I did 
not see evidence of that in the profile.

Regards,
BALATON Zoltan

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-30 11:20                       ` BALATON Zoltan
@ 2025-04-30 13:47                         ` Alex Bennée
  2025-04-30 15:14                           ` BALATON Zoltan
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Bennée @ 2025-04-30 13:47 UTC (permalink / raw)
  To: BALATON Zoltan; +Cc: Nicholas Piggin, Richard Henderson, qemu-devel, qemu-ppc

BALATON Zoltan <balaton@eik.bme.hu> writes:

> On Wed, 30 Apr 2025, Nicholas Piggin wrote:
>> On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
<snip>
>>>>
>>>> if we've been here before (needing n insn from the base addr) we will
>>>> have a cached translation we can re-use. It doesn't stop the longer TB
>>>> being called again as we re-enter a loop.
>>>
>>> So then maybe it should at least check if there's already a cached TB
>>> where it can continue before calling cpu_io_recompile in io_prepare and
>>> only recompile if needed?
>>
>> It basically does do that AFAIKS. cpu_io_recompile() name is misleading
>> it does not cause a recompile, it just updates cflags and exits. Next
>> entry will look up TB that has just 1 insn and enter that.
>
> After reading it I came to the same conclusion but then I don't
> understand what causes the problem. Is it just that it will exit the
> loop for every IO to look up the recompiled TB? It looks like it tries
> to chain TBs, why does that not work here?

Any MMIO access has to come via the slow path. Any MMIO also currently
has to be the last instruction in a block in case the operation triggers
a change in the translation regime that needs to be picked up by the
next instruction you execute.

This is a pathological case when modelling VRAM on a device because its
going to be slow either way. At least if you model the multiple byte
access with a helper you can amortise some of the cost of the MMU lookup
with a single probe_() call. 

>>> I was thinking maybe we need a flag or counter
>>> to see if cpu_io_recompile is called more than once and after a limit
>>> invalidate the TB and create two new ones the first ending at the I/O and
>>> then what cpu_io_recompile does now which as I understood was what Richard
>>> suggested but I don't know how to do that.
>>
>> memset/cpy routines had kind of the same problem with real hardware.
>> They wanted to use vector instructions for best performance, but when
>> those are used on MMIO they would trap and be very slow.
>
> Why do those trap on MMIO on real machine? These routines were tested
> on real machines and the reasoning to use the widest possible access
> was that PCI transfer has overhead and that is minimised by
> transferring more bits in one op. I think they also verifed that it
> works at least for the 32 bit CPUs up to G4 that were used on real
> AmigaNG machines. There are some benchmark results here:
> https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which
> is also where the benchmark I used comes from so this should be
> similar. I think the MemCopy on that page has plain unoptimised copy
> as Copy to/from VRAM and optimised routines similar to this benchmark
> as Read/Write Pixel Array, but it's not easy to search. Some of the
> machines like Pegasos II and AmigaOne XE were made with both G3 or G4
> CPUs so if I find a result from those with same graphics card that
> could show if AltiVec is faster (although the G4s were also higher
> clock so not directly comparable). Some results there are also from
> QEMU, mostly those that are with SiliconMotion 502 but that does not
> have this problem only vfio-pci pass through.

They don't - what we need is to have a RAM-like-device model for QEMU
where we can relax the translation rules because we know we are writing
to RAM like things that don't have registers or other state changing
behaviour.

The poor behaviour is because QEMU currently treats all MMIO as
potentially system state altering where as for VRAM it doesn't need to.

> So maybe it's something
> with how vfio-pci maps PCI memory BARs?

I don't know about vfio-pci but blob resources mapped via virtio-gpu
just appear as chunks of RAM to the guest - hence no trapping.

>
>> Problem is we don't know ahead of time if some routine will access
>> MMIO or not. You could recompile it with fewer instructions but then
>> it will be slow when used for regular memory.
>>
>> Heuristics are tough because you could have e.g., one initial big
>> memset to clear a MMIO region that iterates many times over inner
>> loop of dcbz instructions, but then is never used again for MMIO but
>> important for regular page clearing. Making something that dynamically
>> decays or periodically would recompile to non-IO case perhaps, but
>> then complexity goes up.

We can't have heuristics when we must prioritise correctness. However we
could expand the device model to make the exact behaviour of different
devices clear and optimise when we know it is safe. 

>> I would prefer not like to do that just for a microbenchmark, but if
>> you think it is reasonable overall win for average workloads of your
>> users then perhaps.
>
> I'm still trying to understand what to optimise. So far it looks like
> that dcbz has the least impact, then vperm a bit bigger but still only
> about a few percent and the biggest impact is still not known for sure
> but we see faster access on real machines that run on slower PCIe
> (only 4x at best) while CPU benchmarks don't show slower performance
> on QEMU only accessing passed through card's VRAM is slower than
> expected. But if there's a trap involved I've found before that
> exceptions are slower with QEMU but I did not see evidence of that in
> the profile.
>
> Regards,
> BALATON Zoltan

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC PATCH] target/ppc: Inline most of dcbz helper
  2025-04-30 13:47                         ` Alex Bennée
@ 2025-04-30 15:14                           ` BALATON Zoltan
  0 siblings, 0 replies; 20+ messages in thread
From: BALATON Zoltan @ 2025-04-30 15:14 UTC (permalink / raw)
  To: Alex Bennée; +Cc: Nicholas Piggin, Richard Henderson, qemu-devel, qemu-ppc

[-- Attachment #1: Type: text/plain, Size: 12075 bytes --]

On Wed, 30 Apr 2025, Alex Bennée wrote:
> BALATON Zoltan <balaton@eik.bme.hu> writes:
>> On Wed, 30 Apr 2025, Nicholas Piggin wrote:
>>> On Wed Apr 30, 2025 at 7:09 AM AEST, BALATON Zoltan wrote:
>>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>> On Tue, 29 Apr 2025, Alex Bennée wrote:
>>>>>>> BALATON Zoltan <balaton@eik.bme.hu> writes:
>>>>>>>> On Mon, 28 Apr 2025, Richard Henderson wrote:
>>>>>>>>> On 4/28/25 06:26, BALATON Zoltan wrote:
> <snip>
>>>>>
>>>>> if we've been here before (needing n insn from the base addr) we will
>>>>> have a cached translation we can re-use. It doesn't stop the longer TB
>>>>> being called again as we re-enter a loop.
>>>>
>>>> So then maybe it should at least check if there's already a cached TB
>>>> where it can continue before calling cpu_io_recompile in io_prepare and
>>>> only recompile if needed?
>>>
>>> It basically does do that AFAIKS. cpu_io_recompile() name is misleading
>>> it does not cause a recompile, it just updates cflags and exits. Next
>>> entry will look up TB that has just 1 insn and enter that.
>>
>> After reading it I came to the same conclusion but then I don't
>> understand what causes the problem. Is it just that it will exit the
>> loop for every IO to look up the recompiled TB? It looks like it tries
>> to chain TBs, why does that not work here?
>
> Any MMIO access has to come via the slow path. Any MMIO also currently
> has to be the last instruction in a block in case the operation triggers
> a change in the translation regime that needs to be picked up by the
> next instruction you execute.
>
> This is a pathological case when modelling VRAM on a device because its
> going to be slow either way. At least if you model the multiple byte
> access with a helper you can amortise some of the cost of the MMU lookup
> with a single probe_() call.

I think there is some mix up here because of all the different scenarios I 
benchmarked so let me try to clear that up. The goal is to find out why 
access to vfio-pci passed through graphics card VRAM is slower than 
expected when the host should be faster than those mostly embedded or old 
PPCs used on real machines with only 4x PCIe or PCIe to PCI bridges. In 
this case we are not emulating VRAM but mapping the framebuffer from the 
real card and access that. To find where the slow down comes from I've 
benchmarked all the cases upthread but here are the relevant parts again 
for easier comparison:

First both src and dst are in RAM (just malloced buffers so this is the 
base line):

src 0xb79c8008 dst 0xb78c7008
byte loop: 21.16 sec
memset: 3.85 sec
memcpy: 5.07 sec
copyToVRAMNoAltivec: 2.52 sec
copyToVRAMAltivec: 2.42 sec
copyFromVRAMNoAltivec: 6.39 sec
copyFromVRAMAltivec: 7.02 sec

The FromVRAM cases use dcbz to avoid loading RAM contents to cache on real 
machine that is about to be overwritten so dcbz is never applied to MMIO. 
(Arguably it should use dcba but for some reason nobody remembers why it 
uses dcbz instead.) The ToVRAM cases have dcbt which is noop on QEMU. I 
guess the difference we see here is because of probe_access in dcbz as was 
shown by previous profiling. Replacing that with dcba (which is noop in 
QEMU) makes ToVRAM and FromVRAM run the about the same (you can find that 
case in original message). FromVRAM still a bit slower for some reason but 
most of this overhead can be accounted to dcbz.

In second test dst is mmapped from emulated ati-vga framebuffer BAR. We 
can say we emulate vram here but that's just a ram memory region created 
in vga.c as:

memory_region_init_ram_nomigrate(&s->vram, obj, "vga.vram", s->vram_size, &local_err);

it also has dirty tracking enabled, I don't know if that has any effect. 
This is shown in left column here:

dst in emulated ati-vga               | dst in real card vfio vram
mapping 0x80800000                      mapping 0x80800000
src 0xb78e0008 dst 0xb77de000         | src 0xb7ec5008 dst 0xb7dc3000
byte loop: 21.2 sec                   | byte loop: 563.98 sec
memset: 3.89 sec                      | memset: 39.25 sec
memcpy: 5.07 sec                      | memcpy: 140.49 sec
copyToVRAMNoAltivec: 2.53 sec         | copyToVRAMNoAltivec: 72.03 sec
copyToVRAMAltivec: 12.22 sec          | copyToVRAMAltivec: 78.12 sec
copyFromVRAMNoAltivec: 6.43 sec       | copyFromVRAMNoAltivec: 728.52 sec
copyFromVRAMAltivec: 35.33 sec        | copyFromVRAMAltivec: 754.95 sec

Here we see that AltiVec cases have additional overhead which I think is 
related to vperm as that's the only op that does not seem to be compiled 
to something sensible but calls an unoptimised helper (although that's 
also there for RAM so not sure why this is slower). But this shows no 
other overhead due to MMIO being involved as the NoAltivec cases are the 
same as with RAM.

Last case, shown in right column above, is when instead of ati-vga I have 
a real ATI card passed through with vfio-pci which is much slower than 
what is explained only by PCI overhead and I'm trying to find out the 
source of that slow down.

I've now also run 1000 iterations (vs. 10000 above so numbers are 10 times 
less here than above in right column) of the last case again (using real 
card with vfio-pci) with qemu-system-ppc vs. qemu-system-ppc64 to see if 
mttcg has any effect:

1000 iterations qemu-system-ppc       | qemu-system-ppc64
mapping 0x80800000                      mapping 0x80800000
src 0xb7dc6008 dst 0xb7cc4000         | src 0xb78b8008 dst 0xb77b6000
byte loop: 58.44 sec                  | byte loop: 57.72 sec
memset: 3.99 sec                      | memset: 3.93 sec
memcpy: 14.43 sec                     | memcpy: 14.24 sec
copyToVRAMNoAltivec: 7.27 sec         | copyToVRAMNoAltivec: 7.15 sec
copyToVRAMAltivec: 7.9 sec            | copyToVRAMAltivec: 7.78 sec
copyFromVRAMNoAltivec: 72.68 sec      | copyFromVRAMNoAltivec: 72.69 sec
copyFromVRAMAltivec: 75.15 sec        | copyFromVRAMAltivec: 75.05 sec

This does not seem to have much effect so maybe not having mttcg does not 
enable icount just uses the same function which were confusing in the 
profile.

Finally I dug up some comparable results from real machine vs QEMU.
These are with QEMU with the default -cpu 7454 and -cpu g3 (to check 
AltiVec overhead but there seems to be only about 1%):

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2939
https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2941

and same card on real machine:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2414

It seems for larger rectangles we approach the same limits but smaller 
transfers (what I think VRAM copy also uses) have some big overhead 
compared to what PCIe communication alone explains.

Another card on QEMU:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2931

and on real machine:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/2372

or a similar card (I did not find exactly the same) with slower CPU real 
machine:

https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS/Result/1672

Also on real machine using optimised routines does help so using wider 
transfers is better than default unoptimised case.

>>>> I was thinking maybe we need a flag or counter
>>>> to see if cpu_io_recompile is called more than once and after a limit
>>>> invalidate the TB and create two new ones the first ending at the I/O and
>>>> then what cpu_io_recompile does now which as I understood was what Richard
>>>> suggested but I don't know how to do that.
>>>
>>> memset/cpy routines had kind of the same problem with real hardware.
>>> They wanted to use vector instructions for best performance, but when
>>> those are used on MMIO they would trap and be very slow.
>>
>> Why do those trap on MMIO on real machine? These routines were tested
>> on real machines and the reasoning to use the widest possible access
>> was that PCI transfer has overhead and that is minimised by
>> transferring more bits in one op. I think they also verifed that it
>> works at least for the 32 bit CPUs up to G4 that were used on real
>> AmigaNG machines. There are some benchmark results here:
>> https://hdrlab.org.nz/benchmark/gfxbench2d/OS/AmigaOS?start=60 which
>> is also where the benchmark I used comes from so this should be
>> similar. I think the MemCopy on that page has plain unoptimised copy
>> as Copy to/from VRAM and optimised routines similar to this benchmark
>> as Read/Write Pixel Array, but it's not easy to search. Some of the
>> machines like Pegasos II and AmigaOne XE were made with both G3 or G4
>> CPUs so if I find a result from those with same graphics card that
>> could show if AltiVec is faster (although the G4s were also higher
>> clock so not directly comparable). Some results there are also from
>> QEMU, mostly those that are with SiliconMotion 502 but that does not
>> have this problem only vfio-pci pass through.
>
> They don't - what we need is to have a RAM-like-device model for QEMU
> where we can relax the translation rules because we know we are writing
> to RAM like things that don't have registers or other state changing
> behaviour.
>
> The poor behaviour is because QEMU currently treats all MMIO as
> potentially system state altering where as for VRAM it doesn't need to.

This does not seem to be the case with emulated ati-vga, and with vfio-pci 
it should also be mapped memory from the graphics card which technically 
is MMIO but how does QEMU decides that when it does not seem to consider 
ati-vga as IO? Typically in QEMU MMIO is an io memory region that goes 
through memops and that's understandably slow but here we should 
read/write mapped memory space. Maybe I should try to find out what 
vfio-pci actually does here but it is used for gaming with KVM and there 
people get near native performance so I don't think there is an overhead 
in vfio-pci.

So I could explain some small overheads with dcbz and maybe vperm but the 
biggest one seems to only happen when accessing real card VRAM with 
vfio-pci that does not seem to happen on real machine and I could not 
reproduce with emulated ati-vga either but that's all I could find out so 
far and still don't get where the biggest overhead comes from.

Regards,
BALATON Zoltan

>> So maybe it's something
>> with how vfio-pci maps PCI memory BARs?
>
> I don't know about vfio-pci but blob resources mapped via virtio-gpu
> just appear as chunks of RAM to the guest - hence no trapping.
>
>>
>>> Problem is we don't know ahead of time if some routine will access
>>> MMIO or not. You could recompile it with fewer instructions but then
>>> it will be slow when used for regular memory.
>>>
>>> Heuristics are tough because you could have e.g., one initial big
>>> memset to clear a MMIO region that iterates many times over inner
>>> loop of dcbz instructions, but then is never used again for MMIO but
>>> important for regular page clearing. Making something that dynamically
>>> decays or periodically would recompile to non-IO case perhaps, but
>>> then complexity goes up.
>
> We can't have heuristics when we must prioritise correctness. However we
> could expand the device model to make the exact behaviour of different
> devices clear and optimise when we know it is safe.
>
>>> I would prefer not like to do that just for a microbenchmark, but if
>>> you think it is reasonable overall win for average workloads of your
>>> users then perhaps.
>>
>> I'm still trying to understand what to optimise. So far it looks like
>> that dcbz has the least impact, then vperm a bit bigger but still only
>> about a few percent and the biggest impact is still not known for sure
>> but we see faster access on real machines that run on slower PCIe
>> (only 4x at best) while CPU benchmarks don't show slower performance
>> on QEMU only accessing passed through card's VRAM is slower than
>> expected. But if there's a trap involved I've found before that
>> exceptions are slower with QEMU but I did not see evidence of that in
>> the profile.
>>
>> Regards,
>> BALATON Zoltan
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-04-30 15:15 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01  0:59 [RFC PATCH] target/ppc: Inline most of dcbz helper BALATON Zoltan
2025-04-24 12:45 ` BALATON Zoltan
2025-04-28  0:12   ` BALATON Zoltan
2025-04-28 10:44     ` BALATON Zoltan
2025-04-28 13:26       ` BALATON Zoltan
2025-04-28 13:47         ` Richard Henderson
2025-04-29 14:40           ` BALATON Zoltan
2025-04-29 16:04             ` Alex Bennée
2025-04-29 17:14               ` BALATON Zoltan
2025-04-29 17:58                 ` Alex Bennée
2025-04-29 21:09                   ` BALATON Zoltan
2025-04-30  0:35                     ` Nicholas Piggin
2025-04-30 11:20                       ` BALATON Zoltan
2025-04-30 13:47                         ` Alex Bennée
2025-04-30 15:14                           ` BALATON Zoltan
2025-04-29 15:27         ` Alex Bennée
2025-04-29 17:11           ` BALATON Zoltan
2025-04-29 17:30           ` Richard Henderson
2025-04-29 18:00             ` Alex Bennée
2025-04-29 20:51               ` BALATON Zoltan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.