From: BALATON Zoltan <balaton@eik.bme.hu>
To: qemu-devel@nongnu.org, qemu-ppc@nongnu.org
Cc: Nicholas Piggin <npiggin@gmail.com>,
Daniel Henrique Barboza <danielhb413@gmail.com>,
Richard Henderson <richard.henderson@linaro.org>
Subject: [RFC PATCH] target/ppc: Inline most of dcbz helper
Date: Mon, 01 Jul 2024 02:59:39 +0200 (CEST) [thread overview]
Message-ID: <20240701005939.5A0AF4E6000@zero.eik.bme.hu> (raw)
This is an RFC patch, not finished, just to show the idea and test
this approach. I'm not sure it's correct but I'm sure it can be
improved so comments are requested.
The test case I've used came out of a discussion about very slow
access to VRAM of a graphics card passed through with vfio the reason
for which is still not clear but it was already known that dcbz is
often used by MacOS and AmigaOS for clearing memory and to avoid
reading values about to be overwritten which is faster on real CPU but
was found to be slower on QEMU. The optimised copy routines were
posted here:
https://www.amigans.net/modules/newbb/viewtopic.php?post_id=149123#forumpost149123
and the rest of it I've written to make it a test case is here:
http://zero.eik.bme.hu/~balaton/qemu/vramcopy.tar.xz
Replace the body of has_altivec() with just "return false". Sorry for
only giving pieces but the code posted above has a copyright that does
not allow me to include it in the test. This is not measuring VRAM
access now just memory copy but shows the effect of dcbz. I've got
these results with this patch:
Linux user master: Linux user patch:
byte loop: 2.2 sec byte loop: 2.2 sec
memcpy: 2.19 sec memcpy: 2.19 sec
copyToVRAMNoAltivec: 1.7 sec copyToVRAMNoAltivec: 1.71 sec
copyToVRAMAltivec: 2.13 sec copyToVRAMAltivec: 2.12 sec
copyFromVRAMNoAltivec: 5.11 sec copyFromVRAMNoAltivec: 2.79 sec
copyFromVRAMAltivec: 5.87 sec copyFromVRAMAltivec: 3.26 sec
Linux system master: Linux system patch:
byte loop: 5.86 sec byte loop: 5.9 sec
memcpy: 5.45 sec memcpy: 5.47 sec
copyToVRAMNoAltivec: 2.51 sec copyToVRAMNoAltivec: 2.53 sec
copyToVRAMAltivec: 3.84 sec copyToVRAMAltivec: 3.85 sec
copyFromVRAMNoAltivec: 6.11 sec copyFromVRAMNoAltivec: 3.92 sec
copyFromVRAMAltivec: 7.22 sec copyFromVRAMAltivec: 5.51 sec
It could probably be further optimised with using vector instuctions
(dcbz_size is between 32 and 128) or by eliminating the check left in
the helper for 970 but I don't know how to do those. (Also the series
that convert AltiVec to use 128 bit access may help but I haven't
tested that, only trying to optimise dcbz here,)
Signed-off-by: BALATON Zoltan <balaton@eik.bme.hu>
---
target/ppc/helper.h | 1 +
target/ppc/mem_helper.c | 14 ++++++++++++++
target/ppc/translate.c | 34 ++++++++++++++++++++++++++++------
3 files changed, 43 insertions(+), 6 deletions(-)
diff --git a/target/ppc/helper.h b/target/ppc/helper.h
index 76b8f25c77..e49681c25b 100644
--- a/target/ppc/helper.h
+++ b/target/ppc/helper.h
@@ -46,6 +46,7 @@ DEF_HELPER_FLAGS_3(stmw, TCG_CALL_NO_WG, void, env, tl, i32)
DEF_HELPER_4(lsw, void, env, tl, i32, i32)
DEF_HELPER_5(lswx, void, env, tl, i32, i32, i32)
DEF_HELPER_FLAGS_4(stsw, TCG_CALL_NO_WG, void, env, tl, i32, i32)
+DEF_HELPER_FLAGS_2(dcbz_size, TCG_CALL_NO_WG_SE, tl, env, i32)
DEF_HELPER_FLAGS_3(dcbz, TCG_CALL_NO_WG, void, env, tl, i32)
DEF_HELPER_FLAGS_3(dcbzep, TCG_CALL_NO_WG, void, env, tl, i32)
DEF_HELPER_FLAGS_2(icbi, TCG_CALL_NO_WG, void, env, tl)
diff --git a/target/ppc/mem_helper.c b/target/ppc/mem_helper.c
index f88155ad45..b06cb2d00e 100644
--- a/target/ppc/mem_helper.c
+++ b/target/ppc/mem_helper.c
@@ -270,6 +270,20 @@ void helper_stsw(CPUPPCState *env, target_ulong addr, uint32_t nb,
}
}
+target_ulong helper_dcbz_size(CPUPPCState *env, uint32_t opcode)
+{
+ target_ulong dcbz_size = env->dcache_line_size;
+
+#if defined(TARGET_PPC64)
+ /* Check for dcbz vs dcbzl on 970 */
+ if (env->excp_model == POWERPC_EXCP_970 &&
+ !(opcode & 0x00200000) && ((env->spr[SPR_970_HID5] >> 7) & 0x3) == 1) {
+ dcbz_size = 32;
+ }
+#endif
+ return dcbz_size;
+}
+
static void dcbz_common(CPUPPCState *env, target_ulong addr,
uint32_t opcode, bool epid, uintptr_t retaddr)
{
diff --git a/target/ppc/translate.c b/target/ppc/translate.c
index 0bc16d7251..49221b8303 100644
--- a/target/ppc/translate.c
+++ b/target/ppc/translate.c
@@ -4445,14 +4445,36 @@ static void gen_dcblc(DisasContext *ctx)
/* dcbz */
static void gen_dcbz(DisasContext *ctx)
{
- TCGv tcgv_addr;
- TCGv_i32 tcgv_op;
+ TCGv addr, mask, dcbz_size, t0;
+ TCGv_i32 op = tcg_constant_i32(ctx->opcode & 0x03FF000);
+ TCGv_i64 z64 = tcg_constant_i64(0);
+ TCGv_i128 z128 = tcg_temp_new_i128();
+ TCGLabel *l;
+
+ addr = tcg_temp_new();
+ mask = tcg_temp_new();
+ dcbz_size = tcg_temp_new();
+ t0 = tcg_temp_new();
+ l = gen_new_label();
gen_set_access_type(ctx, ACCESS_CACHE);
- tcgv_addr = tcg_temp_new();
- tcgv_op = tcg_constant_i32(ctx->opcode & 0x03FF000);
- gen_addr_reg_index(ctx, tcgv_addr);
- gen_helper_dcbz(tcg_env, tcgv_addr, tcgv_op);
+ gen_helper_dcbz_size(dcbz_size, tcg_env, op);
+ tcg_gen_mov_tl(mask, dcbz_size);
+ tcg_gen_subi_tl(mask, mask, 1);
+ tcg_gen_not_tl(mask, mask);
+ gen_addr_reg_index(ctx, addr);
+ tcg_gen_and_tl(addr, addr, mask);
+ tcg_gen_mov_tl(t0, cpu_reserve);
+ tcg_gen_and_tl(t0, t0, mask);
+ tcg_gen_movcond_tl(TCG_COND_EQ, cpu_reserve, addr, t0,
+ tcg_constant_tl(-1), cpu_reserve);
+
+ tcg_gen_concat_i64_i128(z128, z64, z64);
+ gen_set_label(l);
+ tcg_gen_qemu_st_i128(z128, addr, ctx->mem_idx, DEF_MEMOP(MO_128));
+ tcg_gen_addi_tl(addr, addr, 16);
+ tcg_gen_subi_tl(dcbz_size, dcbz_size, 16);
+ tcg_gen_brcondi_tl(TCG_COND_GT, dcbz_size, 0, l);
}
/* dcbzep */
--
2.30.9
next reply other threads:[~2024-07-01 1:00 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-01 0:59 BALATON Zoltan [this message]
2025-04-24 12:45 ` [RFC PATCH] target/ppc: Inline most of dcbz helper BALATON Zoltan
2025-04-28 0:12 ` BALATON Zoltan
2025-04-28 10:44 ` BALATON Zoltan
2025-04-28 13:26 ` BALATON Zoltan
2025-04-28 13:47 ` Richard Henderson
2025-04-29 14:40 ` BALATON Zoltan
2025-04-29 16:04 ` Alex Bennée
2025-04-29 17:14 ` BALATON Zoltan
2025-04-29 17:58 ` Alex Bennée
2025-04-29 21:09 ` BALATON Zoltan
2025-04-30 0:35 ` Nicholas Piggin
2025-04-30 11:20 ` BALATON Zoltan
2025-04-30 13:47 ` Alex Bennée
2025-04-30 15:14 ` BALATON Zoltan
2025-04-29 15:27 ` Alex Bennée
2025-04-29 17:11 ` BALATON Zoltan
2025-04-29 17:30 ` Richard Henderson
2025-04-29 18:00 ` Alex Bennée
2025-04-29 20:51 ` BALATON Zoltan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240701005939.5A0AF4E6000@zero.eik.bme.hu \
--to=balaton@eik.bme.hu \
--cc=danielhb413@gmail.com \
--cc=npiggin@gmail.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
--cc=richard.henderson@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.