* [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32
@ 2015-07-18 7:58 Richard Henderson
2015-07-18 21:18 ` Aurelien Jarno
0 siblings, 1 reply; 3+ messages in thread
From: Richard Henderson @ 2015-07-18 7:58 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini, leon.alrae, aurelien
Enforce the invariant that 32-bit quantities are zero extended
in the register. This avoids having to re-zero-extend at memory
accesses for 32-bit guests.
Signed-off-by: Richard Henderson <rth@twiddle.net>
---
Here's an alternative to the other things we've been considering.
We could even make this conditional on USER_ONLY if you like.
This does in fact fix the mips test case. Consider the fact that
memory operations are probably more common than truncations, and
it would seem that we have a net size win by forcing the truncate
over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
Indeed, for 2.5, we could look at dropping the existing zero-extend
from the softmmu path. Also for 2.5, split trunc_shr into two parts,
trunc_lo and trunc_hi. No one really uses any other value for the
shift, and x86 could really use different constraints for the two.
Thoughts?
r~
---
tcg/i386/tcg-target.c | 9 +++++++++
tcg/i386/tcg-target.h | 2 +-
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index ff4d9cf..9f01369 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -2040,6 +2040,14 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
case INDEX_op_ext32s_i64:
tcg_out_ext32s(s, args[0], args[1]);
break;
+ case INDEX_op_trunc_shr_i32:
+ if (args[2] > 0) {
+ tcg_out_shifti(s, SHIFT_SHR + P_REXW, args[0], args[2]);
+ }
+ if (args[2] < 32) {
+ tcg_out_ext32u(s, args[0], args[0]);
+ }
+ break;
#endif
OP_32_64(deposit):
@@ -2170,6 +2178,7 @@ static const TCGTargetOpDef x86_op_defs[] = {
{ INDEX_op_ext8u_i64, { "r", "r" } },
{ INDEX_op_ext16u_i64, { "r", "r" } },
{ INDEX_op_ext32u_i64, { "r", "r" } },
+ { INDEX_op_trunc_shr_i32, { "r", "0" } },
{ INDEX_op_deposit_i64, { "Q", "0", "Q" } },
{ INDEX_op_movcond_i64, { "r", "r", "re", "r", "0" } },
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 25b5133..8b87050 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -102,7 +102,7 @@ extern bool have_bmi1;
#define TCG_TARGET_HAS_mulsh_i32 0
#if TCG_TARGET_REG_BITS == 64
-#define TCG_TARGET_HAS_trunc_shr_i32 0
+#define TCG_TARGET_HAS_trunc_shr_i32 1
#define TCG_TARGET_HAS_div2_i64 1
#define TCG_TARGET_HAS_rot_i64 1
#define TCG_TARGET_HAS_ext8s_i64 1
--
2.4.3
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32
2015-07-18 7:58 [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32 Richard Henderson
@ 2015-07-18 21:18 ` Aurelien Jarno
2015-07-19 11:26 ` Aurelien Jarno
0 siblings, 1 reply; 3+ messages in thread
From: Aurelien Jarno @ 2015-07-18 21:18 UTC (permalink / raw)
To: Richard Henderson; +Cc: pbonzini, leon.alrae, qemu-devel
On 2015-07-18 08:58, Richard Henderson wrote:
> Enforce the invariant that 32-bit quantities are zero extended
> in the register. This avoids having to re-zero-extend at memory
> accesses for 32-bit guests.
>
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
> Here's an alternative to the other things we've been considering.
> We could even make this conditional on USER_ONLY if you like.
>
> This does in fact fix the mips test case. Consider the fact that
> memory operations are probably more common than truncations, and
> it would seem that we have a net size win by forcing the truncate
> over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
I think we should go with your previous patch for 2.4, and think calmly
about how to do that better for 2.5. It slightly increases the generated
code, but only in bytes, not in number of instructions, so I don't think
the performance impact is huge.
> Indeed, for 2.5, we could look at dropping the existing zero-extend
> from the softmmu path. Also for 2.5, split trunc_shr into two parts,
From a quick look, we need to move the address to new registers anyway,
so not zero-extending will mean adding the REXW prefix.
> trunc_lo and trunc_hi. No one really uses any other value for the
> shift, and x86 could really use different constraints for the two.
That sounds like a good idea.
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32
2015-07-18 21:18 ` Aurelien Jarno
@ 2015-07-19 11:26 ` Aurelien Jarno
0 siblings, 0 replies; 3+ messages in thread
From: Aurelien Jarno @ 2015-07-19 11:26 UTC (permalink / raw)
To: Richard Henderson; +Cc: pbonzini, leon.alrae, qemu-devel
On 2015-07-18 23:18, Aurelien Jarno wrote:
> On 2015-07-18 08:58, Richard Henderson wrote:
> > Enforce the invariant that 32-bit quantities are zero extended
> > in the register. This avoids having to re-zero-extend at memory
> > accesses for 32-bit guests.
> >
> > Signed-off-by: Richard Henderson <rth@twiddle.net>
> > ---
> > Here's an alternative to the other things we've been considering.
> > We could even make this conditional on USER_ONLY if you like.
> >
> > This does in fact fix the mips test case. Consider the fact that
> > memory operations are probably more common than truncations, and
> > it would seem that we have a net size win by forcing the truncate
> > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
>
> I think we should go with your previous patch for 2.4, and think calmly
> about how to do that better for 2.5. It slightly increases the generated
> code, but only in bytes, not in number of instructions, so I don't think
> the performance impact is huge.
>
> > Indeed, for 2.5, we could look at dropping the existing zero-extend
> > from the softmmu path. Also for 2.5, split trunc_shr into two parts,
>
> From a quick look, we need to move the address to new registers anyway,
> so not zero-extending will mean adding the REXW prefix.
Well looking more in details, we can move one instruction from the
fast-path to the slow-path. Here is a typical TLB code for store:
fast-path:
mov %rbp,%rdi
mov %rbp,%rsi
shr $0x7,%rdi
and $0xfffffffffffff003,%rsi
and $0x1fe0,%edi
lea 0x4e68(%r14,%rdi,1),%rdi
cmp (%rdi),%rsi
mov %rbp,%rsi
jne 0x7f45b8bcc800
add 0x10(%rdi),%rsi
mov %ebx,(%rsi)
slow-path:
mov %r14,%rdi
mov %ebx,%edx
mov $0x22,%ecx
lea -0x156(%rip),%r8
push %r8
jmpq 0x7f45cb337010
If we know that %rbp is properly zero-extend when needed, we can change
the end of the fast path into:
cmp (%rdi),%rsi
jne 0x7f45b8bcc800
mov 0x10(%rdi),%rsi
mov %ebx,(%rsi,%rbp,1)
However that means that %rsi is not loaded anymore with the address, so
we have to load it in the slow path. At the end it means moving one
instruction from the fast-path to the slow-path.
Now I have no idea what would really improve the performances. Smaller
fast-path so there are less instructions to execute? Smaller code in
general so that the caches are better used?
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://www.aurel32.net
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2015-07-19 11:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-18 7:58 [Qemu-devel] [PATCH for-2.4] tcg/i386: Implement trunc_shr_i32 Richard Henderson
2015-07-18 21:18 ` Aurelien Jarno
2015-07-19 11:26 ` Aurelien Jarno
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).