qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND
@ 2023-08-07 14:28 Helge Deller
  2023-08-07 18:57 ` Richard Henderson
  0 siblings, 1 reply; 3+ messages in thread
From: Helge Deller @ 2023-08-07 14:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: Helge Deller

The tcg uses tgen_arithi(ARITH_AND) during fast CPU TLB lookups,
which e.g. translates to:

0x7ff5b011556a:  48 81 e6 00 f0 ff ff     andq     $0xfffffffffffff000, %rsi

In case the upper 48 bits are all set, the shorter sequence to operate
on the lower 16 bits of the target reg (si) can be used, which will then
be a 2 bytes shorter instruction sequence:

0x7f4488097b31:  66 81 e6 00 f0           andw     $0xf000, %si

Signed-off-by: Helge Deller <deller@gmx.de>
---
 tcg/i386/tcg-target.c.inc | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 77482da070..1cb9759c9e 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -1342,6 +1342,13 @@ static void tgen_arithi(TCGContext *s, int c, int r0,
                 /* AND with no high bits set can use a 32-bit operation.  */
                 rexw = 0;
             }
+            if ((val & 0xffffffffffff0000) == 0xffffffffffff0000) {
+                /* mask lower 16 bits on 16-bit register */
+                tcg_out8(s, 0x66);
+                tcg_out_modrm(s, OPC_ARITH_EvIz, c, r0);
+                tcg_out16(s, val);
+                return;
+            }
         }
         if (val == 0xffu && (r0 < 4 || TCG_TARGET_REG_BITS == 64)) {
             tcg_out_ext8u(s, r0, r0);
--
2.41.0



^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND
  2023-08-07 14:28 [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND Helge Deller
@ 2023-08-07 18:57 ` Richard Henderson
  2023-08-07 19:28   ` Helge Deller
  0 siblings, 1 reply; 3+ messages in thread
From: Richard Henderson @ 2023-08-07 18:57 UTC (permalink / raw)
  To: Helge Deller, qemu-devel

On 8/7/23 07:28, Helge Deller wrote:
> The tcg uses tgen_arithi(ARITH_AND) during fast CPU TLB lookups,
> which e.g. translates to:
> 
> 0x7ff5b011556a:  48 81 e6 00 f0 ff ff     andq     $0xfffffffffffff000, %rsi
> 
> In case the upper 48 bits are all set, the shorter sequence to operate
> on the lower 16 bits of the target reg (si) can be used, which will then
> be a 2 bytes shorter instruction sequence:
> 
> 0x7f4488097b31:  66 81 e6 00 f0           andw     $0xf000, %si
> 
> Signed-off-by: Helge Deller <deller@gmx.de>


Current Intel optimization guidelines

https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual.html

Section 3.4.2.3, Length Changing Prefixes, suggests that using 16-byte operands slows 
decode from 1 cycle to 6 cycles.

Section 3.5.2.3, Partial Register Stalls, says that Skylake has fixed the major issues 
that older microarchitectures had with such stalls, but that these operations have two 
additional cycles of delay.

So on balance I don't think this is a good tradeoff.


r~


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND
  2023-08-07 18:57 ` Richard Henderson
@ 2023-08-07 19:28   ` Helge Deller
  0 siblings, 0 replies; 3+ messages in thread
From: Helge Deller @ 2023-08-07 19:28 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel

On 8/7/23 20:57, Richard Henderson wrote:
> On 8/7/23 07:28, Helge Deller wrote:
>> The tcg uses tgen_arithi(ARITH_AND) during fast CPU TLB lookups,
>> which e.g. translates to:
>>
>> 0x7ff5b011556a:  48 81 e6 00 f0 ff ff     andq     $0xfffffffffffff000, %rsi
>>
>> In case the upper 48 bits are all set, the shorter sequence to operate
>> on the lower 16 bits of the target reg (si) can be used, which will then
>> be a 2 bytes shorter instruction sequence:
>>
>> 0x7f4488097b31:  66 81 e6 00 f0           andw     $0xf000, %si
>>
>> Signed-off-by: Helge Deller <deller@gmx.de>
>
>
> Current Intel optimization guidelines
>
> https://www.intel.com/content/www/us/en/content-details/671488/intel-64-and-ia-32-architectures-optimization-reference-manual.html
>
> Section 3.4.2.3, Length Changing Prefixes, suggests that using 16-byte operands slows decode from 1 cycle to 6 cycles.
>
> Section 3.5.2.3, Partial Register Stalls, says that Skylake has fixed the major issues that older microarchitectures had with such stalls, but that these operations have two additional cycles of delay.
>
> So on balance I don't think this is a good tradeoff.

Ok. Thanks for the links!

Helge



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-08-07 19:29 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-07 14:28 [PATCH] tcg/i386: Check for shorter instruction sequence for ARITH_AND Helge Deller
2023-08-07 18:57 ` Richard Henderson
2023-08-07 19:28   ` Helge Deller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).