* [PATCH v2 0/6] tcg: Improve extract and deposit code gen
@ 2026-02-04 5:24 Richard Henderson
2026-02-04 5:24 ` [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize Richard Henderson
` (5 more replies)
0 siblings, 6 replies; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
Supercedes: 20260119000740.50516-1-richard.henderson@linaro.org
[PATCH 0/3] tcg: Lower deposit/extract2 during optimize
Supercedes: 20260115135453.140870-1-pbonzini@redhat.com
[PATCH 0/2] tcg: improve instruction selection for extract and deposit_z
This is a merge of these two patch sets. I'm not sure what
inputs you were looking at, Paolo?
From random aarch64 guest binaries, and an x86_64 host, I still
see most benefit from the lowering during optimize. It's not
lots, but every little bit helps, I guess.
r~
Paolo Bonzini (2):
tcg: Add tcg_op_imm_match
tcg: target-dependent lowering of extract to shr/and
Richard Henderson (4):
tcg/optimize: Lower unsupported deposit during optimize
tcg/optimize: Lower unsupported extract2 during optimize
tcg: Expand missing rotri with extract2
tcg/optimize: possibly expand deposit into zero with shifts
tcg/tcg-internal.h | 5 +
tcg/optimize.c | 279 ++++++++++++++++++++++++++++++++++++++++-----
tcg/tcg-op.c | 210 ++++++++--------------------------
tcg/tcg.c | 21 +++-
4 files changed, 322 insertions(+), 193 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-25 13:34 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 " Richard Henderson
` (4 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
The expansions that we chose in tcg-op.c may be less than optimial.
Delay lowering until optimize, so that we have propagated constants
and have computed known zero/one masks.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/optimize.c | 194 +++++++++++++++++++++++++++++++++++++++++++------
tcg/tcg-op.c | 113 ++--------------------------
2 files changed, 178 insertions(+), 129 deletions(-)
diff --git a/tcg/optimize.c b/tcg/optimize.c
index 801a0a2c68..890c8068fb 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -1652,12 +1652,17 @@ static bool fold_ctpop(OptContext *ctx, TCGOp *op)
static bool fold_deposit(OptContext *ctx, TCGOp *op)
{
- TempOptInfo *t1 = arg_info(op->args[1]);
- TempOptInfo *t2 = arg_info(op->args[2]);
+ TCGArg ret = op->args[0];
+ TCGArg arg1 = op->args[1];
+ TCGArg arg2 = op->args[2];
int ofs = op->args[3];
int len = op->args[4];
- int width = 8 * tcg_type_size(ctx->type);
- uint64_t z_mask, o_mask, s_mask;
+ TempOptInfo *t1 = arg_info(arg1);
+ TempOptInfo *t2 = arg_info(arg2);
+ int width;
+ uint64_t z_mask, o_mask, s_mask, type_mask, len_mask;
+ TCGOp *op2;
+ bool valid;
if (ti_is_const(t1) && ti_is_const(t2)) {
return tcg_opt_gen_movi(ctx, op, op->args[0],
@@ -1665,35 +1670,182 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
ti_const_val(t2)));
}
- /* Inserting a value into zero at offset 0. */
- if (ti_is_const_val(t1, 0) && ofs == 0) {
- uint64_t mask = MAKE_64BIT_MASK(0, len);
+ width = 8 * tcg_type_size(ctx->type);
+ type_mask = MAKE_64BIT_MASK(0, width);
+ len_mask = MAKE_64BIT_MASK(0, len);
+ /* Inserting all-zero into a value. */
+ if ((t2->z_mask & len_mask) == 0) {
op->opc = INDEX_op_and;
- op->args[1] = op->args[2];
- op->args[2] = arg_new_constant(ctx, mask);
+ op->args[2] = arg_new_constant(ctx, ~(len_mask << ofs));
return fold_and(ctx, op);
}
- /* Inserting zero into a value. */
- if (ti_is_const_val(t2, 0)) {
- uint64_t mask = deposit64(-1, ofs, len, 0);
-
- op->opc = INDEX_op_and;
- op->args[2] = arg_new_constant(ctx, mask);
- return fold_and(ctx, op);
+ /* Inserting all-one into a value. */
+ if ((t2->o_mask & len_mask) == len_mask) {
+ op->opc = INDEX_op_or;
+ op->args[2] = arg_new_constant(ctx, len_mask << ofs);
+ return fold_or(ctx, op);
}
- /* The s_mask from the top portion of the deposit is still valid. */
- if (ofs + len == width) {
- s_mask = t2->s_mask << ofs;
- } else {
- s_mask = t1->s_mask & ~MAKE_64BIT_MASK(0, ofs + len);
+ valid = TCG_TARGET_deposit_valid(ctx->type, ofs, len);
+
+ /* Lower invalid deposit of constant as AND + OR. */
+ if (!valid && ti_is_const(t2)) {
+ uint64_t ins_val = (ti_const_val(t2) & len_mask) << ofs;
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg1;
+ op2->args[2] = arg_new_constant(ctx, ~(len_mask << ofs));
+ fold_and(ctx, op2);
+
+ op->opc = INDEX_op_or;
+ op->args[1] = ret;
+ op->args[2] = arg_new_constant(ctx, ins_val);
+ return fold_or(ctx, op);
}
+ /*
+ * Compute result masks before calling other fold_* subroutines
+ * which could modify the masks of our inputs.
+ */
z_mask = deposit64(t1->z_mask, ofs, len, t2->z_mask);
o_mask = deposit64(t1->o_mask, ofs, len, t2->o_mask);
+ if (ofs + len < width) {
+ s_mask = t1->s_mask & ~MAKE_64BIT_MASK(0, ofs + len);
+ } else {
+ s_mask = t2->s_mask << ofs;
+ }
+ /* Inserting a value into zero. */
+ if (ti_is_const_val(t1, 0)) {
+ uint64_t need_mask;
+
+ /* Always lower deposit into zero at 0 as AND. */
+ if (ofs == 0) {
+ op->opc = INDEX_op_and;
+ op->args[1] = arg2;
+ op->args[2] = arg_new_constant(ctx, len_mask);
+ return fold_and(ctx, op);
+ }
+
+ /*
+ * If the portion of the value outside len that remains after
+ * shifting is zero, we can elide the mask and just shift.
+ */
+ need_mask = t2->z_mask & ~len_mask;
+ need_mask = (need_mask << ofs) & type_mask;
+ if (!need_mask) {
+ op->opc = INDEX_op_shl;
+ op->args[1] = arg2;
+ op->args[2] = arg_new_constant(ctx, ofs);
+ goto done;
+ }
+
+ /* Lower invalid deposit into zero as AND + SHL or SHL + AND. */
+ if (!valid) {
+ if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
+ !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
+ op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg2;
+ op2->args[2] = arg_new_constant(ctx, ofs);
+
+ op->opc = INDEX_op_extract;
+ op->args[1] = ret;
+ op->args[2] = 0;
+ op->args[3] = ofs + len;
+ goto done;
+ }
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg2;
+ op2->args[2] = arg_new_constant(ctx, len_mask);
+ fold_and(ctx, op2);
+
+ op->opc = INDEX_op_shl;
+ op->args[1] = ret;
+ op->args[2] = arg_new_constant(ctx, ofs);
+ goto done;
+ }
+ }
+
+ /* After special cases, lower invalid deposit. */
+ if (!valid) {
+ TCGArg tmp;
+ bool has_ext2 = tcg_op_supported(INDEX_op_extract2, ctx->type, 0);
+ bool has_rotl = tcg_op_supported(INDEX_op_rotl, ctx->type, 0);
+
+ /*
+ * ret = arg2:arg1 >> len
+ * ret = rotl(ret, len)
+ */
+ if (ofs == 0 && has_ext2 && has_rotl) {
+ op2 = opt_insert_before(ctx, op, INDEX_op_extract2, 4);
+ op2->args[0] = ret;
+ op2->args[1] = arg1;
+ op2->args[2] = arg2;
+ op2->args[3] = len;
+
+ op->opc = INDEX_op_rotl;
+ op->args[1] = ret;
+ op->args[2] = arg_new_constant(ctx, len);
+ goto done;
+ }
+
+ /*
+ * tmp = arg1 << len
+ * ret = arg2:tmp >> len
+ */
+ if (ofs + len == width && has_ext2) {
+ tmp = ret == arg2 ? arg_new_temp(ctx) : ret;
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_shl, 4);
+ op2->args[0] = tmp;
+ op2->args[1] = arg1;
+ op2->args[2] = arg_new_constant(ctx, len);
+
+ op->opc = INDEX_op_extract2;
+ op->args[0] = ret;
+ op->args[1] = tmp;
+ op->args[2] = arg2;
+ op->args[3] = len;
+ goto done;
+ }
+
+ /*
+ * tmp = arg2 & mask
+ * ret = arg1 & ~(mask << ofs)
+ * tmp = tmp << ofs
+ * ret = ret | tmp
+ */
+ tmp = arg_new_temp(ctx);
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
+ op2->args[0] = tmp;
+ op2->args[1] = arg2;
+ op2->args[2] = arg_new_constant(ctx, len_mask);
+ fold_and(ctx, op2);
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
+ op2->args[0] = tmp;
+ op2->args[1] = tmp;
+ op2->args[2] = arg_new_constant(ctx, ofs);
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg1;
+ op2->args[2] = arg_new_constant(ctx, ~(len_mask << ofs));
+ fold_and(ctx, op2);
+
+ op->opc = INDEX_op_or;
+ op->args[1] = ret;
+ op->args[2] = tmp;
+ }
+
+ done:
return fold_masks_zos(ctx, op, z_mask, o_mask, s_mask);
}
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 8d67acc4fc..96f72ba381 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -876,9 +876,6 @@ void tcg_gen_rotri_i32(TCGv_i32 ret, TCGv_i32 arg1, int32_t arg2)
void tcg_gen_deposit_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2,
unsigned int ofs, unsigned int len)
{
- uint32_t mask;
- TCGv_i32 t1;
-
tcg_debug_assert(ofs < 32);
tcg_debug_assert(len > 0);
tcg_debug_assert(len <= 32);
@@ -886,39 +883,9 @@ void tcg_gen_deposit_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2,
if (len == 32) {
tcg_gen_mov_i32(ret, arg2);
- return;
- }
- if (TCG_TARGET_deposit_valid(TCG_TYPE_I32, ofs, len)) {
- tcg_gen_op5ii_i32(INDEX_op_deposit, ret, arg1, arg2, ofs, len);
- return;
- }
-
- t1 = tcg_temp_ebb_new_i32();
-
- if (tcg_op_supported(INDEX_op_extract2, TCG_TYPE_I32, 0)) {
- if (ofs + len == 32) {
- tcg_gen_shli_i32(t1, arg1, len);
- tcg_gen_extract2_i32(ret, t1, arg2, len);
- goto done;
- }
- if (ofs == 0) {
- tcg_gen_extract2_i32(ret, arg1, arg2, len);
- tcg_gen_rotli_i32(ret, ret, len);
- goto done;
- }
- }
-
- mask = (1u << len) - 1;
- if (ofs + len < 32) {
- tcg_gen_andi_i32(t1, arg2, mask);
- tcg_gen_shli_i32(t1, t1, ofs);
} else {
- tcg_gen_shli_i32(t1, arg2, ofs);
+ tcg_gen_op5ii_i32(INDEX_op_deposit, ret, arg1, arg2, ofs, len);
}
- tcg_gen_andi_i32(ret, arg1, ~(mask << ofs));
- tcg_gen_or_i32(ret, ret, t1);
- done:
- tcg_temp_free_i32(t1);
}
void tcg_gen_deposit_z_i32(TCGv_i32 ret, TCGv_i32 arg,
@@ -932,28 +899,10 @@ void tcg_gen_deposit_z_i32(TCGv_i32 ret, TCGv_i32 arg,
if (ofs + len == 32) {
tcg_gen_shli_i32(ret, arg, ofs);
} else if (ofs == 0) {
- tcg_gen_andi_i32(ret, arg, (1u << len) - 1);
- } else if (TCG_TARGET_deposit_valid(TCG_TYPE_I32, ofs, len)) {
+ tcg_gen_extract_i32(ret, arg, 0, len);
+ } else {
TCGv_i32 zero = tcg_constant_i32(0);
tcg_gen_op5ii_i32(INDEX_op_deposit, ret, zero, arg, ofs, len);
- } else {
- /*
- * To help two-operand hosts we prefer to zero-extend first,
- * which allows ARG to stay live.
- */
- if (TCG_TARGET_extract_valid(TCG_TYPE_I32, 0, len)) {
- tcg_gen_extract_i32(ret, arg, 0, len);
- tcg_gen_shli_i32(ret, ret, ofs);
- return;
- }
- /* Otherwise prefer zero-extension over AND for code size. */
- if (TCG_TARGET_extract_valid(TCG_TYPE_I32, 0, ofs + len)) {
- tcg_gen_shli_i32(ret, arg, ofs);
- tcg_gen_extract_i32(ret, ret, 0, ofs + len);
- return;
- }
- tcg_gen_andi_i32(ret, arg, (1u << len) - 1);
- tcg_gen_shli_i32(ret, ret, ofs);
}
}
@@ -2148,9 +2097,6 @@ void tcg_gen_rotri_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2)
void tcg_gen_deposit_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2,
unsigned int ofs, unsigned int len)
{
- uint64_t mask;
- TCGv_i64 t1;
-
tcg_debug_assert(ofs < 64);
tcg_debug_assert(len > 0);
tcg_debug_assert(len <= 64);
@@ -2158,40 +2104,9 @@ void tcg_gen_deposit_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2,
if (len == 64) {
tcg_gen_mov_i64(ret, arg2);
- return;
- }
-
- if (TCG_TARGET_deposit_valid(TCG_TYPE_I64, ofs, len)) {
- tcg_gen_op5ii_i64(INDEX_op_deposit, ret, arg1, arg2, ofs, len);
- return;
- }
-
- t1 = tcg_temp_ebb_new_i64();
-
- if (tcg_op_supported(INDEX_op_extract2, TCG_TYPE_I64, 0)) {
- if (ofs + len == 64) {
- tcg_gen_shli_i64(t1, arg1, len);
- tcg_gen_extract2_i64(ret, t1, arg2, len);
- goto done;
- }
- if (ofs == 0) {
- tcg_gen_extract2_i64(ret, arg1, arg2, len);
- tcg_gen_rotli_i64(ret, ret, len);
- goto done;
- }
- }
-
- mask = (1ull << len) - 1;
- if (ofs + len < 64) {
- tcg_gen_andi_i64(t1, arg2, mask);
- tcg_gen_shli_i64(t1, t1, ofs);
} else {
- tcg_gen_shli_i64(t1, arg2, ofs);
+ tcg_gen_op5ii_i64(INDEX_op_deposit, ret, arg1, arg2, ofs, len);
}
- tcg_gen_andi_i64(ret, arg1, ~(mask << ofs));
- tcg_gen_or_i64(ret, ret, t1);
- done:
- tcg_temp_free_i64(t1);
}
void tcg_gen_deposit_z_i64(TCGv_i64 ret, TCGv_i64 arg,
@@ -2206,27 +2121,9 @@ void tcg_gen_deposit_z_i64(TCGv_i64 ret, TCGv_i64 arg,
tcg_gen_shli_i64(ret, arg, ofs);
} else if (ofs == 0) {
tcg_gen_andi_i64(ret, arg, (1ull << len) - 1);
- } else if (TCG_TARGET_deposit_valid(TCG_TYPE_I64, ofs, len)) {
+ } else {
TCGv_i64 zero = tcg_constant_i64(0);
tcg_gen_op5ii_i64(INDEX_op_deposit, ret, zero, arg, ofs, len);
- } else {
- /*
- * To help two-operand hosts we prefer to zero-extend first,
- * which allows ARG to stay live.
- */
- if (TCG_TARGET_extract_valid(TCG_TYPE_I64, 0, len)) {
- tcg_gen_extract_i64(ret, arg, 0, len);
- tcg_gen_shli_i64(ret, ret, ofs);
- return;
- }
- /* Otherwise prefer zero-extension over AND for code size. */
- if (TCG_TARGET_extract_valid(TCG_TYPE_I64, 0, ofs + len)) {
- tcg_gen_shli_i64(ret, arg, ofs);
- tcg_gen_extract_i64(ret, ret, 0, ofs + len);
- return;
- }
- tcg_gen_andi_i64(ret, arg, (1ull << len) - 1);
- tcg_gen_shli_i64(ret, ret, ofs);
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 during optimize
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
2026-02-04 5:24 ` [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-25 14:47 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 3/6] tcg: Expand missing rotri with extract2 Richard Henderson
` (3 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini, Manos Pitsidianakis
The expansions that we chose in tcg-op.c may be less than optimial.
Delay lowering until optimize, so that we have propagated constants
and have computed known zero/one masks.
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/optimize.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++----
tcg/tcg-op.c | 9 ++------
2 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/tcg/optimize.c b/tcg/optimize.c
index 890c8068fb..e6a16921c9 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -1933,21 +1933,74 @@ static bool fold_extract2(OptContext *ctx, TCGOp *op)
uint64_t z2 = t2->z_mask;
uint64_t o1 = t1->o_mask;
uint64_t o2 = t2->o_mask;
+ uint64_t zr, or;
int shr = op->args[3];
+ int shl;
if (ctx->type == TCG_TYPE_I32) {
z1 = (uint32_t)z1 >> shr;
o1 = (uint32_t)o1 >> shr;
- z2 = (uint64_t)((int32_t)z2 << (32 - shr));
- o2 = (uint64_t)((int32_t)o2 << (32 - shr));
+ shl = 32 - shr;
+ z2 = (uint64_t)((int32_t)z2 << shl);
+ o2 = (uint64_t)((int32_t)o2 << shl);
} else {
z1 >>= shr;
o1 >>= shr;
- z2 <<= 64 - shr;
- o2 <<= 64 - shr;
+ shl = 64 - shr;
+ z2 <<= shl;
+ o2 <<= shl;
+ }
+ zr = z1 | z2;
+ or = o1 | o2;
+
+ if (zr == or) {
+ return tcg_opt_gen_movi(ctx, op, op->args[0], zr);
}
- return fold_masks_zo(ctx, op, z1 | z2, o1 | o2);
+ if (z2 == 0) {
+ /* High part zeros folds to simple right shift. */
+ op->opc = INDEX_op_shr;
+ op->args[2] = arg_new_constant(ctx, shr);
+ } else if (z1 == 0) {
+ /* Low part zeros folds to simple left shift. */
+ op->opc = INDEX_op_shl;
+ op->args[1] = op->args[2];
+ op->args[2] = arg_new_constant(ctx, shl);
+ } else if (!tcg_op_supported(INDEX_op_extract2, ctx->type, 0)) {
+ TCGArg tmp = arg_new_temp(ctx);
+ TCGOp *op2 = opt_insert_before(ctx, op, INDEX_op_shr, 3);
+
+ op2->args[0] = tmp;
+ op2->args[1] = op->args[1];
+ op2->args[2] = arg_new_constant(ctx, shr);
+
+ if (TCG_TARGET_deposit_valid(ctx->type, shl, shr)) {
+ /*
+ * Deposit has more arguments than extract2,
+ * so we need to create a new TCGOp.
+ */
+ op2 = opt_insert_before(ctx, op, INDEX_op_deposit, 5);
+ op2->args[0] = op->args[0];
+ op2->args[1] = tmp;
+ op2->args[2] = op->args[2];
+ op2->args[3] = shl;
+ op2->args[4] = shr;
+
+ tcg_op_remove(ctx->tcg, op);
+ op = op2;
+ } else {
+ op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
+ op2->args[0] = op->args[0];
+ op2->args[1] = op->args[2];
+ op2->args[2] = arg_new_constant(ctx, shl);
+
+ op->opc = INDEX_op_or;
+ op->args[1] = op->args[0];
+ op->args[2] = tmp;
+ }
+ }
+
+ return fold_masks_zo(ctx, op, zr, or);
}
static bool fold_exts(OptContext *ctx, TCGOp *op)
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 96f72ba381..8a4fd14ad5 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -1000,13 +1000,8 @@ void tcg_gen_extract2_i32(TCGv_i32 ret, TCGv_i32 al, TCGv_i32 ah,
tcg_gen_mov_i32(ret, ah);
} else if (al == ah) {
tcg_gen_rotri_i32(ret, al, ofs);
- } else if (tcg_op_supported(INDEX_op_extract2, TCG_TYPE_I32, 0)) {
- tcg_gen_op4i_i32(INDEX_op_extract2, ret, al, ah, ofs);
} else {
- TCGv_i32 t0 = tcg_temp_ebb_new_i32();
- tcg_gen_shri_i32(t0, al, ofs);
- tcg_gen_deposit_i32(ret, t0, ah, 32 - ofs, ofs);
- tcg_temp_free_i32(t0);
+ tcg_gen_op4i_i32(INDEX_op_extract2, ret, al, ah, ofs);
}
}
@@ -2221,7 +2216,7 @@ void tcg_gen_extract2_i64(TCGv_i64 ret, TCGv_i64 al, TCGv_i64 ah,
tcg_gen_mov_i64(ret, ah);
} else if (al == ah) {
tcg_gen_rotri_i64(ret, al, ofs);
- } else if (tcg_op_supported(INDEX_op_extract2, TCG_TYPE_I64, 0)) {
+ } else if (TCG_TARGET_REG_BITS == 64) {
tcg_gen_op4i_i64(INDEX_op_extract2, ret, al, ah, ofs);
} else {
TCGv_i64 t0 = tcg_temp_ebb_new_i64();
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 3/6] tcg: Expand missing rotri with extract2
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
2026-02-04 5:24 ` [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize Richard Henderson
2026-02-04 5:24 ` [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 " Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-25 14:54 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 4/6] tcg: Add tcg_op_imm_match Richard Henderson
` (2 subsequent siblings)
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
Use extract2 to implement rotri. To make this easier,
redefine rotli in terms of rotri, rather than the reverse.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op.c | 52 ++++++++++++++++++++++++----------------------------
1 file changed, 24 insertions(+), 28 deletions(-)
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 8a4fd14ad5..078adce610 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -826,23 +826,12 @@ void tcg_gen_rotl_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2)
void tcg_gen_rotli_i32(TCGv_i32 ret, TCGv_i32 arg1, int32_t arg2)
{
tcg_debug_assert(arg2 >= 0 && arg2 < 32);
- /* some cases can be optimized here */
if (arg2 == 0) {
tcg_gen_mov_i32(ret, arg1);
} else if (tcg_op_supported(INDEX_op_rotl, TCG_TYPE_I32, 0)) {
- TCGv_i32 t0 = tcg_constant_i32(arg2);
- tcg_gen_op3_i32(INDEX_op_rotl, ret, arg1, t0);
- } else if (tcg_op_supported(INDEX_op_rotr, TCG_TYPE_I32, 0)) {
- TCGv_i32 t0 = tcg_constant_i32(32 - arg2);
- tcg_gen_op3_i32(INDEX_op_rotr, ret, arg1, t0);
+ tcg_gen_op3_i32(INDEX_op_rotl, ret, arg1, tcg_constant_i32(arg2));
} else {
- TCGv_i32 t0 = tcg_temp_ebb_new_i32();
- TCGv_i32 t1 = tcg_temp_ebb_new_i32();
- tcg_gen_shli_i32(t0, arg1, arg2);
- tcg_gen_shri_i32(t1, arg1, 32 - arg2);
- tcg_gen_or_i32(ret, t0, t1);
- tcg_temp_free_i32(t0);
- tcg_temp_free_i32(t1);
+ tcg_gen_rotri_i32(ret, arg1, -arg2 & 31);
}
}
@@ -870,7 +859,16 @@ void tcg_gen_rotr_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2)
void tcg_gen_rotri_i32(TCGv_i32 ret, TCGv_i32 arg1, int32_t arg2)
{
tcg_debug_assert(arg2 >= 0 && arg2 < 32);
- tcg_gen_rotli_i32(ret, arg1, -arg2 & 31);
+ if (arg2 == 0) {
+ tcg_gen_mov_i32(ret, arg1);
+ } else if (tcg_op_supported(INDEX_op_rotr, TCG_TYPE_I32, 0)) {
+ tcg_gen_op3_i32(INDEX_op_rotr, ret, arg1, tcg_constant_i32(arg2));
+ } else if (tcg_op_supported(INDEX_op_rotl, TCG_TYPE_I32, 0)) {
+ tcg_gen_op3_i32(INDEX_op_rotl, ret, arg1, tcg_constant_i32(32 - arg2));
+ } else {
+ /* Do not recurse with the rotri simplification. */
+ tcg_gen_op4i_i32(INDEX_op_extract2, ret, arg1, arg1, arg2);
+ }
}
void tcg_gen_deposit_i32(TCGv_i32 ret, TCGv_i32 arg1, TCGv_i32 arg2,
@@ -2042,23 +2040,12 @@ void tcg_gen_rotl_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2)
void tcg_gen_rotli_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2)
{
tcg_debug_assert(arg2 >= 0 && arg2 < 64);
- /* some cases can be optimized here */
if (arg2 == 0) {
tcg_gen_mov_i64(ret, arg1);
} else if (tcg_op_supported(INDEX_op_rotl, TCG_TYPE_I64, 0)) {
- TCGv_i64 t0 = tcg_constant_i64(arg2);
- tcg_gen_op3_i64(INDEX_op_rotl, ret, arg1, t0);
- } else if (tcg_op_supported(INDEX_op_rotr, TCG_TYPE_I64, 0)) {
- TCGv_i64 t0 = tcg_constant_i64(64 - arg2);
- tcg_gen_op3_i64(INDEX_op_rotr, ret, arg1, t0);
+ tcg_gen_op3_i64(INDEX_op_rotl, ret, arg1, tcg_constant_i64(arg2));
} else {
- TCGv_i64 t0 = tcg_temp_ebb_new_i64();
- TCGv_i64 t1 = tcg_temp_ebb_new_i64();
- tcg_gen_shli_i64(t0, arg1, arg2);
- tcg_gen_shri_i64(t1, arg1, 64 - arg2);
- tcg_gen_or_i64(ret, t0, t1);
- tcg_temp_free_i64(t0);
- tcg_temp_free_i64(t1);
+ tcg_gen_rotri_i64(ret, arg1, -arg2 & 63);
}
}
@@ -2086,7 +2073,16 @@ void tcg_gen_rotr_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2)
void tcg_gen_rotri_i64(TCGv_i64 ret, TCGv_i64 arg1, int64_t arg2)
{
tcg_debug_assert(arg2 >= 0 && arg2 < 64);
- tcg_gen_rotli_i64(ret, arg1, -arg2 & 63);
+ if (arg2 == 0) {
+ tcg_gen_mov_i64(ret, arg1);
+ } else if (tcg_op_supported(INDEX_op_rotr, TCG_TYPE_I64, 0)) {
+ tcg_gen_op3_i64(INDEX_op_rotr, ret, arg1, tcg_constant_i64(arg2));
+ } else if (tcg_op_supported(INDEX_op_rotl, TCG_TYPE_I64, 0)) {
+ tcg_gen_op3_i64(INDEX_op_rotl, ret, arg1, tcg_constant_i64(64 - arg2));
+ } else {
+ /* Do not recurse with the rotri simplification. */
+ tcg_gen_op4i_i64(INDEX_op_extract2, ret, arg1, arg1, arg2);
+ }
}
void tcg_gen_deposit_i64(TCGv_i64 ret, TCGv_i64 arg1, TCGv_i64 arg2,
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 4/6] tcg: Add tcg_op_imm_match
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
` (2 preceding siblings ...)
2026-02-04 5:24 ` [PATCH v2 3/6] tcg: Expand missing rotri with extract2 Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-25 15:06 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and Richard Henderson
2026-02-04 5:24 ` [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts Richard Henderson
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
From: Paolo Bonzini <pbonzini@redhat.com>
Create a function to test whether the second operand of a
binary operation allows a given immediate.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[rth: Split out from a larger patch; keep the declaration internal.]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-internal.h | 5 +++++
tcg/tcg.c | 21 +++++++++++++++++----
2 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/tcg/tcg-internal.h b/tcg/tcg-internal.h
index 2cbfb5d5ca..c1ce50998e 100644
--- a/tcg/tcg-internal.h
+++ b/tcg/tcg-internal.h
@@ -94,4 +94,9 @@ TCGOp *tcg_op_insert_before(TCGContext *s, TCGOp *op,
TCGOp *tcg_op_insert_after(TCGContext *s, TCGOp *op,
TCGOpcode, TCGType, unsigned nargs);
+/*
+ * For a binary opcode OP, return true if the second input operand allows IMM.
+ */
+bool tcg_op_imm_match(TCGOpcode op, TCGType type, tcg_target_ulong imm);
+
#endif /* TCG_INTERNAL_H */
diff --git a/tcg/tcg.c b/tcg/tcg.c
index e7bf4dad4e..778268f5cd 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -3391,11 +3391,9 @@ static void process_constraint_sets(void)
}
}
-static const TCGArgConstraint *opcode_args_ct(const TCGOp *op)
+static const TCGArgConstraint *op_args_ct(TCGOpcode opc, TCGType type,
+ unsigned flags)
{
- TCGOpcode opc = op->opc;
- TCGType type = TCGOP_TYPE(op);
- unsigned flags = TCGOP_FLAGS(op);
const TCGOpDef *def = &tcg_op_defs[opc];
const TCGOutOp *outop = all_outop[opc];
TCGConstraintSetIndex con_set;
@@ -3422,6 +3420,21 @@ static const TCGArgConstraint *opcode_args_ct(const TCGOp *op)
return all_cts[con_set];
}
+static const TCGArgConstraint *opcode_args_ct(const TCGOp *op)
+{
+ return op_args_ct(op->opc, TCGOP_TYPE(op), TCGOP_FLAGS(op));
+}
+
+bool tcg_op_imm_match(TCGOpcode opc, TCGType type, tcg_target_ulong imm)
+{
+ const TCGArgConstraint *args_ct = op_args_ct(opc, type, 0);
+ const TCGOpDef *def = &tcg_op_defs[opc];
+
+ tcg_debug_assert(def->nb_oargs == 1);
+ tcg_debug_assert(def->nb_iargs == 2);
+ return tcg_target_const_match(imm, args_ct[2].ct, type, 0, 0);
+}
+
static void remove_label_use(TCGOp *op, int idx)
{
TCGLabel *label = arg_label(op->args[idx]);
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
` (3 preceding siblings ...)
2026-02-04 5:24 ` [PATCH v2 4/6] tcg: Add tcg_op_imm_match Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-25 15:16 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts Richard Henderson
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
From: Paolo Bonzini <pbonzini@redhat.com>
Instead of assuming only small immediates are available for AND,
consult the backend in order to decide between SHL/SHR and SHR/AND.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[rth: Split from a larger patch]
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/tcg-op.c | 36 ++++++++++++++++--------------------
1 file changed, 16 insertions(+), 20 deletions(-)
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 078adce610..263d208002 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -907,6 +907,8 @@ void tcg_gen_deposit_z_i32(TCGv_i32 ret, TCGv_i32 arg,
void tcg_gen_extract_i32(TCGv_i32 ret, TCGv_i32 arg,
unsigned int ofs, unsigned int len)
{
+ uint32_t mask;
+
tcg_debug_assert(ofs < 32);
tcg_debug_assert(len > 0);
tcg_debug_assert(len <= 32);
@@ -922,8 +924,10 @@ void tcg_gen_extract_i32(TCGv_i32 ret, TCGv_i32 arg,
tcg_gen_op4ii_i32(INDEX_op_extract, ret, arg, ofs, len);
return;
}
+
+ mask = (1u << len) - 1;
if (ofs == 0) {
- tcg_gen_andi_i32(ret, arg, (1u << len) - 1);
+ tcg_gen_andi_i32(ret, arg, mask);
return;
}
@@ -934,18 +938,12 @@ void tcg_gen_extract_i32(TCGv_i32 ret, TCGv_i32 arg,
return;
}
- /* ??? Ideally we'd know what values are available for immediate AND.
- Assume that 8 bits are available, plus the special case of 16,
- so that we get ext8u, ext16u. */
- switch (len) {
- case 1 ... 8: case 16:
+ if (tcg_op_imm_match(INDEX_op_and, TCG_TYPE_I32, mask)) {
tcg_gen_shri_i32(ret, arg, ofs);
- tcg_gen_andi_i32(ret, ret, (1u << len) - 1);
- break;
- default:
+ tcg_gen_andi_i32(ret, ret, mask);
+ } else {
tcg_gen_shli_i32(ret, arg, 32 - len - ofs);
tcg_gen_shri_i32(ret, ret, 32 - len);
- break;
}
}
@@ -2121,6 +2119,8 @@ void tcg_gen_deposit_z_i64(TCGv_i64 ret, TCGv_i64 arg,
void tcg_gen_extract_i64(TCGv_i64 ret, TCGv_i64 arg,
unsigned int ofs, unsigned int len)
{
+ uint64_t mask;
+
tcg_debug_assert(ofs < 64);
tcg_debug_assert(len > 0);
tcg_debug_assert(len <= 64);
@@ -2136,8 +2136,10 @@ void tcg_gen_extract_i64(TCGv_i64 ret, TCGv_i64 arg,
tcg_gen_op4ii_i64(INDEX_op_extract, ret, arg, ofs, len);
return;
}
+
+ mask = (1ull << len) - 1;
if (ofs == 0) {
- tcg_gen_andi_i64(ret, arg, (1ull << len) - 1);
+ tcg_gen_andi_i64(ret, arg, mask);
return;
}
@@ -2148,18 +2150,12 @@ void tcg_gen_extract_i64(TCGv_i64 ret, TCGv_i64 arg,
return;
}
- /* ??? Ideally we'd know what values are available for immediate AND.
- Assume that 8 bits are available, plus the special cases of 16 and 32,
- so that we get ext8u, ext16u, and ext32u. */
- switch (len) {
- case 1 ... 8: case 16: case 32:
+ if (tcg_op_imm_match(INDEX_op_and, TCG_TYPE_I64, mask)) {
tcg_gen_shri_i64(ret, arg, ofs);
- tcg_gen_andi_i64(ret, ret, (1ull << len) - 1);
- break;
- default:
+ tcg_gen_andi_i64(ret, ret, mask);
+ } else {
tcg_gen_shli_i64(ret, arg, 64 - len - ofs);
tcg_gen_shri_i64(ret, ret, 64 - len);
- break;
}
}
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
` (4 preceding siblings ...)
2026-02-04 5:24 ` [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and Richard Henderson
@ 2026-02-04 5:24 ` Richard Henderson
2026-02-04 8:05 ` Paolo Bonzini
5 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 5:24 UTC (permalink / raw)
To: qemu-devel; +Cc: pbonzini
Use tcg_op_imm_match to choose between expanding with AND+SHL vs SHL+SHR.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
tcg/optimize.c | 40 +++++++++++++++++++++++++++++++---------
1 file changed, 31 insertions(+), 9 deletions(-)
diff --git a/tcg/optimize.c b/tcg/optimize.c
index e6a16921c9..2944c5a748 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -1743,10 +1743,17 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
goto done;
}
- /* Lower invalid deposit into zero as AND + SHL or SHL + AND. */
+ /* Lower invalid deposit into zero. */
if (!valid) {
- if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
- !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
+ if (TCG_TARGET_extract_valid(ctx->type, 0, len)) {
+ /* EXTRACT (at 0) + SHL */
+ op2 = opt_insert_before(ctx, op, INDEX_op_extract, 4);
+ op2->args[0] = ret;
+ op2->args[1] = arg2;
+ op2->args[2] = 0;
+ op2->args[3] = len;
+ } else if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len)) {
+ /* SHL + EXTRACT (at 0) */
op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
op2->args[0] = ret;
op2->args[1] = arg2;
@@ -1757,14 +1764,29 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
op->args[2] = 0;
op->args[3] = ofs + len;
goto done;
+ } else if (tcg_op_imm_match(INDEX_op_and, ctx->type, len_mask)) {
+ /* AND + SHL */
+ op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg2;
+ op2->args[2] = arg_new_constant(ctx, len_mask);
+ } else {
+ /* SHL + SHR */
+ int shl = width - len;
+ int shr = width - len - ofs;
+
+ op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
+ op2->args[0] = ret;
+ op2->args[1] = arg2;
+ op2->args[2] = arg_new_constant(ctx, shl);
+
+ op->opc = INDEX_op_shr;
+ op->args[1] = ret;
+ op->args[2] = arg_new_constant(ctx, shr);
+ goto done;
}
- op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
- op2->args[0] = ret;
- op2->args[1] = arg2;
- op2->args[2] = arg_new_constant(ctx, len_mask);
- fold_and(ctx, op2);
-
+ /* Finish the (EXTRACT|AND) + SHL cases. */
op->opc = INDEX_op_shl;
op->args[1] = ret;
op->args[2] = arg_new_constant(ctx, ofs);
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 5:24 ` [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts Richard Henderson
@ 2026-02-04 8:05 ` Paolo Bonzini
2026-02-04 9:06 ` Richard Henderson
0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2026-02-04 8:05 UTC (permalink / raw)
To: Richard Henderson, qemu-devel
On 2/4/26 06:24, Richard Henderson wrote:
> Use tcg_op_imm_match to choose between expanding with AND+SHL vs SHL+SHR.
>
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> tcg/optimize.c | 40 +++++++++++++++++++++++++++++++---------
> 1 file changed, 31 insertions(+), 9 deletions(-)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index e6a16921c9..2944c5a748 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -1743,10 +1743,17 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
> goto done;
> }
>
> - /* Lower invalid deposit into zero as AND + SHL or SHL + AND. */
> + /* Lower invalid deposit into zero. */
> if (!valid) {
> - if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
> - !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
> + if (TCG_TARGET_extract_valid(ctx->type, 0, len)) {
> + /* EXTRACT (at 0) + SHL */
> + op2 = opt_insert_before(ctx, op, INDEX_op_extract, 4);
> + op2->args[0] = ret;
> + op2->args[1] = arg2;
> + op2->args[2] = 0;
> + op2->args[3] = len;
> + } else if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len)) {
> + /* SHL + EXTRACT (at 0) */
> op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
> op2->args[0] = ret;
> op2->args[1] = arg2;
> @@ -1757,14 +1764,29 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
> op->args[2] = 0;
> op->args[3] = ofs + len;
> goto done;
> + } else if (tcg_op_imm_match(INDEX_op_and, ctx->type, len_mask)) {
> + /* AND + SHL */
Even if these extracts are valid, can they really be cheaper then an AND
with immediate argument, or back to back shifts? You still have a
dependency between the two instruction. I wouldn't bother with using
EXTRACT here.
Paolo
> + op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
> + op2->args[0] = ret;
> + op2->args[1] = arg2;
> + op2->args[2] = arg_new_constant(ctx, len_mask);
> + } else {
> + /* SHL + SHR */
> + int shl = width - len;
> + int shr = width - len - ofs;
> +
> + op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
> + op2->args[0] = ret;
> + op2->args[1] = arg2;
> + op2->args[2] = arg_new_constant(ctx, shl);
> +
> + op->opc = INDEX_op_shr;
> + op->args[1] = ret;
> + op->args[2] = arg_new_constant(ctx, shr);
> + goto done;
> }
>
> - op2 = opt_insert_before(ctx, op, INDEX_op_and, 3);
> - op2->args[0] = ret;
> - op2->args[1] = arg2;
> - op2->args[2] = arg_new_constant(ctx, len_mask);
> - fold_and(ctx, op2);
> -
> + /* Finish the (EXTRACT|AND) + SHL cases. */
> op->opc = INDEX_op_shl;
> op->args[1] = ret;
> op->args[2] = arg_new_constant(ctx, ofs);
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 8:05 ` Paolo Bonzini
@ 2026-02-04 9:06 ` Richard Henderson
2026-02-04 10:41 ` Paolo Bonzini
0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 9:06 UTC (permalink / raw)
To: Paolo Bonzini, qemu-devel
On 2/4/26 18:05, Paolo Bonzini wrote:
> On 2/4/26 06:24, Richard Henderson wrote:
>> Use tcg_op_imm_match to choose between expanding with AND+SHL vs SHL+SHR.
>>
>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>> tcg/optimize.c | 40 +++++++++++++++++++++++++++++++---------
>> 1 file changed, 31 insertions(+), 9 deletions(-)
>>
>> diff --git a/tcg/optimize.c b/tcg/optimize.c
>> index e6a16921c9..2944c5a748 100644
>> --- a/tcg/optimize.c
>> +++ b/tcg/optimize.c
>> @@ -1743,10 +1743,17 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
>> goto done;
>> }
>> - /* Lower invalid deposit into zero as AND + SHL or SHL + AND. */
>> + /* Lower invalid deposit into zero. */
>> if (!valid) {
>> - if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
>> - !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
>> + if (TCG_TARGET_extract_valid(ctx->type, 0, len)) {
>> + /* EXTRACT (at 0) + SHL */
>> + op2 = opt_insert_before(ctx, op, INDEX_op_extract, 4);
>> + op2->args[0] = ret;
>> + op2->args[1] = arg2;
>> + op2->args[2] = 0;
>> + op2->args[3] = len;
>> + } else if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len)) {
>> + /* SHL + EXTRACT (at 0) */
>> op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
>> op2->args[0] = ret;
>> op2->args[1] = arg2;
>> @@ -1757,14 +1764,29 @@ static bool fold_deposit(OptContext *ctx, TCGOp *op)
>> op->args[2] = 0;
>> op->args[3] = ofs + len;
>> goto done;
>> + } else if (tcg_op_imm_match(INDEX_op_and, ctx->type, len_mask)) {
>> + /* AND + SHL */
>
> Even if these extracts are valid, can they really be cheaper then an AND with immediate
> argument, or back to back shifts?
This is primarily for x86.
(1) movz is 2 operand, so that may avoid clobbering an input,
(2) movz is 3-4 byte whereas and r/i32 is 6-7 byte.
Because of these, there's a comment somewhere that says we'll prefer extract over and
(perhaps in tcg_gen_andi_* or fold_and). IIRC this also happens to simplify ppc and s390x
insn selection (and vs rotate and mask). AFAIK, no other hosts are penalized.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 9:06 ` Richard Henderson
@ 2026-02-04 10:41 ` Paolo Bonzini
2026-02-04 20:45 ` Richard Henderson
0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2026-02-04 10:41 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 3029 bytes --]
Il mer 4 feb 2026, 10:06 Richard Henderson <richard.henderson@linaro.org>
ha scritto:
> On 2/4/26 18:05, Paolo Bonzini wrote:
> > On 2/4/26 06:24, Richard Henderson wrote:
> >> Use tcg_op_imm_match to choose between expanding with AND+SHL vs
> SHL+SHR.
> >>
> >> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> >> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> >> ---
> >> tcg/optimize.c | 40 +++++++++++++++++++++++++++++++---------
> >> 1 file changed, 31 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/tcg/optimize.c b/tcg/optimize.c
> >> index e6a16921c9..2944c5a748 100644
> >> --- a/tcg/optimize.c
> >> +++ b/tcg/optimize.c
> >> @@ -1743,10 +1743,17 @@ static bool fold_deposit(OptContext *ctx, TCGOp
> *op)
> >> goto done;
> >> }
> >> - /* Lower invalid deposit into zero as AND + SHL or SHL + AND.
> */
> >> + /* Lower invalid deposit into zero. */
> >> if (!valid) {
> >> - if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
> >> - !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
> >> + if (TCG_TARGET_extract_valid(ctx->type, 0, len)) {
> >> + /* EXTRACT (at 0) + SHL */
> >> + op2 = opt_insert_before(ctx, op, INDEX_op_extract, 4);
> >> + op2->args[0] = ret;
> >> + op2->args[1] = arg2;
> >> + op2->args[2] = 0;
> >> + op2->args[3] = len;
> >> + } else if (TCG_TARGET_extract_valid(ctx->type, 0, ofs +
> len)) {
> >> + /* SHL + EXTRACT (at 0) */
> >> op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
> >> op2->args[0] = ret;
> >> op2->args[1] = arg2;
> >> @@ -1757,14 +1764,29 @@ static bool fold_deposit(OptContext *ctx, TCGOp
> *op)
> >> op->args[2] = 0;
> >> op->args[3] = ofs + len;
> >> goto done;
> >> + } else if (tcg_op_imm_match(INDEX_op_and, ctx->type,
> len_mask)) {
> >> + /* AND + SHL */
> >
> > Even if these extracts are valid, can they really be cheaper then an AND
> with immediate
> > argument, or back to back shifts?
>
> This is primarily for x86.
>
> (1) movz is 2 operand, so that may avoid clobbering an input,
> (2) movz is 3-4 byte whereas and r/i32 is 6-7 byte.
>
> Because of these, there's a comment somewhere that says we'll prefer
> extract over and
> (perhaps in tcg_gen_andi_* or fold_and). IIRC this also happens to
> simplify ppc and s390x
> insn selection (and vs rotate and mask). AFAIK, no other hosts are
> penalized.
>
I think it would be better to pick a canonical form for AND with 2^n-1 and
handle conversion to extract (like PPC rotates or movz) in the backend.
Picking AND as the canonical form also avoids makes the macros for extract
validity simpler too; adding an extra constraint for immediate 2^n-1 is
easier and it generalizes to other PPC rotate and mask cases.
Paolo
>
>
>
> r~
>
>
[-- Attachment #2: Type: text/html, Size: 4657 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 10:41 ` Paolo Bonzini
@ 2026-02-04 20:45 ` Richard Henderson
2026-02-05 8:22 ` Paolo Bonzini
0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-04 20:45 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On 2/4/26 20:41, Paolo Bonzini wrote:
> This is primarily for x86.
>
> (1) movz is 2 operand, so that may avoid clobbering an input,
> (2) movz is 3-4 byte whereas and r/i32 is 6-7 byte.
>
> Because of these, there's a comment somewhere that says we'll prefer extract over and
> (perhaps in tcg_gen_andi_* or fold_and). IIRC this also happens to simplify ppc and
> s390x
> insn selection (and vs rotate and mask). AFAIK, no other hosts are penalized.
>
>
> I think it would be better to pick a canonical form for AND with 2^n-1 and handle
> conversion to extract (like PPC rotates or movz) in the backend.
>
> Picking AND as the canonical form also avoids makes the macros for extract validity
> simpler too; adding an extra constraint for immediate 2^n-1 is easier and it generalizes
> to other PPC rotate and mask cases.
Picking AND means we have to use "r,0,ri" for x86, losing register allocation flexibility.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-04 20:45 ` Richard Henderson
@ 2026-02-05 8:22 ` Paolo Bonzini
2026-02-05 22:29 ` Richard Henderson
0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2026-02-05 8:22 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1267 bytes --]
Il mer 4 feb 2026, 21:46 Richard Henderson <richard.henderson@linaro.org>
ha scritto:
> On 2/4/26 20:41, Paolo Bonzini wrote:
> > This is primarily for x86.
> >
> > (1) movz is 2 operand, so that may avoid clobbering an input,
> > (2) movz is 3-4 byte whereas and r/i32 is 6-7 byte.
> >
> > Because of these, there's a comment somewhere that says we'll prefer
> extract over and
> > (perhaps in tcg_gen_andi_* or fold_and). IIRC this also happens to
> simplify ppc and
> > s390x
> > insn selection (and vs rotate and mask). AFAIK, no other hosts are
> penalized.
> >
> >
> > I think it would be better to pick a canonical form for AND with 2^n-1
> and handle
> > conversion to extract (like PPC rotates or movz) in the backend.
> >
> > Picking AND as the canonical form also avoids makes the macros for
> extract validity
> > simpler too; adding an extra constraint for immediate 2^n-1 is easier
> and it generalizes
> > to other PPC rotate and mask cases.
>
> Picking AND means we have to use "r,0,ri" for x86, losing register
> allocation flexibility.
>
Then could you wrap the target specific extract_valid with one that allows
ofs == 0 if AND allows the immediate 2^len-1? That would also simplify this
series.
Paolo
>
> r~
>
>
[-- Attachment #2: Type: text/html, Size: 2080 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-05 8:22 ` Paolo Bonzini
@ 2026-02-05 22:29 ` Richard Henderson
2026-02-05 23:22 ` Paolo Bonzini
0 siblings, 1 reply; 20+ messages in thread
From: Richard Henderson @ 2026-02-05 22:29 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On 2/5/26 18:22, Paolo Bonzini wrote:
>
>
> Il mer 4 feb 2026, 21:46 Richard Henderson <richard.henderson@linaro.org
> <mailto:richard.henderson@linaro.org>> ha scritto:
>
> On 2/4/26 20:41, Paolo Bonzini wrote:
> > This is primarily for x86.
> >
> > (1) movz is 2 operand, so that may avoid clobbering an input,
> > (2) movz is 3-4 byte whereas and r/i32 is 6-7 byte.
> >
> > Because of these, there's a comment somewhere that says we'll prefer extract
> over and
> > (perhaps in tcg_gen_andi_* or fold_and). IIRC this also happens to simplify
> ppc and
> > s390x
> > insn selection (and vs rotate and mask). AFAIK, no other hosts are penalized.
> >
> >
> > I think it would be better to pick a canonical form for AND with 2^n-1 and handle
> > conversion to extract (like PPC rotates or movz) in the backend.
> >
> > Picking AND as the canonical form also avoids makes the macros for extract validity
> > simpler too; adding an extra constraint for immediate 2^n-1 is easier and it
> generalizes
> > to other PPC rotate and mask cases.
>
> Picking AND means we have to use "r,0,ri" for x86, losing register allocation flexibility.
>
>
> Then could you wrap the target specific extract_valid with one that allows ofs == 0 if AND
> allows the immediate 2^len-1? That would also simplify this series.
I don't understand your suggestion here.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-05 22:29 ` Richard Henderson
@ 2026-02-05 23:22 ` Paolo Bonzini
2026-02-06 1:09 ` Richard Henderson
0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2026-02-05 23:22 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]
Il gio 5 feb 2026, 23:29 Richard Henderson <richard.henderson@linaro.org>
ha scritto:
> > > I think it would be better to pick a canonical form for AND with
> 2^n-1 and handle
> > > conversion to extract (like PPC rotates or movz) in the backend.
> > >
> > > Picking AND as the canonical form also avoids makes the macros
> for extract validity
> > > simpler too; adding an extra constraint for immediate 2^n-1 is
> easier and it
> > generalizes
> > > to other PPC rotate and mask cases.
> >
> > Picking AND means we have to use "r,0,ri" for x86, losing register
> allocation flexibility.
> >
> >
> > Then could you wrap the target specific extract_valid with one that
> allows ofs == 0 if AND
> > allows the immediate 2^len-1? That would also simplify this series.
>
> I don't understand your suggestion here.
>
I am not sure about it either... I am just not sure why extract is
guaranteed to be cheaper or have better constraints than AND.
It does happen to be true for x86, though only for len == 8 or 16; but is
it true of all targets that have a more expansive extract instruction?
Paolo
>
> r~
>
>
[-- Attachment #2: Type: text/html, Size: 2021 bytes --]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts
2026-02-05 23:22 ` Paolo Bonzini
@ 2026-02-06 1:09 ` Richard Henderson
0 siblings, 0 replies; 20+ messages in thread
From: Richard Henderson @ 2026-02-06 1:09 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: qemu-devel
On 2/6/26 09:22, Paolo Bonzini wrote:
>
>
> Il gio 5 feb 2026, 23:29 Richard Henderson <richard.henderson@linaro.org
> <mailto:richard.henderson@linaro.org>> ha scritto:
>
> > > I think it would be better to pick a canonical form for AND with 2^n-1 and
> handle
> > > conversion to extract (like PPC rotates or movz) in the backend.
> > >
> > > Picking AND as the canonical form also avoids makes the macros for extract
> validity
> > > simpler too; adding an extra constraint for immediate 2^n-1 is easier and it
> > generalizes
> > > to other PPC rotate and mask cases.
> >
> > Picking AND means we have to use "r,0,ri" for x86, losing register allocation
> flexibility.
> >
> >
> > Then could you wrap the target specific extract_valid with one that allows ofs == 0
> if AND
> > allows the immediate 2^len-1? That would also simplify this series.
>
> I don't understand your suggestion here.
>
>
> I am not sure about it either... I am just not sure why extract is guaranteed to be
> cheaper or have better constraints than AND.
>
> It does happen to be true for x86, though only for len == 8 or 16; but is it true of all
> targets that have a more expansive extract instruction?
x86 includes len == 32 via 'movl', fwiw.
Similarly, riscv64 has quite a number of filter conditions for extract, mostly because of
a 12-bit signed argument for AND, and a collection of other zero-extend insns.
AArch64, loongarch64, and ppc64 all emit ANDI if possible during tgen_extract.
So it really is all about using extract if valid, and allowing the backend to use the more
favorable set of constraints.
r~
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize
2026-02-04 5:24 ` [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize Richard Henderson
@ 2026-02-25 13:34 ` Jim MacArthur
0 siblings, 0 replies; 20+ messages in thread
From: Jim MacArthur @ 2026-02-25 13:34 UTC (permalink / raw)
To: qemu-devel; +Cc: richard.henderson
On Wed, Feb 04, 2026 at 03:24:51PM +1000, Richard Henderson wrote:
> The expansions that we chose in tcg-op.c may be less than optimial.
> Delay lowering until optimize, so that we have propagated constants
> and have computed known zero/one masks.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> + /* Lower invalid deposit into zero as AND + SHL or SHL + AND. */
> + if (!valid) {
> + if (TCG_TARGET_extract_valid(ctx->type, 0, ofs + len) &&
> + !TCG_TARGET_extract_valid(ctx->type, 0, len)) {
> + op2 = opt_insert_before(ctx, op, INDEX_op_shl, 3);
> + op2->args[0] = ret;
> + op2->args[1] = arg2;
> + op2->args[2] = arg_new_constant(ctx, ofs);
> +
> + op->opc = INDEX_op_extract;
> + op->args[1] = ret;
> + op->args[2] = 0;
> + op->args[3] = ofs + len;
> + goto done;
> + }
I also had questions about extract vs shift/and here. You've explained this on patch 6, but a comment here about it might help future developers.
Nonetheless,
Reviewed-by: Jim MacArthur <jim.macarthur@linaro.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 during optimize
2026-02-04 5:24 ` [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 " Richard Henderson
@ 2026-02-25 14:47 ` Jim MacArthur
0 siblings, 0 replies; 20+ messages in thread
From: Jim MacArthur @ 2026-02-25 14:47 UTC (permalink / raw)
To: qemu-devel; +Cc: Richard Henderson
On Wed, Feb 04, 2026 at 03:24:52PM +1000, Richard Henderson wrote:
> The expansions that we chose in tcg-op.c may be less than optimial.
> Delay lowering until optimize, so that we have propagated constants
> and have computed known zero/one masks.
>
> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Jim MacArthur <jim.macarthur@linaro.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 3/6] tcg: Expand missing rotri with extract2
2026-02-04 5:24 ` [PATCH v2 3/6] tcg: Expand missing rotri with extract2 Richard Henderson
@ 2026-02-25 14:54 ` Jim MacArthur
0 siblings, 0 replies; 20+ messages in thread
From: Jim MacArthur @ 2026-02-25 14:54 UTC (permalink / raw)
To: qemu-devel; +Cc: Richard Henderson
On Wed, Feb 04, 2026 at 03:24:53PM +1000, Richard Henderson wrote:
> Use extract2 to implement rotri. To make this easier,
> redefine rotli in terms of rotri, rather than the reverse.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
> tcg/tcg-op.c | 52 ++++++++++++++++++++++++----------------------------
> 1 file changed, 24 insertions(+), 28 deletions(-)
>
Reviewed-by: Jim MacArthur <jim.macarthur@linaro.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 4/6] tcg: Add tcg_op_imm_match
2026-02-04 5:24 ` [PATCH v2 4/6] tcg: Add tcg_op_imm_match Richard Henderson
@ 2026-02-25 15:06 ` Jim MacArthur
0 siblings, 0 replies; 20+ messages in thread
From: Jim MacArthur @ 2026-02-25 15:06 UTC (permalink / raw)
To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini
On Wed, Feb 04, 2026 at 03:24:54PM +1000, Richard Henderson wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> Create a function to test whether the second operand of a
> binary operation allows a given immediate.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [rth: Split out from a larger patch; keep the declaration internal.]
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Jim MacArthur <jim.macarthur@linaro.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and
2026-02-04 5:24 ` [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and Richard Henderson
@ 2026-02-25 15:16 ` Jim MacArthur
0 siblings, 0 replies; 20+ messages in thread
From: Jim MacArthur @ 2026-02-25 15:16 UTC (permalink / raw)
To: qemu-devel; +Cc: Richard Henderson, Paolo Bonzini
On Wed, Feb 04, 2026 at 03:24:55PM +1000, Richard Henderson wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
>
> Instead of assuming only small immediates are available for AND,
> consult the backend in order to decide between SHL/SHR and SHR/AND.
>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [rth: Split from a larger patch]
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Jim MacArthur <jim.macarthur@linaro.org>
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2026-02-25 15:17 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-04 5:24 [PATCH v2 0/6] tcg: Improve extract and deposit code gen Richard Henderson
2026-02-04 5:24 ` [PATCH v2 1/6] tcg/optimize: Lower unsupported deposit during optimize Richard Henderson
2026-02-25 13:34 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 2/6] tcg/optimize: Lower unsupported extract2 " Richard Henderson
2026-02-25 14:47 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 3/6] tcg: Expand missing rotri with extract2 Richard Henderson
2026-02-25 14:54 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 4/6] tcg: Add tcg_op_imm_match Richard Henderson
2026-02-25 15:06 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 5/6] tcg: target-dependent lowering of extract to shr/and Richard Henderson
2026-02-25 15:16 ` Jim MacArthur
2026-02-04 5:24 ` [PATCH v2 6/6] tcg/optimize: possibly expand deposit into zero with shifts Richard Henderson
2026-02-04 8:05 ` Paolo Bonzini
2026-02-04 9:06 ` Richard Henderson
2026-02-04 10:41 ` Paolo Bonzini
2026-02-04 20:45 ` Richard Henderson
2026-02-05 8:22 ` Paolo Bonzini
2026-02-05 22:29 ` Richard Henderson
2026-02-05 23:22 ` Paolo Bonzini
2026-02-06 1:09 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.