* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-14 20:29 [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions Pranith Kumar
@ 2016-07-19 3:53 ` Richard Henderson
2016-07-19 18:28 ` Pranith Kumar
2016-07-19 14:36 ` Alex Bennée
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Richard Henderson @ 2016-07-19 3:53 UTC (permalink / raw)
To: Pranith Kumar, open list:All patches CC here; +Cc: serge.fdrv, alex.bennee
On 07/15/2016 01:59 AM, Pranith Kumar wrote:
> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
> If the same fence instruction is detected consecutively, we remove
> one instance of it.
>
> ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
> load-acquire/store-release fence can be combined with a full fence
> without relaxing the ordering constraint.
>
> ex: a) ld; ldaq; mb => ld; mb
> b) mb; strl; st => mb; st
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg.h | 1 +
> 2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
> return false;
> }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> + int oi, oi_next;
> + TCGArg prev_op_mb = -1;
> + TCGOp *prev_op;
> +
> + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> + TCGOp *op = &s->gen_op_buf[oi];
> + TCGArg *args = &s->gen_opparam_buf[op->args];
> + TCGOpcode opc = op->opc;
> +
> + switch (opc) {
> + case INDEX_op_mb:
> + {
> + TCGBar curr_mb_type = args[0] & 0xF0;
> + TCGBar prev_mb_type = prev_op_mb & 0xF0;
No check to make sure that prev_op_mb != -1?
> +
> + if (curr_mb_type == prev_mb_type ||
> + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> + /* Remove the current weaker barrier op. The previous
> + * barrier is stronger and sufficient.
> + * mb; strl => mb; st
> + */
> + tcg_op_remove(s, op);
> + } else if (curr_mb_type == TCG_BAR_SC &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Remove the previous weaker barrier op. The current
> + * barrier is stronger and sufficient.
> + * ldaq; mb => ld; mb
> + */
> + tcg_op_remove(s, prev_op);
> + } else if (curr_mb_type == TCG_BAR_STRL &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Consecutive load-acquire and store-release barriers
> + * can be merged into one stronger SC barrier
> + * ldaq; strl => ld; mb; st
> + */
> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> + tcg_op_remove(s, prev_op);
Don't you need to merge the bottom 4 bits as well?
> + default:
> + prev_op_mb = -1;
> + prev_op = NULL;
> + break;
A reasonable start, I suppose, but there's nothing stopping you from
considering barriers that aren't exactly adjacent.
You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or
tcg_op_defs[opc].flags & TCG_OPF_BB_END.
r~
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-19 3:53 ` Richard Henderson
@ 2016-07-19 18:28 ` Pranith Kumar
0 siblings, 0 replies; 9+ messages in thread
From: Pranith Kumar @ 2016-07-19 18:28 UTC (permalink / raw)
To: Richard Henderson; +Cc: open list:All patches CC here, serge.fdrv, alex.bennee
Richard Henderson writes:
> On 07/15/2016 01:59 AM, Pranith Kumar wrote:
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>> If the same fence instruction is detected consecutively, we remove
>> one instance of it.
>>
>> ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>> load-acquire/store-release fence can be combined with a full fence
>> without relaxing the ordering constraint.
>>
>> ex: a) ld; ldaq; mb => ld; mb
>> b) mb; strl; st => mb; st
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> tcg/tcg.h | 1 +
>> 2 files changed, 60 insertions(+)
>>
>> diff --git a/tcg/optimize.c b/tcg/optimize.c
>> index c0d975b..a655829 100644
>> --- a/tcg/optimize.c
>> +++ b/tcg/optimize.c
>> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>> return false;
>> }
>>
>> +/* Eliminate duplicate and unnecessary fence instructions */
>> +void tcg_optimize_mb(TCGContext *s)
>> +{
>> + int oi, oi_next;
>> + TCGArg prev_op_mb = -1;
>> + TCGOp *prev_op;
>> +
>> + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
>> + TCGOp *op = &s->gen_op_buf[oi];
>> + TCGArg *args = &s->gen_opparam_buf[op->args];
>> + TCGOpcode opc = op->opc;
>> +
>> + switch (opc) {
>> + case INDEX_op_mb:
>> + {
>> + TCGBar curr_mb_type = args[0] & 0xF0;
>> + TCGBar prev_mb_type = prev_op_mb & 0xF0;
>
> No check to make sure that prev_op_mb != -1?
>
I don't think that is necessary since if prev_op_mb is -1 then prev_mb_type
will be 0x70 which will not match any of the cases we compare against below.
>> +
>> + if (curr_mb_type == prev_mb_type ||
>> + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
>> + /* Remove the current weaker barrier op. The previous
>> + * barrier is stronger and sufficient.
>> + * mb; strl => mb; st
>> + */
>> + tcg_op_remove(s, op);
>> + } else if (curr_mb_type == TCG_BAR_SC &&
>> + prev_mb_type == TCG_BAR_LDAQ) {
>> + /* Remove the previous weaker barrier op. The current
>> + * barrier is stronger and sufficient.
>> + * ldaq; mb => ld; mb
>> + */
>> + tcg_op_remove(s, prev_op);
>> + } else if (curr_mb_type == TCG_BAR_STRL &&
>> + prev_mb_type == TCG_BAR_LDAQ) {
>> + /* Consecutive load-acquire and store-release barriers
>> + * can be merged into one stronger SC barrier
>> + * ldaq; strl => ld; mb; st
>> + */
>> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> + tcg_op_remove(s, prev_op);
>
> Don't you need to merge the bottom 4 bits as well?
>
That is an interesting point. Ideally, the TCG_MO_* flags for both the
barriers should be the same. Atleast now, for LDAR and STRL barriers this flag
is TCG_MO_ALL since we do not have any other variant.
>> + default:
>> + prev_op_mb = -1;
>> + prev_op = NULL;
>> + break;
>
> A reasonable start, I suppose, but there's nothing stopping you from
> considering barriers that aren't exactly adjacent.
>
> You only need to reset prev_op_mb when you see qemu_ld, qemu_st, call, or
> tcg_op_defs[opc].flags & TCG_OPF_BB_END.
>
That sounds like a good idea to extend this to. I will work on it.
Thanks!
--
Pranith
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-14 20:29 [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions Pranith Kumar
2016-07-19 3:53 ` Richard Henderson
@ 2016-07-19 14:36 ` Alex Bennée
2016-07-19 18:29 ` Pranith Kumar
2016-07-19 15:56 ` Alex Bennée
2016-07-19 17:16 ` Paolo Bonzini
3 siblings, 1 reply; 9+ messages in thread
From: Alex Bennée @ 2016-07-19 14:36 UTC (permalink / raw)
To: Pranith Kumar
Cc: Richard Henderson, open list:All patches CC here, serge.fdrv
Pranith Kumar <bobby.prani@gmail.com> writes:
> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
> If the same fence instruction is detected consecutively, we remove
> one instance of it.
>
> ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
> load-acquire/store-release fence can be combined with a full fence
> without relaxing the ordering constraint.
>
> ex: a) ld; ldaq; mb => ld; mb
> b) mb; strl; st => mb; st
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg.h | 1 +
> 2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
> return false;
> }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> + int oi, oi_next;
> + TCGArg prev_op_mb = -1;
> + TCGOp *prev_op;
The compiler throws up warnings about prev_op not being set:
/home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’:
/home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
tcg_op_remove(s, prev_op);
^
> +
> + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> + TCGOp *op = &s->gen_op_buf[oi];
> + TCGArg *args = &s->gen_opparam_buf[op->args];
> + TCGOpcode opc = op->opc;
> +
> + switch (opc) {
> + case INDEX_op_mb:
> + {
> + TCGBar curr_mb_type = args[0] & 0xF0;
> + TCGBar prev_mb_type = prev_op_mb & 0xF0;
> +
> + if (curr_mb_type == prev_mb_type ||
> + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> + /* Remove the current weaker barrier op. The previous
> + * barrier is stronger and sufficient.
> + * mb; strl => mb; st
> + */
> + tcg_op_remove(s, op);
> + } else if (curr_mb_type == TCG_BAR_SC &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Remove the previous weaker barrier op. The current
> + * barrier is stronger and sufficient.
> + * ldaq; mb => ld; mb
> + */
> + tcg_op_remove(s, prev_op);
> + } else if (curr_mb_type == TCG_BAR_STRL &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Consecutive load-acquire and store-release barriers
> + * can be merged into one stronger SC barrier
> + * ldaq; strl => ld; mb; st
> + */
> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> + tcg_op_remove(s, prev_op);
> + }
> + prev_op_mb = args[0];
> + prev_op = op;
> + break;
> + }
> + case INDEX_op_insn_start:
> + break;
> + default:
> + prev_op_mb = -1;
> + prev_op = NULL;
> + break;
> + }
> +
> + oi_next = op->next;
> + }
> +}
> +
> /* Propagate constants and copies, fold constant expressions. */
> void tcg_optimize(TCGContext *s)
> {
> @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s)
> break;
> }
> }
> +
> + tcg_optimize_mb(s);
> }
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 60ebc2b..2d7a46d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func,
> TCGArg ret, int nargs, TCGArg *args);
>
> void tcg_op_remove(TCGContext *s, TCGOp *op);
> +void tcg_optimize_mb(TCGContext *s);
> void tcg_optimize(TCGContext *s);
>
> /* only used for debugging purposes */
--
Alex Bennée
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-19 14:36 ` Alex Bennée
@ 2016-07-19 18:29 ` Pranith Kumar
0 siblings, 0 replies; 9+ messages in thread
From: Pranith Kumar @ 2016-07-19 18:29 UTC (permalink / raw)
To: Alex Bennée
Cc: Richard Henderson, open list:All patches CC here, serge.fdrv
Alex Bennée writes:
> Pranith Kumar <bobby.prani@gmail.com> writes:
>
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>> If the same fence instruction is detected consecutively, we remove
>> one instance of it.
>>
>> ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>> load-acquire/store-release fence can be combined with a full fence
>> without relaxing the ordering constraint.
>>
>> ex: a) ld; ldaq; mb => ld; mb
>> b) mb; strl; st => mb; st
>>
>> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
>> ---
>> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> tcg/tcg.h | 1 +
>> 2 files changed, 60 insertions(+)
>>
>> diff --git a/tcg/optimize.c b/tcg/optimize.c
>> index c0d975b..a655829 100644
>> --- a/tcg/optimize.c
>> +++ b/tcg/optimize.c
>> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
>> return false;
>> }
>>
>> +/* Eliminate duplicate and unnecessary fence instructions */
>> +void tcg_optimize_mb(TCGContext *s)
>> +{
>> + int oi, oi_next;
>> + TCGArg prev_op_mb = -1;
>> + TCGOp *prev_op;
>
>
> The compiler throws up warnings about prev_op not being set:
>
> /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c: In function ‘tcg_optimize_mb’:
> /home/alex/lsrc/qemu/qemu.git/tcg/optimize.c:611:17: error: ‘prev_op’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
> tcg_op_remove(s, prev_op);
> ^
OK, I will fix this.
Thanks,
--
Pranith
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-14 20:29 [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions Pranith Kumar
2016-07-19 3:53 ` Richard Henderson
2016-07-19 14:36 ` Alex Bennée
@ 2016-07-19 15:56 ` Alex Bennée
2016-07-19 18:33 ` Pranith Kumar
2016-07-19 17:16 ` Paolo Bonzini
3 siblings, 1 reply; 9+ messages in thread
From: Alex Bennée @ 2016-07-19 15:56 UTC (permalink / raw)
To: Pranith Kumar
Cc: Richard Henderson, open list:All patches CC here, serge.fdrv
Pranith Kumar <bobby.prani@gmail.com> writes:
> This patch applies on top of the fence generation patch series.
>
> This commit optimizes fence instructions. Two optimizations are
> currently implemented. These are:
>
> 1. Unnecessary duplicate fence instructions
>
> If the same fence instruction is detected consecutively, we remove
> one instance of it.
>
> ex: mb; mb => mb, strl; strl => strl
>
> 2. Merging weaker fence with subsequent/previous stronger fence
>
> load-acquire/store-release fence can be combined with a full fence
> without relaxing the ordering constraint.
>
> ex: a) ld; ldaq; mb => ld; mb
> b) mb; strl; st => mb; st
What test cases do you have for this?
Currently the litmus tests don't fire (as they have exactly what they
need). Most programs don't seem to trigger multiple barriers.
>
> Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
> ---
> tcg/optimize.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> tcg/tcg.h | 1 +
> 2 files changed, 60 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index c0d975b..a655829 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -569,6 +569,63 @@ static bool swap_commutative2(TCGArg *p1, TCGArg *p2)
> return false;
> }
>
> +/* Eliminate duplicate and unnecessary fence instructions */
> +void tcg_optimize_mb(TCGContext *s)
> +{
> + int oi, oi_next;
> + TCGArg prev_op_mb = -1;
> + TCGOp *prev_op;
> +
> + for (oi = s->gen_first_op_idx; oi >= 0; oi = oi_next) {
> + TCGOp *op = &s->gen_op_buf[oi];
> + TCGArg *args = &s->gen_opparam_buf[op->args];
> + TCGOpcode opc = op->opc;
> +
> + switch (opc) {
> + case INDEX_op_mb:
> + {
> + TCGBar curr_mb_type = args[0] & 0xF0;
> + TCGBar prev_mb_type = prev_op_mb & 0xF0;
> +
> + if (curr_mb_type == prev_mb_type ||
> + (curr_mb_type == TCG_BAR_STRL && prev_mb_type == TCG_BAR_SC)) {
> + /* Remove the current weaker barrier op. The previous
> + * barrier is stronger and sufficient.
> + * mb; strl => mb; st
> + */
> + tcg_op_remove(s, op);
> + } else if (curr_mb_type == TCG_BAR_SC &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Remove the previous weaker barrier op. The current
> + * barrier is stronger and sufficient.
> + * ldaq; mb => ld; mb
> + */
> + tcg_op_remove(s, prev_op);
> + } else if (curr_mb_type == TCG_BAR_STRL &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Consecutive load-acquire and store-release barriers
> + * can be merged into one stronger SC barrier
> + * ldaq; strl => ld; mb; st
> + */
> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> + tcg_op_remove(s, prev_op);
> + }
> + prev_op_mb = args[0];
> + prev_op = op;
> + break;
> + }
> + case INDEX_op_insn_start:
> + break;
> + default:
> + prev_op_mb = -1;
> + prev_op = NULL;
> + break;
> + }
> +
> + oi_next = op->next;
> + }
> +}
> +
> /* Propagate constants and copies, fold constant expressions. */
> void tcg_optimize(TCGContext *s)
> {
> @@ -1327,4 +1384,6 @@ void tcg_optimize(TCGContext *s)
> break;
> }
> }
> +
> + tcg_optimize_mb(s);
> }
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index 60ebc2b..2d7a46d 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -904,6 +904,7 @@ void tcg_gen_callN(TCGContext *s, void *func,
> TCGArg ret, int nargs, TCGArg *args);
>
> void tcg_op_remove(TCGContext *s, TCGOp *op);
> +void tcg_optimize_mb(TCGContext *s);
> void tcg_optimize(TCGContext *s);
>
> /* only used for debugging purposes */
--
Alex Bennée
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-19 15:56 ` Alex Bennée
@ 2016-07-19 18:33 ` Pranith Kumar
0 siblings, 0 replies; 9+ messages in thread
From: Pranith Kumar @ 2016-07-19 18:33 UTC (permalink / raw)
To: Alex Bennée
Cc: Richard Henderson, open list:All patches CC here, serge.fdrv
Alex Bennée writes:
> Pranith Kumar <bobby.prani@gmail.com> writes:
>
>> This patch applies on top of the fence generation patch series.
>>
>> This commit optimizes fence instructions. Two optimizations are
>> currently implemented. These are:
>>
>> 1. Unnecessary duplicate fence instructions
>>
>> If the same fence instruction is detected consecutively, we remove
>> one instance of it.
>>
>> ex: mb; mb => mb, strl; strl => strl
>>
>> 2. Merging weaker fence with subsequent/previous stronger fence
>>
>> load-acquire/store-release fence can be combined with a full fence
>> without relaxing the ordering constraint.
>>
>> ex: a) ld; ldaq; mb => ld; mb
>> b) mb; strl; st => mb; st
>
> What test cases do you have for this?
>
> Currently the litmus tests don't fire (as they have exactly what they
> need). Most programs don't seem to trigger multiple barriers.
>
Indeed, these cases are not so commonly seen in the wild. To test it I wrote
small test programs using C11 builtins to generate appropriate
instructions. Then verified it by running 'qemu-aarch64 -d in_asm,out_asm'.
_Atomic int val;
int main()
{
val = __atomic_load_n(&val, __ATOMIC_ACQUIRE);
__atomic_store_n(&val, val, __ATOMIC_RELEASE);
barrier();
barrier();
}
--
Pranith
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-14 20:29 [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions Pranith Kumar
` (2 preceding siblings ...)
2016-07-19 15:56 ` Alex Bennée
@ 2016-07-19 17:16 ` Paolo Bonzini
2016-07-19 18:55 ` Pranith Kumar
3 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2016-07-19 17:16 UTC (permalink / raw)
To: Pranith Kumar, Richard Henderson, open list:All patches CC here
Cc: serge.fdrv, alex.bennee
On 14/07/2016 22:29, Pranith Kumar wrote:
> + } else if (curr_mb_type == TCG_BAR_STRL &&
> + prev_mb_type == TCG_BAR_LDAQ) {
> + /* Consecutive load-acquire and store-release barriers
> + * can be merged into one stronger SC barrier
> + * ldaq; strl => ld; mb; st
> + */
> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
> + tcg_op_remove(s, prev_op);
Is this really an optimization? For example the processor could reorder
"st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this
if you change ldaq1/strl2 to ld1/mb/st2.
On x86 for example a memory fence costs ~50 clock cycles, while normal
loads and stores are of course faster.
Of course this is useful if your target doesn't have ldaq/strl
instructions. In this case, however, you probably want to lower ldaq to
"ld;mb" and strl to "mb;st"; the other optimizations then will remove
the unnecessary barrier.
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] tcg: Optimize fence instructions
2016-07-19 17:16 ` Paolo Bonzini
@ 2016-07-19 18:55 ` Pranith Kumar
0 siblings, 0 replies; 9+ messages in thread
From: Pranith Kumar @ 2016-07-19 18:55 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Richard Henderson, open list:All patches CC here, serge.fdrv,
alex.bennee
Paolo Bonzini writes:
> On 14/07/2016 22:29, Pranith Kumar wrote:
>> + } else if (curr_mb_type == TCG_BAR_STRL &&
>> + prev_mb_type == TCG_BAR_LDAQ) {
>> + /* Consecutive load-acquire and store-release barriers
>> + * can be merged into one stronger SC barrier
>> + * ldaq; strl => ld; mb; st
>> + */
>> + args[0] = (args[0] & 0x0F) | TCG_BAR_SC;
>> + tcg_op_remove(s, prev_op);
>
> Is this really an optimization? For example the processor could reorder
> "st1; ldaq1; strl2; ld2" to "ldaq1; ld2; st1; strl2". It cannot do this
> if you change ldaq1/strl2 to ld1/mb/st2.
>
> On x86 for example a memory fence costs ~50 clock cycles, while normal
> loads and stores are of course faster.
>
> Of course this is useful if your target doesn't have ldaq/strl
> instructions. In this case, however, you probably want to lower ldaq to
> "ld;mb" and strl to "mb;st"; the other optimizations then will remove
> the unnecessary barrier.
>
I agree that this is a conservative optimization. The problem is that
currently even for architectures which have ldaq/strl instructions, tcg
backend does not generate them. TCG just generates plain loads and stores.I
guess we didn't need to since it was single threaded MTTCG.
I am trying to add support to generate these instructions on AARCH64. Once
this is done we can disable the above optimization.
--
Pranith
^ permalink raw reply [flat|nested] 9+ messages in thread