* Support for x86_64 on aarch64 emulation
@ 2022-04-08 12:21 Redha Gouicem
2022-04-08 15:27 ` Richard Henderson
0 siblings, 1 reply; 3+ messages in thread
From: Redha Gouicem @ 2022-04-08 12:21 UTC (permalink / raw)
To: qemu-devel; +Cc: d.g.sprokholt
We are working on support for x86_64 emulation on aarch64, mainly
related to memory ordering issues. We first wanted to know what the
community thinks about our proposal, and its chance of getting merged
one day.
Note that we worked with qemu-user, so there may be issues in system
mode that we missed.
# Problem
When generating the TCG instructions for memory accesses, fences are
always inserted *before* the access, following this translation rule:
x86 --> TCG --> aarch64
-------------------------------------
RMOV --> Fm_ld; ld --> DMBLD; LDR
WMOV --> Fm_st; st --> DMBFF; STR
Here, Fm_ld is a fence that orders any preceding memory access with
the subsequent load. F_m_st is a fence that orders any preceding
memory access with the subsequent store. This means that, in TCG, all
memory accesses are ordered by fences. Thus, no memory accesses can be
re-ordered in TCG. This is a problem, because it is *stricter than
x86*. Consider when a program contains:
WMOV; RMOV
x86 allows re-ordering independent store-load pairs, so the above pair
can safely re-order on an x86 host. However, with QEMU's current
translation, it becomes:
DMBFF; STR; DMBLD; LDR
In this target aarch64 code, no re-ordering is possible. Hence, QEMU
enforces a stronger model than x86. While that is correct, it harms
performance.
# Solution
We propose an alternative scheme, which we formally proved correct
(paper under review):
x86 --> TCG --> aarch64
-------------------------------------
RMOV --> ld; Fld_m --> LDR; DMBLD
WMOV --> Fst_st; st --> DMBST; STR
This new scheme precisely captures the observable behaviors of the
input program (in x86's memory model). This behavior is preserved in
the resulting TCG and aarch64 programs. Which the inserted fences
enforce (formally verified). Note that this scheme enforces fewer
ordering than the previous (unnecessarily strong) mapping scheme. This
new scheme benefits performance. We evaluated this on benchmarks
(PARSEC) and got up to 19.7% improvement, 6.7% on average.
# Implementation Considerations
Different (source and host) architectures may demand different such
mapping schemes. Some schemes may place fences before an instruction,
while others place them after. The implementation of fence placement
should thus be sufficiently flexible that either is possible. Though,
note that write-read pairs are unordered in almost all architectures.
We see two ways of doing this:
- extracting the placement of the fence from the
tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture
explicitly generate the fence at the correct place
- adding two parameters to these functions specifying the strength of
the "before" and "after" fences. The function would then generate
both fences in the IR (one of them may be a NOP fence), which in
turn will be translated back to the host
We are eager to see what you think about this change in TCG.
Cheers!
--
Redha Gouicem
Post doctoral researcher
Chair of Decentralized Systems Engineering
Department of Informatics, Technical University of Munich (TUM)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Support for x86_64 on aarch64 emulation
2022-04-08 12:21 Support for x86_64 on aarch64 emulation Redha Gouicem
@ 2022-04-08 15:27 ` Richard Henderson
2022-04-14 12:24 ` Redha Gouicem
0 siblings, 1 reply; 3+ messages in thread
From: Richard Henderson @ 2022-04-08 15:27 UTC (permalink / raw)
To: Redha Gouicem, qemu-devel; +Cc: d.g.sprokholt
On 4/8/22 05:21, Redha Gouicem wrote:
> We are working on support for x86_64 emulation on aarch64, mainly
> related to memory ordering issues. We first wanted to know what the
> community thinks about our proposal, and its chance of getting merged
> one day.
>
> Note that we worked with qemu-user, so there may be issues in system
> mode that we missed.
>
> # Problem
>
> When generating the TCG instructions for memory accesses, fences are
> always inserted *before* the access, following this translation rule:
>
> x86 --> TCG --> aarch64
> -------------------------------------
> RMOV --> Fm_ld; ld --> DMBLD; LDR
> WMOV --> Fm_st; st --> DMBFF; STR
>
> Here, Fm_ld is a fence that orders any preceding memory access with
> the subsequent load. F_m_st is a fence that orders any preceding
> memory access with the subsequent store. This means that, in TCG, all
> memory accesses are ordered by fences. Thus, no memory accesses can be
> re-ordered in TCG. This is a problem, because it is *stricter than
> x86*. Consider when a program contains:
>
> WMOV; RMOV
>
>
> x86 allows re-ordering independent store-load pairs, so the above pair
> can safely re-order on an x86 host. However, with QEMU's current
> translation, it becomes:
>
> DMBFF; STR; DMBLD; LDR
>
> In this target aarch64 code, no re-ordering is possible. Hence, QEMU
> enforces a stronger model than x86. While that is correct, it harms
> performance.
>
> # Solution
>
> We propose an alternative scheme, which we formally proved correct
> (paper under review):
>
> x86 --> TCG --> aarch64
> -------------------------------------
> RMOV --> ld; Fld_m --> LDR; DMBLD
> WMOV --> Fst_st; st --> DMBST; STR
>
> This new scheme precisely captures the observable behaviors of the
> input program (in x86's memory model). This behavior is preserved in
> the resulting TCG and aarch64 programs. Which the inserted fences
> enforce (formally verified). Note that this scheme enforces fewer
> ordering than the previous (unnecessarily strong) mapping scheme. This
> new scheme benefits performance. We evaluated this on benchmarks
> (PARSEC) and got up to 19.7% improvement, 6.7% on average.
>
> # Implementation Considerations
>
> Different (source and host) architectures may demand different such
> mapping schemes. Some schemes may place fences before an instruction,
> while others place them after. The implementation of fence placement
> should thus be sufficiently flexible that either is possible. Though,
> note that write-read pairs are unordered in almost all architectures.
>
> We see two ways of doing this:
> - extracting the placement of the fence from the
> tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture
> explicitly generate the fence at the correct place
> - adding two parameters to these functions specifying the strength of
> the "before" and "after" fences. The function would then generate
> both fences in the IR (one of them may be a NOP fence), which in
> turn will be translated back to the host
This has been on my to-do list for quite some time. My previous work was
https://patchew.org/QEMU/20210316220735.2048137-1-richard.henderson@linaro.org/
I have some further work (possibly not posted? I can't find a reference) which attempted
to strength reduce the barriers, and to use load-acquire/store-release insns when
alignment of the operation allows. Unfortunately, for the interesting cases in question
(x86 and s390x guests, with the strongest guest memory models), it was rare that we could
prove the alignment was sufficient, so it was a fair amount of work being done for no gain.
r~
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Support for x86_64 on aarch64 emulation
2022-04-08 15:27 ` Richard Henderson
@ 2022-04-14 12:24 ` Redha Gouicem
0 siblings, 0 replies; 3+ messages in thread
From: Redha Gouicem @ 2022-04-14 12:24 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: d.g.sprokholt
I will start working on a cleaner patch to allow the correct memory
ordering enforcement to be implemented, as my current PoC is very hacky.
I'll post it to the mailing list as soon as I'm done.
Redha
On 08/04/2022 17:27, Richard Henderson wrote:
> On 4/8/22 05:21, Redha Gouicem wrote:
>> We are working on support for x86_64 emulation on aarch64, mainly
>> related to memory ordering issues. We first wanted to know what the
>> community thinks about our proposal, and its chance of getting merged
>> one day.
>>
>> Note that we worked with qemu-user, so there may be issues in system
>> mode that we missed.
>>
>> # Problem
>>
>> When generating the TCG instructions for memory accesses, fences are
>> always inserted *before* the access, following this translation rule:
>>
>> x86 --> TCG --> aarch64
>> -------------------------------------
>> RMOV --> Fm_ld; ld --> DMBLD; LDR
>> WMOV --> Fm_st; st --> DMBFF; STR
>>
>> Here, Fm_ld is a fence that orders any preceding memory access with
>> the subsequent load. F_m_st is a fence that orders any preceding
>> memory access with the subsequent store. This means that, in TCG, all
>> memory accesses are ordered by fences. Thus, no memory accesses can be
>> re-ordered in TCG. This is a problem, because it is *stricter than
>> x86*. Consider when a program contains:
>>
>> WMOV; RMOV
>>
>>
>> x86 allows re-ordering independent store-load pairs, so the above pair
>> can safely re-order on an x86 host. However, with QEMU's current
>> translation, it becomes:
>>
>> DMBFF; STR; DMBLD; LDR
>>
>> In this target aarch64 code, no re-ordering is possible. Hence, QEMU
>> enforces a stronger model than x86. While that is correct, it harms
>> performance.
>>
>> # Solution
>>
>> We propose an alternative scheme, which we formally proved correct
>> (paper under review):
>>
>> x86 --> TCG --> aarch64
>> -------------------------------------
>> RMOV --> ld; Fld_m --> LDR; DMBLD
>> WMOV --> Fst_st; st --> DMBST; STR
>>
>> This new scheme precisely captures the observable behaviors of the
>> input program (in x86's memory model). This behavior is preserved in
>> the resulting TCG and aarch64 programs. Which the inserted fences
>> enforce (formally verified). Note that this scheme enforces fewer
>> ordering than the previous (unnecessarily strong) mapping scheme. This
>> new scheme benefits performance. We evaluated this on benchmarks
>> (PARSEC) and got up to 19.7% improvement, 6.7% on average.
>>
>> # Implementation Considerations
>> Different (source and host) architectures may demand different such
>> mapping schemes. Some schemes may place fences before an instruction,
>> while others place them after. The implementation of fence placement
>> should thus be sufficiently flexible that either is possible. Though,
>> note that write-read pairs are unordered in almost all architectures.
>> We see two ways of doing this:
>> - extracting the placement of the fence from the
>> tcg_gen_qemu_ld/st_i32/i64 functions, and have each architecture
>> explicitly generate the fence at the correct place
>> - adding two parameters to these functions specifying the strength of
>> the "before" and "after" fences. The function would then generate
>> both fences in the IR (one of them may be a NOP fence), which in
>> turn will be translated back to the host
>
> This has been on my to-do list for quite some time. My previous work was
>
> https://patchew.org/QEMU/20210316220735.2048137-1-richard.henderson@linaro.org/
>
> I have some further work (possibly not posted? I can't find a reference) which attempted to strength reduce the barriers, and to use load-acquire/store-release insns when alignment of the operation allows. Unfortunately, for the interesting cases in question (x86 and s390x guests, with the strongest guest memory models), it was rare that we could prove the alignment was sufficient, so it was a fair amount of work being done for no gain.
>
>
> r~
--
Redha Gouicem
Post doctoral researcher
Chair of Decentralized Systems Engineering
Department of Informatics, Technical University of Munich (TUM)
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2022-04-14 12:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-04-08 12:21 Support for x86_64 on aarch64 emulation Redha Gouicem
2022-04-08 15:27 ` Richard Henderson
2022-04-14 12:24 ` Redha Gouicem
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).