* x86 TCG helpers clobbered registers
@ 2020-12-04 15:36 Stephane Duverger
2020-12-04 19:35 ` Richard Henderson
0 siblings, 1 reply; 7+ messages in thread
From: Stephane Duverger @ 2020-12-04 15:36 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson
Hello,
While looking at tcg/i386/tcg-target.c.inc:tcg_out_qemu_st(), I
discovered that the TCG generates a call to a store helper at the end
of the TB which is executed on TLB miss and get back to the remaining
translated ops. I tried to mimick this behavior around the fast path
(right between tcg_out_tlb_load() and tcg_out_qemu_st_direct()) to
filter on memory store accesses.
I know there is now TCG plugins for that purpose at TCG IR level,
which every tcg-target might benefit. FWIW, my design choice was more
led by the fact that I always work on an x86 host and plugins did not
exist by the time. Anyway, the point is more related to generating a
call to a helper at the TCG IR level (classic scenario), or later
during tcg-target code generation (slow path for instance).
The TCG when calling a helper knows that some registers will be call
clobbered and as such must free them. This is what I observed in
tcg_reg_alloc_call():
/* clobber call registers */
for (i = 0; i < TCG_TARGET_NB_REGS; i++) {
if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) {
tcg_reg_free(s, i, allocated_regs);
}
}
But in our case (ie. INDEX_op_qemu_st_i32), the TCG code path comes
from:
tcg_reg_alloc_op()
tcg_out_op()
tcg_out_qemu_st()
Then tcg_out_tlb_load() will inject a 'jmp' to the slow path, whose
generated code does not seem to take care of every call clobbered
registers, if we look at tcg_out_qemu_st_slow_path().
First for an i386 (32bits) tcg-target, as expected, the helper
arguments are injected into the stack. I noticed that 'esp' is not
shifted down before stacking up the args, which might corrupt last
stacked words.
Second, for both 32/64 bits tcg-targets since all of the 'call
clobbered' registers are not preserved, it may happen that depending
on the code executed by the helper (and so generated by GCC) these
registers will be clobbered (ie. R10 for x86-64).
While this never happened for the slow path helper call, I observed
that my guest had trouble running when filtering memory in the same
fashion the slow path helper would be called. Conversely, if I
push/pop all of the call clobbered regs around the call to the helper,
everything runs as expected.
Is this correct ? Am I missing something ?
Thanks a lot in advance for your eagle eye on this :)
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86 TCG helpers clobbered registers 2020-12-04 15:36 x86 TCG helpers clobbered registers Stephane Duverger @ 2020-12-04 19:35 ` Richard Henderson 2020-12-05 1:34 ` Stephane Duverger 0 siblings, 1 reply; 7+ messages in thread From: Richard Henderson @ 2020-12-04 19:35 UTC (permalink / raw) To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini On 12/4/20 9:36 AM, Stephane Duverger wrote: > Hello, > > While looking at tcg/i386/tcg-target.c.inc:tcg_out_qemu_st(), I > discovered that the TCG generates a call to a store helper at the end > of the TB which is executed on TLB miss and get back to the remaining > translated ops. I tried to mimick this behavior around the fast path > (right between tcg_out_tlb_load() and tcg_out_qemu_st_direct()) to > filter on memory store accesses. There's your bug -- don't do that. > I know there is now TCG plugins for that purpose at TCG IR level, > which every tcg-target might benefit. FWIW, my design choice was more > led by the fact that I always work on an x86 host and plugins did not > exist by the time. Anyway, the point is more related to generating a > call to a helper at the TCG IR level (classic scenario), or later > during tcg-target code generation (slow path for instance). You can't just inject a call anywhere you like. If you add it at the IR level, then the rest of the compiler will see it and work properly. If you add the call in the middle of another operation, the compiler doesn't get to see it and Bad Things Happen. > The TCG when calling a helper knows that some registers will be call > clobbered and as such must free them. This is what I observed in > tcg_reg_alloc_call(): > > /* clobber call registers */ > for (i = 0; i < TCG_TARGET_NB_REGS; i++) { > if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) { > tcg_reg_free(s, i, allocated_regs); > } > } > > But in our case (ie. INDEX_op_qemu_st_i32), the TCG code path comes > from: > > tcg_reg_alloc_op() > tcg_out_op() > tcg_out_qemu_st() > > Then tcg_out_tlb_load() will inject a 'jmp' to the slow path, whose > generated code does not seem to take care of every call clobbered > registers, if we look at tcg_out_qemu_st_slow_path(). You missed > if (def->flags & TCG_OPF_CALL_CLOBBER) { > /* XXX: permit generic clobber register list ? */ > for (i = 0; i < TCG_TARGET_NB_REGS; i++) { > if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) { > tcg_reg_free(s, i, i_allocated_regs); > } > } > } which handles this in tcg_reg_alloc_op. > First for an i386 (32bits) tcg-target, as expected, the helper > arguments are injected into the stack. I noticed that 'esp' is not > shifted down before stacking up the args, which might corrupt last > stacked words. No, we generate code for a constant esp, as if by gcc's -mno-push-args option. We have reserved TCG_STATIC_CALL_ARGS_SIZE bytes of stack for the arguments (which is actually larger than necessary for any of the tcg targets). r~ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 TCG helpers clobbered registers 2020-12-04 19:35 ` Richard Henderson @ 2020-12-05 1:34 ` Stephane Duverger 2020-12-05 12:38 ` Richard Henderson 0 siblings, 1 reply; 7+ messages in thread From: Stephane Duverger @ 2020-12-05 1:34 UTC (permalink / raw) To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini On Fri, Dec 04, 2020 at 01:35:55PM -0600, Richard Henderson wrote: Thank you Richard for your answer. I don't want to generate a debate, or defend the way I've done things initially. Really want to clarify these internals. Hope it will benefit to other QEMU enthusiasts. > You can't just inject a call anywhere you like. If you add it at > the IR level, then the rest of the compiler will see it and work > properly. If you add the call in the middle of another operation, > the compiler doesn't get to see it and Bad Things Happen. I do understand that, and surprisingly isn't it what is done in the qemu slow path ? I mean, the call to the helper is not generated at IR level but rather injected through a 'jmp' right in the middle of currently generated instructions, plus code added at the end of the TB. What's the difference between the way it is currently done for the slow path and something like: static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64) { [...] tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc, label_ptr, offsetof(CPUTLBEntry, addr_write)); /* TLB Hit. */ tcg_out_qemu_st_filter(s, opc, addrlo, addrhi, datalo, datahi); tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc); /* Record the current context of a store into ldst label */ add_qemu_ldst_label(s, false, is64, oi, datalo, datahi, addrlo, addrhi, s->code_ptr, label_ptr); } Where: static void tcg_out_qemu_st_filter(TCGContext *s, MemOp opc, TCGReg addrlo, TCGReg addrhi, TCGReg datalo, TCGReg datahi) { MemOp s_bits = opc & MO_SIZE; tcg_out_push(s, TCG_REG_L1); // used later on by tcg_out_qemu_st_direct tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32), tcg_target_call_iarg_regs[0], addrlo); tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32), tcg_target_call_iarg_regs[1], datalo); tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], opc); tcg_out_call(s, (void*)filter_store_memop); tcg_out_pop(s, TCG_REG_L1); } Does the ldst_label mechanism generating slow path code at TB's end change something ? There is still an injected 'jne' at tcg_out_tlb_load() which redirects to the slow path code, whatever its location, like I do in-place for tcg_out_qemu_st_filter. For sure the TCG is blind at some point, but it works for the slow path, so it should for the filter. The TCG qemu_st_i32 op is DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1, TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS) And as you stated, the tcg_reg_alloc_op() had properly managed the call clobbered registers. So we should be safe calling a helper from tcg_out_qemu_st() and arguably that's why you do so for the slow path ? > > I noticed that 'esp' is not shifted down before stacking up the > > args, which might corrupt last stacked words. > > No, we generate code for a constant esp, as if by gcc's > -mno-push-args option. We have reserved TCG_STATIC_CALL_ARGS_SIZE > bytes of stack for the arguments (which is actually larger than > necessary for any of the tcg targets). As this is done only at the TB prologue, do you mean that the TCG will never generate an equivalent to a push *followed* by a memory store/load ? Our host esp will never point to a last stacked word, issued by the translation of a TCG op ? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 TCG helpers clobbered registers 2020-12-05 1:34 ` Stephane Duverger @ 2020-12-05 12:38 ` Richard Henderson 2020-12-07 10:10 ` Stephane Duverger 0 siblings, 1 reply; 7+ messages in thread From: Richard Henderson @ 2020-12-05 12:38 UTC (permalink / raw) To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini On 12/4/20 7:34 PM, Stephane Duverger wrote: >> You can't just inject a call anywhere you like. If you add it at >> the IR level, then the rest of the compiler will see it and work >> properly. If you add the call in the middle of another operation, >> the compiler doesn't get to see it and Bad Things Happen. > > I do understand that, and surprisingly isn't it what is done in the > qemu slow path ? I mean, the call to the helper is not generated at IR > level but rather injected through a 'jmp' right in the middle of > currently generated instructions, plus code added at the end of the > TB. > > What's the difference between the way it is currently done for the > slow path and something like: > > static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64) > { [...] > tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc, > label_ptr, offsetof(CPUTLBEntry, addr_write)); > > /* TLB Hit. */ > tcg_out_qemu_st_filter(s, opc, addrlo, addrhi, datalo, datahi); > tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc); The difference is that the slow path is aware that there are input registers that are live, containing data (addrlo, addrhi, datalo, datahi), which must be stored into the arguments for the slow path call. Those input registers (and all other call-clobbered registers) are dead *after* the slow path call. You are injecting your filter call while those input registers are still live. They will be next used by the fast-path store. That is a very significant difference. >> No, we generate code for a constant esp, as if by gcc's >> -mno-push-args option. We have reserved TCG_STATIC_CALL_ARGS_SIZE >> bytes of stack for the arguments (which is actually larger than >> necessary for any of the tcg targets). > > As this is done only at the TB prologue, do you mean that the TCG will > never generate an equivalent to a push *followed* by a memory > store/load ? Our host esp will never point to a last stacked word, > issued by the translation of a TCG op ? TCG will never generate a push for an argument register. The only push outside of the prologue is to store the return address for a jmp, a "call" returning to a different address. r~ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 TCG helpers clobbered registers 2020-12-05 12:38 ` Richard Henderson @ 2020-12-07 10:10 ` Stephane Duverger 2020-12-08 21:18 ` Richard Henderson 0 siblings, 1 reply; 7+ messages in thread From: Stephane Duverger @ 2020-12-07 10:10 UTC (permalink / raw) To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini On Sat, Dec 05, 2020 at 06:38:25AM -0600, Richard Henderson wrote: > The difference is that the slow path is aware that there are input registers > that are live, containing data (addrlo, addrhi, datalo, datahi), which must be > stored into the arguments for the slow path call. Those input registers (and > all other call-clobbered registers) are dead *after* the slow path call. > > You are injecting your filter call while those input registers are still live. > They will be next used by the fast-path store. > > That is a very significant difference. Ok. That's why I saved REG_L1 (prepared by tlb_load) for both st/ld_direct use, plus datalo for st_direct only. I saw datahi is only used for MO_64 on 32bits tcg-target. And I better understand it thanks to you. This leads me to that simple reflection: If we want to filter on every memory accesses, *out of the fast-path*, the most natural place to do so would be in store_helper() and load_helper() from accel/tcg/cputlb.c. By doing so, every target would benefit from filtering, and even specific helpers using cpu_ldst functions would be intercepted. No ? For the remaining fast-path case, it could be interesting to generate it this time at IR level (tlb_load, jne to slow_path, direct load/store) ? Again every target would benefit from filtering without the need for a specific fast-path implementation in tcg/<arch>/tcg-target.c.inc Wouldn't it be simplier than actual mem plugin implementation, which generate fitler callback *after* load/store and has specific extra work for tracking memory accesses performed from helpers (afaiu) ? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 TCG helpers clobbered registers 2020-12-07 10:10 ` Stephane Duverger @ 2020-12-08 21:18 ` Richard Henderson 2020-12-08 22:39 ` Stephane Duverger 0 siblings, 1 reply; 7+ messages in thread From: Richard Henderson @ 2020-12-08 21:18 UTC (permalink / raw) To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini On 12/7/20 4:10 AM, Stephane Duverger wrote: > This leads me to that simple reflection: > > If we want to filter on every memory accesses, *out of the fast-path*, > the most natural place to do so would be in store_helper() and > load_helper() from accel/tcg/cputlb.c. By doing so, every target would > benefit from filtering, and even specific helpers using cpu_ldst > functions would be intercepted. No ? > > For the remaining fast-path case, it could be interesting to generate > it this time at IR level (tlb_load, jne to slow_path, direct > load/store) ? Again every target would benefit from filtering without > the need for a specific fast-path implementation in > tcg/<arch>/tcg-target.c.inc > > Wouldn't it be simplier than actual mem plugin implementation, which > generate fitler callback *after* load/store and has specific extra > work for tracking memory accesses performed from helpers (afaiu) ? > As for modifying store_helper(), the reason not to do it there is that it misses the fast-path cases. As for modifying the fast path cases, the code is quite delicate, and you run into problems with live registers. Which could be worked around in each backend, but... why? Which naturally suggests separate instrumentation separate from the above, which is exactly what we do. So, no, I don't think it would be simpler any other way. r~ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86 TCG helpers clobbered registers 2020-12-08 21:18 ` Richard Henderson @ 2020-12-08 22:39 ` Stephane Duverger 0 siblings, 0 replies; 7+ messages in thread From: Stephane Duverger @ 2020-12-08 22:39 UTC (permalink / raw) To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini On Tue, Dec 08, 2020 at 03:18:54PM -0600, Richard Henderson wrote: > As for modifying the fast path cases, the code is quite delicate, > and you run into problems with live registers. Which could be > worked around in each backend, but... why? Perhaps thinking that working at IR level would prevent these live registers issues, seconded by the fact that vCPU TLBs handling seem to be tcg-target agnostic. But I do understand your position, have no patch suggesting a viable alternative implementation, and most of all don't want to take more of your time. I appreciated the discussion and your help Richard. Thanks again. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2020-12-08 22:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-12-04 15:36 x86 TCG helpers clobbered registers Stephane Duverger 2020-12-04 19:35 ` Richard Henderson 2020-12-05 1:34 ` Stephane Duverger 2020-12-05 12:38 ` Richard Henderson 2020-12-07 10:10 ` Stephane Duverger 2020-12-08 21:18 ` Richard Henderson 2020-12-08 22:39 ` Stephane Duverger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).