x86 TCG helpers clobbered registers

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* x86 TCG helpers clobbered registers
@ 2020-12-04 15:36 Stephane Duverger
  2020-12-04 19:35 ` Richard Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Stephane Duverger @ 2020-12-04 15:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Richard Henderson

Hello,

While looking at tcg/i386/tcg-target.c.inc:tcg_out_qemu_st(), I
discovered that the TCG generates a call to a store helper at the end
of the TB which is executed on TLB miss and get back to the remaining
translated ops. I tried to mimick this behavior around the fast path
(right between tcg_out_tlb_load() and tcg_out_qemu_st_direct()) to
filter on memory store accesses.

I know there is now TCG plugins for that purpose at TCG IR level,
which every tcg-target might benefit. FWIW, my design choice was more
led by the fact that I always work on an x86 host and plugins did not
exist by the time. Anyway, the point is more related to generating a
call to a helper at the TCG IR level (classic scenario), or later
during tcg-target code generation (slow path for instance).

The TCG when calling a helper knows that some registers will be call
clobbered and as such must free them. This is what I observed in
tcg_reg_alloc_call():

/* clobber call registers */
for (i = 0; i < TCG_TARGET_NB_REGS; i++) {
    if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) {
        tcg_reg_free(s, i, allocated_regs);
    }
}

But in our case (ie. INDEX_op_qemu_st_i32), the TCG code path comes
from:

tcg_reg_alloc_op()
  tcg_out_op()
    tcg_out_qemu_st()

Then tcg_out_tlb_load() will inject a 'jmp' to the slow path, whose
generated code does not seem to take care of every call clobbered
registers, if we look at tcg_out_qemu_st_slow_path().

First for an i386 (32bits) tcg-target, as expected, the helper
arguments are injected into the stack. I noticed that 'esp' is not
shifted down before stacking up the args, which might corrupt last
stacked words.

Second, for both 32/64 bits tcg-targets since all of the 'call
clobbered' registers are not preserved, it may happen that depending
on the code executed by the helper (and so generated by GCC) these
registers will be clobbered (ie. R10 for x86-64).

While this never happened for the slow path helper call, I observed
that my guest had trouble running when filtering memory in the same
fashion the slow path helper would be called. Conversely, if I
push/pop all of the call clobbered regs around the call to the helper,
everything runs as expected.

Is this correct ? Am I missing something ?

Thanks a lot in advance for your eagle eye on this :)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-04 15:36 x86 TCG helpers clobbered registers Stephane Duverger
@ 2020-12-04 19:35 ` Richard Henderson
  2020-12-05  1:34   ` Stephane Duverger
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2020-12-04 19:35 UTC (permalink / raw)
  To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini

On 12/4/20 9:36 AM, Stephane Duverger wrote:
> Hello,
> 
> While looking at tcg/i386/tcg-target.c.inc:tcg_out_qemu_st(), I
> discovered that the TCG generates a call to a store helper at the end
> of the TB which is executed on TLB miss and get back to the remaining
> translated ops. I tried to mimick this behavior around the fast path
> (right between tcg_out_tlb_load() and tcg_out_qemu_st_direct()) to
> filter on memory store accesses.

There's your bug -- don't do that.

> I know there is now TCG plugins for that purpose at TCG IR level,
> which every tcg-target might benefit. FWIW, my design choice was more
> led by the fact that I always work on an x86 host and plugins did not
> exist by the time. Anyway, the point is more related to generating a
> call to a helper at the TCG IR level (classic scenario), or later
> during tcg-target code generation (slow path for instance).

You can't just inject a call anywhere you like.  If you add it at the IR level,
then the rest of the compiler will see it and work properly.  If you add the
call in the middle of another operation, the compiler doesn't get to see it and
Bad Things Happen.

> The TCG when calling a helper knows that some registers will be call
> clobbered and as such must free them. This is what I observed in
> tcg_reg_alloc_call():
> 
> /* clobber call registers */
> for (i = 0; i < TCG_TARGET_NB_REGS; i++) {
>     if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) {
>         tcg_reg_free(s, i, allocated_regs);
>     }
> }
> 
> But in our case (ie. INDEX_op_qemu_st_i32), the TCG code path comes
> from:
> 
> tcg_reg_alloc_op()
>   tcg_out_op()
>     tcg_out_qemu_st()
> 
> Then tcg_out_tlb_load() will inject a 'jmp' to the slow path, whose
> generated code does not seem to take care of every call clobbered
> registers, if we look at tcg_out_qemu_st_slow_path().

You missed

>         if (def->flags & TCG_OPF_CALL_CLOBBER) {
>             /* XXX: permit generic clobber register list ? */ 
>             for (i = 0; i < TCG_TARGET_NB_REGS; i++) {
>                 if (tcg_regset_test_reg(tcg_target_call_clobber_regs, i)) {
>                     tcg_reg_free(s, i, i_allocated_regs);
>                 }
>             }
>         }

which handles this in tcg_reg_alloc_op.


> First for an i386 (32bits) tcg-target, as expected, the helper
> arguments are injected into the stack. I noticed that 'esp' is not
> shifted down before stacking up the args, which might corrupt last
> stacked words.

No, we generate code for a constant esp, as if by gcc's -mno-push-args option.
 We have reserved TCG_STATIC_CALL_ARGS_SIZE bytes of stack for the arguments
(which is actually larger than necessary for any of the tcg targets).


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-04 19:35 ` Richard Henderson
@ 2020-12-05  1:34   ` Stephane Duverger
  2020-12-05 12:38     ` Richard Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Stephane Duverger @ 2020-12-05  1:34 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini

On Fri, Dec 04, 2020 at 01:35:55PM -0600, Richard Henderson wrote:

Thank you Richard for your answer. I don't want to generate a debate,
or defend the way I've done things initially. Really want to clarify
these internals. Hope it will benefit to other QEMU enthusiasts.

> You can't just inject a call anywhere you like.  If you add it at
> the IR level, then the rest of the compiler will see it and work
> properly.  If you add the call in the middle of another operation,
> the compiler doesn't get to see it and Bad Things Happen.

I do understand that, and surprisingly isn't it what is done in the
qemu slow path ? I mean, the call to the helper is not generated at IR
level but rather injected through a 'jmp' right in the middle of
currently generated instructions, plus code added at the end of the
TB.

What's the difference between the way it is currently done for the
slow path and something like:

static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
{ [...]
    tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc,
                     label_ptr, offsetof(CPUTLBEntry, addr_write));

    /* TLB Hit.  */
    tcg_out_qemu_st_filter(s, opc, addrlo, addrhi, datalo, datahi);
    tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc);

    /* Record the current context of a store into ldst label */
    add_qemu_ldst_label(s, false, is64, oi, datalo, datahi, addrlo, addrhi,
                        s->code_ptr, label_ptr);
}

Where:
static void tcg_out_qemu_st_filter(TCGContext *s, MemOp opc,
                                   TCGReg addrlo, TCGReg addrhi,
                                   TCGReg datalo, TCGReg datahi)
{
  MemOp s_bits = opc & MO_SIZE;

  tcg_out_push(s, TCG_REG_L1); // used later on by tcg_out_qemu_st_direct

  tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
              tcg_target_call_iarg_regs[0], addrlo);

  tcg_out_mov(s, (s_bits == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32),
              tcg_target_call_iarg_regs[1], datalo);

  tcg_out_movi(s, TCG_TYPE_I32, tcg_target_call_iarg_regs[2], opc);

  tcg_out_call(s, (void*)filter_store_memop);

  tcg_out_pop(s, TCG_REG_L1);
}

Does the ldst_label mechanism generating slow path code at TB's end
change something ? There is still an injected 'jne' at
tcg_out_tlb_load() which redirects to the slow path code, whatever its
location, like I do in-place for tcg_out_qemu_st_filter.

For sure the TCG is blind at some point, but it works for the slow
path, so it should for the filter. The TCG qemu_st_i32 op is

DEF(qemu_st_i32, 0, TLADDR_ARGS + 1, 1,
    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS)

And as you stated, the tcg_reg_alloc_op() had properly managed the
call clobbered registers. So we should be safe calling a helper from
tcg_out_qemu_st() and arguably that's why you do so for the slow path
?

> > I noticed that 'esp' is not shifted down before stacking up the
> > args, which might corrupt last stacked words.
> 
> No, we generate code for a constant esp, as if by gcc's
> -mno-push-args option. We have reserved TCG_STATIC_CALL_ARGS_SIZE
> bytes of stack for the arguments (which is actually larger than
> necessary for any of the tcg targets).

As this is done only at the TB prologue, do you mean that the TCG will
never generate an equivalent to a push *followed* by a memory
store/load ? Our host esp will never point to a last stacked word,
issued by the translation of a TCG op ?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-05  1:34   ` Stephane Duverger
@ 2020-12-05 12:38     ` Richard Henderson
  2020-12-07 10:10       ` Stephane Duverger
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2020-12-05 12:38 UTC (permalink / raw)
  To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini

On 12/4/20 7:34 PM, Stephane Duverger wrote:
>> You can't just inject a call anywhere you like.  If you add it at
>> the IR level, then the rest of the compiler will see it and work
>> properly.  If you add the call in the middle of another operation,
>> the compiler doesn't get to see it and Bad Things Happen.
> 
> I do understand that, and surprisingly isn't it what is done in the
> qemu slow path ? I mean, the call to the helper is not generated at IR
> level but rather injected through a 'jmp' right in the middle of
> currently generated instructions, plus code added at the end of the
> TB.
> 
> What's the difference between the way it is currently done for the
> slow path and something like:
> 
> static void tcg_out_qemu_st(TCGContext *s, const TCGArg *args, bool is64)
> { [...]
>     tcg_out_tlb_load(s, addrlo, addrhi, mem_index, opc,
>                      label_ptr, offsetof(CPUTLBEntry, addr_write));
> 
>     /* TLB Hit.  */
>     tcg_out_qemu_st_filter(s, opc, addrlo, addrhi, datalo, datahi);
>     tcg_out_qemu_st_direct(s, datalo, datahi, TCG_REG_L1, -1, 0, 0, opc);

The difference is that the slow path is aware that there are input registers
that are live, containing data (addrlo, addrhi, datalo, datahi), which must be
stored into the arguments for the slow path call.  Those input registers (and
all other call-clobbered registers) are dead *after* the slow path call.

You are injecting your filter call while those input registers are still live.
 They will be next used by the fast-path store.

That is a very significant difference.

>> No, we generate code for a constant esp, as if by gcc's
>> -mno-push-args option. We have reserved TCG_STATIC_CALL_ARGS_SIZE
>> bytes of stack for the arguments (which is actually larger than
>> necessary for any of the tcg targets).
> 
> As this is done only at the TB prologue, do you mean that the TCG will
> never generate an equivalent to a push *followed* by a memory
> store/load ? Our host esp will never point to a last stacked word,
> issued by the translation of a TCG op ?

TCG will never generate a push for an argument register.  The only push outside
of the prologue is to store the return address for a jmp, a "call" returning to
a different address.


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-05 12:38     ` Richard Henderson
@ 2020-12-07 10:10       ` Stephane Duverger
  2020-12-08 21:18         ` Richard Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Stephane Duverger @ 2020-12-07 10:10 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini

On Sat, Dec 05, 2020 at 06:38:25AM -0600, Richard Henderson wrote:
> The difference is that the slow path is aware that there are input registers
> that are live, containing data (addrlo, addrhi, datalo, datahi), which must be
> stored into the arguments for the slow path call.  Those input registers (and
> all other call-clobbered registers) are dead *after* the slow path call.
> 
> You are injecting your filter call while those input registers are still live.
> They will be next used by the fast-path store.
> 
> That is a very significant difference.

Ok. That's why I saved REG_L1 (prepared by tlb_load) for both
st/ld_direct use, plus datalo for st_direct only. I saw datahi is only
used for MO_64 on 32bits tcg-target. And I better understand it thanks
to you.

This leads me to that simple reflection:

If we want to filter on every memory accesses, *out of the fast-path*,
the most natural place to do so would be in store_helper() and
load_helper() from accel/tcg/cputlb.c. By doing so, every target would
benefit from filtering, and even specific helpers using cpu_ldst
functions would be intercepted. No ?

For the remaining fast-path case, it could be interesting to generate
it this time at IR level (tlb_load, jne to slow_path, direct
load/store) ? Again every target would benefit from filtering without
the need for a specific fast-path implementation in
tcg/<arch>/tcg-target.c.inc

Wouldn't it be simplier than actual mem plugin implementation, which
generate fitler callback *after* load/store and has specific extra
work for tracking memory accesses performed from helpers (afaiu) ?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-07 10:10       ` Stephane Duverger
@ 2020-12-08 21:18         ` Richard Henderson
  2020-12-08 22:39           ` Stephane Duverger
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Henderson @ 2020-12-08 21:18 UTC (permalink / raw)
  To: Stephane Duverger, qemu-devel; +Cc: Paolo Bonzini

On 12/7/20 4:10 AM, Stephane Duverger wrote:
> This leads me to that simple reflection:
> 
> If we want to filter on every memory accesses, *out of the fast-path*,
> the most natural place to do so would be in store_helper() and
> load_helper() from accel/tcg/cputlb.c. By doing so, every target would
> benefit from filtering, and even specific helpers using cpu_ldst
> functions would be intercepted. No ?
> 
> For the remaining fast-path case, it could be interesting to generate
> it this time at IR level (tlb_load, jne to slow_path, direct
> load/store) ? Again every target would benefit from filtering without
> the need for a specific fast-path implementation in
> tcg/<arch>/tcg-target.c.inc
> 
> Wouldn't it be simplier than actual mem plugin implementation, which
> generate fitler callback *after* load/store and has specific extra
> work for tracking memory accesses performed from helpers (afaiu) ?
> 

As for modifying store_helper(), the reason not to do it there is that it
misses the fast-path cases.

As for modifying the fast path cases, the code is quite delicate, and you run
into problems with live registers.  Which could be worked around in each
backend, but... why?

Which naturally suggests separate instrumentation separate from the above,
which is exactly what we do.  So, no, I don't think it would be simpler any
other way.


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: x86 TCG helpers clobbered registers
  2020-12-08 21:18         ` Richard Henderson
@ 2020-12-08 22:39           ` Stephane Duverger
  0 siblings, 0 replies; 7+ messages in thread
From: Stephane Duverger @ 2020-12-08 22:39 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: Paolo Bonzini

On Tue, Dec 08, 2020 at 03:18:54PM -0600, Richard Henderson wrote:
> As for modifying the fast path cases, the code is quite delicate,
> and you run into problems with live registers.  Which could be
> worked around in each backend, but... why?

Perhaps thinking that working at IR level would prevent these live
registers issues, seconded by the fact that vCPU TLBs handling seem to
be tcg-target agnostic.

But I do understand your position, have no patch suggesting a viable
alternative implementation, and most of all don't want to take more of
your time.

I appreciated the discussion and your help Richard. Thanks again.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-12-08 22:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-12-04 15:36 x86 TCG helpers clobbered registers Stephane Duverger
2020-12-04 19:35 ` Richard Henderson
2020-12-05  1:34   ` Stephane Duverger
2020-12-05 12:38     ` Richard Henderson
2020-12-07 10:10       ` Stephane Duverger
2020-12-08 21:18         ` Richard Henderson
2020-12-08 22:39           ` Stephane Duverger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).