[Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)
@ 2010-08-03 21:42 Lluís
  2010-11-26 19:06 ` Paul Brook
  0 siblings, 1 reply; 5+ messages in thread
From: Lluís @ 2010-08-03 21:42 UTC (permalink / raw)
  To: qemu-devel; +Cc: Stefan Hajnoczi, Yufei Chen, Eduardo Cruz, Jun Koi

Ok, sorry for the delay.

Here's a "report" on the current status. Please comment if you feel that any
decision has been taken through the wrong path. Also, if you send me patches
I'll happily push them into the repository.

Quick status summary
--------------------

 * minimal set of instrumentation points in place for testing (FETCH, VMEM)
 * examples at ./backdoor/examples and ./instrument/examples
 * code available at:
   https://projects.gso.ac.upc.edu/projects/qemu-instrument
   git clone https://code.gso.ac.upc.edu/git/qemu-instrument

How instrumentation currently works
-----------------------------------

Instrumentation points have the form of preprocessor macro calls. The user
defines each of these on a separate file (selected at configure time).

Each CPU has an "instrumentation state" variable, that can be dynamically
changed (e.g., by defining a backdoor that calls the instrumentation control
API) by the host. The number of states is defined by the user.

The macros are called at code generation time, where the user can check if a
specific instrumentation state is active on the current CPU (assuming
'cpu_single_env' points to the cpu object that originated the request for
disassembling the current instruction).

Changing the instrumentation state triggers TB flushes to allow for new
disassembly calls to take into account the new instrumentation state.

As of now, all CPUs must have the same state (see below).

What is lacking
---------------

1) immediate end of TB on backdoor instruction

I use backdoor instructions to control the instrumentation state from the guest,
triggering a call to a host-side code helper associated to the backdoor
instruction.

For this to work when controlling instrumentation state, the disassembly of an
instrumentation backdoor must immediately end the current TB.

The problem is that calling 'end_eob' (i386) produces code that infinitely
reexecutes that backdoor instruction.

2) instrumenting i386 is extremely time-consuming (for the developer)

As my work is not tied to a specific target architecture, I was thinking of
shifting into PPC, as the ISA is pretty regular and that would certainly make
the process easier by just patching a small set of places in the code.

3) per-CPU instrumentation state

The goal is to achieve minimal performance impact when executing TBs: no
instrumentation state checks when executing TBs (perform checks at TB generation
time), and negligible performance impact when executing non-instrumented TBs
(see if the modifications described below have no performance impact).

The original idea was to expand the arrays holding TBs ('tbs' and
'tb_phys_hash') into 2-dimensional arrays, where the first dimension would
contain one entry for each possible instrumentation state.

When a CPU looks up a TB, it is searched/added on the array for the current
state.

If the CPU-specific state changes, 'tb_jmp_cache' is flushed and lookups will
continue wherever they must according to the current state.

It is still unclear if PageDesc should also contain an array of 'first_tb' or if
'l1_map' should be 2-dimensional; I still have to look into that code in more
detail to see the feasibility and performance costs of each one.

4) KVM

I've performed tests only on i386-linux-user, but backdoor instructions and
calls to the instrumentation control API should switch from KVM to softmmu (and
disabling all instrumentation should jump back to KVM).

I dont' really know what would happen right now.

What needs to be decided
------------------------

1) instrumentation points

Which static instrumentation points must be present, and which arguments should
they have in order to have a target-agnostinc interface.

The current example points are:

FETCH(vaddress, size, used_registers, defined_registers)
VMEM(vaddress, size, read_or_write)

2) instrumentation from code helpers

It might be unavoidable the need to add a second set of calls to user-provided
macros to instrument from code helpers, as these must not generate code, but
call the user-provided instrumentation code helpers.

Another option would be to re-define INSTR_GEN_* into a plain function call to
the user macro.

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)
  2010-08-03 21:42 [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing) Lluís
@ 2010-11-26 19:06 ` Paul Brook
  2010-11-26 20:19   ` Lluís
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Brook @ 2010-11-26 19:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: Stefan Hajnoczi, Yufei Chen, Lluís, Eduardo Cruz, Jun Koi

> 2) instrumenting i386 is extremely time-consuming (for the developer)
> 
> As my work is not tied to a specific target architecture, I was thinking of
> shifting into PPC, as the ISA is pretty regular and that would certainly
> make the process easier by just patching a small set of places in the
> code.
> 
>... 
> The current example points are:
> 
> FETCH(vaddress, size, used_registers, defined_registers)

Duplicating the insn decoder to determine which registers are accessed is not 
a maintainable solution. Likewise requiring separate tracing hooks be added to 
the existing decoders is extremely unlikely to be a feasible long-term 
solution. Anything solution that tries to separate CPU instrumentation/tracing 
from code generation is IMO fundamentally flawed and will rapidly bitrot 
beyond usefulness.

I'd also posit that instrumenting changes in sate is of very limited use if 
you don't know what the new value is.

You almost certainly want to do this using the equivalent of a memory 
watchpoint on the CPUState structure.

Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)
  2010-11-26 19:06 ` Paul Brook
@ 2010-11-26 20:19   ` Lluís
  2010-11-26 21:33     ` Paul Brook
  0 siblings, 1 reply; 5+ messages in thread
From: Lluís @ 2010-11-26 20:19 UTC (permalink / raw)
  To: Paul Brook; +Cc: Stefan Hajnoczi, Yufei Chen, qemu-devel, Eduardo Cruz, Jun Koi

Paul Brook writes:

>> 2) instrumenting i386 is extremely time-consuming (for the developer)
>> 
>> As my work is not tied to a specific target architecture, I was thinking of
>> shifting into PPC, as the ISA is pretty regular and that would certainly
>> make the process easier by just patching a small set of places in the
>> code.
>> 
>> ... 
>> The current example points are:
>> 
>> FETCH(vaddress, size, used_registers, defined_registers)

> Duplicating the insn decoder to determine which registers are accessed is not 
> a maintainable solution.

Right. On ISAs like PowerPC this can be solved much more easily, but in
x86 the implementation was based on manually searching all uses of the
register arrays, and adding the required call to set/define register.

Instead I could "hide" these structures in CPUState (not that I can
really do that in C), and provide two accessors that will do the job
instead.

> Likewise requiring separate tracing hooks be added to the existing
> decoders is extremely unlikely to be a feasible long-term
> solution.

You mean having to modify each "translate.c"? The worst event to handle
is instruction fetch on x86. Memory accesses are already automatically
handled by simply including a header that wraps the tcg_gen_qemu_ld/st
functions, and other events like privilege level change are very
localized, so bitrotting is much harder there.

> Anything solution that tries to separate CPU instrumentation/tracing
> from code generation is IMO fundamentally flawed and will rapidly
> bitrot beyond usefulness.

That's fundamentally correct, but I think that only on certaing events
and architectures.

As I said, this could be solved by forcing the programmer to use some
well-known interface for accessing, e.g., registers.

> I'd also posit that instrumenting changes in sate is of very limited use if 
> you don't know what the new value is.

I don't understand what you mean here.

> You almost certainly want to do this using the equivalent of a memory 
> watchpoint on the CPUState structure.

Sorry, do what?

Thanks,
        Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)
  2010-11-26 20:19   ` Lluís
@ 2010-11-26 21:33     ` Paul Brook
  2010-11-29 15:04       ` Lluís
  0 siblings, 1 reply; 5+ messages in thread
From: Paul Brook @ 2010-11-26 21:33 UTC (permalink / raw)
  To: Lluís; +Cc: Stefan Hajnoczi, Yufei Chen, qemu-devel, Eduardo Cruz, Jun Koi

> > Likewise requiring separate tracing hooks be added to the existing
> > decoders is extremely unlikely to be a feasible long-term
> > solution.
> 
> You mean having to modify each "translate.c"? The worst event to handle
> is instruction fetch on x86.

Instruction fetches are trivial, you just intercept calls to ld*_code.

> > I'd also posit that instrumenting changes in sate is of very limited use
> > if you don't know what the new value is.
> 
> I don't understand what you mean here.

Your proposed FETCH macro instrumented which registers are modified by an 
insn, but did not the actual values about to be written to those registers.

> > You almost certainly want to do this using the equivalent of a memory
> > watchpoint on the CPUState structure.
> 
> Sorry, do what?

All guest register values are held in the CPUState structure. So to instrument 
accesses to guest state you just need to intercept TCG accesses to this 
structure, either via explicit ld/st ops, or via a global_mem. To a first 
approximation you can probably get away with just the latter.

Paul

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing)
  2010-11-26 21:33     ` Paul Brook
@ 2010-11-29 15:04       ` Lluís
  0 siblings, 0 replies; 5+ messages in thread
From: Lluís @ 2010-11-29 15:04 UTC (permalink / raw)
  To: Paul Brook; +Cc: Stefan Hajnoczi, Yufei Chen, qemu-devel, Eduardo Cruz, Jun Koi

Paul Brook writes:

>> > Likewise requiring separate tracing hooks be added to the existing
>> > decoders is extremely unlikely to be a feasible long-term
>> > solution.
>> 
>> You mean having to modify each "translate.c"? The worst event to handle
>> is instruction fetch on x86.

> Instruction fetches are trivial, you just intercept calls to ld*_code.

For instruction fetch I mean:

* instruction address and length
* instruction opcode "type" (I'd like to have some kind of a rough
  generic opcode type: ALU, branch, etc.)
* set of used/defined registers

For the loads, ld*_code is not accurate, as the code might call it
multiple times, even on positions after the actual instruction, or
repeat calls for the saame position. I'ts much easier (I think) to do
the kind of trivial math I do on x86 with "s->pc".

>> > I'd also posit that instrumenting changes in sate is of very limited use
>> > if you don't know what the new value is.
>> 
>> I don't understand what you mean here.

> Your proposed FETCH macro instrumented which registers are modified by an 
> insn, but did not the actual values about to be written to those registers.

Right. Register values are not part of what I was looking for. Still,
having an accurate set of used/defined registers, you could be given the
option of gathering their values at "commit" time by, e.g., adding a new
tracing event after generating code for an instruction, with a CPUState
argument.

The combinations of data you might want to gather from guest code are
nearly infinite, and that's why I wanted to provide the minimum set of
information that, when you enable instrumentation of tracing events,
will let you gather all the rest (like values).

>> > You almost certainly want to do this using the equivalent of a memory
>> > watchpoint on the CPUState structure.
>> 
>> Sorry, do what?

> All guest register values are held in the CPUState structure. So to instrument 
> accesses to guest state you just need to intercept TCG accesses to this 
> structure, either via explicit ld/st ops, or via a global_mem. To a first 
> approximation you can probably get away with just the latter.

I think I understand what you mean, and it would certainly simplify the
implementation, but it depends on being able to efficiently identify
from all memory accesses, which are directed to the interesting CPUState
fields, which is easy to get when using functions like
"tcg_gen_ld/st_i64", but will get almost impòpssible when the code
generator starts using multiple tcg operations to calculate the address
of a single access to a CPUState field.

But indeed this would be much more easily to maintain for the common
case, although probably slower:

- you have to register all the possible TCGv_ptr that can point to a
  CPUState (if you use something else, it will no longer work, although
  I don't know if there is any such case anywhere)
- register the "interesting" offsets (which can be multiple and
  non-consecutive: general-purpose registers, control registers in x86,
  mtrr, etc.)
- "decode" from those offsets which specific field is being accessed

This is supposing that CPUState field access detection is embedded into
"tcg-op.h", so that I have all the info after translating, not during
translated code execution (as then I would be unable to have all the
fetch info before actually executing translated code).

Still, if this is what you meant, I think it's way better than the
time-consuming and easily bitrotting task that I did.

The only drawback is that it would force all targets to produce the
fetch event after the whole instruction translation, and thus be forced
to do the x86 trick of moving buffers at the end of the instruction
translation. I hoped that this would be necessary only for x86, and that
other architectures would let me do it more easily before starting the
real translation (e.g., using the translation tables found in PowerPC).

Thanks,
        Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-11-29 15:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-03 21:42 [Qemu-devel] [RFC] Static instrumentation (aka guest code tracing) Lluís
2010-11-26 19:06 ` Paul Brook
2010-11-26 20:19   ` Lluís
2010-11-26 21:33     ` Paul Brook
2010-11-29 15:04       ` Lluís

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).