Instruction virtual address in TCG Plugins

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Instruction virtual address in TCG Plugins
@ 2023-11-13 18:33 Mikhail Tyutin
  2023-11-13 20:58 ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Mikhail Tyutin @ 2023-11-13 18:33 UTC (permalink / raw)
  To: qemu-devel@nongnu.org
  Cc: Richard Henderson, Alex Bennée, erdnaxe@crans.org,
	ma.mandourr@gmail.com

Greetings,

What is the right way to get virtual address of either translation block or instruction inside of TCG plugin? Does
plugin API allow that or it needs some extension?

So far I use qemu_plugin_tb_vaddr() inside of my block translation callback to get block virtual address and then
pass it as 'userdata' argument into qemu_plugin_register_vcpu_tb_exec_cb(). I use it later during code execution.
It works well for user-mode emulation, but sometimes leads to incorrect addresses in system-mode emulation.
I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.

I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
CPUArchState structure).

---
Mikhail

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instruction virtual address in TCG Plugins
  2023-11-13 18:33 Instruction virtual address in TCG Plugins Mikhail Tyutin
@ 2023-11-13 20:58 ` Alex Bennée
  2023-11-14  9:14   ` Mikhail Tyutin
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2023-11-13 20:58 UTC (permalink / raw)
  To: Mikhail Tyutin
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

Mikhail Tyutin <m.tyutin@yadro.com> writes:

> Greetings,
>
> What is the right way to get virtual address of either translation block or instruction inside of TCG plugin? Does
> plugin API allow that or it needs some extension?
>
> So far I use qemu_plugin_tb_vaddr() inside of my block translation callback to get block virtual address and then
> pass it as 'userdata' argument into qemu_plugin_register_vcpu_tb_exec_cb(). I use it later during code execution.
> It works well for user-mode emulation, but sometimes leads to
> incorrect addresses in system-mode emulation.

You can use qemu_plugin_insn_vaddr and qemu_plugin_insn_haddr. But your
right something under one vaddr and be executed under another with
overlapping mappings. The haddr should be stable though I think.

> I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.
>
> I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
> value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
> Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
> CPUArchState structure).

When we merge the register api you should be able to do that. Although
during testing I realised that PC acted funny compared to everything
else because we don't actually update the shadow register every
instruction.

>
> ---
> Mikhail

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Instruction virtual address in TCG Plugins
  2023-11-13 20:58 ` Alex Bennée
@ 2023-11-14  9:14   ` Mikhail Tyutin
  2023-11-14 10:57     ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Mikhail Tyutin @ 2023-11-14  9:14 UTC (permalink / raw)
  To: Alex Bennée
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

> > What is the right way to get virtual address of either translation block or instruction inside of TCG plugin? Does
> > plugin API allow that or it needs some extension?
> >
> > So far I use qemu_plugin_tb_vaddr() inside of my block translation callback to get block virtual address and then
> > pass it as 'userdata' argument into qemu_plugin_register_vcpu_tb_exec_cb(). I use it later during code execution.
> > It works well for user-mode emulation, but sometimes leads to
> > incorrect addresses in system-mode emulation.
> 
> You can use qemu_plugin_insn_vaddr and qemu_plugin_insn_haddr. But your
> right something under one vaddr and be executed under another with
> overlapping mappings. The haddr should be stable though I think.

As far as I see haddr is ok and can be used to identify blocks. However, if I have haddr at block execution phase and
I want to know vaddr, there is no API to get such mapping. Maybe it is possible to extract from software MMU, but I
have no clue where to start with.

> > I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.
> >
> > I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
> > value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
> > Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
> > CPUArchState structure).
> 
> When we merge the register api you should be able to do that. Although
> during testing I realised that PC acted funny compared to everything
> else because we don't actually update the shadow register every
> instruction.

We implemented similar API to read registers (by coincidence, I posted this patch at the same time as the API you
mentioned) and I observe similar behavior. As far as I see, CPU state is only updated in between of executed translation
blocks. Switching to 'singlestep' mode helps to fix that, but execution overhead is huge.

There is also blocks 'chaining' mechanism which is likely contributes to corrupted blocks vaddr inside of callbacks.
My guess is that 'pc' value for those chained blocks points to the first block of entire chain. Unfortunately, It is very
hard to debug, because I can only see block chains when I run whole Linux guest OS. Does Qemu has small test
application to trigger long enough chain of translation blocks?

Having those complexities makes me think to inject appropriate code into translation blocks to compute actual block
vaddr at execution stage. The problem here is to find a variable where I can load 'pc' at start of translation block.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instruction virtual address in TCG Plugins
  2023-11-14  9:14   ` Mikhail Tyutin
@ 2023-11-14 10:57     ` Alex Bennée
  2023-11-21 16:39       ` Mikhail Tyutin
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2023-11-14 10:57 UTC (permalink / raw)
  To: Mikhail Tyutin
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

Mikhail Tyutin <m.tyutin@yadro.com> writes:

>> > What is the right way to get virtual address of either translation block or instruction inside of TCG plugin? Does
>> > plugin API allow that or it needs some extension?
>> >
>> > So far I use qemu_plugin_tb_vaddr() inside of my block translation callback to get block virtual address and then
>> > pass it as 'userdata' argument into qemu_plugin_register_vcpu_tb_exec_cb(). I use it later during code execution.
>> > It works well for user-mode emulation, but sometimes leads to
>> > incorrect addresses in system-mode emulation.
>> 
>> You can use qemu_plugin_insn_vaddr and qemu_plugin_insn_haddr. But your
>> right something under one vaddr and be executed under another with
>> overlapping mappings. The haddr should be stable though I think.
>
> As far as I see haddr is ok and can be used to identify blocks. However, if I have haddr at block execution phase and
> I want to know vaddr, there is no API to get such mapping. Maybe it is possible to extract from software MMU, but I
> have no clue where to start with.

The translator doesn't know (at least since CF_PCREL) because the whole
point of that change was to avoid re-translating the same code from
multiple mappings. However we do have the ability to resolve a PC at
fault time so we could expose that to a execution callback.

>> > I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.
>> >
>> > I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
>> > value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
>> > Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
>> > CPUArchState structure).
>> 
>> When we merge the register api you should be able to do that. Although
>> during testing I realised that PC acted funny compared to everything
>> else because we don't actually update the shadow register every
>> instruction.
>
> We implemented similar API to read registers (by coincidence, I posted this patch at the same time as the API you
> mentioned) and I observe similar behavior. As far as I see, CPU state is only updated in between of executed translation
> blocks. Switching to 'singlestep' mode helps to fix that, but execution overhead is huge.
>
> There is also blocks 'chaining' mechanism which is likely contributes to corrupted blocks vaddr inside of callbacks.
> My guess is that 'pc' value for those chained blocks points to the first block of entire chain. Unfortunately, It is very
> hard to debug, because I can only see block chains when I run whole Linux guest OS. Does Qemu has small test
> application to trigger long enough chain of translation blocks?

No all registers should be resolved by the end of any block. There is
currently no optimisation of register usage between TBs. If you are
seeing PC corruption that would be a bug - but fundamentally things
would break pretty quick if tb_lookup() and friends didn't have an
accurate PC.

As for block chains any moderately complex loop should trigger chaining.

> Having those complexities makes me think to inject appropriate code into translation blocks to compute actual block
> vaddr at execution stage. The problem here is to find a variable where
> I can load 'pc' at start of translation block.

I think it would be pretty easy to ensure there is a rectified PC value
written before calling any callback - arguably the QEMU_PLUGIN_CB_R_REGS
flag should do this.

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Instruction virtual address in TCG Plugins
  2023-11-14 10:57     ` Alex Bennée
@ 2023-11-21 16:39       ` Mikhail Tyutin
  2023-11-21 17:24         ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Mikhail Tyutin @ 2023-11-21 16:39 UTC (permalink / raw)
  To: Alex Bennée
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

> >> > I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.
> >> >
> >> > I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
> >> > value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
> >> > Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
> >> > CPUArchState structure).
> >>
> >> When we merge the register api you should be able to do that. Although
> >> during testing I realised that PC acted funny compared to everything
> >> else because we don't actually update the shadow register every
> >> instruction.
> >
> > We implemented similar API to read registers (by coincidence, I posted this patch at the same time as the API you
> > mentioned) and I observe similar behavior. As far as I see, CPU state is only updated in between of executed translation
> > blocks. Switching to 'singlestep' mode helps to fix that, but execution overhead is huge.
> >
> > There is also blocks 'chaining' mechanism which is likely contributes to corrupted blocks vaddr inside of callbacks.
> > My guess is that 'pc' value for those chained blocks points to the first block of entire chain. Unfortunately, It is very
> > hard to debug, because I can only see block chains when I run whole Linux guest OS. Does Qemu has small test
> > application to trigger long enough chain of translation blocks?
> 
> No all registers should be resolved by the end of any block. There is
> currently no optimisation of register usage between TBs. If you are
> seeing PC corruption that would be a bug - but fundamentally things
> would break pretty quick if tb_lookup() and friends didn't have an
> accurate PC.

I managed to root cause source of corrupted addresses in plugin callbacks.
There were basically 2 problems:

1. Memory IO operations force TCG to create special translation blocks to
process that memory load/store operation. The plugin gets notification for
this translation block as well, but instrumentation callbacks other than
memory ones are silently ignored. To make it correct, the plugin has to match
instruction execution callback from previous TB to memory callback from that
special TB. The fix was to expose internal ‘memOnly’ TB flag to the plugin to
handle such TBs differently.

2. Another problem is related to interrupts handling. Since we can insert pre-
callback on instructions only, the plugin is not aware if instruction is
actually executed or interrupted by an interrupt or exception. In fact, it
mistakenly interprets all interrupted instructions as executed. Adding API
to receive interrupt notification and appropriate handling of it fixes
the problem.

I will send those patches for review shortly and thank you for dissuading me
from going to wrong direction!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Instruction virtual address in TCG Plugins
  2023-11-21 16:39       ` Mikhail Tyutin
@ 2023-11-21 17:24         ` Alex Bennée
  2023-11-22 12:28           ` Mikhail Tyutin
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2023-11-21 17:24 UTC (permalink / raw)
  To: Mikhail Tyutin
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

Mikhail Tyutin <m.tyutin@yadro.com> writes:

>> >> > I suspect it is because of memory mappings by guest OS that changes virtual addresses for that block.
>> >> >
>> >> > I also looked at gen_empty_udata_cb() function and considered to extend plugin API to pass a program counter
>> >> > value as additional callback argument. I thought it would always give me valid virtual address of an instruction.
>> >> > Unfortunately, I didn't find a way to get value of that register in architecture agnostic way (it is 'pc' member in
>> >> > CPUArchState structure).
>> >>
>> >> When we merge the register api you should be able to do that. Although
>> >> during testing I realised that PC acted funny compared to everything
>> >> else because we don't actually update the shadow register every
>> >> instruction.
>> >
>> > We implemented similar API to read registers (by coincidence, I posted this patch at the same time as the API you
>> > mentioned) and I observe similar behavior. As far as I see, CPU state is only updated in between of executed translation
>> > blocks. Switching to 'singlestep' mode helps to fix that, but execution overhead is huge.
>> >
>> > There is also blocks 'chaining' mechanism which is likely contributes to corrupted blocks vaddr inside of callbacks.
>> > My guess is that 'pc' value for those chained blocks points to the first block of entire chain. Unfortunately, It is very
>> > hard to debug, because I can only see block chains when I run whole Linux guest OS. Does Qemu has small test
>> > application to trigger long enough chain of translation blocks?
>> 
>> No all registers should be resolved by the end of any block. There is
>> currently no optimisation of register usage between TBs. If you are
>> seeing PC corruption that would be a bug - but fundamentally things
>> would break pretty quick if tb_lookup() and friends didn't have an
>> accurate PC.
>
> I managed to root cause source of corrupted addresses in plugin callbacks.
> There were basically 2 problems:
>
> 1. Memory IO operations force TCG to create special translation blocks to
> process that memory load/store operation. The plugin gets notification for
> this translation block as well, but instrumentation callbacks other than
> memory ones are silently ignored. To make it correct, the plugin has to match
> instruction execution callback from previous TB to memory callback from that
> special TB. The fix was to expose internal ‘memOnly’ TB flag to the plugin to
> handle such TBs differently.

Are you talking about the CF_MEMI_ONLY compile flag? We added this to
avoid double counting executed instructions. Has there been a clash with
the other changes to always cpu_recompile_io? This was a change added to
fix: https://gitlab.com/qemu-project/qemu/-/issues/1866

Richard is going to look at optimising the cpu_recompile_io code so we
"lock in" a shortened translation once we discover a block is doing
MMIO. See https://linaro.atlassian.net/browse/QEMU-605 for an overview.

> 2. Another problem is related to interrupts handling. Since we can insert pre-
> callback on instructions only, the plugin is not aware if instruction is
> actually executed or interrupted by an interrupt or exception. In fact, it
> mistakenly interprets all interrupted instructions as executed. Adding API
> to receive interrupt notification and appropriate handling of it fixes
> the problem.

We don't process any interrupts until the start of each block so no
asynchronous IRQs should interrupt execution. However it is possible
that any given instruction could generate a synchronous exception so if
you need a precise count of execution you need to instrument every
single instruction. With enough knowledge the plugin could avoid
instrumenting stuff that will never fault but that relies on baking
additional knowledge into the plugin.

Generally its only memory operations that can fault (although I guess
FPU and some more esoteric integer ops can).

>
> I will send those patches for review shortly and thank you for dissuading me
> from going to wrong direction!

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Instruction virtual address in TCG Plugins
  2023-11-21 17:24         ` Alex Bennée
@ 2023-11-22 12:28           ` Mikhail Tyutin
  0 siblings, 0 replies; 7+ messages in thread
From: Mikhail Tyutin @ 2023-11-22 12:28 UTC (permalink / raw)
  To: Alex Bennée
  Cc: qemu-devel@nongnu.org, Richard Henderson, erdnaxe@crans.org,
	ma.mandourr@gmail.com

> > 1. Memory IO operations force TCG to create special translation blocks to
> > process that memory load/store operation. The plugin gets notification for
> > this translation block as well, but instrumentation callbacks other than
> > memory ones are silently ignored. To make it correct, the plugin has to match
> > instruction execution callback from previous TB to memory callback from that
> > special TB. The fix was to expose internal ‘memOnly’ TB flag to the plugin to
> > handle such TBs differently.
> 
> Are you talking about the CF_MEMI_ONLY compile flag? We added this to
> avoid double counting executed instructions. Has there been a clash with
> the other changes to always cpu_recompile_io? This was a change added to
> fix: https://gitlab.com/qemu-project/qemu/-/issues/1866

Yes, that's it. qemu_plugin_tb structure has 'mem_only' field for those block.
I only added API to read this flag by a plugin.

 
> > 2. Another problem is related to interrupts handling. Since we can insert pre-
> > callback on instructions only, the plugin is not aware if instruction is
> > actually executed or interrupted by an interrupt or exception. In fact, it
> > mistakenly interprets all interrupted instructions as executed. Adding API
> > to receive interrupt notification and appropriate handling of it fixes
> > the problem.
> 
> We don't process any interrupts until the start of each block so no
> asynchronous IRQs should interrupt execution. However it is possible
> that any given instruction could generate a synchronous exception so if
> you need a precise count of execution you need to instrument every
> single instruction. With enough knowledge the plugin could avoid
> instrumenting stuff that will never fault but that relies on baking
> additional knowledge into the plugin.
> 
> Generally its only memory operations that can fault (although I guess
> FPU and some more esoteric integer ops can).

That matches my observation. I do see interrupts either on TB boundary
(e.g. timers) or memory load instructions.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-11-22 12:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-13 18:33 Instruction virtual address in TCG Plugins Mikhail Tyutin
2023-11-13 20:58 ` Alex Bennée
2023-11-14  9:14   ` Mikhail Tyutin
2023-11-14 10:57     ` Alex Bennée
2023-11-21 16:39       ` Mikhail Tyutin
2023-11-21 17:24         ` Alex Bennée
2023-11-22 12:28           ` Mikhail Tyutin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).