[Qemu-devel] save compiled qemu traces.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] save compiled qemu traces.
@ 2013-12-09  6:36 Xin Tong
  2013-12-09 15:25 ` Alex Bennée
  2013-12-09 15:32 ` Peter Maydell
  0 siblings, 2 replies; 8+ messages in thread
From: Xin Tong @ 2013-12-09  6:36 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 624 bytes --]

Does anyone have profiles on how much time QEMU spends in translating
instructions. QEMU does not have a baseline interpreter nor does it
translate on trace-granularity.  so i imagine QEMU must spend quite a bit
of time translating instructions.

Is it possible for QEMU to obviate some of the translations by attaching a
signature (e.g. a hash) with every translated basic block and try to reuse
translated basic block based on the signature as much as possible ? Reuses
can be a result of rerunning programs or same libraries statically linked
to programs.

This could end up saving some translation time.

Thank you,
Xin

[-- Attachment #2: Type: text/html, Size: 1258 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-09  6:36 [Qemu-devel] save compiled qemu traces Xin Tong
@ 2013-12-09 15:25 ` Alex Bennée
  2013-12-12  4:07   ` Xin Tong
  2013-12-09 15:32 ` Peter Maydell
  1 sibling, 1 reply; 8+ messages in thread
From: Alex Bennée @ 2013-12-09 15:25 UTC (permalink / raw)
  To: Xin Tong; +Cc: qemu-devel

trent.tong@gmail.com writes:

> Does anyone have profiles on how much time QEMU spends in translating
> instructions. QEMU does not have a baseline interpreter nor does it
> translate on trace-granularity.  so i imagine QEMU must spend quite a bit
> of time translating instructions.

Not as much as you'd think. The translation stage isn't very complex and
blocks only get translated once (modulo exceptions and self modifying
code). If you run perf on your task you should see most of the time is
spent in the generated code - if not please send the test case to the
list.

I suspect the more useful statistic would be getting a break down of the
translation blocks and seeing which ones are the most heavily used and
examining if QEMU has done as good a job as it can of translating them.  

> Is it possible for QEMU to obviate some of the translations by attaching a
> signature (e.g. a hash) with every translated basic block and try to reuse
> translated basic block based on the signature as much as possible ? Reuses
> can be a result of rerunning programs or same libraries statically linked
> to programs.

Your right a translation cache *could* save some translation time,
especially if you end up translating the same program over and over
again. Having said that you might find the cost of computing the
checksum obviates any speed-up from skipping the translation. After all
QEMU only needs to look at each subject instruction once normally.

Using QEMU  linux-user for cross building would be the obvious pain
point. However as the usual use case is building for embedded platforms
most users are just happy to fully utilise their 80-core build machines
in preference to having a farm of slow embedded processors.

> This could end up saving some translation time.

I think you would need to do some performance analysis and come up with
some numbers before you made that assumption.

Cheers,

-- 
Alex Bennée
QEMU/KVM Hacker for Linaro

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-09 15:25 ` Alex Bennée
@ 2013-12-12  4:07   ` Xin Tong
  2013-12-12  4:51     ` Xin Tong
  2013-12-12 13:37     ` Laurent Desnogues
  0 siblings, 2 replies; 8+ messages in thread
From: Xin Tong @ 2013-12-12  4:07 UTC (permalink / raw)
  To: Alex Bennée; +Cc: QEMU Developers

see questions below.

On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> trent.tong@gmail.com writes:
>
>> Does anyone have profiles on how much time QEMU spends in translating
>> instructions. QEMU does not have a baseline interpreter nor does it
>> translate on trace-granularity.  so i imagine QEMU must spend quite a bit
>> of time translating instructions.
>
> Not as much as you'd think. The translation stage isn't very complex and
> blocks only get translated once (modulo exceptions and self modifying
> code). If you run perf on your task you should see most of the time is
> spent in the generated code - if not please send the test case to the
> list.

I took a profile running speccpu2006 403.gcc with test input on a
intel xeon machine. we only spent 44.76% of the time in the code cache
(i.e. 13M ticks in the code cache), while 40.97% of the time is spent
in the qemu-system-x86_64. some of the hot functions in
qemu-system-x86_64 are listed below.

*you are right* we do not spend much time in translation routines.
instead we spend significant amount of time in address translation
code.

CPU_CLK_UNHALTED %     Symbol/Functions
1340512         100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)


CPU_CLK_UNHALTED %     Symbol/Functions
314655           25.64 address_space_translate_internal
308942           25.18 cpu_x86_exec
128922           10.51 ldq_phys
92345           7.53 cpu_x86_handle_mmu_fault
62456           5.09 tlb_set_page
49332           4.02 memory_region_is_ram
31055           2.53 helper_le_ldq_mmu
22048           1.80 memory_region_get_ram_addr
19223           1.57 memory_region_section_get_iotlb
15873           1.29 tcg_optimize
14526           1.18 get_page_addr_code
12601           1.03 memory_region_get_ram_ptr

Xin


>
> I suspect the more useful statistic would be getting a break down of the
> translation blocks and seeing which ones are the most heavily used and
> examining if QEMU has done as good a job as it can of translating them.
>
>> Is it possible for QEMU to obviate some of the translations by attaching a
>> signature (e.g. a hash) with every translated basic block and try to reuse
>> translated basic block based on the signature as much as possible ? Reuses
>> can be a result of rerunning programs or same libraries statically linked
>> to programs.
>
> Your right a translation cache *could* save some translation time,
> especially if you end up translating the same program over and over
> again. Having said that you might find the cost of computing the
> checksum obviates any speed-up from skipping the translation. After all
> QEMU only needs to look at each subject instruction once normally.
>
> Using QEMU  linux-user for cross building would be the obvious pain
> point. However as the usual use case is building for embedded platforms
> most users are just happy to fully utilise their 80-core build machines
> in preference to having a farm of slow embedded processors.
>
>> This could end up saving some translation time.
>
> I think you would need to do some performance analysis and come up with
> some numbers before you made that assumption.
>
> Cheers,
>
> --
> Alex Bennée
> QEMU/KVM Hacker for Linaro
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-12  4:07   ` Xin Tong
@ 2013-12-12  4:51     ` Xin Tong
  2013-12-12 13:37     ` Laurent Desnogues
  1 sibling, 0 replies; 8+ messages in thread
From: Xin Tong @ 2013-12-12  4:51 UTC (permalink / raw)
  To: Alex Bennée; +Cc: QEMU Developers

On Thu, Dec 12, 2013 at 1:07 PM, Xin Tong <trent.tong@gmail.com> wrote:
> see questions below.
>
> On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> trent.tong@gmail.com writes:
>>
>>> Does anyone have profiles on how much time QEMU spends in translating
>>> instructions. QEMU does not have a baseline interpreter nor does it
>>> translate on trace-granularity.  so i imagine QEMU must spend quite a bit
>>> of time translating instructions.
>>
>> Not as much as you'd think. The translation stage isn't very complex and
>> blocks only get translated once (modulo exceptions and self modifying
>> code). If you run perf on your task you should see most of the time is
>> spent in the generated code - if not please send the test case to the
>> list.
>
> I took a profile running speccpu2006 403.gcc with test input on a
> intel xeon machine. we only spent 44.76% of the time in the code cache
> (i.e. 13M ticks in the code cache), while 40.97% of the time is spent
> in the qemu-system-x86_64. some of the hot functions in
> qemu-system-x86_64 are listed below.
>
> *you are right* we do not spend much time in translation routines.
> instead we spend significant amount of time in address translation
> code.
>
> CPU_CLK_UNHALTED %     Symbol/Functions
> 1340512         100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)
>
>
> CPU_CLK_UNHALTED %     Symbol/Functions
> 314655           25.64 address_space_translate_internal
> 308942           25.18 cpu_x86_exec
> 128922           10.51 ldq_phys
> 92345           7.53 cpu_x86_handle_mmu_fault
> 62456           5.09 tlb_set_page
> 49332           4.02 memory_region_is_ram
> 31055           2.53 helper_le_ldq_mmu
> 22048           1.80 memory_region_get_ram_addr
> 19223           1.57 memory_region_section_get_iotlb
> 15873           1.29 tcg_optimize
> 14526           1.18 get_page_addr_code
> 12601           1.03 memory_region_get_ram_ptr

However, being able to reuse cached blocks based on content in QEMU
maybe a step towards reusing translated blocks across multiple
invocations of QEMU.
>
> Xin
>
>
>>
>> I suspect the more useful statistic would be getting a break down of the
>> translation blocks and seeing which ones are the most heavily used and
>> examining if QEMU has done as good a job as it can of translating them.
>>
>>> Is it possible for QEMU to obviate some of the translations by attaching a
>>> signature (e.g. a hash) with every translated basic block and try to reuse
>>> translated basic block based on the signature as much as possible ? Reuses
>>> can be a result of rerunning programs or same libraries statically linked
>>> to programs.
>>
>> Your right a translation cache *could* save some translation time,
>> especially if you end up translating the same program over and over
>> again. Having said that you might find the cost of computing the
>> checksum obviates any speed-up from skipping the translation. After all
>> QEMU only needs to look at each subject instruction once normally.
>>
>> Using QEMU  linux-user for cross building would be the obvious pain
>> point. However as the usual use case is building for embedded platforms
>> most users are just happy to fully utilise their 80-core build machines
>> in preference to having a farm of slow embedded processors.
>>
>>> This could end up saving some translation time.
>>
>> I think you would need to do some performance analysis and come up with
>> some numbers before you made that assumption.
>>
>> Cheers,
>>
>> --
>> Alex Bennée
>> QEMU/KVM Hacker for Linaro
>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-12  4:07   ` Xin Tong
  2013-12-12  4:51     ` Xin Tong
@ 2013-12-12 13:37     ` Laurent Desnogues
  1 sibling, 0 replies; 8+ messages in thread
From: Laurent Desnogues @ 2013-12-12 13:37 UTC (permalink / raw)
  To: Xin Tong; +Cc: Alex Bennée, QEMU Developers

On Thu, Dec 12, 2013 at 5:07 AM, Xin Tong <trent.tong@gmail.com> wrote:
> see questions below.
>
> On Tue, Dec 10, 2013 at 12:25 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> trent.tong@gmail.com writes:
>>
>>> Does anyone have profiles on how much time QEMU spends in translating
>>> instructions. QEMU does not have a baseline interpreter nor does it
>>> translate on trace-granularity.  so i imagine QEMU must spend quite a bit
>>> of time translating instructions.
>>
>> Not as much as you'd think. The translation stage isn't very complex and
>> blocks only get translated once (modulo exceptions and self modifying
>> code). If you run perf on your task you should see most of the time is
>> spent in the generated code - if not please send the test case to the
>> list.
>
> I took a profile running speccpu2006 403.gcc with test input on a
> intel xeon machine. we only spent 44.76% of the time in the code cache
> (i.e. 13M ticks in the code cache), while 40.97% of the time is spent
> in the qemu-system-x86_64. some of the hot functions in
> qemu-system-x86_64 are listed below.
>
> *you are right* we do not spend much time in translation routines.
> instead we spend significant amount of time in address translation
> code.
>
> CPU_CLK_UNHALTED %     Symbol/Functions
> 1340512         100.00 anon (tgid:7106 range:0x7f97815ca000-0x7f979a692000)
>
>
> CPU_CLK_UNHALTED %     Symbol/Functions
> 314655           25.64 address_space_translate_internal
> 308942           25.18 cpu_x86_exec
> 128922           10.51 ldq_phys
> 92345           7.53 cpu_x86_handle_mmu_fault
> 62456           5.09 tlb_set_page
> 49332           4.02 memory_region_is_ram
> 31055           2.53 helper_le_ldq_mmu
> 22048           1.80 memory_region_get_ram_addr
> 19223           1.57 memory_region_section_get_iotlb
> 15873           1.29 tcg_optimize
> 14526           1.18 get_page_addr_code
> 12601           1.03 memory_region_get_ram_ptr

You could perhaps redo the same experiment using user mode QEMU.
That'll give you another interesting point of measure.

Another experiment is kernel booting, because it's likely to run code
once which will make code translation functions climb up the use
scale.


Laurent

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-09  6:36 [Qemu-devel] save compiled qemu traces Xin Tong
  2013-12-09 15:25 ` Alex Bennée
@ 2013-12-09 15:32 ` Peter Maydell
  2013-12-10  2:41   ` Xin Tong
  2013-12-10 10:04   ` Alex Bennée
  1 sibling, 2 replies; 8+ messages in thread
From: Peter Maydell @ 2013-12-09 15:32 UTC (permalink / raw)
  To: Xin Tong; +Cc: QEMU Developers

On 9 December 2013 06:36, Xin Tong <trent.tong@gmail.com> wrote:
> Is it possible for QEMU to obviate some of the translations by attaching a
> signature (e.g. a hash) with every translated basic block and try to reuse
> translated basic block based on the signature as much as possible ? Reuses
> can be a result of rerunning programs or same libraries statically linked to
> programs.

We already cache translated results. See tb_find_fast()
and tb_find_slow() which do the lookup into the cache.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-09 15:32 ` Peter Maydell
@ 2013-12-10  2:41   ` Xin Tong
  2013-12-10 10:04   ` Alex Bennée
  1 sibling, 0 replies; 8+ messages in thread
From: Xin Tong @ 2013-12-10  2:41 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 854 bytes --]

tb_find_fast and tb_find_slow are finding the translated blocks based on
guest physical address. I am thinking about finding tbs by content, e.g.
using a hash signature. this can be used to potentially save translations.

Xin


On Mon, Dec 9, 2013 at 7:32 AM, Peter Maydell <peter.maydell@linaro.org>wrote:

> On 9 December 2013 06:36, Xin Tong <trent.tong@gmail.com> wrote:
> > Is it possible for QEMU to obviate some of the translations by attaching
> a
> > signature (e.g. a hash) with every translated basic block and try to
> reuse
> > translated basic block based on the signature as much as possible ?
> Reuses
> > can be a result of rerunning programs or same libraries statically
> linked to
> > programs.
>
> We already cache translated results. See tb_find_fast()
> and tb_find_slow() which do the lookup into the cache.
>
> thanks
> -- PMM
>

[-- Attachment #2: Type: text/html, Size: 1503 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] save compiled qemu traces.
  2013-12-09 15:32 ` Peter Maydell
  2013-12-10  2:41   ` Xin Tong
@ 2013-12-10 10:04   ` Alex Bennée
  1 sibling, 0 replies; 8+ messages in thread
From: Alex Bennée @ 2013-12-10 10:04 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Xin Tong, QEMU Developers


peter.maydell@linaro.org writes:

> On 9 December 2013 06:36, Xin Tong <trent.tong@gmail.com> wrote:
>> Is it possible for QEMU to obviate some of the translations by attaching a
>> signature (e.g. a hash) with every translated basic block and try to reuse
>> translated basic block based on the signature as much as possible ? Reuses
>> can be a result of rerunning programs or same libraries statically linked to
>> programs.
>
> We already cache translated results. See tb_find_fast()
> and tb_find_slow() which do the lookup into the cache.

These are for the current execution context though aren't they? I
thought Xin was talking about caching translations between invocations
of QEMU.

I suspect address space randomisation would be another wrinkle in the
side of any such scheme though.

>
> thanks
> -- PMM

-- 
Alex Bennée
QEMU/KVM Hacker for Linaro

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-12-12 13:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-09  6:36 [Qemu-devel] save compiled qemu traces Xin Tong
2013-12-09 15:25 ` Alex Bennée
2013-12-12  4:07   ` Xin Tong
2013-12-12  4:51     ` Xin Tong
2013-12-12 13:37     ` Laurent Desnogues
2013-12-09 15:32 ` Peter Maydell
2013-12-10  2:41   ` Xin Tong
2013-12-10 10:04   ` Alex Bennée

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).