[Qemu-devel] performance monitor

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] performance monitor
@ 2008-01-03 20:36 Clemens Kolbitsch
  2008-01-03 21:29 ` Paul Brook
  0 siblings, 1 reply; 10+ messages in thread
From: Clemens Kolbitsch @ 2008-01-03 20:36 UTC (permalink / raw)
  To: qemu-devel

hi!
has anyone ever used some "real" performance monitoring tools (like papiex, 
perfex, pfmon, etc.) on qemu? i'm running a debian linux and would like to 
time some applications inside qemu and have tried the perfmon2 kernel-patch 
(http://perfmon2.sourceforge.net/) for testing.

sadly, it does not work... dmesg tells me that the CPU is not identified 
correctly ("unsupported family=6"). Now i am not really sure what type of 
hardware-support the monitor relies on (i think PMU is the correct term, but 
I'm not sure about that) and what CPUs are supported (dmesg tells me that 
qemu simulates a Pentium M, but that's probably because I've compiled the 
kernel on my *real* Pentium M).

... Ok, to cut a long question short: Is there any hardware support im qemu 
for doing monitoring (that goes deeper than using "time") and has anyone ever 
tested something that could work?

Thanks!
Clemens

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 20:36 [Qemu-devel] performance monitor Clemens Kolbitsch
@ 2008-01-03 21:29 ` Paul Brook
  2008-01-03 21:38   ` Clemens Kolbitsch
  0 siblings, 1 reply; 10+ messages in thread
From: Paul Brook @ 2008-01-03 21:29 UTC (permalink / raw)
  To: qemu-devel; +Cc: Clemens Kolbitsch

> ... Ok, to cut a long question short: Is there any hardware support im qemu
> for doing monitoring (that goes deeper than using "time") and has anyone
> ever tested something that could work?

Probably your application wants the performance counters. Qemu doesn't emulate 
those.

Besides which, qemu is not cycle accurate. Any performance measurements your 
make are pretty much meaningless, and bear absolutely no relationship to real 
hardware.

Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 21:29 ` Paul Brook
@ 2008-01-03 21:38   ` Clemens Kolbitsch
  2008-01-03 22:07     ` Paul Brook
  2008-01-04  8:49     ` Rob Landley
  0 siblings, 2 replies; 10+ messages in thread
From: Clemens Kolbitsch @ 2008-01-03 21:38 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Thursday 03 January 2008 22:29:06 Paul Brook wrote:
> > ... Ok, to cut a long question short: Is there any hardware support im
> > qemu for doing monitoring (that goes deeper than using "time") and has
> > anyone ever tested something that could work?
>
> Probably your application wants the performance counters. Qemu doesn't
> emulate those.
>
> Besides which, qemu is not cycle accurate. Any performance measurements
> your make are pretty much meaningless, and bear absolutely no relationship
> to real hardware.

Thanks for the quick answer Paul! Not really what I wanted to hear, but 
probably true ;-)

Does anyone have an idea on how I can measure performance in qemu to a 
somewhat accurate level? I have modified qemu (the memory handling) and the 
linux kernel and want to find out the penalty this introduced... does anyone 
have any comments / ideas on this?

Thanks!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 21:38   ` Clemens Kolbitsch
@ 2008-01-03 22:07     ` Paul Brook
  2008-01-03 22:11       ` Clemens Kolbitsch
  2008-01-04  8:49     ` Rob Landley
  1 sibling, 1 reply; 10+ messages in thread
From: Paul Brook @ 2008-01-03 22:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Clemens Kolbitsch

> Does anyone have an idea on how I can measure performance in qemu to a
> somewhat accurate level? I have modified qemu (the memory handling) and the
> linux kernel and want to find out the penalty this introduced... does
> anyone have any comments / ideas on this?

Short answer is you probably can't. And even if you can I won't believe tyour 
results unless you've verified them on real hardware :-)

With the exception of some very small embedded cores, Modern CPUs have complex 
out of order execution pipelines and multi-level cache hierarchies. It's 
common for performance to be dominated by these secondary factors rather than 
raw instruction throughput.

Exactly what features dominate performance is very application specific. 
Determining which factor dominates is unlikely to be something qemu can help 
with.

However if e.g. you know that for your application there's a good correlation 
was between performance and L2 cache misses you could instrument qemu to and 
a L1/L2 cache model. The overhead will be fairly severe (easily 10x slower), 
and completely screw up any realtime measurements. However it would produce 
some useful cache use statistics that you could use to guesstimate actual 
performance. This is similar to how cachegrind works. Obviously if your 
application isn't cache bound then these figures will be meaningless.

Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 22:07     ` Paul Brook
@ 2008-01-03 22:11       ` Clemens Kolbitsch
  2008-01-03 22:18         ` Paul Brook
  2008-01-03 22:19         ` Laurent Desnogues
  0 siblings, 2 replies; 10+ messages in thread
From: Clemens Kolbitsch @ 2008-01-03 22:11 UTC (permalink / raw)
  To: qemu-devel

On Thursday 03 January 2008 23:07:07 you wrote:
> > Does anyone have an idea on how I can measure performance in qemu to a
> > somewhat accurate level? I have modified qemu (the memory handling) and
> > the linux kernel and want to find out the penalty this introduced... does
> > anyone have any comments / ideas on this?
>
> Short answer is you probably can't. And even if you can I won't believe
> tyour results unless you've verified them on real hardware :-)
>
> With the exception of some very small embedded cores, Modern CPUs have
> complex out of order execution pipelines and multi-level cache hierarchies.
> It's common for performance to be dominated by these secondary factors
> rather than raw instruction throughput.
>
> Exactly what features dominate performance is very application specific.
> Determining which factor dominates is unlikely to be something qemu can
> help with.
>
> However if e.g. you know that for your application there's a good
> correlation was between performance and L2 cache misses you could
> instrument qemu to and a L1/L2 cache model. The overhead will be fairly
> severe (easily 10x slower), and completely screw up any realtime
> measurements. However it would produce some useful cache use statistics
> that you could use to guesstimate actual performance. This is similar to
> how cachegrind works. Obviously if your application isn't cache bound then
> these figures will be meaningless.

Well, the measuring I had in mind partly concentrats on TLB misses, page 
faults, etc. (in addition to the cycle measuring). guess i'll have to 
implement something for myself in qemu :-/

But thanks a lot for helping me out!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 22:11       ` Clemens Kolbitsch
@ 2008-01-03 22:18         ` Paul Brook
  2008-01-03 22:21           ` Clemens Kolbitsch
  2008-01-03 22:19         ` Laurent Desnogues
  1 sibling, 1 reply; 10+ messages in thread
From: Paul Brook @ 2008-01-03 22:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Clemens Kolbitsch

> Well, the measuring I had in mind partly concentrats on TLB misses, page
> faults, etc. (in addition to the cycle measuring). guess i'll have to
> implement something for myself in qemu :-/

Be aware that the TLB qemu uses behaves very differently to a real CPU TLB. If 
you want to get TLB miss statistics you'll need to model a "real" TLB for 
that separately.

Page faults should be straightforward, but any half-decent guest OS would be 
able to tell you those anyway.

Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 22:18         ` Paul Brook
@ 2008-01-03 22:21           ` Clemens Kolbitsch
  0 siblings, 0 replies; 10+ messages in thread
From: Clemens Kolbitsch @ 2008-01-03 22:21 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

On Thursday 03 January 2008 23:18:58 Paul Brook wrote:
> > Well, the measuring I had in mind partly concentrats on TLB misses, page
> > faults, etc. (in addition to the cycle measuring). guess i'll have to
> > implement something for myself in qemu :-/
>
> Be aware that the TLB qemu uses behaves very differently to a real CPU TLB.
> If you want to get TLB miss statistics you'll need to model a "real" TLB
> for that separately.

Sure, yes. But I don't even care what it would be like on a real CPU. I just 
want to know the impact it has on the emulated CPU ;-)

> Page faults should be straightforward, but any half-decent guest OS would
> be able to tell you those anyway.

True *g*

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 22:11       ` Clemens Kolbitsch
  2008-01-03 22:18         ` Paul Brook
@ 2008-01-03 22:19         ` Laurent Desnogues
  1 sibling, 0 replies; 10+ messages in thread
From: Laurent Desnogues @ 2008-01-03 22:19 UTC (permalink / raw)
  To: qemu-devel

On Jan 3, 2008 11:11 PM, Clemens Kolbitsch <clemens.kol@gmx.at> wrote:
>
> Well, the measuring I had in mind partly concentrats on TLB misses, page
> faults, etc. (in addition to the cycle measuring). guess i'll have to
> implement something for myself in qemu :-/

There's something not clear here:  do you want to measure your kernel
changes or do you want to profile Qemu?

As Paul clearly explained you can't do both :)

If you want to measure kernel performance oprofile is probably worth
looking at.  But you will need the real hardware.

Another option, though much more intrusive, would be to add explicit
performance counters in places you need to look at (this method can
be applied to both Qemu too).

And to say it again:  nobody can expect to measure OS performance
on a simulator, unless the simulator is directly derived from the HDL
code written by designers.  At least I would never trust such a
result ;)

Laurent

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-03 21:38   ` Clemens Kolbitsch
  2008-01-03 22:07     ` Paul Brook
@ 2008-01-04  8:49     ` Rob Landley
  2008-01-04 15:09       ` Clemens Kolbitsch
  1 sibling, 1 reply; 10+ messages in thread
From: Rob Landley @ 2008-01-04  8:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Clemens Kolbitsch, Paul Brook

On Thursday 03 January 2008 15:38:02 Clemens Kolbitsch wrote:
> Does anyone have an idea on how I can measure performance in qemu to a
> somewhat accurate level?

hwclock --show > time1
tar xvjf linux-2.6.23.tar.bz2 && cd linux-2.6.23 && make allnoconfig && make 
cd ..
hwclock --show > time2

Do that on host and client, and you've got a ratio of the performance of qemu 
to your host that should be good to within a few percent.

> I have modified qemu (the memory handling) and the 
> linux kernel and want to find out the penalty this introduced... does
> anyone have any comments / ideas on this?

If it's something big, you can compare the result in minutes and seconds.  
That's probably the best you're going to do.  (Although really you want 
hwclock --show before and after, and then do the math.  That tunnels out to 
the host system to get its idea of the time, which doesn't get thrown off by 
timer interrupt delivery (as a signal) getting deferred by the host system's 
scheduler.  Of course the fact that hwclock _takes_ a second or so to read 
the clock is a bit of a downer, but anything that takes less than a minute or 
so to run isn't going to give you a very accurate time because the 
performance of qemu isn't constant, and your results are going to skew all 
over the place.

Especially for small things, the performance varies from run to run.  Start by 
imagining qemu as having the mother of all page fault latencies.  The cost of 
faulting code into the L2 cache includes dynamic recompilation, which is 
expensive.

Worse, when the dynamic recompilation buffer fills up it blanks the whole 
thing, and recompiles every new page it hits one at a time until the buffer 
fills up again.  (What is it these days, 16 megs of translated code before it 
resets?)  No LRU or anything, no cache management at _all_, just "when the 
bucket fills up, dump it and start over".  (Well, that's what it did back 
around the last stable release anyway.  It has been almost a year since then, 
so maybe it's changed.  I've been busy with other things and not really 
keeping track of changes that didn't affect what I could and couldn't get to 
run.)

So anyway, depending on what code you run in what order, the performance can 
_differ_ from one run to the next due to when the cache gets blanked and 
stuff gets retranslated.  By a lot.  There's no obvious way to predict this 
or control it.  And the "software" clock inside your emulated system can lie 
to you about it if timer interrupts get deferred.

All this should pretty much average out if you do something big with lots of 
execs (like build a linux kernel from source).  But if you do something small 
expect serious butterfly effects.  Expect microbenchmarks to swing around 
wildly.

Quick analogy: you know the performance difference faulting your executable in 
from disk vs running it out of cache?  Imagine a daemon that makes random 
intermittent calls to "echo 1 > /proc/sys/vm/drop_caches", and now try to do 
a sane benchmark.  No matter what you use to measure, what you're measuring 
isn't going to be consistent from one run to the next.

Performance should be better (and more stable) with kqemu or kvm.  Maybe that 
you can benchmark sanely, I wouldn't know.  Ask somebody else. :)

P.S.  Take the above with a large grain of salt, I'm not close to an expert in 
this area...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] performance monitor
  2008-01-04  8:49     ` Rob Landley
@ 2008-01-04 15:09       ` Clemens Kolbitsch
  0 siblings, 0 replies; 10+ messages in thread
From: Clemens Kolbitsch @ 2008-01-04 15:09 UTC (permalink / raw)
  To: Rob Landley; +Cc: qemu-devel, Paul Brook paul

On Friday 04 January 2008 09:49:22 Rob Landley wrote:
> On Thursday 03 January 2008 15:38:02 Clemens Kolbitsch wrote:
> > Does anyone have an idea on how I can measure performance in qemu to a
> > somewhat accurate level?
>
> hwclock --show > time1
> tar xvjf linux-2.6.23.tar.bz2 && cd linux-2.6.23 && make allnoconfig &&
> make cd ..
> hwclock --show > time2
>
> Do that on host and client, and you've got a ratio of the performance of
> qemu to your host that should be good to within a few percent.
>
> > I have modified qemu (the memory handling) and the
> > linux kernel and want to find out the penalty this introduced... does
> > anyone have any comments / ideas on this?
>
> If it's something big, you can compare the result in minutes and seconds.
> That's probably the best you're going to do.  (Although really you want
> hwclock --show before and after, and then do the math.  That tunnels out to
> the host system to get its idea of the time, which doesn't get thrown off
> by timer interrupt delivery (as a signal) getting deferred by the host
> system's scheduler.  Of course the fact that hwclock _takes_ a second or so
> to read the clock is a bit of a downer, but anything that takes less than a
> minute or so to run isn't going to give you a very accurate time because
> the performance of qemu isn't constant, and your results are going to skew
> all over the place.
>
> Especially for small things, the performance varies from run to run.  Start
> by imagining qemu as having the mother of all page fault latencies.  The
> cost of faulting code into the L2 cache includes dynamic recompilation,
> which is expensive.
>
> Worse, when the dynamic recompilation buffer fills up it blanks the whole
> thing, and recompiles every new page it hits one at a time until the buffer
> fills up again.  (What is it these days, 16 megs of translated code before
> it resets?)  No LRU or anything, no cache management at _all_, just "when
> the bucket fills up, dump it and start over".  (Well, that's what it did
> back around the last stable release anyway.  It has been almost a year
> since then, so maybe it's changed.  I've been busy with other things and
> not really keeping track of changes that didn't affect what I could and
> couldn't get to run.)
>
> So anyway, depending on what code you run in what order, the performance
> can _differ_ from one run to the next due to when the cache gets blanked
> and stuff gets retranslated.  By a lot.  There's no obvious way to predict
> this or control it.  And the "software" clock inside your emulated system
> can lie to you about it if timer interrupts get deferred.
>
> All this should pretty much average out if you do something big with lots
> of execs (like build a linux kernel from source).  But if you do something
> small expect serious butterfly effects.  Expect microbenchmarks to swing
> around wildly.
>
> Quick analogy: you know the performance difference faulting your executable
> in
>
> >from disk vs running it out of cache?  Imagine a daemon that makes random
>
> intermittent calls to "echo 1 > /proc/sys/vm/drop_caches", and now try to
> do a sane benchmark.  No matter what you use to measure, what you're
> measuring isn't going to be consistent from one run to the next.
>
> Performance should be better (and more stable) with kqemu or kvm.  Maybe
> that you can benchmark sanely, I wouldn't know.  Ask somebody else. :)
>
> P.S.  Take the above with a large grain of salt, I'm not close to an expert
> in this area...

:-)

Ok. What you've said pretty much covers how I've made up my mind in the last 
couple of hours trying to think about the problem *g*

Guess I'll have to be happy counting TLB misses and page faults, adding up 
executed instructions (in user/kernel mode) per process and doing some timing 
stuff... then running the examples a lot of times, making an average of all 
numbers and finally just ignoring them since I *know* that they are bogus ;-)

No, seriously... I understand the problem, but I think the above is the best I 
can do since I'm really only interested in the effekt it has on QEMU for the 
moment :-)

Thanks again for your ideas!!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-01-04 15:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-03 20:36 [Qemu-devel] performance monitor Clemens Kolbitsch
2008-01-03 21:29 ` Paul Brook
2008-01-03 21:38   ` Clemens Kolbitsch
2008-01-03 22:07     ` Paul Brook
2008-01-03 22:11       ` Clemens Kolbitsch
2008-01-03 22:18         ` Paul Brook
2008-01-03 22:21           ` Clemens Kolbitsch
2008-01-03 22:19         ` Laurent Desnogues
2008-01-04  8:49     ` Rob Landley
2008-01-04 15:09       ` Clemens Kolbitsch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).