* [Qemu-devel] performance monitor @ 2008-01-03 20:36 Clemens Kolbitsch 2008-01-03 21:29 ` Paul Brook 0 siblings, 1 reply; 10+ messages in thread From: Clemens Kolbitsch @ 2008-01-03 20:36 UTC (permalink / raw) To: qemu-devel hi! has anyone ever used some "real" performance monitoring tools (like papiex, perfex, pfmon, etc.) on qemu? i'm running a debian linux and would like to time some applications inside qemu and have tried the perfmon2 kernel-patch (http://perfmon2.sourceforge.net/) for testing. sadly, it does not work... dmesg tells me that the CPU is not identified correctly ("unsupported family=6"). Now i am not really sure what type of hardware-support the monitor relies on (i think PMU is the correct term, but I'm not sure about that) and what CPUs are supported (dmesg tells me that qemu simulates a Pentium M, but that's probably because I've compiled the kernel on my *real* Pentium M). ... Ok, to cut a long question short: Is there any hardware support im qemu for doing monitoring (that goes deeper than using "time") and has anyone ever tested something that could work? Thanks! Clemens ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 20:36 [Qemu-devel] performance monitor Clemens Kolbitsch @ 2008-01-03 21:29 ` Paul Brook 2008-01-03 21:38 ` Clemens Kolbitsch 0 siblings, 1 reply; 10+ messages in thread From: Paul Brook @ 2008-01-03 21:29 UTC (permalink / raw) To: qemu-devel; +Cc: Clemens Kolbitsch > ... Ok, to cut a long question short: Is there any hardware support im qemu > for doing monitoring (that goes deeper than using "time") and has anyone > ever tested something that could work? Probably your application wants the performance counters. Qemu doesn't emulate those. Besides which, qemu is not cycle accurate. Any performance measurements your make are pretty much meaningless, and bear absolutely no relationship to real hardware. Paul ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 21:29 ` Paul Brook @ 2008-01-03 21:38 ` Clemens Kolbitsch 2008-01-03 22:07 ` Paul Brook 2008-01-04 8:49 ` Rob Landley 0 siblings, 2 replies; 10+ messages in thread From: Clemens Kolbitsch @ 2008-01-03 21:38 UTC (permalink / raw) To: Paul Brook; +Cc: qemu-devel On Thursday 03 January 2008 22:29:06 Paul Brook wrote: > > ... Ok, to cut a long question short: Is there any hardware support im > > qemu for doing monitoring (that goes deeper than using "time") and has > > anyone ever tested something that could work? > > Probably your application wants the performance counters. Qemu doesn't > emulate those. > > Besides which, qemu is not cycle accurate. Any performance measurements > your make are pretty much meaningless, and bear absolutely no relationship > to real hardware. Thanks for the quick answer Paul! Not really what I wanted to hear, but probably true ;-) Does anyone have an idea on how I can measure performance in qemu to a somewhat accurate level? I have modified qemu (the memory handling) and the linux kernel and want to find out the penalty this introduced... does anyone have any comments / ideas on this? Thanks! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 21:38 ` Clemens Kolbitsch @ 2008-01-03 22:07 ` Paul Brook 2008-01-03 22:11 ` Clemens Kolbitsch 2008-01-04 8:49 ` Rob Landley 1 sibling, 1 reply; 10+ messages in thread From: Paul Brook @ 2008-01-03 22:07 UTC (permalink / raw) To: qemu-devel; +Cc: Clemens Kolbitsch > Does anyone have an idea on how I can measure performance in qemu to a > somewhat accurate level? I have modified qemu (the memory handling) and the > linux kernel and want to find out the penalty this introduced... does > anyone have any comments / ideas on this? Short answer is you probably can't. And even if you can I won't believe tyour results unless you've verified them on real hardware :-) With the exception of some very small embedded cores, Modern CPUs have complex out of order execution pipelines and multi-level cache hierarchies. It's common for performance to be dominated by these secondary factors rather than raw instruction throughput. Exactly what features dominate performance is very application specific. Determining which factor dominates is unlikely to be something qemu can help with. However if e.g. you know that for your application there's a good correlation was between performance and L2 cache misses you could instrument qemu to and a L1/L2 cache model. The overhead will be fairly severe (easily 10x slower), and completely screw up any realtime measurements. However it would produce some useful cache use statistics that you could use to guesstimate actual performance. This is similar to how cachegrind works. Obviously if your application isn't cache bound then these figures will be meaningless. Paul ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 22:07 ` Paul Brook @ 2008-01-03 22:11 ` Clemens Kolbitsch 2008-01-03 22:18 ` Paul Brook 2008-01-03 22:19 ` Laurent Desnogues 0 siblings, 2 replies; 10+ messages in thread From: Clemens Kolbitsch @ 2008-01-03 22:11 UTC (permalink / raw) To: qemu-devel On Thursday 03 January 2008 23:07:07 you wrote: > > Does anyone have an idea on how I can measure performance in qemu to a > > somewhat accurate level? I have modified qemu (the memory handling) and > > the linux kernel and want to find out the penalty this introduced... does > > anyone have any comments / ideas on this? > > Short answer is you probably can't. And even if you can I won't believe > tyour results unless you've verified them on real hardware :-) > > With the exception of some very small embedded cores, Modern CPUs have > complex out of order execution pipelines and multi-level cache hierarchies. > It's common for performance to be dominated by these secondary factors > rather than raw instruction throughput. > > Exactly what features dominate performance is very application specific. > Determining which factor dominates is unlikely to be something qemu can > help with. > > However if e.g. you know that for your application there's a good > correlation was between performance and L2 cache misses you could > instrument qemu to and a L1/L2 cache model. The overhead will be fairly > severe (easily 10x slower), and completely screw up any realtime > measurements. However it would produce some useful cache use statistics > that you could use to guesstimate actual performance. This is similar to > how cachegrind works. Obviously if your application isn't cache bound then > these figures will be meaningless. Well, the measuring I had in mind partly concentrats on TLB misses, page faults, etc. (in addition to the cycle measuring). guess i'll have to implement something for myself in qemu :-/ But thanks a lot for helping me out! ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 22:11 ` Clemens Kolbitsch @ 2008-01-03 22:18 ` Paul Brook 2008-01-03 22:21 ` Clemens Kolbitsch 2008-01-03 22:19 ` Laurent Desnogues 1 sibling, 1 reply; 10+ messages in thread From: Paul Brook @ 2008-01-03 22:18 UTC (permalink / raw) To: qemu-devel; +Cc: Clemens Kolbitsch > Well, the measuring I had in mind partly concentrats on TLB misses, page > faults, etc. (in addition to the cycle measuring). guess i'll have to > implement something for myself in qemu :-/ Be aware that the TLB qemu uses behaves very differently to a real CPU TLB. If you want to get TLB miss statistics you'll need to model a "real" TLB for that separately. Page faults should be straightforward, but any half-decent guest OS would be able to tell you those anyway. Paul ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 22:18 ` Paul Brook @ 2008-01-03 22:21 ` Clemens Kolbitsch 0 siblings, 0 replies; 10+ messages in thread From: Clemens Kolbitsch @ 2008-01-03 22:21 UTC (permalink / raw) To: Paul Brook; +Cc: qemu-devel On Thursday 03 January 2008 23:18:58 Paul Brook wrote: > > Well, the measuring I had in mind partly concentrats on TLB misses, page > > faults, etc. (in addition to the cycle measuring). guess i'll have to > > implement something for myself in qemu :-/ > > Be aware that the TLB qemu uses behaves very differently to a real CPU TLB. > If you want to get TLB miss statistics you'll need to model a "real" TLB > for that separately. Sure, yes. But I don't even care what it would be like on a real CPU. I just want to know the impact it has on the emulated CPU ;-) > Page faults should be straightforward, but any half-decent guest OS would > be able to tell you those anyway. True *g* ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 22:11 ` Clemens Kolbitsch 2008-01-03 22:18 ` Paul Brook @ 2008-01-03 22:19 ` Laurent Desnogues 1 sibling, 0 replies; 10+ messages in thread From: Laurent Desnogues @ 2008-01-03 22:19 UTC (permalink / raw) To: qemu-devel On Jan 3, 2008 11:11 PM, Clemens Kolbitsch <clemens.kol@gmx.at> wrote: > > Well, the measuring I had in mind partly concentrats on TLB misses, page > faults, etc. (in addition to the cycle measuring). guess i'll have to > implement something for myself in qemu :-/ There's something not clear here: do you want to measure your kernel changes or do you want to profile Qemu? As Paul clearly explained you can't do both :) If you want to measure kernel performance oprofile is probably worth looking at. But you will need the real hardware. Another option, though much more intrusive, would be to add explicit performance counters in places you need to look at (this method can be applied to both Qemu too). And to say it again: nobody can expect to measure OS performance on a simulator, unless the simulator is directly derived from the HDL code written by designers. At least I would never trust such a result ;) Laurent ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-03 21:38 ` Clemens Kolbitsch 2008-01-03 22:07 ` Paul Brook @ 2008-01-04 8:49 ` Rob Landley 2008-01-04 15:09 ` Clemens Kolbitsch 1 sibling, 1 reply; 10+ messages in thread From: Rob Landley @ 2008-01-04 8:49 UTC (permalink / raw) To: qemu-devel; +Cc: Clemens Kolbitsch, Paul Brook On Thursday 03 January 2008 15:38:02 Clemens Kolbitsch wrote: > Does anyone have an idea on how I can measure performance in qemu to a > somewhat accurate level? hwclock --show > time1 tar xvjf linux-2.6.23.tar.bz2 && cd linux-2.6.23 && make allnoconfig && make cd .. hwclock --show > time2 Do that on host and client, and you've got a ratio of the performance of qemu to your host that should be good to within a few percent. > I have modified qemu (the memory handling) and the > linux kernel and want to find out the penalty this introduced... does > anyone have any comments / ideas on this? If it's something big, you can compare the result in minutes and seconds. That's probably the best you're going to do. (Although really you want hwclock --show before and after, and then do the math. That tunnels out to the host system to get its idea of the time, which doesn't get thrown off by timer interrupt delivery (as a signal) getting deferred by the host system's scheduler. Of course the fact that hwclock _takes_ a second or so to read the clock is a bit of a downer, but anything that takes less than a minute or so to run isn't going to give you a very accurate time because the performance of qemu isn't constant, and your results are going to skew all over the place. Especially for small things, the performance varies from run to run. Start by imagining qemu as having the mother of all page fault latencies. The cost of faulting code into the L2 cache includes dynamic recompilation, which is expensive. Worse, when the dynamic recompilation buffer fills up it blanks the whole thing, and recompiles every new page it hits one at a time until the buffer fills up again. (What is it these days, 16 megs of translated code before it resets?) No LRU or anything, no cache management at _all_, just "when the bucket fills up, dump it and start over". (Well, that's what it did back around the last stable release anyway. It has been almost a year since then, so maybe it's changed. I've been busy with other things and not really keeping track of changes that didn't affect what I could and couldn't get to run.) So anyway, depending on what code you run in what order, the performance can _differ_ from one run to the next due to when the cache gets blanked and stuff gets retranslated. By a lot. There's no obvious way to predict this or control it. And the "software" clock inside your emulated system can lie to you about it if timer interrupts get deferred. All this should pretty much average out if you do something big with lots of execs (like build a linux kernel from source). But if you do something small expect serious butterfly effects. Expect microbenchmarks to swing around wildly. Quick analogy: you know the performance difference faulting your executable in from disk vs running it out of cache? Imagine a daemon that makes random intermittent calls to "echo 1 > /proc/sys/vm/drop_caches", and now try to do a sane benchmark. No matter what you use to measure, what you're measuring isn't going to be consistent from one run to the next. Performance should be better (and more stable) with kqemu or kvm. Maybe that you can benchmark sanely, I wouldn't know. Ask somebody else. :) P.S. Take the above with a large grain of salt, I'm not close to an expert in this area... Rob -- "One of my most productive days was throwing away 1000 lines of code." - Ken Thompson. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] performance monitor 2008-01-04 8:49 ` Rob Landley @ 2008-01-04 15:09 ` Clemens Kolbitsch 0 siblings, 0 replies; 10+ messages in thread From: Clemens Kolbitsch @ 2008-01-04 15:09 UTC (permalink / raw) To: Rob Landley; +Cc: qemu-devel, Paul Brook paul On Friday 04 January 2008 09:49:22 Rob Landley wrote: > On Thursday 03 January 2008 15:38:02 Clemens Kolbitsch wrote: > > Does anyone have an idea on how I can measure performance in qemu to a > > somewhat accurate level? > > hwclock --show > time1 > tar xvjf linux-2.6.23.tar.bz2 && cd linux-2.6.23 && make allnoconfig && > make cd .. > hwclock --show > time2 > > Do that on host and client, and you've got a ratio of the performance of > qemu to your host that should be good to within a few percent. > > > I have modified qemu (the memory handling) and the > > linux kernel and want to find out the penalty this introduced... does > > anyone have any comments / ideas on this? > > If it's something big, you can compare the result in minutes and seconds. > That's probably the best you're going to do. (Although really you want > hwclock --show before and after, and then do the math. That tunnels out to > the host system to get its idea of the time, which doesn't get thrown off > by timer interrupt delivery (as a signal) getting deferred by the host > system's scheduler. Of course the fact that hwclock _takes_ a second or so > to read the clock is a bit of a downer, but anything that takes less than a > minute or so to run isn't going to give you a very accurate time because > the performance of qemu isn't constant, and your results are going to skew > all over the place. > > Especially for small things, the performance varies from run to run. Start > by imagining qemu as having the mother of all page fault latencies. The > cost of faulting code into the L2 cache includes dynamic recompilation, > which is expensive. > > Worse, when the dynamic recompilation buffer fills up it blanks the whole > thing, and recompiles every new page it hits one at a time until the buffer > fills up again. (What is it these days, 16 megs of translated code before > it resets?) No LRU or anything, no cache management at _all_, just "when > the bucket fills up, dump it and start over". (Well, that's what it did > back around the last stable release anyway. It has been almost a year > since then, so maybe it's changed. I've been busy with other things and > not really keeping track of changes that didn't affect what I could and > couldn't get to run.) > > So anyway, depending on what code you run in what order, the performance > can _differ_ from one run to the next due to when the cache gets blanked > and stuff gets retranslated. By a lot. There's no obvious way to predict > this or control it. And the "software" clock inside your emulated system > can lie to you about it if timer interrupts get deferred. > > All this should pretty much average out if you do something big with lots > of execs (like build a linux kernel from source). But if you do something > small expect serious butterfly effects. Expect microbenchmarks to swing > around wildly. > > Quick analogy: you know the performance difference faulting your executable > in > > >from disk vs running it out of cache? Imagine a daemon that makes random > > intermittent calls to "echo 1 > /proc/sys/vm/drop_caches", and now try to > do a sane benchmark. No matter what you use to measure, what you're > measuring isn't going to be consistent from one run to the next. > > Performance should be better (and more stable) with kqemu or kvm. Maybe > that you can benchmark sanely, I wouldn't know. Ask somebody else. :) > > P.S. Take the above with a large grain of salt, I'm not close to an expert > in this area... :-) Ok. What you've said pretty much covers how I've made up my mind in the last couple of hours trying to think about the problem *g* Guess I'll have to be happy counting TLB misses and page faults, adding up executed instructions (in user/kernel mode) per process and doing some timing stuff... then running the examples a lot of times, making an average of all numbers and finally just ignoring them since I *know* that they are bogus ;-) No, seriously... I understand the problem, but I think the above is the best I can do since I'm really only interested in the effekt it has on QEMU for the moment :-) Thanks again for your ideas!! ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2008-01-04 15:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-01-03 20:36 [Qemu-devel] performance monitor Clemens Kolbitsch 2008-01-03 21:29 ` Paul Brook 2008-01-03 21:38 ` Clemens Kolbitsch 2008-01-03 22:07 ` Paul Brook 2008-01-03 22:11 ` Clemens Kolbitsch 2008-01-03 22:18 ` Paul Brook 2008-01-03 22:21 ` Clemens Kolbitsch 2008-01-03 22:19 ` Laurent Desnogues 2008-01-04 8:49 ` Rob Landley 2008-01-04 15:09 ` Clemens Kolbitsch
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).