* [Qemu-devel] Performance Monitoring @ 2008-05-20 18:56 Cheif Jones 2008-05-20 22:06 ` Paul Brook ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Cheif Jones @ 2008-05-20 18:56 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 504 bytes --] Hi All, I'm doing a research project in which i want to run an OS under an emulator for a period of time and get full CPU opcode statistics (how many times every opcode was executed). As far as i understand the Qemu design, it is doing "JIT" translation of terget opcode to host opcodes to improve performance, and so there is no easy way to count target opcodes (e.g a loop is compiled JIT and runs natively). Is it possible to disable Qemu's JIT capabilities and get target opcode statistics? Cheif [-- Attachment #2: Type: text/html, Size: 552 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Performance Monitoring 2008-05-20 18:56 [Qemu-devel] Performance Monitoring Cheif Jones @ 2008-05-20 22:06 ` Paul Brook 2008-05-21 0:09 ` Vince Weaver 2008-05-21 6:41 ` Laurent Desnogues 2008-05-20 22:44 ` Stanislav Shwartsman 2008-05-21 16:10 ` [Qemu-devel] " Charles Duffy 2 siblings, 2 replies; 11+ messages in thread From: Paul Brook @ 2008-05-20 22:06 UTC (permalink / raw) To: qemu-devel; +Cc: Cheif Jones > I'm doing a research project in which i want to run an OS under an emulator > for a period of time and get full CPU opcode statistics (how many times > every opcode was executed). As far as i understand the Qemu design, it is > doing "JIT" translation of terget opcode to host opcodes to improve > performance, and so there is no easy way to count target opcodes (e.g a > loop is compiled JIT and runs natively). > > Is it possible to disable Qemu's JIT capabilities and get target opcode > statistics? You've a couple of options: - Disable TB caching (so code is always translated whenever it is run), and do the counting during translation. Performance is going to be fairly sucky. - Inject the counters into the translated code. This is maybe a bit more work, but should perform much better. With either alternative you'll still have issues with exceptions. MMU faults abort a TB early, so will screw up your statistics. One possibility is to terminate a TB on every memory access, like we do for watchpoints. You probably already know this, but I'd be surprised if the statistics you get have much/any correlation with real world performance on modern hardware. Paul ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Performance Monitoring 2008-05-20 22:06 ` Paul Brook @ 2008-05-21 0:09 ` Vince Weaver 2008-05-21 6:41 ` Laurent Desnogues 1 sibling, 0 replies; 11+ messages in thread From: Vince Weaver @ 2008-05-21 0:09 UTC (permalink / raw) To: qemu-devel On Tue, 20 May 2008, Paul Brook wrote: > > You probably already know this, but I'd be surprised if the statistics you get > have much/any correlation with real world performance on modern hardware. You might be suprised. I've been adding hooks to Qemu to generate traces for use in computer architecture research. I compare the results to hardware performance counters, and things like branch-prediction, cache miss rates, and total instruction mixes come across very well (with errors less than 5-10%). [this is simulating MIPS and also x86] Now things that depend a lot on actual timing like cycle counts and CPI you can't do. I've actually had a paper published that involves using Qemu in a similar fashion. It generates traces at least as good as Valgrind and Pin can. http://www.csl.cornell.edu/~vince/papers/hipeac08.html Vince ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Performance Monitoring 2008-05-20 22:06 ` Paul Brook 2008-05-21 0:09 ` Vince Weaver @ 2008-05-21 6:41 ` Laurent Desnogues 1 sibling, 0 replies; 11+ messages in thread From: Laurent Desnogues @ 2008-05-21 6:41 UTC (permalink / raw) To: qemu-devel On 5/21/08, Paul Brook <paul@codesourcery.com> wrote: > > get full CPU opcode statistics (how many times > > every opcode was executed). ... > You probably already know this, but I'd be surprised if the statistics you get > have much/any correlation with real world performance on modern hardware. The OP wants to count instruction opcodes, not cycles or other timing dependent statistics. Is there something that prevents exact counting of that? Laurent ^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [Qemu-devel] Performance Monitoring 2008-05-20 18:56 [Qemu-devel] Performance Monitoring Cheif Jones 2008-05-20 22:06 ` Paul Brook @ 2008-05-20 22:44 ` Stanislav Shwartsman 2008-05-21 16:10 ` [Qemu-devel] " Charles Duffy 2 siblings, 0 replies; 11+ messages in thread From: Stanislav Shwartsman @ 2008-05-20 22:44 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 836 bytes --] Hello, I think it is better to use Bochs for such purpose . Stanislav From: qemu-devel-bounces+stlintel=gmail.com@nongnu.org [mailto:qemu-devel-bounces+stlintel=gmail.com@nongnu.org] On Behalf Of Cheif Jones Sent: Tuesday, May 20, 2008 8:56 PM To: qemu-devel@nongnu.org Subject: [Qemu-devel] Performance Monitoring Hi All, I'm doing a research project in which i want to run an OS under an emulator for a period of time and get full CPU opcode statistics (how many times every opcode was executed). As far as i understand the Qemu design, it is doing "JIT" translation of terget opcode to host opcodes to improve performance, and so there is no easy way to count target opcodes (e.g a loop is compiled JIT and runs natively). Is it possible to disable Qemu's JIT capabilities and get target opcode statistics? Cheif [-- Attachment #2: Type: text/html, Size: 3711 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Qemu-devel] Re: Performance Monitoring 2008-05-20 18:56 [Qemu-devel] Performance Monitoring Cheif Jones 2008-05-20 22:06 ` Paul Brook 2008-05-20 22:44 ` Stanislav Shwartsman @ 2008-05-21 16:10 ` Charles Duffy 2008-05-22 3:46 ` Glauber Costa 2 siblings, 1 reply; 11+ messages in thread From: Charles Duffy @ 2008-05-21 16:10 UTC (permalink / raw) To: qemu-devel Cheif Jones wrote: > I'm doing a research project in which i want to run an OS under an > emulator for a period of time and get full CPU opcode statistics (how > many times every opcode was executed). As far as i understand the Qemu > design, it is doing "JIT" translation of terget opcode to host opcodes > to improve performance, and so there is no easy way to count target > opcodes (e.g a loop is compiled JIT and runs natively). Is an actual emulator necessary, or could you use something like an instrumented UML kernel? I'd think that would be a much simpler approach. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Re: Performance Monitoring 2008-05-21 16:10 ` [Qemu-devel] " Charles Duffy @ 2008-05-22 3:46 ` Glauber Costa 2008-05-22 6:13 ` Cheif Jones 0 siblings, 1 reply; 11+ messages in thread From: Glauber Costa @ 2008-05-22 3:46 UTC (permalink / raw) To: qemu-devel On Wed, May 21, 2008 at 1:10 PM, Charles Duffy <Charles_Duffy@messageone.com> wrote: > Cheif Jones wrote: >> >> I'm doing a research project in which i want to run an OS under an >> emulator for a period of time and get full CPU opcode statistics (how many >> times every opcode was executed). As far as i understand the Qemu design, it >> is doing "JIT" translation of terget opcode to host opcodes to improve >> performance, and so there is no easy way to count target opcodes (e.g a loop >> is compiled JIT and runs natively). > > Is an actual emulator necessary, or could you use something like an > instrumented UML kernel? I'd think that would be a much simpler approach. Since he claims to need opcode statistics, and UML won't trap any instruction but the privileged ones, does not seem feasible. -- Glauber Costa. "Free as in Freedom" http://glommer.net "The less confident you are, the more serious you have to act." ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Re: Performance Monitoring 2008-05-22 3:46 ` Glauber Costa @ 2008-05-22 6:13 ` Cheif Jones 2008-05-23 3:38 ` Vince Weaver 0 siblings, 1 reply; 11+ messages in thread From: Cheif Jones @ 2008-05-22 6:13 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 1613 bytes --] Hi All, Thanks for all the good advice! I do not have the HW and so I need an emulator. Also the target arch is not x86, it has to be some RISC (ARM, PPC). I would like to run an OS, say Linux, and take a sample for a small period of time (seconds) while some app(s) are running and get a list of opcode names and how many times they were executed. I'm not interested in CPI at the moment. Vince, I briefly read your paper and went through the patch . It is very interesting and maybe i can use it for what i need. Questions: - Is there a 1-1 correspondence between a BBV and a target CPU opcode? - If i use the QEMU to generate BBVs, and the other tools you mentioned that gets BBV as input, will i be able to see the opcode name (and not just a uniqe ID that goes with BBV) - Paul mentioned "With either alternative you'll still have issues with exceptions. MMU faults abort a TB early, so will screw up your statistics. One possibility is to terminate a TB on every memory access, like we do for watchpoints." - is this an issue addressed by your patch? Thanks, Chief - - On 5/22/08, Glauber > > > >> is compiled JIT and runs natively). > > > > Is an actual emulator necessary, or could you use something like an > > instrumented UML kernel? I'd think that would be a much simpler approach. > > > Since he claims to need opcode statistics, and UML won't trap any > instruction but the privileged ones, does not seem feasible. > > > -- > Glauber Costa. > "Free as in Freedom" > http://glommer.net > > "The less confident you are, the more serious you have to act." > > > [-- Attachment #2: Type: text/html, Size: 2152 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Re: Performance Monitoring 2008-05-22 6:13 ` Cheif Jones @ 2008-05-23 3:38 ` Vince Weaver 2008-05-25 12:22 ` Cheif Jones 0 siblings, 1 reply; 11+ messages in thread From: Vince Weaver @ 2008-05-23 3:38 UTC (permalink / raw) To: qemu-devel > I would like to run an OS, say Linux, and take a sample for a small period > of time (seconds) while some app(s) are running and get a list of opcode > names and how many times they were executed. I'm not interested in CPI at > the moment. What you are trying to do is relatively straightforward, especially if you are going to be running binaries from a RISC type machine. The way I'd recommend doing it is getting Qemu to output the raw instruction stream, and then write an external program that decodes the instructions and counts what kinds are in each. This is fairly straightforward to do on an arch like MIPS; it would be very complicated on something like x86. I have some code I can dig up that does this kind of thing (I used it to run a branch predictor simulator). I'll include it at the end ot this e-mail. > - Paul mentioned "With either alternative you'll still have issues with > exceptions. MMU faults abort a TB early, so will screw up your statistics. > One possibility is to terminate a TB on every memory access, like we do for > watchpoints." - is this an issue addressed by your patch? I've actually only tested my method of generating things with the userspace linux-user type method of emulation, I haven't tested it at all when doing full-system simulation. I'd imagine it would still work. Here's the code. It's based on a pre-TCG version of Qemu so you can't use it on the latest snapshots. It also only works with MIPS, but it probably will be similar with other architectures. The code buffers a large block of values before writing it out (for performance). To avoid creating huge traces to disk (and they will be huge) you can write to a named pipe (mkfifo) and have your analysis routine run at the same time reading in from the same pipe. Hopefully if I am doing something horribly wrong with this code, someone will correct me. I've been using it for a while now though and have been getting good results when compared to hw perf counters. This adds code to dump the pc and instruction every executed instruction: --- ./target-mips/translate.c 2008-04-23 12:23:55.000000000 -0400 +++ ./target-mips/translate.c 2008-05-22 23:31:13.000000000 -0400 @@ -6696,6 +6696,7 @@ gen_opc_instr_start[lj] = 1; } ctx.opcode = ldl_code(ctx.pc); + gen_op_dump_brpred(ctx.pc,ctx.opcode); decode_opc(env, &ctx); ctx.pc += 4; Add this to "op.c" void op_dump_brpred(void) { helper_dump_brpred(PARAM1,PARAM2); } Add this to "helper.c": static int brpred_fd=-1,brpred_ptr=0; static char error_message[]="Write error!\n"; struct brpredtype { unsigned int addr; unsigned int insn; } __attribute__((__packed__)); #define TRACE_UNITS 4096 static struct brpredtype brpred_buf[TRACE_UNITS]; void helper_dump_brpred(unsigned long address,unsigned long insn) { int result; if (brpred_fd<0) { brpred_fd=creat("trace.bpred",0666); } brpred_buf[memtrace_ptr].addr=address; brpred_buf[memtrace_ptr].insn=insn; brpred_ptr++; if (brpred_ptr>TRACE_UNITS) { brpred_ptr=0; result=write(brpred_fd,brpred_buf, TRACE_UNITS*sizeof(struct brpredtype)); if (result!=TRACE_UNITS*sizeof(struct brpredtype)) { write(2,error_message,13); } } } ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Re: Performance Monitoring 2008-05-23 3:38 ` Vince Weaver @ 2008-05-25 12:22 ` Cheif Jones 2008-05-28 0:21 ` Vince Weaver 0 siblings, 1 reply; 11+ messages in thread From: Cheif Jones @ 2008-05-25 12:22 UTC (permalink / raw) To: qemu-devel [-- Attachment #1: Type: text/plain, Size: 4027 bytes --] Vince, Your suggested patch is a good solution. One thing bothers me that there is a TB caching mechanism on top of the opcode translation mechanism. If the TB cache is not disabled, your patch might give inaccurate results (e.g a MIPS loop is translated to host CPU, gets cached, and executed N times from cache without being re-translated). The TB cache is implemented in tb_find_*() FYI. Am i missing something? Chief On Fri, May 23, 2008 at 6:38 AM, Vince Weaver <vince@csl.cornell.edu> wrote: > > > I would like to run an OS, say Linux, and take a sample for a small > period > > of time (seconds) while some app(s) are running and get a list of opcode > > names and how many times they were executed. I'm not interested in CPI at > > the moment. > > What you are trying to do is relatively straightforward, especially if you > are going to be running binaries from a RISC type machine. > > The way I'd recommend doing it is getting Qemu to output the raw > instruction stream, and then write an external program that > decodes the instructions and counts what kinds are in each. This > is fairly straightforward to do on an arch like MIPS; it would be > very complicated on something like x86. > > I have some code I can dig up that does this kind of thing (I used > it to run a branch predictor simulator). I'll include it at the end ot > this e-mail. > > > - Paul mentioned "With either alternative you'll still have issues > with > > exceptions. MMU faults abort a TB early, so will screw up your > statistics. > > One possibility is to terminate a TB on every memory access, like we > do for > > watchpoints." - is this an issue addressed by your patch? > > I've actually only tested my method of generating things with the > userspace linux-user type method of emulation, I haven't tested it at all > when doing full-system simulation. I'd imagine it would still work. > > > Here's the code. It's based on a pre-TCG version of Qemu so you can't use > it on the latest snapshots. It also only works with MIPS, but it > probably will be similar with other architectures. The code > buffers a large block of values before writing it out (for performance). > To avoid creating huge traces to disk (and they will be huge) you > can write to a named pipe (mkfifo) and have your analysis routine > run at the same time reading in from the same pipe. > > Hopefully if I am doing something horribly wrong with this code, someone > will correct me. I've been using it for a while now though and have been > getting good results when compared to hw perf counters. > > > This adds code to dump the pc and instruction every executed instruction: > > --- ./target-mips/translate.c 2008-04-23 12:23:55.000000000 -0400 > +++ ./target-mips/translate.c 2008-05-22 23:31:13.000000000 -0400 > @@ -6696,6 +6696,7 @@ > gen_opc_instr_start[lj] = 1; > } > ctx.opcode = ldl_code(ctx.pc); > + gen_op_dump_brpred(ctx.pc,ctx.opcode); > decode_opc(env, &ctx); > ctx.pc += 4; > > > Add this to "op.c" > > void op_dump_brpred(void) { > helper_dump_brpred(PARAM1,PARAM2); > } > > Add this to "helper.c": > > static int brpred_fd=-1,brpred_ptr=0; > > static char error_message[]="Write error!\n"; > > struct brpredtype { > unsigned int addr; > unsigned int insn; > } __attribute__((__packed__)); > > #define TRACE_UNITS 4096 > > static struct brpredtype brpred_buf[TRACE_UNITS]; > > void helper_dump_brpred(unsigned long address,unsigned long insn) { > > int result; > > if (brpred_fd<0) { > brpred_fd=creat("trace.bpred",0666); > } > > brpred_buf[memtrace_ptr].addr=address; > brpred_buf[memtrace_ptr].insn=insn; > > brpred_ptr++; > > if (brpred_ptr>TRACE_UNITS) { > brpred_ptr=0; > result=write(brpred_fd,brpred_buf, > TRACE_UNITS*sizeof(struct brpredtype)); > if (result!=TRACE_UNITS*sizeof(struct brpredtype)) { > write(2,error_message,13); > } > } > } > > > > > [-- Attachment #2: Type: text/html, Size: 5195 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] Re: Performance Monitoring 2008-05-25 12:22 ` Cheif Jones @ 2008-05-28 0:21 ` Vince Weaver 0 siblings, 0 replies; 11+ messages in thread From: Vince Weaver @ 2008-05-28 0:21 UTC (permalink / raw) To: qemu-devel On Sun, 25 May 2008, Cheif Jones wrote: > Your suggested patch is a good solution. One thing bothers me that there is > a TB caching mechanism on top of the opcode translation mechanism. If the TB > cache is not disabled, your patch might give inaccurate results (e.g a MIPS > loop is translated to host CPU, gets cached, and executed N times from cache > without being re-translated). The TB cache is implemented in tb_find_*() > FYI. I'm pretty sure the code I gave you inserts the dumping code into the TB, so the fact that the TB is cached shouldn't matter. I could be wrong, as the Qemu internals can be a bit confusing. I've run experiments with MIPS binaries on both actual r12k hardware with performance counters and with the Qemu generated results though and the results match to less than 1% different on intructions retired for the SPEC CPU 2000 benchmarks. Vince > > Am i missing something? > > Chief > > On Fri, May 23, 2008 at 6:38 AM, Vince Weaver <vince@csl.cornell.edu> wrote: > > > > > > I would like to run an OS, say Linux, and take a sample for a small > > period > > > of time (seconds) while some app(s) are running and get a list of opcode > > > names and how many times they were executed. I'm not interested in CPI at > > > the moment. > > > > What you are trying to do is relatively straightforward, especially if you > > are going to be running binaries from a RISC type machine. > > > > The way I'd recommend doing it is getting Qemu to output the raw > > instruction stream, and then write an external program that > > decodes the instructions and counts what kinds are in each. This > > is fairly straightforward to do on an arch like MIPS; it would be > > very complicated on something like x86. > > > > I have some code I can dig up that does this kind of thing (I used > > it to run a branch predictor simulator). I'll include it at the end ot > > this e-mail. > > > > > - Paul mentioned "With either alternative you'll still have issues > > with > > > exceptions. MMU faults abort a TB early, so will screw up your > > statistics. > > > One possibility is to terminate a TB on every memory access, like we > > do for > > > watchpoints." - is this an issue addressed by your patch? > > > > I've actually only tested my method of generating things with the > > userspace linux-user type method of emulation, I haven't tested it at all > > when doing full-system simulation. I'd imagine it would still work. > > > > > > Here's the code. It's based on a pre-TCG version of Qemu so you can't use > > it on the latest snapshots. It also only works with MIPS, but it > > probably will be similar with other architectures. The code > > buffers a large block of values before writing it out (for performance). > > To avoid creating huge traces to disk (and they will be huge) you > > can write to a named pipe (mkfifo) and have your analysis routine > > run at the same time reading in from the same pipe. > > > > Hopefully if I am doing something horribly wrong with this code, someone > > will correct me. I've been using it for a while now though and have been > > getting good results when compared to hw perf counters. > > > > > > This adds code to dump the pc and instruction every executed instruction: > > > > --- ./target-mips/translate.c 2008-04-23 12:23:55.000000000 -0400 > > +++ ./target-mips/translate.c 2008-05-22 23:31:13.000000000 -0400 > > @@ -6696,6 +6696,7 @@ > > gen_opc_instr_start[lj] = 1; > > } > > ctx.opcode = ldl_code(ctx.pc); > > + gen_op_dump_brpred(ctx.pc,ctx.opcode); > > decode_opc(env, &ctx); > > ctx.pc += 4; > > > > > > Add this to "op.c" > > > > void op_dump_brpred(void) { > > helper_dump_brpred(PARAM1,PARAM2); > > } > > > > Add this to "helper.c": > > > > static int brpred_fd=-1,brpred_ptr=0; > > > > static char error_message[]="Write error!\n"; > > > > struct brpredtype { > > unsigned int addr; > > unsigned int insn; > > } __attribute__((__packed__)); > > > > #define TRACE_UNITS 4096 > > > > static struct brpredtype brpred_buf[TRACE_UNITS]; > > > > void helper_dump_brpred(unsigned long address,unsigned long insn) { > > > > int result; > > > > if (brpred_fd<0) { > > brpred_fd=creat("trace.bpred",0666); > > } > > > > brpred_buf[memtrace_ptr].addr=address; > > brpred_buf[memtrace_ptr].insn=insn; > > > > brpred_ptr++; > > > > if (brpred_ptr>TRACE_UNITS) { > > brpred_ptr=0; > > result=write(brpred_fd,brpred_buf, > > TRACE_UNITS*sizeof(struct brpredtype)); > > if (result!=TRACE_UNITS*sizeof(struct brpredtype)) { > > write(2,error_message,13); > > } > > } > > } > > > > > > > > > > > -- /* Vince Weaver vince@csl.cornell.edu http://csl.cornell.edu/~vince */ main(){char O,o[66]="|\n\\/_ ",*I=o+7,l[]="B!FhhBHCWE9C?cJFKET$+h'Iq*chT" ,i=0,_;while(_=l[i++])for(O=0;O++<_>>5;)*I=*(I++-(_&31));*I=0;puts(o+5);} ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-05-28 0:21 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-20 18:56 [Qemu-devel] Performance Monitoring Cheif Jones 2008-05-20 22:06 ` Paul Brook 2008-05-21 0:09 ` Vince Weaver 2008-05-21 6:41 ` Laurent Desnogues 2008-05-20 22:44 ` Stanislav Shwartsman 2008-05-21 16:10 ` [Qemu-devel] " Charles Duffy 2008-05-22 3:46 ` Glauber Costa 2008-05-22 6:13 ` Cheif Jones 2008-05-23 3:38 ` Vince Weaver 2008-05-25 12:22 ` Cheif Jones 2008-05-28 0:21 ` Vince Weaver
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).