From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1K19QH-0004G8-Qm for qemu-devel@nongnu.org; Tue, 27 May 2008 20:21:45 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1K19QH-0004FA-1h for qemu-devel@nongnu.org; Tue, 27 May 2008 20:21:45 -0400 Received: from [199.232.76.173] (port=39193 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1K19QG-0004Ev-Rv for qemu-devel@nongnu.org; Tue, 27 May 2008 20:21:44 -0400 Received: from csl.cornell.edu ([128.84.224.10]:3798 helo=vlsi.csl.cornell.edu) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1K19QG-0003Oc-Iu for qemu-devel@nongnu.org; Tue, 27 May 2008 20:21:44 -0400 Received: from bell.csl.cornell.edu (bell.csl.cornell.edu [128.84.224.41]) by vlsi.csl.cornell.edu (8.13.4/8.13.4) with ESMTP id m4S0LXUA049407 for ; Tue, 27 May 2008 20:21:38 -0400 (EDT) Date: Tue, 27 May 2008 20:21:33 -0400 (EDT) From: Vince Weaver Subject: Re: [Qemu-devel] Re: Performance Monitoring In-Reply-To: <3000d2e90805250522o54fdaa17g43d716d8f15dfe9d@mail.gmail.com> Message-ID: <20080527201827.D48790-100000@bell.csl.cornell.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org On Sun, 25 May 2008, Cheif Jones wrote: > Your suggested patch is a good solution. One thing bothers me that there is > a TB caching mechanism on top of the opcode translation mechanism. If the TB > cache is not disabled, your patch might give inaccurate results (e.g a MIPS > loop is translated to host CPU, gets cached, and executed N times from cache > without being re-translated). The TB cache is implemented in tb_find_*() > FYI. I'm pretty sure the code I gave you inserts the dumping code into the TB, so the fact that the TB is cached shouldn't matter. I could be wrong, as the Qemu internals can be a bit confusing. I've run experiments with MIPS binaries on both actual r12k hardware with performance counters and with the Qemu generated results though and the results match to less than 1% different on intructions retired for the SPEC CPU 2000 benchmarks. Vince > > Am i missing something? > > Chief > > On Fri, May 23, 2008 at 6:38 AM, Vince Weaver wrote: > > > > > > I would like to run an OS, say Linux, and take a sample for a small > > period > > > of time (seconds) while some app(s) are running and get a list of opcode > > > names and how many times they were executed. I'm not interested in CPI at > > > the moment. > > > > What you are trying to do is relatively straightforward, especially if you > > are going to be running binaries from a RISC type machine. > > > > The way I'd recommend doing it is getting Qemu to output the raw > > instruction stream, and then write an external program that > > decodes the instructions and counts what kinds are in each. This > > is fairly straightforward to do on an arch like MIPS; it would be > > very complicated on something like x86. > > > > I have some code I can dig up that does this kind of thing (I used > > it to run a branch predictor simulator). I'll include it at the end ot > > this e-mail. > > > > > - Paul mentioned "With either alternative you'll still have issues > > with > > > exceptions. MMU faults abort a TB early, so will screw up your > > statistics. > > > One possibility is to terminate a TB on every memory access, like we > > do for > > > watchpoints." - is this an issue addressed by your patch? > > > > I've actually only tested my method of generating things with the > > userspace linux-user type method of emulation, I haven't tested it at all > > when doing full-system simulation. I'd imagine it would still work. > > > > > > Here's the code. It's based on a pre-TCG version of Qemu so you can't use > > it on the latest snapshots. It also only works with MIPS, but it > > probably will be similar with other architectures. The code > > buffers a large block of values before writing it out (for performance). > > To avoid creating huge traces to disk (and they will be huge) you > > can write to a named pipe (mkfifo) and have your analysis routine > > run at the same time reading in from the same pipe. > > > > Hopefully if I am doing something horribly wrong with this code, someone > > will correct me. I've been using it for a while now though and have been > > getting good results when compared to hw perf counters. > > > > > > This adds code to dump the pc and instruction every executed instruction: > > > > --- ./target-mips/translate.c 2008-04-23 12:23:55.000000000 -0400 > > +++ ./target-mips/translate.c 2008-05-22 23:31:13.000000000 -0400 > > @@ -6696,6 +6696,7 @@ > > gen_opc_instr_start[lj] = 1; > > } > > ctx.opcode = ldl_code(ctx.pc); > > + gen_op_dump_brpred(ctx.pc,ctx.opcode); > > decode_opc(env, &ctx); > > ctx.pc += 4; > > > > > > Add this to "op.c" > > > > void op_dump_brpred(void) { > > helper_dump_brpred(PARAM1,PARAM2); > > } > > > > Add this to "helper.c": > > > > static int brpred_fd=-1,brpred_ptr=0; > > > > static char error_message[]="Write error!\n"; > > > > struct brpredtype { > > unsigned int addr; > > unsigned int insn; > > } __attribute__((__packed__)); > > > > #define TRACE_UNITS 4096 > > > > static struct brpredtype brpred_buf[TRACE_UNITS]; > > > > void helper_dump_brpred(unsigned long address,unsigned long insn) { > > > > int result; > > > > if (brpred_fd<0) { > > brpred_fd=creat("trace.bpred",0666); > > } > > > > brpred_buf[memtrace_ptr].addr=address; > > brpred_buf[memtrace_ptr].insn=insn; > > > > brpred_ptr++; > > > > if (brpred_ptr>TRACE_UNITS) { > > brpred_ptr=0; > > result=write(brpred_fd,brpred_buf, > > TRACE_UNITS*sizeof(struct brpredtype)); > > if (result!=TRACE_UNITS*sizeof(struct brpredtype)) { > > write(2,error_message,13); > > } > > } > > } > > > > > > > > > > > -- /* Vince Weaver vince@csl.cornell.edu http://csl.cornell.edu/~vince */ main(){char O,o[66]="|\n\\/_ ",*I=o+7,l[]="B!FhhBHCWE9C?cJFKET$+h'Iq*chT" ,i=0,_;while(_=l[i++])for(O=0;O++<_>>5;)*I=*(I++-(_&31));*I=0;puts(o+5);}