Vince, Your suggested patch is a good solution. One thing bothers me that there is a TB caching mechanism on top of the opcode translation mechanism. If the TB cache is not disabled, your patch might give inaccurate results (e.g a MIPS loop is translated to host CPU, gets cached, and executed N times from cache without being re-translated). The TB cache is implemented in tb_find_*() FYI. Am i missing something? Chief On Fri, May 23, 2008 at 6:38 AM, Vince Weaver wrote: > > > I would like to run an OS, say Linux, and take a sample for a small > period > > of time (seconds) while some app(s) are running and get a list of opcode > > names and how many times they were executed. I'm not interested in CPI at > > the moment. > > What you are trying to do is relatively straightforward, especially if you > are going to be running binaries from a RISC type machine. > > The way I'd recommend doing it is getting Qemu to output the raw > instruction stream, and then write an external program that > decodes the instructions and counts what kinds are in each. This > is fairly straightforward to do on an arch like MIPS; it would be > very complicated on something like x86. > > I have some code I can dig up that does this kind of thing (I used > it to run a branch predictor simulator). I'll include it at the end ot > this e-mail. > > > - Paul mentioned "With either alternative you'll still have issues > with > > exceptions. MMU faults abort a TB early, so will screw up your > statistics. > > One possibility is to terminate a TB on every memory access, like we > do for > > watchpoints." - is this an issue addressed by your patch? > > I've actually only tested my method of generating things with the > userspace linux-user type method of emulation, I haven't tested it at all > when doing full-system simulation. I'd imagine it would still work. > > > Here's the code. It's based on a pre-TCG version of Qemu so you can't use > it on the latest snapshots. It also only works with MIPS, but it > probably will be similar with other architectures. The code > buffers a large block of values before writing it out (for performance). > To avoid creating huge traces to disk (and they will be huge) you > can write to a named pipe (mkfifo) and have your analysis routine > run at the same time reading in from the same pipe. > > Hopefully if I am doing something horribly wrong with this code, someone > will correct me. I've been using it for a while now though and have been > getting good results when compared to hw perf counters. > > > This adds code to dump the pc and instruction every executed instruction: > > --- ./target-mips/translate.c 2008-04-23 12:23:55.000000000 -0400 > +++ ./target-mips/translate.c 2008-05-22 23:31:13.000000000 -0400 > @@ -6696,6 +6696,7 @@ > gen_opc_instr_start[lj] = 1; > } > ctx.opcode = ldl_code(ctx.pc); > + gen_op_dump_brpred(ctx.pc,ctx.opcode); > decode_opc(env, &ctx); > ctx.pc += 4; > > > Add this to "op.c" > > void op_dump_brpred(void) { > helper_dump_brpred(PARAM1,PARAM2); > } > > Add this to "helper.c": > > static int brpred_fd=-1,brpred_ptr=0; > > static char error_message[]="Write error!\n"; > > struct brpredtype { > unsigned int addr; > unsigned int insn; > } __attribute__((__packed__)); > > #define TRACE_UNITS 4096 > > static struct brpredtype brpred_buf[TRACE_UNITS]; > > void helper_dump_brpred(unsigned long address,unsigned long insn) { > > int result; > > if (brpred_fd<0) { > brpred_fd=creat("trace.bpred",0666); > } > > brpred_buf[memtrace_ptr].addr=address; > brpred_buf[memtrace_ptr].insn=insn; > > brpred_ptr++; > > if (brpred_ptr>TRACE_UNITS) { > brpred_ptr=0; > result=write(brpred_fd,brpred_buf, > TRACE_UNITS*sizeof(struct brpredtype)); > if (result!=TRACE_UNITS*sizeof(struct brpredtype)) { > write(2,error_message,13); > } > } > } > > > > >