From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Date: Tue, 15 Apr 2008 07:34:47 +0000
Subject: Re: [kvm-ppc-devel] [PATCH] [1/3] kvmppc: add detailed instruction
Message-Id: <48045A97.90804@linux.vnet.ibm.com>
List-Id: <kvm-ppc.vger.kernel.org>
References: <12081758423120-git-send-email-ehrhardt@linux.vnet.ibm.com>
In-Reply-To: <12081758423120-git-send-email-ehrhardt@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
To: kvm-ppc@vger.kernel.org

Hollis Blanchard wrote:
> On Monday 14 April 2008 07:23:56 ehrhardt@linux.vnet.ibm.com wrote:
>> From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
>>
>> This extends the kvm_stat reports for kvm on embedded powerpc. Since kvm=
ppc
>> is using emulation (no hardware support yet) this gives people interested
>> in performance a detailed split of the emul_instruction counter already
>> available. This statistic does not cover e.g. operants of the commands, =
but
>> that way it should have only a small perf impact (never break what you w=
ant
>> to measure). This feature is configurable in .config under the kvmppc
>> virtualization itself.
>=20
> This array-based approach seems to add a lot of lines of code, and it's a=
lso=20
> copy/paste stuff that is just begging for a typo (e.g. miscounting STHX i=
n=20
> the STHUX bucket) or being forgotten entirely when adding new emulation c=
ode.
>=20
> A more general approach would be to record just the opcode/extended opcod=
e in=20
> a variable-sized structure, allocating a new bucket as we encounter a=20
> previously unseen instruction. I'm thinking of something like this:

I already thought about some bucket like counting, but I focused small runt=
ime overhead above everything else here by hijacking the already existent s=
witch/case and avoiding any new if's & function calls.
I agree that the array like solution is messy in the aspect of lines and is=
 more error prone by forgetting something.

> log_instruction(inst) {
> 	bucket =3D hash_u32(inst, 5);
> 	list_for_each(l, vcpu->instlog[bucket]) {
> 		struct instlog =3D list_entry(l);
> 		if (instlog->inst =3D inst) {
> 			instlog->count++;
> 			break;
> 		}
> 	}
> }
>=20
> emulate(inst) {
> 	log =3D get_op(inst);
> 	switch (get_op(inst)) {
> 		...
> 	case 31:
> 		log |=3D get_xop(inst);
> 		switch (inst) {
> 			...
> 		}
> 	}
> 	log_instruction(log);
> }

I agree that this looks better, but is has per instruction:
1x hash function
a loop wich runs ~1-2 times
an if (do we actually have long pipelines that suffer from wrong branch pre=
diction?)
When we add more functions here like mapping sprn's and whatever else comes=
 in mind later this might get even more complex, while this statistics shou=
ld be the low overhead stats.
The thing I want to point out is that the tracing with the relay channel al=
so only have (speaking of the reduced version discussed in the other thread=
 with only 3 integers) 3 integers to move into a buffer and some overhead t=
o select the right buffer. The userspace application reads 4096b =3D> 341 r=
ecords per scheduling of the reading app and we might even increase that si=
ze. I really like the "full trace" feeling of the relay based patch, so I d=
on't like to increase the functionality&overhead of this patch which was in=
tended to be low-overhead.

As stated above I agree with you about lines of code and error proneness of=
 that patch. What about that:
- for integration into our upstream code only the relay based approach is t=
ake into account, giving a full trace while it has some overhead
- we keep the slight overhead, but error prone kvm_stat based patch in our =
private queues and only use it for our own measurements until we either nee=
d or change it once we have more experience with the relay based tracing (e=
.g. know the perf. overhead of that better)
- That way we would have two very similar solutions that share code and are=
 configurable so developers can use them as they want
- additionally I think again about Arnd's comments and if we might need/wan=
t a runtime switch e.g. proc/sys interface to enable/disable that tracing f=
unctions on the fly

Comments welcome

> It looks like we could build a hash table pretty easily with hash_long() =
and=20
> list_entry stuff (see fs/mbcache.c for example). So far you've found 17=20
> different instructions types emulated in the "boot" workload, so 32 entri=
es=20
> would probably be a reasonable hash size. The same approach could be used=
 for=20
> SPR accesses, where you've hit 31 different registers. Really I think tho=
se=20
> numbers won't vary much by workload, but rather by guest kernel...
>=20


--=20

Gr=FCsse / regards,=20
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference=20
Don't miss this year's exciting event. There's still time to save $100.=20
Use priority code J8TL2D2.=20
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/java=
one
_______________________________________________
kvm-ppc-devel mailing list
kvm-ppc-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-ppc-devel