exit timing analysis v1 - comments&discussions welcome

All of lore.kernel.org
 help / color / mirror / Atom feed

* exit timing analysis v1 - comments&discussions welcome
@ 2008-09-24  9:24 Christian Ehrhardt
  2008-09-24 15:14 ` Hollis Blanchard
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-09-24  9:24 UTC (permalink / raw)
  To: kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 5637 bytes --]

Hi everyone,
I wrote a patch that tries to get some data where we spent our time. 
This should give us some data to think about when we talk about 
virtualization on non hardware assisted platforms like 440.

A good example what this data might help us to discuss is the comparison 
between paravirtualized and non paravirtualized guest (using the magic 
page mechanism I posted recently). You see that the pv extension reduces 
the number of exits for instruction emulation by 50%, but unfortunately 
that only reduced the time we spend for the emulation by ~20%. This data 
is for you information and comments/discussions are welcome (I expect 
that we need to trace the data of the emulation per instr. to identify 
the slow ones and think about their optimization).

The basic approach is to read TB[U&L] when entering the guest and when 
exiting it. The time from exit->enter is the time we spent to handle the 
exit and accounted per exit type. The time enter->exit is the time we 
spent in the guest and accounted too.
For details I attached the patch I used to measure the timing. I 
verified it using with serveral assertions I now removed in the final 
version like "testing for double accounting of an exit" or "testing if 
we missed to account an exit". Additionally I verified the exit counts 
with the data reported by our kvm_stat support and that matches. Also 
the sum of accounted time matches the real time spent to run the test 
(Test was booting a ubuntu guest, logon as root, executing a ls, kill 
guest).

I think this pretty much explains why the guest is sometimes feeling 
slow :-).
Handle this data as semi-experimental and let me know what you think.

NO PV support
unit is usec
traced duration 202,244,233 usec
                     count      min         max             
sum            avg        stddev     time%
    EMULINST     6,423,699       12     247,582      91,732,705        
14.2804       241.200     45.36
    ITLBVIRT     1,777,242       21     264,257      47,116,557        
26.5111       286.040     23.30
    DTLBVIRT     1,544,241       22     263,947      41,765,447        
27.0459       218.997     20.65
 TIMEINGUEST    10,000,000        0       3,673      10,861,792         
1.0862         4.790      5.37
        MMIO         9,318       47      10,686       3,544,934       
380.4394       507.292      1.75
      EXTINT         3,523       33       4,475       2,557,183       
725.8538       619.026      1.26
         DEC        16,937       50       4,866       1,441,084        
85.0850       203.856      0.71
    DTLBREAL        95,497       13         105       1,349,255        
14.1288         2.811      0.67
    ITLBREAL        80,622       13         146       1,079,052        
13.3841         2.870      0.53
         DSI        22,498       11          99         249,554        
11.0923         1.975      0.12
         DCR         4,854       41       3,758         232,657        
47.9310        55.802      0.12
         ISI        11,167       11          67         124,210        
11.1230         2.242      0.06
     SYSCALL         9,459       11         100         113,972        
12.0491         2.763      0.06
      SIGNAL           635       66       6,025          72,438       
114.0756       335.235      0.04
  FP_UNAVAIL           308       11          12           3,393        
11.0162         0.126      0.00
*never hit USR_PR_INST, HCALL, HALT, DTLBPV, DEBUG

With PV Support
unit is usec
traced duration 136,640,181 usec
                     count      min         max             
sum            avg        stddev     time%
    EMULINST     3,160,786       12     247,587      50,808,972        
16.0748       205.445     37.18
    ITLBVIRT     1,333,725       21     264,336      35,094,300        
26.3130       235.204     25.68
    DTLBVIRT     1,212,505       22      16,753      32,382,302        
26.7069        51.475     23.70
 TIMEINGUEST     6,000,000        0       3,846        7259,180         
1.2099         9.317      5.31
        MMIO         9,955       47      12,289        3570,654       
358.6795       539.759      2.61
      EXTINT         2,380       31       5,285        1791,705       
752.8172       665.021      1.31
    ITLBREAL        80,690       13     247,582        1320,621        
16.3666       871.535      0.97
    DTLBREAL        93,213       13         108        1315,333        
14.1110         2.799      0.96
      DTLBPV        46,941       20       3,696        1100,399        
23.4422        50.127      0.81
         DEC        10,376       50       3,672         928,878        
89.5218       217.384      0.68
      SIGNAL           658       66     264,544         339,911       
516.5821     10303.957      0.25
         DSI        21,799       11          87         242,343        
11.1172         2.227      0.18
         DCR         5,042       41         258         233,095        
46.2307         6.465      0.17
         ISI        11,582       11          66         128,625        
11.1056         2.133      0.09
     SYSCALL        10,052       11          59         120,597        
11.9973         2.415      0.09
  FP_UNAVAIL           295       11          12           3,247        
11.0068         0.082      0.00
       HCALL             1       19          19              19        
19.0000         0.000      0.00
*never hit USR_PR_INST, HALT, DEBUG


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: kvmppc-exit-timing.diff --]
[-- Type: text/x-diff; name="kvmppc-exit-timing.diff", Size: 12342 bytes --]

diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -346,6 +346,11 @@
 	DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
 	DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear));
 	DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
+
+	DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu));
+	DEFINE(VCPU_TIMING_EXIT_TBL, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbl));
+	DEFINE(VCPU_TIMING_LAST_ENTER_TBU, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbu));
+	DEFINE(VCPU_TIMING_LAST_ENTER_TBL, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbl));
 #endif
 
 	return 0;
diff --git a/arch/powerpc/kvm/booke_guest.c b/arch/powerpc/kvm/booke_guest.c
--- a/arch/powerpc/kvm/booke_guest.c
+++ b/arch/powerpc/kvm/booke_guest.c
@@ -56,6 +56,14 @@
 	{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
 	{ NULL }
 };
+
+u32 last_exit_type;
+u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_last_exit;
 
 static const u32 interrupt_msr_mask[16] = {
 	[BOOKE_INTERRUPT_CRITICAL]      = MSR_ME,
@@ -261,6 +269,9 @@
 	enum emulation_result er;
 	int r = RESUME_HOST;
 
+	/* update before a new last_exit_type is written */
+	update_exit_timing(&(vcpu->arch));
+
 	local_irq_enable();
 
 	run->exit_reason = KVM_EXIT_UNKNOWN;
@@ -285,10 +296,13 @@
 		 * misses before ceding control. */
 		if (need_resched())
 			cond_resched();
-		if (exit_nr == BOOKE_INTERRUPT_DECREMENTER)
+		if (exit_nr == BOOKE_INTERRUPT_DECREMENTER) {
 			vcpu->stat.dec_exits++;
-		else
+			set_last_exit_type(DEC_EXITS);
+		} else {
 			vcpu->stat.ext_intr_exits++;
+			set_last_exit_type(EXT_INTR_EXITS);
+		}
 		r = RESUME_GUEST;
 		break;
 
@@ -299,6 +313,7 @@
 			vcpu->arch.esr = vcpu->arch.fault_esr;
 			kvmppc_queue_exception(vcpu, BOOKE_INTERRUPT_PROGRAM);
 			r = RESUME_GUEST;
+			set_last_exit_type(USR_PR_INST);
 			break;
 		}
 
@@ -308,10 +323,12 @@
 			/* Future optimization: only reload non-volatiles if
 			 * they were actually modified by emulation. */
 			vcpu->stat.emulated_inst_exits++;
+			set_last_exit_type(EMULATED_INST_EXITS);
 			r = RESUME_GUEST_NV;
 			break;
 		case EMULATE_DO_DCR:
 			run->exit_reason = KVM_EXIT_DCR;
+			set_last_exit_type(DCR_EXITS);
 			r = RESUME_HOST;
 			break;
 		case EMULATE_FAIL:
@@ -331,6 +348,7 @@
 
 	case BOOKE_INTERRUPT_FP_UNAVAIL:
 		kvmppc_queue_exception(vcpu, exit_nr);
+		set_last_exit_type(FP_UNAVAIL);
 		r = RESUME_GUEST;
 		break;
 
@@ -339,6 +357,7 @@
 		vcpu->arch.esr = vcpu->arch.fault_esr;
 		kvmppc_queue_exception(vcpu, exit_nr);
 		vcpu->stat.dsi_exits++;
+		set_last_exit_type(DSI_EXITS);
 		r = RESUME_GUEST;
 		break;
 
@@ -346,6 +365,7 @@
 		vcpu->arch.esr = vcpu->arch.fault_esr;
 		kvmppc_queue_exception(vcpu, exit_nr);
 		vcpu->stat.isi_exits++;
+		set_last_exit_type(ISI_EXITS);
 		r = RESUME_GUEST;
 		break;
 
@@ -353,9 +373,11 @@
 		if (vcpu->arch.last_inst == KVM_HYPERCALL_BIN) {
 			kvmppc_do_hypercall(vcpu);
 			vcpu->stat.hcall_exits++;
+			set_last_exit_type(HCALL_EXITS);
 		} else {
 			kvmppc_queue_exception(vcpu, exit_nr);
 			vcpu->stat.syscall_exits++;
+			set_last_exit_type(SYSCALL_EXITS);
 		}
 		r = RESUME_GUEST;
 		break;
@@ -370,6 +392,7 @@
 			 vcpu->arch.pvmem_gpaddr >> KVM_PPCPV_MAGIC_PAGE_SHIFT,
 			 0, KVM_PPCPV_MAGIC_PAGE_FLAGS);
 			vcpu->stat.dtlb_pvmem_miss_exits++;
+			set_last_exit_type(DTLB_PVMEM_MISS_EXITS);
 			r = RESUME_GUEST;
 			break;
 		}
@@ -382,6 +405,7 @@
 			vcpu->arch.dear = vcpu->arch.fault_dear;
 			vcpu->arch.esr = vcpu->arch.fault_esr;
 			vcpu->stat.dtlb_real_miss_exits++;
+			set_last_exit_type(DTLB_REAL_MISS_EXITS);
 			r = RESUME_GUEST;
 			break;
 		}
@@ -399,11 +423,13 @@
 			kvmppc_mmu_map(vcpu, eaddr, gfn, gtlbe->tid,
 			               gtlbe->word2);
 			vcpu->stat.dtlb_virt_miss_exits++;
+			set_last_exit_type(DTLB_VIRT_MISS_EXITS);
 			r = RESUME_GUEST;
 		} else {
 			/* Guest has mapped and accessed a page which is not
 			 * actually RAM. */
 			r = kvmppc_emulate_mmio(run, vcpu);
+			set_last_exit_type(MMIO_EXITS);
 		}
 
 		break;
@@ -422,10 +448,12 @@
 			/* The guest didn't have a mapping for it. */
 			kvmppc_queue_exception(vcpu, exit_nr);
 			vcpu->stat.itlb_real_miss_exits++;
+			set_last_exit_type(ITLB_REAL_MISS_EXITS);
 			break;
 		}
 
 		vcpu->stat.itlb_virt_miss_exits++;
+		set_last_exit_type(ITLB_VIRT_MISS_EXITS);
 
 		gfn = tlb_xlate(gtlbe, eaddr) >> PAGE_SHIFT;
 
@@ -458,6 +486,7 @@
 		mtspr(SPRN_DBSR, dbsr);
 
 		run->exit_reason = KVM_EXIT_DEBUG;
+		set_last_exit_type(DEBUG_EXITS);
 		r = RESUME_HOST;
 		break;
 	}
@@ -482,6 +511,7 @@
 			r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
 
 			vcpu->stat.signal_exits++;
+			set_last_exit_type(SIGNAL_EXITS);
 		} else {
 			vcpu->stat.light_exits++;
 		}
@@ -495,6 +525,7 @@
 			break;
 		case KVM_EXIT_INTR:
 			vcpu->stat.signal_exits++;
+			set_last_exit_type(SIGNAL_EXITS);
 			break;
 		}
 	}
@@ -535,6 +566,8 @@
 	 * real timebase frequency. Accordingly, it must see the state of
 	 * CCR1[TCS]. */
 	vcpu->arch.ccr1 = mfspr(SPRN_CCR1);
+
+	init_exit_timing(&(vcpu->arch));
 
 	return 0;
 }
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -109,6 +109,16 @@
 
 	li	r6, 1
 	slw	r6, r6, r5
+
+	/* save exit time */
+..exit_tbu_overflow_loop:
+	mfspr	r7, SPRN_TBRU
+	mfspr	r8, SPRN_TBRL
+	mfspr	r9, SPRN_TBRU
+	cmpw	r9, r7
+	bne	..exit_tbu_overflow_loop
+	stw	r8, VCPU_TIMING_EXIT_TBL(r4)
+	stw	r9, VCPU_TIMING_EXIT_TBU(r4)
 
 	/* Save the faulting instruction and all GPRs for emulation. */
 	andi.	r7, r6, NEED_INST_MASK
@@ -415,6 +425,17 @@
 	lwz	r3, VCPU_SPRG7(r4)
 	mtspr	SPRN_SPRG7, r3
 
+	/* save enter time */
+
+..enter_tbu_overflow_loop:
+	mfspr	r6, SPRN_TBRU
+	mfspr	r7, SPRN_TBRL
+	mfspr	r8, SPRN_TBRU
+	cmpw	r8, r6
+	bne	..enter_tbu_overflow_loop
+	stw	r7, VCPU_TIMING_LAST_ENTER_TBL(r4)
+	stw	r8, VCPU_TIMING_LAST_ENTER_TBU(r4)
+
 	/* Finish loading guest volatiles and jump to guest. */
 	lwz	r3, VCPU_CTR(r4)
 	mtctr	r3
diff --git a/include/asm-powerpc/kvm_host.h b/include/asm-powerpc/kvm_host.h
--- a/include/asm-powerpc/kvm_host.h
+++ b/include/asm-powerpc/kvm_host.h
@@ -25,6 +25,8 @@
 #include <linux/types.h>
 #include <linux/kvm_types.h>
 #include <asm/kvm_asm.h>
+#include <asm/time.h>
+#include <asm-generic/div64.h>
 
 #define KVM_MAX_VCPUS 1
 #define KVM_MEMORY_SLOTS 32
@@ -66,6 +68,53 @@
 	u32 halt_wakeup;
 };
 
+enum kvm_exit_types {
+	MMIO_EXITS,
+	DCR_EXITS,
+	SIGNAL_EXITS,
+	ITLB_REAL_MISS_EXITS,
+	ITLB_VIRT_MISS_EXITS,
+	DTLB_REAL_MISS_EXITS,
+	DTLB_PVMEM_MISS_EXITS,
+	DTLB_VIRT_MISS_EXITS,
+	SYSCALL_EXITS,
+	HCALL_EXITS,
+	ISI_EXITS,
+	DSI_EXITS,
+	EMULATED_INST_EXITS,
+	DEC_EXITS,
+	EXT_INTR_EXITS,
+	HALT_WAKEUP,
+	USR_PR_INST,
+	FP_UNAVAIL,
+	DEBUG_EXITS,
+	TIMEINGUEST,
+	__NUMBER_OF_KVM_EXIT_TYPES
+};
+
+static const char* kvm_exit_names[__NUMBER_OF_KVM_EXIT_TYPES] = {
+	"MMIO",
+	"DCR",
+	"SIGNAL",
+	"ITLBREAL",
+	"ITLBVIRT",
+	"DTLBREAL",
+	"DTLBPV",
+	"DTLBVIRT",
+	"SYSCALL",
+	"HCALL",
+	"ISI",
+	"DSI",
+	"EMULINST",
+	"DEC",
+	"EXTINT",
+	"HALT",
+	"USR_PR_INST",
+	"FP_UNAVAIL",
+	"DEBUG",
+	"TIMEINGUEST"
+};
+
 struct tlbe {
 	u32 tid; /* Only the low 8 bits are used. */
 	u32 word0;
@@ -75,6 +124,14 @@
 
 struct kvm_arch {
 };
+
+/* allow access to big endian 32bit upper/lower parts and 64bit var */
+typedef union {
+	u64 tv64;
+	struct {
+		u32 tbu, tbl;
+	} tv32;
+} exit_timing_t;
 
 struct kvm_vcpu_arch {
 	/* Unmodified copy of the guest's TLB. */
@@ -139,6 +196,10 @@
 	u32 dbcr0;
 	u32 dbcr1;
 
+	/* to trace exit timing */
+	exit_timing_t timing_exit;
+	exit_timing_t timing_last_enter;
+
 	u32 last_inst;
 	u32 fault_dear;
 	u32 fault_esr;
@@ -165,4 +226,121 @@
 	int singlestep;
 };
 
+/* helpers for instruction timing, allocated in arch/powerpc/kvm/powerpc.c */
+extern u32 last_exit_type;
+extern u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_last_exit;
+
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+	int i;
+	last_exit_type=0xDEAD;
+	for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+		timing_count_type[i]=0;
+		timing_max_duration[i]=0;
+		timing_min_duration[i]=0xFFFFFFFF;
+		timing_sum_duration[i]=0;
+		timing_sum_quad_duration[i]=0;
+	}
+	vcpu_arch->timing_exit.tv64 = 0;
+	timing_last_exit = 0;
+	vcpu_arch->timing_last_enter.tv64 = 0;
+	printk(KERN_ERR"### initialized exit timing (print start timestamp to log) ###\n");
+}
+
+static inline void set_last_exit_type(int type)
+{
+	last_exit_type=type;
+}
+
+static inline void print_exit_timing(void)
+{
+	int i;
+	u64 min, max;
+
+	printk(KERN_ERR"### start exit timing statistics ###\n");
+	for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+		if (timing_min_duration[i]==0xFFFFFFFF)
+			min = 0;
+		else
+			min = timing_min_duration[i];
+		if (timing_max_duration[i]==0)
+			max = 0;
+		else
+			max = timing_max_duration[i];
+
+		printk(KERN_ERR"exit type %2d (%12s) count %10d min %10lld max %10lld sum %20lld sum_quad %20lld\n",
+				i, kvm_exit_names[i], timing_count_type[i], timing_min_duration[i], timing_max_duration[i],
+				timing_sum_duration[i], timing_sum_quad_duration[i]);
+	}
+	printk(KERN_ERR"please post calc: avg = sum/count; stddev = sqrt((sum_quad/count)-(sum_dur/count)������)\n");
+	printk(KERN_ERR"### end exit timing statistics ###\n");
+}
+
+static inline void add_exit_timing(u64 duration, int type)
+{
+	u64 old; 
+	
+	do_div(duration, tb_ticks_per_usec);
+	if (unlikely(duration > 0xFFFFFFFF)) {
+		printk(KERN_ERR"%s - duration too big, **2 would overflow duration %lld max 4294967295 type %d exit # of type %d (tb_ticks_per_usec=%ld)\n",
+				__func__, duration, type, timing_count_type[type], tb_ticks_per_usec);
+		print_exit_timing();
+	}
+		
+
+	timing_count_type[type]++;
+
+	/* sum */
+	old = timing_sum_duration[type];
+	timing_sum_duration[type] += duration;
+	if (unlikely(old > timing_sum_duration[type])) {
+		printk(KERN_ERR"%s - wraparound adding sum of durations old %lld new %lld type %d exit # of type %d\n",
+				__func__, old, timing_sum_duration[type], type, timing_count_type[type]);
+		print_exit_timing();
+	}
+
+	/* square sum */
+	old = timing_sum_quad_duration[type];
+	timing_sum_quad_duration[type] += (duration*duration);
+	if (unlikely(old > timing_sum_quad_duration[type])) {
+		printk(KERN_ERR"%s - wraparound adding sum of squared durations old %lld new %lld type %d exit # of type %d\n",
+				__func__, old, timing_sum_quad_duration[type], type, timing_count_type[type]);
+		print_exit_timing();
+	}
+
+	/* set min/max */
+	if (unlikely(duration < timing_min_duration[type]))
+		timing_min_duration[type] = duration;
+	if (unlikely(duration > timing_max_duration[type]))
+		timing_max_duration[type] = duration;
+}
+
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+	u64 exit = timing_last_exit;
+	u64 enter = vcpu_arch->timing_last_enter.tv64;
+
+	/* save exit time to use it next exit when the related enter time is known */
+	timing_last_exit = vcpu_arch->timing_exit.tv64;
+
+	if (unlikely(last_exit_type==0xDEAD))
+		return; /* skip first incomplete enter/exit cycle */
+
+	/* update statistics for average and standard deviation */
+	add_exit_timing((enter - exit), last_exit_type);
+	/* enter -> timing_last_exit is time spent in guest - log this too */
+	add_exit_timing((timing_last_exit - enter), TIMEINGUEST);
+
+	/* report regularly every 100000 exits*/
+	if (unlikely(timing_count_type[TIMEINGUEST] % 1000000 == 0)) {
+		printk(KERN_ERR"%s - %d exits report timing statistics\n", __func__,timing_count_type[TIMEINGUEST]);
+		print_exit_timing();
+	}
+}
+
 #endif /* __POWERPC_KVM_HOST_H__ */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
@ 2008-09-24 15:14 ` Hollis Blanchard
  2008-09-25  9:32 ` Liu Yu-B13201
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-09-24 15:14 UTC (permalink / raw)
  To: kvm-ppc

On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>                      count      min         max             
> sum            avg        stddev     time%
>     EMULINST     6,423,699       12     247,582      91,732,705        
> 14.2804       241.200     45.36
>     ITLBVIRT     1,777,242       21     264,257      47,116,557        
> 26.5111       286.040     23.30
>     DTLBVIRT     1,544,241       22     263,947      41,765,447        
> 27.0459       218.997     20.65

(The max here is of course when the guest was de-scheduled in favor of
another host process.)

I think it's interesting that the min instruction emulation time is 12
usecs. In fact, our "null" exits where we do almost no processing (ISI,
DSI, syscall) are 11 usecs, which suggests we have a problem with
interrupt handler overhead (for all exit types).

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
  2008-09-24 15:14 ` Hollis Blanchard
@ 2008-09-25  9:32 ` Liu Yu-B13201
  2008-09-25 15:18 ` Hollis Blanchard
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Liu Yu-B13201 @ 2008-09-25  9:32 UTC (permalink / raw)
  To: kvm-ppc


> 
> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
> >                      count      min         max             
> > sum            avg        stddev     time%
> >     EMULINST     6,423,699       12     247,582      
> 91,732,705        
> > 14.2804       241.200     45.36
> >     ITLBVIRT     1,777,242       21     264,257      
> 47,116,557        
> > 26.5111       286.040     23.30
> >     DTLBVIRT     1,544,241       22     263,947      
> 41,765,447        
> > 27.0459       218.997     20.65
> 
> (The max here is of course when the guest was de-scheduled in favor of
> another host process.)
> 
> I think it's interesting that the min instruction emulation time is 12
> usecs. In fact, our "null" exits where we do almost no 
> processing (ISI,
> DSI, syscall) are 11 usecs, which suggests we have a problem with
> interrupt handler overhead (for all exit types).
> 

Will you consider about moving tlb manipulation out of entering path?
Examining the modify array may cost some time.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
  2008-09-24 15:14 ` Hollis Blanchard
  2008-09-25  9:32 ` Liu Yu-B13201
@ 2008-09-25 15:18 ` Hollis Blanchard
  2008-10-02 12:02 ` Christian Ehrhardt
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-09-25 15:18 UTC (permalink / raw)
  To: kvm-ppc

On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
> > 
> > On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
> > >                      count      min         max             
> > > sum            avg        stddev     time%
> > >     EMULINST     6,423,699       12     247,582      
> > 91,732,705        
> > > 14.2804       241.200     45.36
> > >     ITLBVIRT     1,777,242       21     264,257      
> > 47,116,557        
> > > 26.5111       286.040     23.30
> > >     DTLBVIRT     1,544,241       22     263,947      
> > 41,765,447        
> > > 27.0459       218.997     20.65
> > 
> > (The max here is of course when the guest was de-scheduled in favor of
> > another host process.)
> > 
> > I think it's interesting that the min instruction emulation time is 12
> > usecs. In fact, our "null" exits where we do almost no 
> > processing (ISI,
> > DSI, syscall) are 11 usecs, which suggests we have a problem with
> > interrupt handler overhead (for all exit types).
> > 
> 
> Will you consider about moving tlb manipulation out of entering path?
> Examining the modify array may cost some time.

Yes, definitely.

That's about the only thing I can see in the 440 interrupt path that
might take significant time. Unfortunately we can't profile that code
because we have no performance counters, so finding the problem will
require some experimentation.

-- 
Hollis Blanchard
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (2 preceding siblings ...)
  2008-09-25 15:18 ` Hollis Blanchard
@ 2008-10-02 12:02 ` Christian Ehrhardt
  2008-10-07 14:36 ` Christian Ehrhardt
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-02 12:02 UTC (permalink / raw)
  To: kvm-ppc

[-- Attachment #1: Type: text/plain, Size: 7192 bytes --]

Today I refactored my code in a way that it might eventually get upstream.
That means I removed some debug blocks, made it configurable and print 
output only on request per sysrq.
Feel free to review that code now, otherwise wait for my separate rfc 
post that will follow ~next week.

Additionally I realized in a discussion that my host system was still 
configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
I took some updated numbers without all those background activities - 
testacase again boot, login, ls, kill.

I intend to the irq path in detail as suggested and first of all I'll 
try to get some timing of the "segments" of such a call.
This should show us where we loose our "constant base time" and give me 
some hints when looking at it in detail.

FYI - new numbers without high background load (monospace recommended):
NOPV:
        MMIO count      10105 min         46 max       1534 
sum              2952467 avg         292.1788 stddev      295.933 %  2.04
         DCR count       5428 min         40 max        209 
sum               246168 avg          45.3515 stddev        6.758 %  0.17
      SIGNAL count        695 min         65 max       3755 
sum                89767 avg         129.1612 stddev      314.421 %  0.06
    ITLBREAL count      80045 min         13 max        108 
sum              1063127 avg          13.2816 stddev        2.338 %  0.73
    ITLBVIRT count    1000585 min         21 max     264827 
sum             24300596 avg          24.2864 stddev      264.753 % 16.78
    DTLBREAL count      91206 min         13 max         69 
sum              1285214 avg          14.0913 stddev        2.225 %  0.89
    DTLBVIRT count     977434 min         21 max       1446 
sum             24007008 avg          24.5613 stddev        4.426 % 16.58
     SYSCALL count      10460 min         11 max         55 
sum               116447 avg          11.1326 stddev        1.929 %  0.08
         ISI count      11724 min         11 max         61 
sum               130007 avg          11.0890 stddev        1.929 %  0.09
         DSI count      20737 min         11 max         57 
sum               230009 avg          11.0917 stddev        1.914 %  0.16
    EMULINST count    5683356 min         11 max       3778 
sum             79339467 avg          13.9600 stddev       50.275 % 54.78
         DEC count      13079 min         50 max        826 
sum               732712 avg          56.0220 stddev       22.382 %  0.51
      EXTINT count         55 min         30 max       1478 
sum                10996 avg         199.9273 stddev      238.150 %  0.01
  FP_UNAVAIL count        280 min         11 max         53 
sum                 3163 avg          11.2964 stddev        3.495 %  0.00
 TIMEINGUEST count    7905189 min          0 max       3688 
sum             10330742 avg           1.3068 stddev        8.970 %  7.13
              csum   15810378                               
sumsum         144837890 => ~2:24 runtime
                                                            
sumavg           7241894

PV:
        MMIO count      12505 min         46 max       3087 
sum              3693782 avg         295.3844 stddev      260.788 %  4.01
         DCR count       5595 min         40 max        706 
sum               273578 avg          48.8969 stddev       31.305 %  0.30
      SIGNAL count        654 min         65 max       4132 
sum               300027 avg         458.7569 stddev      571.130 %  0.33
    ITLBREAL count      71711 min         13 max        104 
sum               943053 avg          13.1507 stddev        2.360 %  1.02
    ITLBVIRT count     750649 min         21 max       1503 
sum             18178245 avg          24.2167 stddev        7.335 % 19.71
    DTLBREAL count      83356 min         13 max        102 
sum              1146242 avg          13.7512 stddev        2.406 %  1.24
      DTLBPV count      30086 min         20 max        237 
sum               653556 avg          21.7229 stddev        4.639 %  0.71
    DTLBVIRT count     772811 min         21 max        713 
sum             19079477 avg          24.6884 stddev        6.593 % 20.69
     SYSCALL count       7647 min         11 max         57 
sum                84821 avg          11.0921 stddev        1.897 %  0.09
       HCALL count          1 min         19 max         19 
sum                   19 avg          19.0000 stddev        0.000 %  0.00
         ISI count       9895 min         11 max         73 
sum               109667 avg          11.0831 stddev        1.904 %  0.12
         DSI count      17974 min         10 max         57 
sum               199504 avg          11.0996 stddev        2.046 %  0.22
    EMULINST count    2567245 min         11 max       4212 
sum             40501314 avg          15.7762 stddev       65.673 % 43.92
         DEC count       7488 min         51 max        641 
sum               426813 avg          56.9996 stddev       23.893 %  0.46
      EXTINT count       2215 min         31 max       1677 
sum               297495 avg         134.3093 stddev      116.219 %  0.32
  FP_UNAVAIL count        258 min         11 max         11 
sum                 2838 avg          11.0000 stddev        0.000 %  0.00
 TIMEINGUEST count    4340090 min          0 max       3850 
sum              6316079 avg           1.4553 stddev       12.599 %  6.85
              csum    8680180                               
sumsum          92206510 => ~1:32 runtime
                                                            
sumavg           4610325

Hollis Blanchard wrote:
> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>                      count      min         max             
>>>> sum            avg        stddev     time%
>>>>     EMULINST     6,423,699       12     247,582      
>>> 91,732,705        
>>>> 14.2804       241.200     45.36
>>>>     ITLBVIRT     1,777,242       21     264,257      
>>> 47,116,557        
>>>> 26.5111       286.040     23.30
>>>>     DTLBVIRT     1,544,241       22     263,947      
>>> 41,765,447        
>>>> 27.0459       218.997     20.65
>>> (The max here is of course when the guest was de-scheduled in favor of
>>> another host process.)
>>>
>>> I think it's interesting that the min instruction emulation time is 12
>>> usecs. In fact, our "null" exits where we do almost no 
>>> processing (ISI,
>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>> interrupt handler overhead (for all exit types).
>>>
>> Will you consider about moving tlb manipulation out of entering path?
>> Examining the modify array may cost some time.
>
> Yes, definitely.
>
> That's about the only thing I can see in the 440 interrupt path that
> might take significant time. Unfortunately we can't profile that code
> because we have no performance counters, so finding the problem will
> require some experimentation.


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: kvmppc-exit-timing.diff --]
[-- Type: text/x-diff; name="kvmppc-exit-timing.diff", Size: 13787 bytes --]

diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -348,5 +348,12 @@
 	DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
 #endif
 
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+	DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu));
+	DEFINE(VCPU_TIMING_EXIT_TBL, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbl));
+	DEFINE(VCPU_TIMING_LAST_ENTER_TBU, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbu));
+	DEFINE(VCPU_TIMING_LAST_ENTER_TBL, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbl));
+#endif
+
 	return 0;
 }
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -37,6 +37,14 @@
 	  Provides host support for KVM on Book E PowerPC processors. Currently
 	  this works on 440 processors only.
 
+config KVM_BOOKE_EXIT_TIMING
+	bool "Trace detailed exit Timing"
+	depends on KVM_BOOKE_HOST
+	---help---
+	  Inserts code to trace timestamps for every exit/enter cycle. A report
+	  can be triggered using debugfs entry TODO.
+	  This adds overhead, it's not recommended for production environments.
+
 config KVM_TRACE
 	bool "KVM trace support"
 	depends on KVM && MARKERS && SYSFS
diff --git a/arch/powerpc/kvm/booke_guest.c b/arch/powerpc/kvm/booke_guest.c
--- a/arch/powerpc/kvm/booke_guest.c
+++ b/arch/powerpc/kvm/booke_guest.c
@@ -56,6 +56,21 @@
 	{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
 	{ NULL }
 };
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+u32 last_exit_type;
+u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_last_exit;
+struct sysrq_key_op trigger_print_exit_timing_op = {
+        .handler        = print_exit_timing_handler,
+        .help_msg       = "Print KVm PowerPC exit timings",
+        .action_msg     = "Printing KVM PowerPC exit timings",
+};
+#endif
 
 static const u32 interrupt_msr_mask[16] = {
 	[BOOKE_INTERRUPT_CRITICAL]      = MSR_ME,
@@ -261,6 +276,9 @@
 	enum emulation_result er;
 	int r = RESUME_HOST;
 
+	/* update before a new last_exit_type is written */
+	update_exit_timing(&(vcpu->arch));
+
 	local_irq_enable();
 
 	run->exit_reason = KVM_EXIT_UNKNOWN;
@@ -285,10 +303,13 @@
 		 * misses before ceding control. */
 		if (need_resched())
 			cond_resched();
-		if (exit_nr == BOOKE_INTERRUPT_DECREMENTER)
+		if (exit_nr == BOOKE_INTERRUPT_DECREMENTER) {
 			vcpu->stat.dec_exits++;
-		else
+			set_last_exit_type(DEC_EXITS);
+		} else {
 			vcpu->stat.ext_intr_exits++;
+			set_last_exit_type(EXT_INTR_EXITS);
+		}
 		r = RESUME_GUEST;
 		break;
 
@@ -299,6 +320,7 @@
 			vcpu->arch.esr = vcpu->arch.fault_esr;
 			kvmppc_queue_exception(vcpu, BOOKE_INTERRUPT_PROGRAM);
 			r = RESUME_GUEST;
+			set_last_exit_type(USR_PR_INST);
 			break;
 		}
 
@@ -308,10 +330,12 @@
 			/* Future optimization: only reload non-volatiles if
 			 * they were actually modified by emulation. */
 			vcpu->stat.emulated_inst_exits++;
+			set_last_exit_type(EMULATED_INST_EXITS);
 			r = RESUME_GUEST_NV;
 			break;
 		case EMULATE_DO_DCR:
 			run->exit_reason = KVM_EXIT_DCR;
+			set_last_exit_type(DCR_EXITS);
 			r = RESUME_HOST;
 			break;
 		case EMULATE_FAIL:
@@ -331,6 +355,7 @@
 
 	case BOOKE_INTERRUPT_FP_UNAVAIL:
 		kvmppc_queue_exception(vcpu, exit_nr);
+		set_last_exit_type(FP_UNAVAIL);
 		r = RESUME_GUEST;
 		break;
 
@@ -339,6 +364,7 @@
 		vcpu->arch.esr = vcpu->arch.fault_esr;
 		kvmppc_queue_exception(vcpu, exit_nr);
 		vcpu->stat.dsi_exits++;
+		set_last_exit_type(DSI_EXITS);
 		r = RESUME_GUEST;
 		break;
 
@@ -346,6 +372,7 @@
 		vcpu->arch.esr = vcpu->arch.fault_esr;
 		kvmppc_queue_exception(vcpu, exit_nr);
 		vcpu->stat.isi_exits++;
+		set_last_exit_type(ISI_EXITS);
 		r = RESUME_GUEST;
 		break;
 
@@ -353,9 +380,11 @@
 		if (vcpu->arch.last_inst == KVM_HYPERCALL_BIN) {
 			kvmppc_do_hypercall(vcpu);
 			vcpu->stat.hcall_exits++;
+			set_last_exit_type(HCALL_EXITS);
 		} else {
 			kvmppc_queue_exception(vcpu, exit_nr);
 			vcpu->stat.syscall_exits++;
+			set_last_exit_type(SYSCALL_EXITS);
 		}
 		r = RESUME_GUEST;
 		break;
@@ -370,6 +399,7 @@
 			 vcpu->arch.pvmem_gpaddr >> KVM_PPCPV_MAGIC_PAGE_SHIFT,
 			 0, KVM_PPCPV_MAGIC_PAGE_FLAGS);
 			vcpu->stat.dtlb_pvmem_miss_exits++;
+			set_last_exit_type(DTLB_PVMEM_MISS_EXITS);
 			r = RESUME_GUEST;
 			break;
 		}
@@ -382,6 +412,7 @@
 			vcpu->arch.dear = vcpu->arch.fault_dear;
 			vcpu->arch.esr = vcpu->arch.fault_esr;
 			vcpu->stat.dtlb_real_miss_exits++;
+			set_last_exit_type(DTLB_REAL_MISS_EXITS);
 			r = RESUME_GUEST;
 			break;
 		}
@@ -399,11 +430,13 @@
 			kvmppc_mmu_map(vcpu, eaddr, gfn, gtlbe->tid,
 			               gtlbe->word2);
 			vcpu->stat.dtlb_virt_miss_exits++;
+			set_last_exit_type(DTLB_VIRT_MISS_EXITS);
 			r = RESUME_GUEST;
 		} else {
 			/* Guest has mapped and accessed a page which is not
 			 * actually RAM. */
 			r = kvmppc_emulate_mmio(run, vcpu);
+			set_last_exit_type(MMIO_EXITS);
 		}
 
 		break;
@@ -422,10 +455,12 @@
 			/* The guest didn't have a mapping for it. */
 			kvmppc_queue_exception(vcpu, exit_nr);
 			vcpu->stat.itlb_real_miss_exits++;
+			set_last_exit_type(ITLB_REAL_MISS_EXITS);
 			break;
 		}
 
 		vcpu->stat.itlb_virt_miss_exits++;
+		set_last_exit_type(ITLB_VIRT_MISS_EXITS);
 
 		gfn = tlb_xlate(gtlbe, eaddr) >> PAGE_SHIFT;
 
@@ -458,6 +493,7 @@
 		mtspr(SPRN_DBSR, dbsr);
 
 		run->exit_reason = KVM_EXIT_DEBUG;
+		set_last_exit_type(DEBUG_EXITS);
 		r = RESUME_HOST;
 		break;
 	}
@@ -482,6 +518,7 @@
 			r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
 
 			vcpu->stat.signal_exits++;
+			set_last_exit_type(SIGNAL_EXITS);
 		} else {
 			vcpu->stat.light_exits++;
 		}
@@ -495,6 +532,7 @@
 			break;
 		case KVM_EXIT_INTR:
 			vcpu->stat.signal_exits++;
+			set_last_exit_type(SIGNAL_EXITS);
 			break;
 		}
 	}
@@ -535,6 +573,8 @@
 	 * real timebase frequency. Accordingly, it must see the state of
 	 * CCR1[TCS]. */
 	vcpu->arch.ccr1 = mfspr(SPRN_CCR1);
+
+	init_exit_timing(&(vcpu->arch));
 
 	return 0;
 }
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -109,6 +109,18 @@
 
 	li	r6, 1
 	slw	r6, r6, r5
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+	/* save exit time */
+..exit_tbu_overflow_loop:
+	mfspr	r7, SPRN_TBRU
+	mfspr	r8, SPRN_TBRL
+	mfspr	r9, SPRN_TBRU
+	cmpw	r9, r7
+	bne	..exit_tbu_overflow_loop
+	stw	r8, VCPU_TIMING_EXIT_TBL(r4)
+	stw	r9, VCPU_TIMING_EXIT_TBU(r4)
+#endif
 
 	/* Save the faulting instruction and all GPRs for emulation. */
 	andi.	r7, r6, NEED_INST_MASK
@@ -415,6 +427,18 @@
 	lwz	r3, VCPU_SPRG7(r4)
 	mtspr	SPRN_SPRG7, r3
 
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+	/* save enter time */
+..enter_tbu_overflow_loop:
+	mfspr	r6, SPRN_TBRU
+	mfspr	r7, SPRN_TBRL
+	mfspr	r8, SPRN_TBRU
+	cmpw	r8, r6
+	bne	..enter_tbu_overflow_loop
+	stw	r7, VCPU_TIMING_LAST_ENTER_TBL(r4)
+	stw	r8, VCPU_TIMING_LAST_ENTER_TBU(r4)
+#endif
+
 	/* Finish loading guest volatiles and jump to guest. */
 	lwz	r3, VCPU_CTR(r4)
 	mtctr	r3
diff --git a/include/asm-powerpc/kvm_host.h b/include/asm-powerpc/kvm_host.h
--- a/include/asm-powerpc/kvm_host.h
+++ b/include/asm-powerpc/kvm_host.h
@@ -25,6 +25,12 @@
 #include <linux/types.h>
 #include <linux/kvm_types.h>
 #include <asm/kvm_asm.h>
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+#include <asm/time.h>
+#include <asm-generic/div64.h>
+#include <linux/sysrq.h>
+#endif
 
 #define KVM_MAX_VCPUS 1
 #define KVM_MEMORY_SLOTS 32
@@ -66,6 +72,55 @@
 	u32 halt_wakeup;
 };
 
+enum kvm_exit_types {
+	MMIO_EXITS,
+	DCR_EXITS,
+	SIGNAL_EXITS,
+	ITLB_REAL_MISS_EXITS,
+	ITLB_VIRT_MISS_EXITS,
+	DTLB_REAL_MISS_EXITS,
+	DTLB_PVMEM_MISS_EXITS,
+	DTLB_VIRT_MISS_EXITS,
+	SYSCALL_EXITS,
+	HCALL_EXITS,
+	ISI_EXITS,
+	DSI_EXITS,
+	EMULATED_INST_EXITS,
+	DEC_EXITS,
+	EXT_INTR_EXITS,
+	HALT_WAKEUP,
+	USR_PR_INST,
+	FP_UNAVAIL,
+	DEBUG_EXITS,
+	TIMEINGUEST,
+	__NUMBER_OF_KVM_EXIT_TYPES
+};
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+static const char* kvm_exit_names[__NUMBER_OF_KVM_EXIT_TYPES] = {
+	"MMIO",
+	"DCR",
+	"SIGNAL",
+	"ITLBREAL",
+	"ITLBVIRT",
+	"DTLBREAL",
+	"DTLBPV",
+	"DTLBVIRT",
+	"SYSCALL",
+	"HCALL",
+	"ISI",
+	"DSI",
+	"EMULINST",
+	"DEC",
+	"EXTINT",
+	"HALT",
+	"USR_PR_INST",
+	"FP_UNAVAIL",
+	"DEBUG",
+	"TIMEINGUEST"
+};
+#endif
+
 struct tlbe {
 	u32 tid; /* Only the low 8 bits are used. */
 	u32 word0;
@@ -75,6 +130,16 @@
 
 struct kvm_arch {
 };
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+/* allow access to big endian 32bit upper/lower parts and 64bit var */
+typedef union {
+	u64 tv64;
+	struct {
+		u32 tbu, tbl;
+	} tv32;
+} exit_timing_t;
+#endif
 
 struct kvm_vcpu_arch {
 	/* Unmodified copy of the guest's TLB. */
@@ -139,6 +204,11 @@
 	u32 dbcr0;
 	u32 dbcr1;
 
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+	exit_timing_t timing_exit;
+	exit_timing_t timing_last_enter;
+#endif
+
 	u32 last_inst;
 	u32 fault_dear;
 	u32 fault_esr;
@@ -165,4 +235,131 @@
 	int singlestep;
 };
 
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+/* helpers for instruction timing, allocated in arch/powerpc/kvm/powerpc.c */
+extern u32 last_exit_type;
+extern u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_last_exit;
+extern struct sysrq_key_op trigger_print_exit_timing_op;
+
+static void print_exit_timing_handler(int key, struct tty_struct *tty)
+{
+	int i;
+	u64 min, max;
+
+	printk(KERN_ERR"### start exit timing statistics ###\n");
+	for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+		if (timing_min_duration[i]==0xFFFFFFFF)
+			min = 0;
+		else
+			min = timing_min_duration[i];
+		if (timing_max_duration[i]==0)
+			max = 0;
+		else
+			max = timing_max_duration[i];
+
+		printk(KERN_ERR"exit type %2d (%12s) count %10d min %10lld max %10lld sum %20lld sum_quad %20lld\n",
+				i, kvm_exit_names[i], timing_count_type[i], timing_min_duration[i], timing_max_duration[i],
+				timing_sum_duration[i], timing_sum_quad_duration[i]);
+	}
+	printk(KERN_ERR"please post calc: avg = sum/count; stddev = sqrt((sum_quad/count)-(sum_dur/count)������)\n");
+	printk(KERN_ERR"### end exit timing statistics ###\n");
+}
+
+static inline void print_exit_timing(void) {
+	print_exit_timing_handler(0,NULL);
+}
+
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+	int i;
+
+	last_exit_type=0xDEAD;
+	for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+		timing_count_type[i]=0;
+		timing_max_duration[i]=0;
+		timing_min_duration[i]=0xFFFFFFFF;
+		timing_sum_duration[i]=0;
+		timing_sum_quad_duration[i]=0;
+	}
+	vcpu_arch->timing_exit.tv64 = 0;
+	timing_last_exit = 0;
+	vcpu_arch->timing_last_enter.tv64 = 0;
+
+	if (register_sysrq_key('y', &trigger_print_exit_timing_op))
+		printk(KERN_ERR"%s - failed to register sysrq key\n", __func__);
+		
+/* TODO unregister_sysrq_key(int key, struct sysrq_key_op *op_p) */
+}
+
+static inline void set_last_exit_type(int type)
+{
+	last_exit_type=type;
+}
+
+static inline void add_exit_timing(u64 duration, int type)
+{
+	u64 old; 
+	
+	do_div(duration, tb_ticks_per_usec);
+	if (unlikely(duration > 0xFFFFFFFF)) {
+		printk(KERN_ERR"%s - duration too big, **2 would overflow duration %lld max 4294967295 type %d exit # of type %d (tb_ticks_per_usec=%ld)\n",
+				__func__, duration, type, timing_count_type[type], tb_ticks_per_usec);
+		print_exit_timing();
+	}
+		
+
+	timing_count_type[type]++;
+
+	/* sum */
+	old = timing_sum_duration[type];
+	timing_sum_duration[type] += duration;
+	if (unlikely(old > timing_sum_duration[type])) {
+		printk(KERN_ERR"%s - wraparound adding sum of durations old %lld new %lld type %d exit # of type %d\n",
+				__func__, old, timing_sum_duration[type], type, timing_count_type[type]);
+		print_exit_timing();
+	}
+
+	/* square sum */
+	old = timing_sum_quad_duration[type];
+	timing_sum_quad_duration[type] += (duration*duration);
+	if (unlikely(old > timing_sum_quad_duration[type])) {
+		printk(KERN_ERR"%s - wraparound adding sum of squared durations old %lld new %lld type %d exit # of type %d\n",
+				__func__, old, timing_sum_quad_duration[type], type, timing_count_type[type]);
+		print_exit_timing();
+	}
+
+	/* set min/max */
+	if (unlikely(duration < timing_min_duration[type]))
+		timing_min_duration[type] = duration;
+	if (unlikely(duration > timing_max_duration[type]))
+		timing_max_duration[type] = duration;
+}
+
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+	u64 exit = timing_last_exit;
+	u64 enter = vcpu_arch->timing_last_enter.tv64;
+
+	/* save exit time to use it next exit when the related enter time is known */
+	timing_last_exit = vcpu_arch->timing_exit.tv64;
+
+	if (unlikely(last_exit_type==0xDEAD))
+		return; /* skip first incomplete enter/exit cycle */
+
+	/* update statistics for average and standard deviation */
+	add_exit_timing((enter - exit), last_exit_type);
+	/* enter -> timing_last_exit is time spent in guest - log this too */
+	add_exit_timing((timing_last_exit - enter), TIMEINGUEST);
+}
+#else
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch) {}
+static inline void set_last_exit_type(int type) {}
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch) {}
+#endif /* CONFIG_KVM_BOOKE_EXIT_TIMING */
+
 #endif /* __POWERPC_KVM_HOST_H__ */

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (3 preceding siblings ...)
  2008-10-02 12:02 ` Christian Ehrhardt
@ 2008-10-07 14:36 ` Christian Ehrhardt
  2008-10-08 13:49 ` Christian Ehrhardt
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-07 14:36 UTC (permalink / raw)
  To: kvm-ppc

As intended I separated the time of one of our exits into it's segments 
to see where we loose the constant base time.
I looked at the syscall exit which is pretty fast and does not deviate 
very much.

A little tracing and debugging later I can now show the rough breakdown 
where the time is spent for that exit:

avg time spent in the segments I checked:
a) 11.5%  - exit, saving guest state (booke_interrupt.S)
b) 8.5% - reaching kvmppc_handle_exit
c) 10%  - syscall exit is checked and a interrupt is queued using 
kvmppc_queue_exception
d) 30.5%  - postprocessing for all exits = 
kvmppc_check_and_deliver_interrupts and updating some statistics
e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S
f) 34%  - restoring guest state tlb/regs (booke_interrupt.S)

Ok, looking at these numbers I think we have two areas to check in detail.
- On one hand the post processing part (d) actually should not take that 
much time.
- On the other hand we might be able to improve the last segment by 
chanign our tlb management (Liu made a suggestion on the list two weeks ago)

When thinking about this breakdown while having the complete timing in 
mind (see below) shows us that the average time spent for one of our 
most frequent exit is near the minimum duration (EMUL). This means if we 
can reduce some of the constant overehad this might really help us in 
those frequent and hot paths (also birt tlb misses will benefit from that).

I continue on that and keep you all updated.
Christian

Christian Ehrhardt wrote:
> Today I refactored my code in a way that it might eventually get 
> upstream.
> That means I removed some debug blocks, made it configurable and print 
> output only on request per sysrq.
> Feel free to review that code now, otherwise wait for my separate rfc 
> post that will follow ~next week.
>
> Additionally I realized in a discussion that my host system was still 
> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
> I took some updated numbers without all those background activities - 
> testacase again boot, login, ls, kill.
>
> I intend to the irq path in detail as suggested and first of all I'll 
> try to get some timing of the "segments" of such a call.
> This should show us where we loose our "constant base time" and give 
> me some hints when looking at it in detail.
>
> FYI - new numbers without high background load (monospace recommended):
> NOPV:
>        MMIO count      10105 min         46 max       1534 
> sum              2952467 avg         292.1788 stddev      295.933 %  2.04
>         DCR count       5428 min         40 max        209 
> sum               246168 avg          45.3515 stddev        6.758 %  0.17
>      SIGNAL count        695 min         65 max       3755 
> sum                89767 avg         129.1612 stddev      314.421 %  0.06
>    ITLBREAL count      80045 min         13 max        108 
> sum              1063127 avg          13.2816 stddev        2.338 %  0.73
>    ITLBVIRT count    1000585 min         21 max     264827 
> sum             24300596 avg          24.2864 stddev      264.753 % 16.78
>    DTLBREAL count      91206 min         13 max         69 
> sum              1285214 avg          14.0913 stddev        2.225 %  0.89
>    DTLBVIRT count     977434 min         21 max       1446 
> sum             24007008 avg          24.5613 stddev        4.426 % 16.58
>     SYSCALL count      10460 min         11 max         55 
> sum               116447 avg          11.1326 stddev        1.929 %  0.08
>         ISI count      11724 min         11 max         61 
> sum               130007 avg          11.0890 stddev        1.929 %  0.09
>         DSI count      20737 min         11 max         57 
> sum               230009 avg          11.0917 stddev        1.914 %  0.16
>    EMULINST count    5683356 min         11 max       3778 
> sum             79339467 avg          13.9600 stddev       50.275 % 54.78
>         DEC count      13079 min         50 max        826 
> sum               732712 avg          56.0220 stddev       22.382 %  0.51
>      EXTINT count         55 min         30 max       1478 
> sum                10996 avg         199.9273 stddev      238.150 %  0.01
>  FP_UNAVAIL count        280 min         11 max         53 
> sum                 3163 avg          11.2964 stddev        3.495 %  0.00
> TIMEINGUEST count    7905189 min          0 max       3688 
> sum             10330742 avg           1.3068 stddev        8.970 %  7.13
>              csum   15810378                               
> sumsum         144837890 => ~2:24 runtime
>                                                            
> sumavg           7241894
>
> PV:
>        MMIO count      12505 min         46 max       3087 
> sum              3693782 avg         295.3844 stddev      260.788 %  4.01
>         DCR count       5595 min         40 max        706 
> sum               273578 avg          48.8969 stddev       31.305 %  0.30
>      SIGNAL count        654 min         65 max       4132 
> sum               300027 avg         458.7569 stddev      571.130 %  0.33
>    ITLBREAL count      71711 min         13 max        104 
> sum               943053 avg          13.1507 stddev        2.360 %  1.02
>    ITLBVIRT count     750649 min         21 max       1503 
> sum             18178245 avg          24.2167 stddev        7.335 % 19.71
>    DTLBREAL count      83356 min         13 max        102 
> sum              1146242 avg          13.7512 stddev        2.406 %  1.24
>      DTLBPV count      30086 min         20 max        237 
> sum               653556 avg          21.7229 stddev        4.639 %  0.71
>    DTLBVIRT count     772811 min         21 max        713 
> sum             19079477 avg          24.6884 stddev        6.593 % 20.69
>     SYSCALL count       7647 min         11 max         57 
> sum                84821 avg          11.0921 stddev        1.897 %  0.09
>       HCALL count          1 min         19 max         19 
> sum                   19 avg          19.0000 stddev        0.000 %  0.00
>         ISI count       9895 min         11 max         73 
> sum               109667 avg          11.0831 stddev        1.904 %  0.12
>         DSI count      17974 min         10 max         57 
> sum               199504 avg          11.0996 stddev        2.046 %  0.22
>    EMULINST count    2567245 min         11 max       4212 
> sum             40501314 avg          15.7762 stddev       65.673 % 43.92
>         DEC count       7488 min         51 max        641 
> sum               426813 avg          56.9996 stddev       23.893 %  0.46
>      EXTINT count       2215 min         31 max       1677 
> sum               297495 avg         134.3093 stddev      116.219 %  0.32
>  FP_UNAVAIL count        258 min         11 max         11 
> sum                 2838 avg          11.0000 stddev        0.000 %  0.00
> TIMEINGUEST count    4340090 min          0 max       3850 
> sum              6316079 avg           1.4553 stddev       12.599 %  6.85
>              csum    8680180                               
> sumsum          92206510 => ~1:32 runtime
>                                                            
> sumavg           4610325
>
> Hollis Blanchard wrote:
>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>>                      count      min         max             
>>>>> sum            avg        stddev     time%
>>>>>     EMULINST     6,423,699       12     247,582      
>>>> 91,732,705       
>>>>> 14.2804       241.200     45.36
>>>>>     ITLBVIRT     1,777,242       21     264,257      
>>>> 47,116,557       
>>>>> 26.5111       286.040     23.30
>>>>>     DTLBVIRT     1,544,241       22     263,947      
>>>> 41,765,447       
>>>>> 27.0459       218.997     20.65
>>>> (The max here is of course when the guest was de-scheduled in favor of
>>>> another host process.)
>>>>
>>>> I think it's interesting that the min instruction emulation time is 12
>>>> usecs. In fact, our "null" exits where we do almost no processing 
>>>> (ISI,
>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>>> interrupt handler overhead (for all exit types).
>>>>
>>> Will you consider about moving tlb manipulation out of entering path?
>>> Examining the modify array may cost some time.
>>
>> Yes, definitely.
>>
>> That's about the only thing I can see in the 440 interrupt path that
>> might take significant time. Unfortunately we can't profile that code
>> because we have no performance counters, so finding the problem will
>> require some experimentation.
>
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (4 preceding siblings ...)
  2008-10-07 14:36 ` Christian Ehrhardt
@ 2008-10-08 13:49 ` Christian Ehrhardt
  2008-10-08 15:41 ` Hollis Blanchard
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-08 13:49 UTC (permalink / raw)
  To: kvm-ppc

Wondering about that 30.5% for postprocessing and 
kvmppc_check_and_deliver_interrupts I quickly checked that in detail - 
part d is now divided in 4 subparts.
I also looked at the return to guest path if the expected part 
(restoring tlb) is really the main time eater there. The result shows 
clearly that it is.

more detailed breakdown:
a)  10.94%  - exit, saving guest state (booke_interrupt.S)
b)   8.12% - reaching kvmppc_handle_exit
c)   7.59%  - syscall exit is checked and a interrupt is queued using 
kvmppc_queue_exception
d1)  3.33%  - some checks for all exits
d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
d3) 17.20% - can_deliver/clear&deliver exception in 
kvmppc_check_and_deliver_interrupts
d4)  4.47% - updating kvm_stat statistics
e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
f1) 29.18% - restoring guest tlb
f2)  4.69% - restoring guest state ([s]regs)

These fractions are % of our ~12µs syscall exit.
=> restoring tlb on each reenter = 4µs constant overhead
=> looking a bit into irq delivery and other constant things like 
kvm_stat updating

---
I changed some code of the statistic updating and the interrupt delivery 
and got this:
      base - impirq (d3) - impstat (d5) - impboth
a)  12.57% -  11.13%     -  12.05%      -  11.03% exit, saving guest 
state (booke_interrupt.S)
b)   7.37% -   9.38%     -   8.69%      -   8.07% reaching 
kvmppc_handle_exit
c)   7.38% -   7.20%     -   7.49%      -   9.78%  syscall exit is 
checked and a interrupt is queued using kvmppc_queue_exception
d1)  2.49% -   3.39%     -   2.56%      -   3.30% some checks for all exits
d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in 
kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
exception in kvmppc_check_and_deliver_interrupts
d5)  3.65% -   4.57%     -   2.68%      -   4.41% updating kvm_stat 
statistics
e)   6.55% -   6.30%     -   6.30%      -   5.89% returning from 
kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% -  28.78%     -  30.16%      -  29.16% restoring guest tlb
f2)  4.81% -   4.77%     -   5.06%      -   4.66% restoring guest state 
([s]regs)

Well the improvements (if existant) are too close to the measuring 
inaccuracy. Therefore I can't make any assumptions right now, but the 
patches seem at least to work.
I will test them soon with a longer testcase and a bit measureing 
overhead reduced to avoid some inaccuracy.

The patch to our kvm_stat counters would make sense for style reasons 
alone, so it is worth to try it :-)

Now I go for the TLB replacement in f1.
Christian

Christian Ehrhardt wrote:
> As intended I separated the time of one of our exits into it's 
> segments to see where we loose the constant base time.
> I looked at the syscall exit which is pretty fast and does not deviate 
> very much.
>
> A little tracing and debugging later I can now show the rough 
> breakdown where the time is spent for that exit:
>
> avg time spent in the segments I checked:
> a) 11.5%  - exit, saving guest state (booke_interrupt.S)
> b) 8.5% - reaching kvmppc_handle_exit
> c) 10%  - syscall exit is checked and a interrupt is queued using 
> kvmppc_queue_exception
> d) 30.5%  - postprocessing for all exits = 
> kvmppc_check_and_deliver_interrupts and updating some statistics
> e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S
> f) 34%  - restoring guest state tlb/regs (booke_interrupt.S)
>
> Ok, looking at these numbers I think we have two areas to check in 
> detail.
> - On one hand the post processing part (d) actually should not take 
> that much time.
> - On the other hand we might be able to improve the last segment by 
> chanign our tlb management (Liu made a suggestion on the list two 
> weeks ago)
>
> When thinking about this breakdown while having the complete timing in 
> mind (see below) shows us that the average time spent for one of our 
> most frequent exit is near the minimum duration (EMUL). This means if 
> we can reduce some of the constant overehad this might really help us 
> in those frequent and hot paths (also birt tlb misses will benefit 
> from that).
>
> I continue on that and keep you all updated.
> Christian
>
> Christian Ehrhardt wrote:
>> Today I refactored my code in a way that it might eventually get 
>> upstream.
>> That means I removed some debug blocks, made it configurable and 
>> print output only on request per sysrq.
>> Feel free to review that code now, otherwise wait for my separate rfc 
>> post that will follow ~next week.
>>
>> Additionally I realized in a discussion that my host system was still 
>> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
>> I took some updated numbers without all those background activities - 
>> testacase again boot, login, ls, kill.
>>
>> I intend to the irq path in detail as suggested and first of all I'll 
>> try to get some timing of the "segments" of such a call.
>> This should show us where we loose our "constant base time" and give 
>> me some hints when looking at it in detail.
>>
>> FYI - new numbers without high background load (monospace recommended):
>> NOPV:
>>        MMIO count      10105 min         46 max       1534 
>> sum              2952467 avg         292.1788 stddev      295.933 %  
>> 2.04
>>         DCR count       5428 min         40 max        209 
>> sum               246168 avg          45.3515 stddev        6.758 %  
>> 0.17
>>      SIGNAL count        695 min         65 max       3755 
>> sum                89767 avg         129.1612 stddev      314.421 %  
>> 0.06
>>    ITLBREAL count      80045 min         13 max        108 
>> sum              1063127 avg          13.2816 stddev        2.338 %  
>> 0.73
>>    ITLBVIRT count    1000585 min         21 max     264827 
>> sum             24300596 avg          24.2864 stddev      264.753 % 
>> 16.78
>>    DTLBREAL count      91206 min         13 max         69 
>> sum              1285214 avg          14.0913 stddev        2.225 %  
>> 0.89
>>    DTLBVIRT count     977434 min         21 max       1446 
>> sum             24007008 avg          24.5613 stddev        4.426 % 
>> 16.58
>>     SYSCALL count      10460 min         11 max         55 
>> sum               116447 avg          11.1326 stddev        1.929 %  
>> 0.08
>>         ISI count      11724 min         11 max         61 
>> sum               130007 avg          11.0890 stddev        1.929 %  
>> 0.09
>>         DSI count      20737 min         11 max         57 
>> sum               230009 avg          11.0917 stddev        1.914 %  
>> 0.16
>>    EMULINST count    5683356 min         11 max       3778 
>> sum             79339467 avg          13.9600 stddev       50.275 % 
>> 54.78
>>         DEC count      13079 min         50 max        826 
>> sum               732712 avg          56.0220 stddev       22.382 %  
>> 0.51
>>      EXTINT count         55 min         30 max       1478 
>> sum                10996 avg         199.9273 stddev      238.150 %  
>> 0.01
>>  FP_UNAVAIL count        280 min         11 max         53 
>> sum                 3163 avg          11.2964 stddev        3.495 %  
>> 0.00
>> TIMEINGUEST count    7905189 min          0 max       3688 
>> sum             10330742 avg           1.3068 stddev        8.970 %  
>> 7.13
>>              csum   15810378                               
>> sumsum         144837890 => ~2:24 runtime
>>                                                            
>> sumavg           7241894
>>
>> PV:
>>        MMIO count      12505 min         46 max       3087 
>> sum              3693782 avg         295.3844 stddev      260.788 %  
>> 4.01
>>         DCR count       5595 min         40 max        706 
>> sum               273578 avg          48.8969 stddev       31.305 %  
>> 0.30
>>      SIGNAL count        654 min         65 max       4132 
>> sum               300027 avg         458.7569 stddev      571.130 %  
>> 0.33
>>    ITLBREAL count      71711 min         13 max        104 
>> sum               943053 avg          13.1507 stddev        2.360 %  
>> 1.02
>>    ITLBVIRT count     750649 min         21 max       1503 
>> sum             18178245 avg          24.2167 stddev        7.335 % 
>> 19.71
>>    DTLBREAL count      83356 min         13 max        102 
>> sum              1146242 avg          13.7512 stddev        2.406 %  
>> 1.24
>>      DTLBPV count      30086 min         20 max        237 
>> sum               653556 avg          21.7229 stddev        4.639 %  
>> 0.71
>>    DTLBVIRT count     772811 min         21 max        713 
>> sum             19079477 avg          24.6884 stddev        6.593 % 
>> 20.69
>>     SYSCALL count       7647 min         11 max         57 
>> sum                84821 avg          11.0921 stddev        1.897 %  
>> 0.09
>>       HCALL count          1 min         19 max         19 
>> sum                   19 avg          19.0000 stddev        0.000 %  
>> 0.00
>>         ISI count       9895 min         11 max         73 
>> sum               109667 avg          11.0831 stddev        1.904 %  
>> 0.12
>>         DSI count      17974 min         10 max         57 
>> sum               199504 avg          11.0996 stddev        2.046 %  
>> 0.22
>>    EMULINST count    2567245 min         11 max       4212 
>> sum             40501314 avg          15.7762 stddev       65.673 % 
>> 43.92
>>         DEC count       7488 min         51 max        641 
>> sum               426813 avg          56.9996 stddev       23.893 %  
>> 0.46
>>      EXTINT count       2215 min         31 max       1677 
>> sum               297495 avg         134.3093 stddev      116.219 %  
>> 0.32
>>  FP_UNAVAIL count        258 min         11 max         11 
>> sum                 2838 avg          11.0000 stddev        0.000 %  
>> 0.00
>> TIMEINGUEST count    4340090 min          0 max       3850 
>> sum              6316079 avg           1.4553 stddev       12.599 %  
>> 6.85
>>              csum    8680180                               
>> sumsum          92206510 => ~1:32 runtime
>>                                                            
>> sumavg           4610325
>>
>> Hollis Blanchard wrote:
>>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>>>                      count      min         max             
>>>>>> sum            avg        stddev     time%
>>>>>>     EMULINST     6,423,699       12     247,582      
>>>>> 91,732,705      
>>>>>> 14.2804       241.200     45.36
>>>>>>     ITLBVIRT     1,777,242       21     264,257      
>>>>> 47,116,557      
>>>>>> 26.5111       286.040     23.30
>>>>>>     DTLBVIRT     1,544,241       22     263,947      
>>>>> 41,765,447      
>>>>>> 27.0459       218.997     20.65
>>>>> (The max here is of course when the guest was de-scheduled in 
>>>>> favor of
>>>>> another host process.)
>>>>>
>>>>> I think it's interesting that the min instruction emulation time 
>>>>> is 12
>>>>> usecs. In fact, our "null" exits where we do almost no processing 
>>>>> (ISI,
>>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>>>> interrupt handler overhead (for all exit types).
>>>>>
>>>> Will you consider about moving tlb manipulation out of entering path?
>>>> Examining the modify array may cost some time.
>>>
>>> Yes, definitely.
>>>
>>> That's about the only thing I can see in the 440 interrupt path that
>>> might take significant time. Unfortunately we can't profile that code
>>> because we have no performance counters, so finding the problem will
>>> require some experimentation.
>>
>>
>
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (5 preceding siblings ...)
  2008-10-08 13:49 ` Christian Ehrhardt
@ 2008-10-08 15:41 ` Hollis Blanchard
  2008-10-09  8:02 ` Christian Ehrhardt
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-10-08 15:41 UTC (permalink / raw)
  To: kvm-ppc

On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
> Wondering about that 30.5% for postprocessing and 
> kvmppc_check_and_deliver_interrupts I quickly checked that in detail - 
> part d is now divided in 4 subparts.
> I also looked at the return to guest path if the expected part 
> (restoring tlb) is really the main time eater there. The result shows 
> clearly that it is.
> 
> more detailed breakdown:
> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
> b)   8.12% - reaching kvmppc_handle_exit
> c)   7.59%  - syscall exit is checked and a interrupt is queued using 
> kvmppc_queue_exception
> d1)  3.33%  - some checks for all exits
> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
> d3) 17.20% - can_deliver/clear&deliver exception in 
> kvmppc_check_and_deliver_interrupts
> d4)  4.47% - updating kvm_stat statistics
> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
> f1) 29.18% - restoring guest tlb
> f2)  4.69% - restoring guest state ([s]regs)
> 
> These fractions are % of our ~12µs syscall exit.
> => restoring tlb on each reenter = 4µs constant overhead
> => looking a bit into irq delivery and other constant things like 
> kvm_stat updating
> 
...
> 
> Now I go for the TLB replacement in f1.

Hang on... does d3 make sense to you? It doesn't to me, and if there's a
bug there it will be easier to fix than rewriting the TLB code. :)

I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
of 12µs is 2064ns, or about 1300 cycles. (Check my math.)

Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
that time is going. We're assuming the first_first_bit() loop usually
executes once, for syscall. Does it actually execute more than that? I
don't expect any of kvmppc_can_deliver_interrupt(),
kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
take lots of time.

Could it be cache effects? exception_priority[] and priority_exception[]
are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
fit into one... except they're not aligned.

Also, it looks like we use the generic find_first_bit(). That may be
more expensive than we'd like. However, since
vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
bitfield), we should be able to use ffs() instead, which has an
optimized PowerPC implementation. That might help a lot.

We might even be able to replace find_next_bit() too, by shifting a mask
over each loop, but I don't think we'll have to, since I expect the
common case to be we can deliver the first pending exception. (Worth
checking? :)

-- 
Hollis Blanchard
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (6 preceding siblings ...)
  2008-10-08 15:41 ` Hollis Blanchard
@ 2008-10-09  8:02 ` Christian Ehrhardt
  2008-10-09  9:35 ` Christian Ehrhardt
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09  8:02 UTC (permalink / raw)
  To: kvm-ppc

Hollis Blanchard wrote:
> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>   
>> Wondering about that 30.5% for postprocessing and 
>> kvmppc_check_and_deliver_interrupts I quickly checked that in detail - 
>> part d is now divided in 4 subparts.
>> I also looked at the return to guest path if the expected part 
>> (restoring tlb) is really the main time eater there. The result shows 
>> clearly that it is.
>>
>> more detailed breakdown:
>> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
>> b)   8.12% - reaching kvmppc_handle_exit
>> c)   7.59%  - syscall exit is checked and a interrupt is queued using 
>> kvmppc_queue_exception
>> d1)  3.33%  - some checks for all exits
>> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>> d3) 17.20% - can_deliver/clear&deliver exception in 
>> kvmppc_check_and_deliver_interrupts
>> d4)  4.47% - updating kvm_stat statistics
>> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>> f1) 29.18% - restoring guest tlb
>> f2)  4.69% - restoring guest state ([s]regs)
>>
>> These fractions are % of our ~12µs syscall exit.
>> => restoring tlb on each reenter = 4µs constant overhead
>> => looking a bit into irq delivery and other constant things like 
>> kvm_stat updating
>>
>>     
> ...
>   
>> Now I go for the TLB replacement in f1.
>>     
>
> Hang on... does d3 make sense to you? It doesn't to me, and if there's a
> bug there it will be easier to fix than rewriting the TLB code. :)
>   
I did not give up improving that part too :-)
> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>   
I get the same results. 1% ~ 80 cycles.
> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
> that time is going. We're assuming the first_first_bit() loop usually
> executes once, for syscall. Does it actually execute more than that? I
> don't expect any of kvmppc_can_deliver_interrupt(),
> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
> take lots of time.
>   
You can see below that I already had a more detailed breakdown in my old 
mail:
[...]
d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in 
kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
exception in kvmppc_check_and_deliver_interrupts
[...]
> Could it be cache effects? exception_priority[] and priority_exception[]
> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
> fit into one... except they're not aligned.
>   
I would be so happy if I would have hardware performance counters like 
cache misses :-)
> Also, it looks like we use the generic find_first_bit(). That may be
> more expensive than we'd like. However, since
> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
> bitfield), we should be able to use ffs() instead, which has an
> optimized PowerPC implementation. That might help a lot.
>   
good idea.
I'll check this and some other small improvements I have in mind.

> We might even be able to replace find_next_bit() too, by shifting a mask
> over each loop, but I don't think we'll have to, since I expect the
> common case to be we can deliver the first pending exception. (Worth
> checking? :)
>   
I'm not sure. It's surely worth checking how often that second 
find_next_bit is called.
If that number is far too small it's not worth.

-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (7 preceding siblings ...)
  2008-10-09  8:02 ` Christian Ehrhardt
@ 2008-10-09  9:35 ` Christian Ehrhardt
  2008-10-09 14:49 ` Christian Ehrhardt
  2008-10-10  8:32 ` Christian Ehrhardt
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09  9:35 UTC (permalink / raw)
  To: kvm-ppc

I modified the code according to your comments and my ideas, the new 
values are shown in column impISF (irq delivery, Stat, FindFirstBit)

I changed some code of the statistic updating and the interrupt delivery 
and got this:
     base - impirq (d3) - impstat (d5) - impboth  - impISF
a)  12.57% -  11.13%     -  12.05%      -  11.03% - 12.28%  exit, saving 
guest state (booke_interrupt.S)
b)   7.37% -   9.38%     -   8.69%      -   8.07% - 10.13%  reaching 
kvmppc_handle_exit
c)   7.38% -   7.20%     -   7.49%      -   9.78% -  7.85%  syscall exit 
is checked and a interrupt is queued using kvmppc_queue_exception
d1)  2.49% -   3.39%     -   2.56%      -   3.30% -  3.70%  some checks 
for all exits
d2)  8.84% -   8.56%     -   9.28%      -   8.31% -  6.07%  finding 
first bit in kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% -  4.27%  can_deliver 
in kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% - 13.96%  
clear&deliver exception in kvmppc_check_and_deliver_interrupts
d5)  3.65% -   4.57%     -   2.68%      -   4.41% -  3.77%  updating 
kvm_stat statistics
e)   6.55% -   6.30%     -   6.30%      -   5.89% -  6.74%  returning 
from kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% -  28.78%     -  30.16%      -  29.16% - 31.19%  restoring 
guest tlb
f2)  4.81% -   4.77%     -   5.06%      -   4.66% -  5.17%  restoring 
guest state ([s]regs)

We all see the measurement inaccuracy, but the last columns look good at 
the improved sections d2, d3 and d4.
I'll remove these detailed tracing soon and make a larger test hoping 
that this will not have the inaccuracy.
But for now I still wonder about the ~14% for clear&deliver - that 
should just not be "that" much.
It should be worth to look into that section once again more in detail 
first.

Christian Ehrhardt wrote:
> Hollis Blanchard wrote:
>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>  
>>> Wondering about that 30.5% for postprocessing and 
>>> kvmppc_check_and_deliver_interrupts I quickly checked that in detail 
>>> - part d is now divided in 4 subparts.
>>> I also looked at the return to guest path if the expected part 
>>> (restoring tlb) is really the main time eater there. The result 
>>> shows clearly that it is.
>>>
>>> more detailed breakdown:
>>> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
>>> b)   8.12% - reaching kvmppc_handle_exit
>>> c)   7.59%  - syscall exit is checked and a interrupt is queued 
>>> using kvmppc_queue_exception
>>> d1)  3.33%  - some checks for all exits
>>> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>> d3) 17.20% - can_deliver/clear&deliver exception in 
>>> kvmppc_check_and_deliver_interrupts
>>> d4)  4.47% - updating kvm_stat statistics
>>> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>> f1) 29.18% - restoring guest tlb
>>> f2)  4.69% - restoring guest state ([s]regs)
>>>
>>> These fractions are % of our ~12µs syscall exit.
>>> => restoring tlb on each reenter = 4µs constant overhead
>>> => looking a bit into irq delivery and other constant things like 
>>> kvm_stat updating
>>>
>>>     
>> ...
>>  
>>> Now I go for the TLB replacement in f1.
>>>     
>>
>> Hang on... does d3 make sense to you? It doesn't to me, and if there's a
>> bug there it will be easier to fix than rewriting the TLB code. :)
>>   
> I did not give up improving that part too :-)
>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>   
> I get the same results. 1% ~ 80 cycles.
>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>> that time is going. We're assuming the first_first_bit() loop usually
>> executes once, for syscall. Does it actually execute more than that? I
>> don't expect any of kvmppc_can_deliver_interrupt(),
>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>> take lots of time.
>>   
> You can see below that I already had a more detailed breakdown in my 
> old mail:
> [...]
> d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in 
> kvmppc_check_and_deliver_interrupts
> d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
> kvmppc_check_and_deliver_interrupts
> d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
> exception in kvmppc_check_and_deliver_interrupts
> [...]
>> Could it be cache effects? exception_priority[] and priority_exception[]
>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
>> fit into one... except they're not aligned.
>>   
> I would be so happy if I would have hardware performance counters like 
> cache misses :-)
>> Also, it looks like we use the generic find_first_bit(). That may be
>> more expensive than we'd like. However, since
>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>> bitfield), we should be able to use ffs() instead, which has an
>> optimized PowerPC implementation. That might help a lot.
>>   
> good idea.
> I'll check this and some other small improvements I have in mind.
>
>> We might even be able to replace find_next_bit() too, by shifting a mask
>> over each loop, but I don't think we'll have to, since I expect the
>> common case to be we can deliver the first pending exception. (Worth
>> checking? :)
>>   
> I'm not sure. It's surely worth checking how often that second 
> find_next_bit is called.
> If that number is far too small it's not worth.
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (8 preceding siblings ...)
  2008-10-09  9:35 ` Christian Ehrhardt
@ 2008-10-09 14:49 ` Christian Ehrhardt
  2008-10-10  8:32 ` Christian Ehrhardt
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09 14:49 UTC (permalink / raw)
  To: kvm-ppc

Ok, looking at clear&deliver in detail I made some minor changes, but 
eventually it has no single point to improve.
Most of the time of the 14% in clear%deliver is lost in 
kvmppc_mmu_priv_switch (~8%), the rest spread evenly among the code - 
nothing obvious (maybe some cache misses though).

This code has to be touched anyway when going for a change in the 
guest/host (shadow)TLB management.

Christian Ehrhardt wrote:
> I modified the code according to your comments and my ideas, the new 
> values are shown in column impISF (irq delivery, Stat, FindFirstBit)
>
> I changed some code of the statistic updating and the interrupt 
> delivery and got this:
>     base - impirq (d3) - impstat (d5) - impboth  - impISF
> a)  12.57% -  11.13%     -  12.05%      -  11.03% - 12.28%  exit, 
> saving guest state (booke_interrupt.S)
> b)   7.37% -   9.38%     -   8.69%      -   8.07% - 10.13%  reaching 
> kvmppc_handle_exit
> c)   7.38% -   7.20%     -   7.49%      -   9.78% -  7.85%  syscall 
> exit is checked and a interrupt is queued using kvmppc_queue_exception
> d1)  2.49% -   3.39%     -   2.56%      -   3.30% -  3.70%  some 
> checks for all exits
> d2)  8.84% -   8.56%     -   9.28%      -   8.31% -  6.07%  finding 
> first bit in kvmppc_check_and_deliver_interrupts
> d3)  6.53% -   5.25%     -   6.63%      -   5.10% -  4.27%  
> can_deliver in kvmppc_check_and_deliver_interrupts
> d4) 13.66% -  15.37%     -  14.12%      -  14.92% - 13.96%  
> clear&deliver exception in kvmppc_check_and_deliver_interrupts
> d5)  3.65% -   4.57%     -   2.68%      -   4.41% -  3.77%  updating 
> kvm_stat statistics
> e)   6.55% -   6.30%     -   6.30%      -   5.89% -  6.74%  returning 
> from kvmppc_handle_exit to booke_interrupt.S
> f1) 30.90% -  28.78%     -  30.16%      -  29.16% - 31.19%  restoring 
> guest tlb
> f2)  4.81% -   4.77%     -   5.06%      -   4.66% -  5.17%  restoring 
> guest state ([s]regs)
>
> We all see the measurement inaccuracy, but the last columns look good 
> at the improved sections d2, d3 and d4.
> I'll remove these detailed tracing soon and make a larger test hoping 
> that this will not have the inaccuracy.
> But for now I still wonder about the ~14% for clear&deliver - that 
> should just not be "that" much.
> It should be worth to look into that section once again more in detail 
> first.
>
> Christian Ehrhardt wrote:
>> Hollis Blanchard wrote:
>>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>>  
>>>> Wondering about that 30.5% for postprocessing and 
>>>> kvmppc_check_and_deliver_interrupts I quickly checked that in 
>>>> detail - part d is now divided in 4 subparts.
>>>> I also looked at the return to guest path if the expected part 
>>>> (restoring tlb) is really the main time eater there. The result 
>>>> shows clearly that it is.
>>>>
>>>> more detailed breakdown:
>>>> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
>>>> b)   8.12% - reaching kvmppc_handle_exit
>>>> c)   7.59%  - syscall exit is checked and a interrupt is queued 
>>>> using kvmppc_queue_exception
>>>> d1)  3.33%  - some checks for all exits
>>>> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>>> d3) 17.20% - can_deliver/clear&deliver exception in 
>>>> kvmppc_check_and_deliver_interrupts
>>>> d4)  4.47% - updating kvm_stat statistics
>>>> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>>> f1) 29.18% - restoring guest tlb
>>>> f2)  4.69% - restoring guest state ([s]regs)
>>>>
>>>> These fractions are % of our ~12µs syscall exit.
>>>> => restoring tlb on each reenter = 4µs constant overhead
>>>> => looking a bit into irq delivery and other constant things like 
>>>> kvm_stat updating
>>>>
>>>>     
>>> ...
>>>  
>>>> Now I go for the TLB replacement in f1.
>>>>     
>>>
>>> Hang on... does d3 make sense to you? It doesn't to me, and if 
>>> there's a
>>> bug there it will be easier to fix than rewriting the TLB code. :)
>>>   
>> I did not give up improving that part too :-)
>>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>>   
>> I get the same results. 1% ~ 80 cycles.
>>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>>> that time is going. We're assuming the first_first_bit() loop usually
>>> executes once, for syscall. Does it actually execute more than that? I
>>> don't expect any of kvmppc_can_deliver_interrupt(),
>>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>>> take lots of time.
>>>   
>> You can see below that I already had a more detailed breakdown in my 
>> old mail:
>> [...]
>> d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit 
>> in kvmppc_check_and_deliver_interrupts
>> d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
>> kvmppc_check_and_deliver_interrupts
>> d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
>> exception in kvmppc_check_and_deliver_interrupts
>> [...]
>>> Could it be cache effects? exception_priority[] and 
>>> priority_exception[]
>>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should 
>>> both
>>> fit into one... except they're not aligned.
>>>   
>> I would be so happy if I would have hardware performance counters 
>> like cache misses :-)
>>> Also, it looks like we use the generic find_first_bit(). That may be
>>> more expensive than we'd like. However, since
>>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>>> bitfield), we should be able to use ffs() instead, which has an
>>> optimized PowerPC implementation. That might help a lot.
>>>   
>> good idea.
>> I'll check this and some other small improvements I have in mind.
>>
>>> We might even be able to replace find_next_bit() too, by shifting a 
>>> mask
>>> over each loop, but I don't think we'll have to, since I expect the
>>> common case to be we can deliver the first pending exception. (Worth
>>> checking? :)
>>>   
>> I'm not sure. It's surely worth checking how often that second 
>> find_next_bit is called.
>> If that number is far too small it's not worth.
>>
>
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: exit timing analysis v1 - comments&discussions welcome
  2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
                   ` (9 preceding siblings ...)
  2008-10-09 14:49 ` Christian Ehrhardt
@ 2008-10-10  8:32 ` Christian Ehrhardt
  10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-10  8:32 UTC (permalink / raw)
  To: kvm-ppc


Hollis Blanchard wrote:
> Also, it looks like we use the generic find_first_bit(). That may be
> more expensive than we'd like. However, since
> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
> bitfield), we should be able to use ffs() instead, which has an
> optimized PowerPC implementation. That might help a lot.
>
> We might even be able to replace find_next_bit() too, by shifting a mask
> over each loop, but I don't think we'll have to, since I expect the
> common case to be we can deliver the first pending exception. (Worth
> checking? :)
>   
FFS for first one is working fine.
To check if modifying the following Find next bit to mask&__ffs I did 
some quick accounting and it seems not to be worth.

On the exits casued by external interrupts we hit "find next bit" in 
~50% of the cases (MMIO, DCR, DEC, EXITINT).
But on all the frequent cases like DTLVVIRT, ITLBVIRT and EMULINST we 
are at ~2.75%.
Therefore it should be ok to keep the generic find_next_bit there.

-- 

Gr√ºsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2008-10-10  8:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-24  9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
2008-09-24 15:14 ` Hollis Blanchard
2008-09-25  9:32 ` Liu Yu-B13201
2008-09-25 15:18 ` Hollis Blanchard
2008-10-02 12:02 ` Christian Ehrhardt
2008-10-07 14:36 ` Christian Ehrhardt
2008-10-08 13:49 ` Christian Ehrhardt
2008-10-08 15:41 ` Hollis Blanchard
2008-10-09  8:02 ` Christian Ehrhardt
2008-10-09  9:35 ` Christian Ehrhardt
2008-10-09 14:49 ` Christian Ehrhardt
2008-10-10  8:32 ` Christian Ehrhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.