* exit timing analysis v1 - comments&discussions welcome
@ 2008-09-24 9:24 Christian Ehrhardt
2008-09-24 15:14 ` Hollis Blanchard
` (10 more replies)
0 siblings, 11 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-09-24 9:24 UTC (permalink / raw)
To: kvm-ppc
[-- Attachment #1: Type: text/plain, Size: 5637 bytes --]
Hi everyone,
I wrote a patch that tries to get some data where we spent our time.
This should give us some data to think about when we talk about
virtualization on non hardware assisted platforms like 440.
A good example what this data might help us to discuss is the comparison
between paravirtualized and non paravirtualized guest (using the magic
page mechanism I posted recently). You see that the pv extension reduces
the number of exits for instruction emulation by 50%, but unfortunately
that only reduced the time we spend for the emulation by ~20%. This data
is for you information and comments/discussions are welcome (I expect
that we need to trace the data of the emulation per instr. to identify
the slow ones and think about their optimization).
The basic approach is to read TB[U&L] when entering the guest and when
exiting it. The time from exit->enter is the time we spent to handle the
exit and accounted per exit type. The time enter->exit is the time we
spent in the guest and accounted too.
For details I attached the patch I used to measure the timing. I
verified it using with serveral assertions I now removed in the final
version like "testing for double accounting of an exit" or "testing if
we missed to account an exit". Additionally I verified the exit counts
with the data reported by our kvm_stat support and that matches. Also
the sum of accounted time matches the real time spent to run the test
(Test was booting a ubuntu guest, logon as root, executing a ls, kill
guest).
I think this pretty much explains why the guest is sometimes feeling
slow :-).
Handle this data as semi-experimental and let me know what you think.
NO PV support
unit is usec
traced duration 202,244,233 usec
count min max
sum avg stddev time%
EMULINST 6,423,699 12 247,582 91,732,705
14.2804 241.200 45.36
ITLBVIRT 1,777,242 21 264,257 47,116,557
26.5111 286.040 23.30
DTLBVIRT 1,544,241 22 263,947 41,765,447
27.0459 218.997 20.65
TIMEINGUEST 10,000,000 0 3,673 10,861,792
1.0862 4.790 5.37
MMIO 9,318 47 10,686 3,544,934
380.4394 507.292 1.75
EXTINT 3,523 33 4,475 2,557,183
725.8538 619.026 1.26
DEC 16,937 50 4,866 1,441,084
85.0850 203.856 0.71
DTLBREAL 95,497 13 105 1,349,255
14.1288 2.811 0.67
ITLBREAL 80,622 13 146 1,079,052
13.3841 2.870 0.53
DSI 22,498 11 99 249,554
11.0923 1.975 0.12
DCR 4,854 41 3,758 232,657
47.9310 55.802 0.12
ISI 11,167 11 67 124,210
11.1230 2.242 0.06
SYSCALL 9,459 11 100 113,972
12.0491 2.763 0.06
SIGNAL 635 66 6,025 72,438
114.0756 335.235 0.04
FP_UNAVAIL 308 11 12 3,393
11.0162 0.126 0.00
*never hit USR_PR_INST, HCALL, HALT, DTLBPV, DEBUG
With PV Support
unit is usec
traced duration 136,640,181 usec
count min max
sum avg stddev time%
EMULINST 3,160,786 12 247,587 50,808,972
16.0748 205.445 37.18
ITLBVIRT 1,333,725 21 264,336 35,094,300
26.3130 235.204 25.68
DTLBVIRT 1,212,505 22 16,753 32,382,302
26.7069 51.475 23.70
TIMEINGUEST 6,000,000 0 3,846 7259,180
1.2099 9.317 5.31
MMIO 9,955 47 12,289 3570,654
358.6795 539.759 2.61
EXTINT 2,380 31 5,285 1791,705
752.8172 665.021 1.31
ITLBREAL 80,690 13 247,582 1320,621
16.3666 871.535 0.97
DTLBREAL 93,213 13 108 1315,333
14.1110 2.799 0.96
DTLBPV 46,941 20 3,696 1100,399
23.4422 50.127 0.81
DEC 10,376 50 3,672 928,878
89.5218 217.384 0.68
SIGNAL 658 66 264,544 339,911
516.5821 10303.957 0.25
DSI 21,799 11 87 242,343
11.1172 2.227 0.18
DCR 5,042 41 258 233,095
46.2307 6.465 0.17
ISI 11,582 11 66 128,625
11.1056 2.133 0.09
SYSCALL 10,052 11 59 120,597
11.9973 2.415 0.09
FP_UNAVAIL 295 11 12 3,247
11.0068 0.082 0.00
HCALL 1 19 19 19
19.0000 0.000 0.00
*never hit USR_PR_INST, HALT, DEBUG
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: kvmppc-exit-timing.diff --]
[-- Type: text/x-diff; name="kvmppc-exit-timing.diff", Size: 12342 bytes --]
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -346,6 +346,11 @@
DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
DEFINE(VCPU_FAULT_DEAR, offsetof(struct kvm_vcpu, arch.fault_dear));
DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
+
+ DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu));
+ DEFINE(VCPU_TIMING_EXIT_TBL, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbl));
+ DEFINE(VCPU_TIMING_LAST_ENTER_TBU, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbu));
+ DEFINE(VCPU_TIMING_LAST_ENTER_TBL, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbl));
#endif
return 0;
diff --git a/arch/powerpc/kvm/booke_guest.c b/arch/powerpc/kvm/booke_guest.c
--- a/arch/powerpc/kvm/booke_guest.c
+++ b/arch/powerpc/kvm/booke_guest.c
@@ -56,6 +56,14 @@
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ NULL }
};
+
+u32 last_exit_type;
+u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_last_exit;
static const u32 interrupt_msr_mask[16] = {
[BOOKE_INTERRUPT_CRITICAL] = MSR_ME,
@@ -261,6 +269,9 @@
enum emulation_result er;
int r = RESUME_HOST;
+ /* update before a new last_exit_type is written */
+ update_exit_timing(&(vcpu->arch));
+
local_irq_enable();
run->exit_reason = KVM_EXIT_UNKNOWN;
@@ -285,10 +296,13 @@
* misses before ceding control. */
if (need_resched())
cond_resched();
- if (exit_nr == BOOKE_INTERRUPT_DECREMENTER)
+ if (exit_nr == BOOKE_INTERRUPT_DECREMENTER) {
vcpu->stat.dec_exits++;
- else
+ set_last_exit_type(DEC_EXITS);
+ } else {
vcpu->stat.ext_intr_exits++;
+ set_last_exit_type(EXT_INTR_EXITS);
+ }
r = RESUME_GUEST;
break;
@@ -299,6 +313,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, BOOKE_INTERRUPT_PROGRAM);
r = RESUME_GUEST;
+ set_last_exit_type(USR_PR_INST);
break;
}
@@ -308,10 +323,12 @@
/* Future optimization: only reload non-volatiles if
* they were actually modified by emulation. */
vcpu->stat.emulated_inst_exits++;
+ set_last_exit_type(EMULATED_INST_EXITS);
r = RESUME_GUEST_NV;
break;
case EMULATE_DO_DCR:
run->exit_reason = KVM_EXIT_DCR;
+ set_last_exit_type(DCR_EXITS);
r = RESUME_HOST;
break;
case EMULATE_FAIL:
@@ -331,6 +348,7 @@
case BOOKE_INTERRUPT_FP_UNAVAIL:
kvmppc_queue_exception(vcpu, exit_nr);
+ set_last_exit_type(FP_UNAVAIL);
r = RESUME_GUEST;
break;
@@ -339,6 +357,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.dsi_exits++;
+ set_last_exit_type(DSI_EXITS);
r = RESUME_GUEST;
break;
@@ -346,6 +365,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.isi_exits++;
+ set_last_exit_type(ISI_EXITS);
r = RESUME_GUEST;
break;
@@ -353,9 +373,11 @@
if (vcpu->arch.last_inst == KVM_HYPERCALL_BIN) {
kvmppc_do_hypercall(vcpu);
vcpu->stat.hcall_exits++;
+ set_last_exit_type(HCALL_EXITS);
} else {
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.syscall_exits++;
+ set_last_exit_type(SYSCALL_EXITS);
}
r = RESUME_GUEST;
break;
@@ -370,6 +392,7 @@
vcpu->arch.pvmem_gpaddr >> KVM_PPCPV_MAGIC_PAGE_SHIFT,
0, KVM_PPCPV_MAGIC_PAGE_FLAGS);
vcpu->stat.dtlb_pvmem_miss_exits++;
+ set_last_exit_type(DTLB_PVMEM_MISS_EXITS);
r = RESUME_GUEST;
break;
}
@@ -382,6 +405,7 @@
vcpu->arch.dear = vcpu->arch.fault_dear;
vcpu->arch.esr = vcpu->arch.fault_esr;
vcpu->stat.dtlb_real_miss_exits++;
+ set_last_exit_type(DTLB_REAL_MISS_EXITS);
r = RESUME_GUEST;
break;
}
@@ -399,11 +423,13 @@
kvmppc_mmu_map(vcpu, eaddr, gfn, gtlbe->tid,
gtlbe->word2);
vcpu->stat.dtlb_virt_miss_exits++;
+ set_last_exit_type(DTLB_VIRT_MISS_EXITS);
r = RESUME_GUEST;
} else {
/* Guest has mapped and accessed a page which is not
* actually RAM. */
r = kvmppc_emulate_mmio(run, vcpu);
+ set_last_exit_type(MMIO_EXITS);
}
break;
@@ -422,10 +448,12 @@
/* The guest didn't have a mapping for it. */
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.itlb_real_miss_exits++;
+ set_last_exit_type(ITLB_REAL_MISS_EXITS);
break;
}
vcpu->stat.itlb_virt_miss_exits++;
+ set_last_exit_type(ITLB_VIRT_MISS_EXITS);
gfn = tlb_xlate(gtlbe, eaddr) >> PAGE_SHIFT;
@@ -458,6 +486,7 @@
mtspr(SPRN_DBSR, dbsr);
run->exit_reason = KVM_EXIT_DEBUG;
+ set_last_exit_type(DEBUG_EXITS);
r = RESUME_HOST;
break;
}
@@ -482,6 +511,7 @@
r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
vcpu->stat.signal_exits++;
+ set_last_exit_type(SIGNAL_EXITS);
} else {
vcpu->stat.light_exits++;
}
@@ -495,6 +525,7 @@
break;
case KVM_EXIT_INTR:
vcpu->stat.signal_exits++;
+ set_last_exit_type(SIGNAL_EXITS);
break;
}
}
@@ -535,6 +566,8 @@
* real timebase frequency. Accordingly, it must see the state of
* CCR1[TCS]. */
vcpu->arch.ccr1 = mfspr(SPRN_CCR1);
+
+ init_exit_timing(&(vcpu->arch));
return 0;
}
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -109,6 +109,16 @@
li r6, 1
slw r6, r6, r5
+
+ /* save exit time */
+..exit_tbu_overflow_loop:
+ mfspr r7, SPRN_TBRU
+ mfspr r8, SPRN_TBRL
+ mfspr r9, SPRN_TBRU
+ cmpw r9, r7
+ bne ..exit_tbu_overflow_loop
+ stw r8, VCPU_TIMING_EXIT_TBL(r4)
+ stw r9, VCPU_TIMING_EXIT_TBU(r4)
/* Save the faulting instruction and all GPRs for emulation. */
andi. r7, r6, NEED_INST_MASK
@@ -415,6 +425,17 @@
lwz r3, VCPU_SPRG7(r4)
mtspr SPRN_SPRG7, r3
+ /* save enter time */
+
+..enter_tbu_overflow_loop:
+ mfspr r6, SPRN_TBRU
+ mfspr r7, SPRN_TBRL
+ mfspr r8, SPRN_TBRU
+ cmpw r8, r6
+ bne ..enter_tbu_overflow_loop
+ stw r7, VCPU_TIMING_LAST_ENTER_TBL(r4)
+ stw r8, VCPU_TIMING_LAST_ENTER_TBU(r4)
+
/* Finish loading guest volatiles and jump to guest. */
lwz r3, VCPU_CTR(r4)
mtctr r3
diff --git a/include/asm-powerpc/kvm_host.h b/include/asm-powerpc/kvm_host.h
--- a/include/asm-powerpc/kvm_host.h
+++ b/include/asm-powerpc/kvm_host.h
@@ -25,6 +25,8 @@
#include <linux/types.h>
#include <linux/kvm_types.h>
#include <asm/kvm_asm.h>
+#include <asm/time.h>
+#include <asm-generic/div64.h>
#define KVM_MAX_VCPUS 1
#define KVM_MEMORY_SLOTS 32
@@ -66,6 +68,53 @@
u32 halt_wakeup;
};
+enum kvm_exit_types {
+ MMIO_EXITS,
+ DCR_EXITS,
+ SIGNAL_EXITS,
+ ITLB_REAL_MISS_EXITS,
+ ITLB_VIRT_MISS_EXITS,
+ DTLB_REAL_MISS_EXITS,
+ DTLB_PVMEM_MISS_EXITS,
+ DTLB_VIRT_MISS_EXITS,
+ SYSCALL_EXITS,
+ HCALL_EXITS,
+ ISI_EXITS,
+ DSI_EXITS,
+ EMULATED_INST_EXITS,
+ DEC_EXITS,
+ EXT_INTR_EXITS,
+ HALT_WAKEUP,
+ USR_PR_INST,
+ FP_UNAVAIL,
+ DEBUG_EXITS,
+ TIMEINGUEST,
+ __NUMBER_OF_KVM_EXIT_TYPES
+};
+
+static const char* kvm_exit_names[__NUMBER_OF_KVM_EXIT_TYPES] = {
+ "MMIO",
+ "DCR",
+ "SIGNAL",
+ "ITLBREAL",
+ "ITLBVIRT",
+ "DTLBREAL",
+ "DTLBPV",
+ "DTLBVIRT",
+ "SYSCALL",
+ "HCALL",
+ "ISI",
+ "DSI",
+ "EMULINST",
+ "DEC",
+ "EXTINT",
+ "HALT",
+ "USR_PR_INST",
+ "FP_UNAVAIL",
+ "DEBUG",
+ "TIMEINGUEST"
+};
+
struct tlbe {
u32 tid; /* Only the low 8 bits are used. */
u32 word0;
@@ -75,6 +124,14 @@
struct kvm_arch {
};
+
+/* allow access to big endian 32bit upper/lower parts and 64bit var */
+typedef union {
+ u64 tv64;
+ struct {
+ u32 tbu, tbl;
+ } tv32;
+} exit_timing_t;
struct kvm_vcpu_arch {
/* Unmodified copy of the guest's TLB. */
@@ -139,6 +196,10 @@
u32 dbcr0;
u32 dbcr1;
+ /* to trace exit timing */
+ exit_timing_t timing_exit;
+ exit_timing_t timing_last_enter;
+
u32 last_inst;
u32 fault_dear;
u32 fault_esr;
@@ -165,4 +226,121 @@
int singlestep;
};
+/* helpers for instruction timing, allocated in arch/powerpc/kvm/powerpc.c */
+extern u32 last_exit_type;
+extern u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_last_exit;
+
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+ int i;
+ last_exit_type=0xDEAD;
+ for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+ timing_count_type[i]=0;
+ timing_max_duration[i]=0;
+ timing_min_duration[i]=0xFFFFFFFF;
+ timing_sum_duration[i]=0;
+ timing_sum_quad_duration[i]=0;
+ }
+ vcpu_arch->timing_exit.tv64 = 0;
+ timing_last_exit = 0;
+ vcpu_arch->timing_last_enter.tv64 = 0;
+ printk(KERN_ERR"### initialized exit timing (print start timestamp to log) ###\n");
+}
+
+static inline void set_last_exit_type(int type)
+{
+ last_exit_type=type;
+}
+
+static inline void print_exit_timing(void)
+{
+ int i;
+ u64 min, max;
+
+ printk(KERN_ERR"### start exit timing statistics ###\n");
+ for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+ if (timing_min_duration[i]==0xFFFFFFFF)
+ min = 0;
+ else
+ min = timing_min_duration[i];
+ if (timing_max_duration[i]==0)
+ max = 0;
+ else
+ max = timing_max_duration[i];
+
+ printk(KERN_ERR"exit type %2d (%12s) count %10d min %10lld max %10lld sum %20lld sum_quad %20lld\n",
+ i, kvm_exit_names[i], timing_count_type[i], timing_min_duration[i], timing_max_duration[i],
+ timing_sum_duration[i], timing_sum_quad_duration[i]);
+ }
+ printk(KERN_ERR"please post calc: avg = sum/count; stddev = sqrt((sum_quad/count)-(sum_dur/count)������)\n");
+ printk(KERN_ERR"### end exit timing statistics ###\n");
+}
+
+static inline void add_exit_timing(u64 duration, int type)
+{
+ u64 old;
+
+ do_div(duration, tb_ticks_per_usec);
+ if (unlikely(duration > 0xFFFFFFFF)) {
+ printk(KERN_ERR"%s - duration too big, **2 would overflow duration %lld max 4294967295 type %d exit # of type %d (tb_ticks_per_usec=%ld)\n",
+ __func__, duration, type, timing_count_type[type], tb_ticks_per_usec);
+ print_exit_timing();
+ }
+
+
+ timing_count_type[type]++;
+
+ /* sum */
+ old = timing_sum_duration[type];
+ timing_sum_duration[type] += duration;
+ if (unlikely(old > timing_sum_duration[type])) {
+ printk(KERN_ERR"%s - wraparound adding sum of durations old %lld new %lld type %d exit # of type %d\n",
+ __func__, old, timing_sum_duration[type], type, timing_count_type[type]);
+ print_exit_timing();
+ }
+
+ /* square sum */
+ old = timing_sum_quad_duration[type];
+ timing_sum_quad_duration[type] += (duration*duration);
+ if (unlikely(old > timing_sum_quad_duration[type])) {
+ printk(KERN_ERR"%s - wraparound adding sum of squared durations old %lld new %lld type %d exit # of type %d\n",
+ __func__, old, timing_sum_quad_duration[type], type, timing_count_type[type]);
+ print_exit_timing();
+ }
+
+ /* set min/max */
+ if (unlikely(duration < timing_min_duration[type]))
+ timing_min_duration[type] = duration;
+ if (unlikely(duration > timing_max_duration[type]))
+ timing_max_duration[type] = duration;
+}
+
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+ u64 exit = timing_last_exit;
+ u64 enter = vcpu_arch->timing_last_enter.tv64;
+
+ /* save exit time to use it next exit when the related enter time is known */
+ timing_last_exit = vcpu_arch->timing_exit.tv64;
+
+ if (unlikely(last_exit_type==0xDEAD))
+ return; /* skip first incomplete enter/exit cycle */
+
+ /* update statistics for average and standard deviation */
+ add_exit_timing((enter - exit), last_exit_type);
+ /* enter -> timing_last_exit is time spent in guest - log this too */
+ add_exit_timing((timing_last_exit - enter), TIMEINGUEST);
+
+ /* report regularly every 100000 exits*/
+ if (unlikely(timing_count_type[TIMEINGUEST] % 1000000 == 0)) {
+ printk(KERN_ERR"%s - %d exits report timing statistics\n", __func__,timing_count_type[TIMEINGUEST]);
+ print_exit_timing();
+ }
+}
+
#endif /* __POWERPC_KVM_HOST_H__ */
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
@ 2008-09-24 15:14 ` Hollis Blanchard
2008-09-25 9:32 ` Liu Yu-B13201
` (9 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-09-24 15:14 UTC (permalink / raw)
To: kvm-ppc
On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
> count min max
> sum avg stddev time%
> EMULINST 6,423,699 12 247,582 91,732,705
> 14.2804 241.200 45.36
> ITLBVIRT 1,777,242 21 264,257 47,116,557
> 26.5111 286.040 23.30
> DTLBVIRT 1,544,241 22 263,947 41,765,447
> 27.0459 218.997 20.65
(The max here is of course when the guest was de-scheduled in favor of
another host process.)
I think it's interesting that the min instruction emulation time is 12
usecs. In fact, our "null" exits where we do almost no processing (ISI,
DSI, syscall) are 11 usecs, which suggests we have a problem with
interrupt handler overhead (for all exit types).
--
Hollis Blanchard
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 12+ messages in thread* RE: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
2008-09-24 15:14 ` Hollis Blanchard
@ 2008-09-25 9:32 ` Liu Yu-B13201
2008-09-25 15:18 ` Hollis Blanchard
` (8 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Liu Yu-B13201 @ 2008-09-25 9:32 UTC (permalink / raw)
To: kvm-ppc
>
> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
> > count min max
> > sum avg stddev time%
> > EMULINST 6,423,699 12 247,582
> 91,732,705
> > 14.2804 241.200 45.36
> > ITLBVIRT 1,777,242 21 264,257
> 47,116,557
> > 26.5111 286.040 23.30
> > DTLBVIRT 1,544,241 22 263,947
> 41,765,447
> > 27.0459 218.997 20.65
>
> (The max here is of course when the guest was de-scheduled in favor of
> another host process.)
>
> I think it's interesting that the min instruction emulation time is 12
> usecs. In fact, our "null" exits where we do almost no
> processing (ISI,
> DSI, syscall) are 11 usecs, which suggests we have a problem with
> interrupt handler overhead (for all exit types).
>
Will you consider about moving tlb manipulation out of entering path?
Examining the modify array may cost some time.
^ permalink raw reply [flat|nested] 12+ messages in thread* RE: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
2008-09-24 15:14 ` Hollis Blanchard
2008-09-25 9:32 ` Liu Yu-B13201
@ 2008-09-25 15:18 ` Hollis Blanchard
2008-10-02 12:02 ` Christian Ehrhardt
` (7 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-09-25 15:18 UTC (permalink / raw)
To: kvm-ppc
On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
> >
> > On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
> > > count min max
> > > sum avg stddev time%
> > > EMULINST 6,423,699 12 247,582
> > 91,732,705
> > > 14.2804 241.200 45.36
> > > ITLBVIRT 1,777,242 21 264,257
> > 47,116,557
> > > 26.5111 286.040 23.30
> > > DTLBVIRT 1,544,241 22 263,947
> > 41,765,447
> > > 27.0459 218.997 20.65
> >
> > (The max here is of course when the guest was de-scheduled in favor of
> > another host process.)
> >
> > I think it's interesting that the min instruction emulation time is 12
> > usecs. In fact, our "null" exits where we do almost no
> > processing (ISI,
> > DSI, syscall) are 11 usecs, which suggests we have a problem with
> > interrupt handler overhead (for all exit types).
> >
>
> Will you consider about moving tlb manipulation out of entering path?
> Examining the modify array may cost some time.
Yes, definitely.
That's about the only thing I can see in the 440 interrupt path that
might take significant time. Unfortunately we can't profile that code
because we have no performance counters, so finding the problem will
require some experimentation.
--
Hollis Blanchard
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (2 preceding siblings ...)
2008-09-25 15:18 ` Hollis Blanchard
@ 2008-10-02 12:02 ` Christian Ehrhardt
2008-10-07 14:36 ` Christian Ehrhardt
` (6 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-02 12:02 UTC (permalink / raw)
To: kvm-ppc
[-- Attachment #1: Type: text/plain, Size: 7192 bytes --]
Today I refactored my code in a way that it might eventually get upstream.
That means I removed some debug blocks, made it configurable and print
output only on request per sysrq.
Feel free to review that code now, otherwise wait for my separate rfc
post that will follow ~next week.
Additionally I realized in a discussion that my host system was still
configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
I took some updated numbers without all those background activities -
testacase again boot, login, ls, kill.
I intend to the irq path in detail as suggested and first of all I'll
try to get some timing of the "segments" of such a call.
This should show us where we loose our "constant base time" and give me
some hints when looking at it in detail.
FYI - new numbers without high background load (monospace recommended):
NOPV:
MMIO count 10105 min 46 max 1534
sum 2952467 avg 292.1788 stddev 295.933 % 2.04
DCR count 5428 min 40 max 209
sum 246168 avg 45.3515 stddev 6.758 % 0.17
SIGNAL count 695 min 65 max 3755
sum 89767 avg 129.1612 stddev 314.421 % 0.06
ITLBREAL count 80045 min 13 max 108
sum 1063127 avg 13.2816 stddev 2.338 % 0.73
ITLBVIRT count 1000585 min 21 max 264827
sum 24300596 avg 24.2864 stddev 264.753 % 16.78
DTLBREAL count 91206 min 13 max 69
sum 1285214 avg 14.0913 stddev 2.225 % 0.89
DTLBVIRT count 977434 min 21 max 1446
sum 24007008 avg 24.5613 stddev 4.426 % 16.58
SYSCALL count 10460 min 11 max 55
sum 116447 avg 11.1326 stddev 1.929 % 0.08
ISI count 11724 min 11 max 61
sum 130007 avg 11.0890 stddev 1.929 % 0.09
DSI count 20737 min 11 max 57
sum 230009 avg 11.0917 stddev 1.914 % 0.16
EMULINST count 5683356 min 11 max 3778
sum 79339467 avg 13.9600 stddev 50.275 % 54.78
DEC count 13079 min 50 max 826
sum 732712 avg 56.0220 stddev 22.382 % 0.51
EXTINT count 55 min 30 max 1478
sum 10996 avg 199.9273 stddev 238.150 % 0.01
FP_UNAVAIL count 280 min 11 max 53
sum 3163 avg 11.2964 stddev 3.495 % 0.00
TIMEINGUEST count 7905189 min 0 max 3688
sum 10330742 avg 1.3068 stddev 8.970 % 7.13
csum 15810378
sumsum 144837890 => ~2:24 runtime
sumavg 7241894
PV:
MMIO count 12505 min 46 max 3087
sum 3693782 avg 295.3844 stddev 260.788 % 4.01
DCR count 5595 min 40 max 706
sum 273578 avg 48.8969 stddev 31.305 % 0.30
SIGNAL count 654 min 65 max 4132
sum 300027 avg 458.7569 stddev 571.130 % 0.33
ITLBREAL count 71711 min 13 max 104
sum 943053 avg 13.1507 stddev 2.360 % 1.02
ITLBVIRT count 750649 min 21 max 1503
sum 18178245 avg 24.2167 stddev 7.335 % 19.71
DTLBREAL count 83356 min 13 max 102
sum 1146242 avg 13.7512 stddev 2.406 % 1.24
DTLBPV count 30086 min 20 max 237
sum 653556 avg 21.7229 stddev 4.639 % 0.71
DTLBVIRT count 772811 min 21 max 713
sum 19079477 avg 24.6884 stddev 6.593 % 20.69
SYSCALL count 7647 min 11 max 57
sum 84821 avg 11.0921 stddev 1.897 % 0.09
HCALL count 1 min 19 max 19
sum 19 avg 19.0000 stddev 0.000 % 0.00
ISI count 9895 min 11 max 73
sum 109667 avg 11.0831 stddev 1.904 % 0.12
DSI count 17974 min 10 max 57
sum 199504 avg 11.0996 stddev 2.046 % 0.22
EMULINST count 2567245 min 11 max 4212
sum 40501314 avg 15.7762 stddev 65.673 % 43.92
DEC count 7488 min 51 max 641
sum 426813 avg 56.9996 stddev 23.893 % 0.46
EXTINT count 2215 min 31 max 1677
sum 297495 avg 134.3093 stddev 116.219 % 0.32
FP_UNAVAIL count 258 min 11 max 11
sum 2838 avg 11.0000 stddev 0.000 % 0.00
TIMEINGUEST count 4340090 min 0 max 3850
sum 6316079 avg 1.4553 stddev 12.599 % 6.85
csum 8680180
sumsum 92206510 => ~1:32 runtime
sumavg 4610325
Hollis Blanchard wrote:
> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>> count min max
>>>> sum avg stddev time%
>>>> EMULINST 6,423,699 12 247,582
>>> 91,732,705
>>>> 14.2804 241.200 45.36
>>>> ITLBVIRT 1,777,242 21 264,257
>>> 47,116,557
>>>> 26.5111 286.040 23.30
>>>> DTLBVIRT 1,544,241 22 263,947
>>> 41,765,447
>>>> 27.0459 218.997 20.65
>>> (The max here is of course when the guest was de-scheduled in favor of
>>> another host process.)
>>>
>>> I think it's interesting that the min instruction emulation time is 12
>>> usecs. In fact, our "null" exits where we do almost no
>>> processing (ISI,
>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>> interrupt handler overhead (for all exit types).
>>>
>> Will you consider about moving tlb manipulation out of entering path?
>> Examining the modify array may cost some time.
>
> Yes, definitely.
>
> That's about the only thing I can see in the 440 interrupt path that
> might take significant time. Unfortunately we can't profile that code
> because we have no performance counters, so finding the problem will
> require some experimentation.
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: kvmppc-exit-timing.diff --]
[-- Type: text/x-diff; name="kvmppc-exit-timing.diff", Size: 13787 bytes --]
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -348,5 +348,12 @@
DEFINE(VCPU_FAULT_ESR, offsetof(struct kvm_vcpu, arch.fault_esr));
#endif
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+ DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu));
+ DEFINE(VCPU_TIMING_EXIT_TBL, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbl));
+ DEFINE(VCPU_TIMING_LAST_ENTER_TBU, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbu));
+ DEFINE(VCPU_TIMING_LAST_ENTER_TBL, offsetof(struct kvm_vcpu, arch.timing_last_enter.tv32.tbl));
+#endif
+
return 0;
}
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -37,6 +37,14 @@
Provides host support for KVM on Book E PowerPC processors. Currently
this works on 440 processors only.
+config KVM_BOOKE_EXIT_TIMING
+ bool "Trace detailed exit Timing"
+ depends on KVM_BOOKE_HOST
+ ---help---
+ Inserts code to trace timestamps for every exit/enter cycle. A report
+ can be triggered using debugfs entry TODO.
+ This adds overhead, it's not recommended for production environments.
+
config KVM_TRACE
bool "KVM trace support"
depends on KVM && MARKERS && SYSFS
diff --git a/arch/powerpc/kvm/booke_guest.c b/arch/powerpc/kvm/booke_guest.c
--- a/arch/powerpc/kvm/booke_guest.c
+++ b/arch/powerpc/kvm/booke_guest.c
@@ -56,6 +56,21 @@
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ NULL }
};
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+u32 last_exit_type;
+u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+u64 timing_last_exit;
+struct sysrq_key_op trigger_print_exit_timing_op = {
+ .handler = print_exit_timing_handler,
+ .help_msg = "Print KVm PowerPC exit timings",
+ .action_msg = "Printing KVM PowerPC exit timings",
+};
+#endif
static const u32 interrupt_msr_mask[16] = {
[BOOKE_INTERRUPT_CRITICAL] = MSR_ME,
@@ -261,6 +276,9 @@
enum emulation_result er;
int r = RESUME_HOST;
+ /* update before a new last_exit_type is written */
+ update_exit_timing(&(vcpu->arch));
+
local_irq_enable();
run->exit_reason = KVM_EXIT_UNKNOWN;
@@ -285,10 +303,13 @@
* misses before ceding control. */
if (need_resched())
cond_resched();
- if (exit_nr == BOOKE_INTERRUPT_DECREMENTER)
+ if (exit_nr == BOOKE_INTERRUPT_DECREMENTER) {
vcpu->stat.dec_exits++;
- else
+ set_last_exit_type(DEC_EXITS);
+ } else {
vcpu->stat.ext_intr_exits++;
+ set_last_exit_type(EXT_INTR_EXITS);
+ }
r = RESUME_GUEST;
break;
@@ -299,6 +320,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, BOOKE_INTERRUPT_PROGRAM);
r = RESUME_GUEST;
+ set_last_exit_type(USR_PR_INST);
break;
}
@@ -308,10 +330,12 @@
/* Future optimization: only reload non-volatiles if
* they were actually modified by emulation. */
vcpu->stat.emulated_inst_exits++;
+ set_last_exit_type(EMULATED_INST_EXITS);
r = RESUME_GUEST_NV;
break;
case EMULATE_DO_DCR:
run->exit_reason = KVM_EXIT_DCR;
+ set_last_exit_type(DCR_EXITS);
r = RESUME_HOST;
break;
case EMULATE_FAIL:
@@ -331,6 +355,7 @@
case BOOKE_INTERRUPT_FP_UNAVAIL:
kvmppc_queue_exception(vcpu, exit_nr);
+ set_last_exit_type(FP_UNAVAIL);
r = RESUME_GUEST;
break;
@@ -339,6 +364,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.dsi_exits++;
+ set_last_exit_type(DSI_EXITS);
r = RESUME_GUEST;
break;
@@ -346,6 +372,7 @@
vcpu->arch.esr = vcpu->arch.fault_esr;
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.isi_exits++;
+ set_last_exit_type(ISI_EXITS);
r = RESUME_GUEST;
break;
@@ -353,9 +380,11 @@
if (vcpu->arch.last_inst == KVM_HYPERCALL_BIN) {
kvmppc_do_hypercall(vcpu);
vcpu->stat.hcall_exits++;
+ set_last_exit_type(HCALL_EXITS);
} else {
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.syscall_exits++;
+ set_last_exit_type(SYSCALL_EXITS);
}
r = RESUME_GUEST;
break;
@@ -370,6 +399,7 @@
vcpu->arch.pvmem_gpaddr >> KVM_PPCPV_MAGIC_PAGE_SHIFT,
0, KVM_PPCPV_MAGIC_PAGE_FLAGS);
vcpu->stat.dtlb_pvmem_miss_exits++;
+ set_last_exit_type(DTLB_PVMEM_MISS_EXITS);
r = RESUME_GUEST;
break;
}
@@ -382,6 +412,7 @@
vcpu->arch.dear = vcpu->arch.fault_dear;
vcpu->arch.esr = vcpu->arch.fault_esr;
vcpu->stat.dtlb_real_miss_exits++;
+ set_last_exit_type(DTLB_REAL_MISS_EXITS);
r = RESUME_GUEST;
break;
}
@@ -399,11 +430,13 @@
kvmppc_mmu_map(vcpu, eaddr, gfn, gtlbe->tid,
gtlbe->word2);
vcpu->stat.dtlb_virt_miss_exits++;
+ set_last_exit_type(DTLB_VIRT_MISS_EXITS);
r = RESUME_GUEST;
} else {
/* Guest has mapped and accessed a page which is not
* actually RAM. */
r = kvmppc_emulate_mmio(run, vcpu);
+ set_last_exit_type(MMIO_EXITS);
}
break;
@@ -422,10 +455,12 @@
/* The guest didn't have a mapping for it. */
kvmppc_queue_exception(vcpu, exit_nr);
vcpu->stat.itlb_real_miss_exits++;
+ set_last_exit_type(ITLB_REAL_MISS_EXITS);
break;
}
vcpu->stat.itlb_virt_miss_exits++;
+ set_last_exit_type(ITLB_VIRT_MISS_EXITS);
gfn = tlb_xlate(gtlbe, eaddr) >> PAGE_SHIFT;
@@ -458,6 +493,7 @@
mtspr(SPRN_DBSR, dbsr);
run->exit_reason = KVM_EXIT_DEBUG;
+ set_last_exit_type(DEBUG_EXITS);
r = RESUME_HOST;
break;
}
@@ -482,6 +518,7 @@
r = (-EINTR << 2) | RESUME_HOST | (r & RESUME_FLAG_NV);
vcpu->stat.signal_exits++;
+ set_last_exit_type(SIGNAL_EXITS);
} else {
vcpu->stat.light_exits++;
}
@@ -495,6 +532,7 @@
break;
case KVM_EXIT_INTR:
vcpu->stat.signal_exits++;
+ set_last_exit_type(SIGNAL_EXITS);
break;
}
}
@@ -535,6 +573,8 @@
* real timebase frequency. Accordingly, it must see the state of
* CCR1[TCS]. */
vcpu->arch.ccr1 = mfspr(SPRN_CCR1);
+
+ init_exit_timing(&(vcpu->arch));
return 0;
}
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -109,6 +109,18 @@
li r6, 1
slw r6, r6, r5
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+ /* save exit time */
+..exit_tbu_overflow_loop:
+ mfspr r7, SPRN_TBRU
+ mfspr r8, SPRN_TBRL
+ mfspr r9, SPRN_TBRU
+ cmpw r9, r7
+ bne ..exit_tbu_overflow_loop
+ stw r8, VCPU_TIMING_EXIT_TBL(r4)
+ stw r9, VCPU_TIMING_EXIT_TBU(r4)
+#endif
/* Save the faulting instruction and all GPRs for emulation. */
andi. r7, r6, NEED_INST_MASK
@@ -415,6 +427,18 @@
lwz r3, VCPU_SPRG7(r4)
mtspr SPRN_SPRG7, r3
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+ /* save enter time */
+..enter_tbu_overflow_loop:
+ mfspr r6, SPRN_TBRU
+ mfspr r7, SPRN_TBRL
+ mfspr r8, SPRN_TBRU
+ cmpw r8, r6
+ bne ..enter_tbu_overflow_loop
+ stw r7, VCPU_TIMING_LAST_ENTER_TBL(r4)
+ stw r8, VCPU_TIMING_LAST_ENTER_TBU(r4)
+#endif
+
/* Finish loading guest volatiles and jump to guest. */
lwz r3, VCPU_CTR(r4)
mtctr r3
diff --git a/include/asm-powerpc/kvm_host.h b/include/asm-powerpc/kvm_host.h
--- a/include/asm-powerpc/kvm_host.h
+++ b/include/asm-powerpc/kvm_host.h
@@ -25,6 +25,12 @@
#include <linux/types.h>
#include <linux/kvm_types.h>
#include <asm/kvm_asm.h>
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+#include <asm/time.h>
+#include <asm-generic/div64.h>
+#include <linux/sysrq.h>
+#endif
#define KVM_MAX_VCPUS 1
#define KVM_MEMORY_SLOTS 32
@@ -66,6 +72,55 @@
u32 halt_wakeup;
};
+enum kvm_exit_types {
+ MMIO_EXITS,
+ DCR_EXITS,
+ SIGNAL_EXITS,
+ ITLB_REAL_MISS_EXITS,
+ ITLB_VIRT_MISS_EXITS,
+ DTLB_REAL_MISS_EXITS,
+ DTLB_PVMEM_MISS_EXITS,
+ DTLB_VIRT_MISS_EXITS,
+ SYSCALL_EXITS,
+ HCALL_EXITS,
+ ISI_EXITS,
+ DSI_EXITS,
+ EMULATED_INST_EXITS,
+ DEC_EXITS,
+ EXT_INTR_EXITS,
+ HALT_WAKEUP,
+ USR_PR_INST,
+ FP_UNAVAIL,
+ DEBUG_EXITS,
+ TIMEINGUEST,
+ __NUMBER_OF_KVM_EXIT_TYPES
+};
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+static const char* kvm_exit_names[__NUMBER_OF_KVM_EXIT_TYPES] = {
+ "MMIO",
+ "DCR",
+ "SIGNAL",
+ "ITLBREAL",
+ "ITLBVIRT",
+ "DTLBREAL",
+ "DTLBPV",
+ "DTLBVIRT",
+ "SYSCALL",
+ "HCALL",
+ "ISI",
+ "DSI",
+ "EMULINST",
+ "DEC",
+ "EXTINT",
+ "HALT",
+ "USR_PR_INST",
+ "FP_UNAVAIL",
+ "DEBUG",
+ "TIMEINGUEST"
+};
+#endif
+
struct tlbe {
u32 tid; /* Only the low 8 bits are used. */
u32 word0;
@@ -75,6 +130,16 @@
struct kvm_arch {
};
+
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+/* allow access to big endian 32bit upper/lower parts and 64bit var */
+typedef union {
+ u64 tv64;
+ struct {
+ u32 tbu, tbl;
+ } tv32;
+} exit_timing_t;
+#endif
struct kvm_vcpu_arch {
/* Unmodified copy of the guest's TLB. */
@@ -139,6 +204,11 @@
u32 dbcr0;
u32 dbcr1;
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+ exit_timing_t timing_exit;
+ exit_timing_t timing_last_enter;
+#endif
+
u32 last_inst;
u32 fault_dear;
u32 fault_esr;
@@ -165,4 +235,131 @@
int singlestep;
};
+#ifdef CONFIG_KVM_BOOKE_EXIT_TIMING
+/* helpers for instruction timing, allocated in arch/powerpc/kvm/powerpc.c */
+extern u32 last_exit_type;
+extern u32 timing_count_type[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_sum_quad_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_min_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_max_duration[__NUMBER_OF_KVM_EXIT_TYPES];
+extern u64 timing_last_exit;
+extern struct sysrq_key_op trigger_print_exit_timing_op;
+
+static void print_exit_timing_handler(int key, struct tty_struct *tty)
+{
+ int i;
+ u64 min, max;
+
+ printk(KERN_ERR"### start exit timing statistics ###\n");
+ for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+ if (timing_min_duration[i]==0xFFFFFFFF)
+ min = 0;
+ else
+ min = timing_min_duration[i];
+ if (timing_max_duration[i]==0)
+ max = 0;
+ else
+ max = timing_max_duration[i];
+
+ printk(KERN_ERR"exit type %2d (%12s) count %10d min %10lld max %10lld sum %20lld sum_quad %20lld\n",
+ i, kvm_exit_names[i], timing_count_type[i], timing_min_duration[i], timing_max_duration[i],
+ timing_sum_duration[i], timing_sum_quad_duration[i]);
+ }
+ printk(KERN_ERR"please post calc: avg = sum/count; stddev = sqrt((sum_quad/count)-(sum_dur/count)������)\n");
+ printk(KERN_ERR"### end exit timing statistics ###\n");
+}
+
+static inline void print_exit_timing(void) {
+ print_exit_timing_handler(0,NULL);
+}
+
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+ int i;
+
+ last_exit_type=0xDEAD;
+ for (i=0; i<__NUMBER_OF_KVM_EXIT_TYPES; i++) {
+ timing_count_type[i]=0;
+ timing_max_duration[i]=0;
+ timing_min_duration[i]=0xFFFFFFFF;
+ timing_sum_duration[i]=0;
+ timing_sum_quad_duration[i]=0;
+ }
+ vcpu_arch->timing_exit.tv64 = 0;
+ timing_last_exit = 0;
+ vcpu_arch->timing_last_enter.tv64 = 0;
+
+ if (register_sysrq_key('y', &trigger_print_exit_timing_op))
+ printk(KERN_ERR"%s - failed to register sysrq key\n", __func__);
+
+/* TODO unregister_sysrq_key(int key, struct sysrq_key_op *op_p) */
+}
+
+static inline void set_last_exit_type(int type)
+{
+ last_exit_type=type;
+}
+
+static inline void add_exit_timing(u64 duration, int type)
+{
+ u64 old;
+
+ do_div(duration, tb_ticks_per_usec);
+ if (unlikely(duration > 0xFFFFFFFF)) {
+ printk(KERN_ERR"%s - duration too big, **2 would overflow duration %lld max 4294967295 type %d exit # of type %d (tb_ticks_per_usec=%ld)\n",
+ __func__, duration, type, timing_count_type[type], tb_ticks_per_usec);
+ print_exit_timing();
+ }
+
+
+ timing_count_type[type]++;
+
+ /* sum */
+ old = timing_sum_duration[type];
+ timing_sum_duration[type] += duration;
+ if (unlikely(old > timing_sum_duration[type])) {
+ printk(KERN_ERR"%s - wraparound adding sum of durations old %lld new %lld type %d exit # of type %d\n",
+ __func__, old, timing_sum_duration[type], type, timing_count_type[type]);
+ print_exit_timing();
+ }
+
+ /* square sum */
+ old = timing_sum_quad_duration[type];
+ timing_sum_quad_duration[type] += (duration*duration);
+ if (unlikely(old > timing_sum_quad_duration[type])) {
+ printk(KERN_ERR"%s - wraparound adding sum of squared durations old %lld new %lld type %d exit # of type %d\n",
+ __func__, old, timing_sum_quad_duration[type], type, timing_count_type[type]);
+ print_exit_timing();
+ }
+
+ /* set min/max */
+ if (unlikely(duration < timing_min_duration[type]))
+ timing_min_duration[type] = duration;
+ if (unlikely(duration > timing_max_duration[type]))
+ timing_max_duration[type] = duration;
+}
+
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch)
+{
+ u64 exit = timing_last_exit;
+ u64 enter = vcpu_arch->timing_last_enter.tv64;
+
+ /* save exit time to use it next exit when the related enter time is known */
+ timing_last_exit = vcpu_arch->timing_exit.tv64;
+
+ if (unlikely(last_exit_type==0xDEAD))
+ return; /* skip first incomplete enter/exit cycle */
+
+ /* update statistics for average and standard deviation */
+ add_exit_timing((enter - exit), last_exit_type);
+ /* enter -> timing_last_exit is time spent in guest - log this too */
+ add_exit_timing((timing_last_exit - enter), TIMEINGUEST);
+}
+#else
+static inline void init_exit_timing(struct kvm_vcpu_arch* vcpu_arch) {}
+static inline void set_last_exit_type(int type) {}
+static inline void update_exit_timing(struct kvm_vcpu_arch* vcpu_arch) {}
+#endif /* CONFIG_KVM_BOOKE_EXIT_TIMING */
+
#endif /* __POWERPC_KVM_HOST_H__ */
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (3 preceding siblings ...)
2008-10-02 12:02 ` Christian Ehrhardt
@ 2008-10-07 14:36 ` Christian Ehrhardt
2008-10-08 13:49 ` Christian Ehrhardt
` (5 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-07 14:36 UTC (permalink / raw)
To: kvm-ppc
As intended I separated the time of one of our exits into it's segments
to see where we loose the constant base time.
I looked at the syscall exit which is pretty fast and does not deviate
very much.
A little tracing and debugging later I can now show the rough breakdown
where the time is spent for that exit:
avg time spent in the segments I checked:
a) 11.5% - exit, saving guest state (booke_interrupt.S)
b) 8.5% - reaching kvmppc_handle_exit
c) 10% - syscall exit is checked and a interrupt is queued using
kvmppc_queue_exception
d) 30.5% - postprocessing for all exits =
kvmppc_check_and_deliver_interrupts and updating some statistics
e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S
f) 34% - restoring guest state tlb/regs (booke_interrupt.S)
Ok, looking at these numbers I think we have two areas to check in detail.
- On one hand the post processing part (d) actually should not take that
much time.
- On the other hand we might be able to improve the last segment by
chanign our tlb management (Liu made a suggestion on the list two weeks ago)
When thinking about this breakdown while having the complete timing in
mind (see below) shows us that the average time spent for one of our
most frequent exit is near the minimum duration (EMUL). This means if we
can reduce some of the constant overehad this might really help us in
those frequent and hot paths (also birt tlb misses will benefit from that).
I continue on that and keep you all updated.
Christian
Christian Ehrhardt wrote:
> Today I refactored my code in a way that it might eventually get
> upstream.
> That means I removed some debug blocks, made it configurable and print
> output only on request per sysrq.
> Feel free to review that code now, otherwise wait for my separate rfc
> post that will follow ~next week.
>
> Additionally I realized in a discussion that my host system was still
> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
> I took some updated numbers without all those background activities -
> testacase again boot, login, ls, kill.
>
> I intend to the irq path in detail as suggested and first of all I'll
> try to get some timing of the "segments" of such a call.
> This should show us where we loose our "constant base time" and give
> me some hints when looking at it in detail.
>
> FYI - new numbers without high background load (monospace recommended):
> NOPV:
> MMIO count 10105 min 46 max 1534
> sum 2952467 avg 292.1788 stddev 295.933 % 2.04
> DCR count 5428 min 40 max 209
> sum 246168 avg 45.3515 stddev 6.758 % 0.17
> SIGNAL count 695 min 65 max 3755
> sum 89767 avg 129.1612 stddev 314.421 % 0.06
> ITLBREAL count 80045 min 13 max 108
> sum 1063127 avg 13.2816 stddev 2.338 % 0.73
> ITLBVIRT count 1000585 min 21 max 264827
> sum 24300596 avg 24.2864 stddev 264.753 % 16.78
> DTLBREAL count 91206 min 13 max 69
> sum 1285214 avg 14.0913 stddev 2.225 % 0.89
> DTLBVIRT count 977434 min 21 max 1446
> sum 24007008 avg 24.5613 stddev 4.426 % 16.58
> SYSCALL count 10460 min 11 max 55
> sum 116447 avg 11.1326 stddev 1.929 % 0.08
> ISI count 11724 min 11 max 61
> sum 130007 avg 11.0890 stddev 1.929 % 0.09
> DSI count 20737 min 11 max 57
> sum 230009 avg 11.0917 stddev 1.914 % 0.16
> EMULINST count 5683356 min 11 max 3778
> sum 79339467 avg 13.9600 stddev 50.275 % 54.78
> DEC count 13079 min 50 max 826
> sum 732712 avg 56.0220 stddev 22.382 % 0.51
> EXTINT count 55 min 30 max 1478
> sum 10996 avg 199.9273 stddev 238.150 % 0.01
> FP_UNAVAIL count 280 min 11 max 53
> sum 3163 avg 11.2964 stddev 3.495 % 0.00
> TIMEINGUEST count 7905189 min 0 max 3688
> sum 10330742 avg 1.3068 stddev 8.970 % 7.13
> csum 15810378
> sumsum 144837890 => ~2:24 runtime
>
> sumavg 7241894
>
> PV:
> MMIO count 12505 min 46 max 3087
> sum 3693782 avg 295.3844 stddev 260.788 % 4.01
> DCR count 5595 min 40 max 706
> sum 273578 avg 48.8969 stddev 31.305 % 0.30
> SIGNAL count 654 min 65 max 4132
> sum 300027 avg 458.7569 stddev 571.130 % 0.33
> ITLBREAL count 71711 min 13 max 104
> sum 943053 avg 13.1507 stddev 2.360 % 1.02
> ITLBVIRT count 750649 min 21 max 1503
> sum 18178245 avg 24.2167 stddev 7.335 % 19.71
> DTLBREAL count 83356 min 13 max 102
> sum 1146242 avg 13.7512 stddev 2.406 % 1.24
> DTLBPV count 30086 min 20 max 237
> sum 653556 avg 21.7229 stddev 4.639 % 0.71
> DTLBVIRT count 772811 min 21 max 713
> sum 19079477 avg 24.6884 stddev 6.593 % 20.69
> SYSCALL count 7647 min 11 max 57
> sum 84821 avg 11.0921 stddev 1.897 % 0.09
> HCALL count 1 min 19 max 19
> sum 19 avg 19.0000 stddev 0.000 % 0.00
> ISI count 9895 min 11 max 73
> sum 109667 avg 11.0831 stddev 1.904 % 0.12
> DSI count 17974 min 10 max 57
> sum 199504 avg 11.0996 stddev 2.046 % 0.22
> EMULINST count 2567245 min 11 max 4212
> sum 40501314 avg 15.7762 stddev 65.673 % 43.92
> DEC count 7488 min 51 max 641
> sum 426813 avg 56.9996 stddev 23.893 % 0.46
> EXTINT count 2215 min 31 max 1677
> sum 297495 avg 134.3093 stddev 116.219 % 0.32
> FP_UNAVAIL count 258 min 11 max 11
> sum 2838 avg 11.0000 stddev 0.000 % 0.00
> TIMEINGUEST count 4340090 min 0 max 3850
> sum 6316079 avg 1.4553 stddev 12.599 % 6.85
> csum 8680180
> sumsum 92206510 => ~1:32 runtime
>
> sumavg 4610325
>
> Hollis Blanchard wrote:
>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>> count min max
>>>>> sum avg stddev time%
>>>>> EMULINST 6,423,699 12 247,582
>>>> 91,732,705
>>>>> 14.2804 241.200 45.36
>>>>> ITLBVIRT 1,777,242 21 264,257
>>>> 47,116,557
>>>>> 26.5111 286.040 23.30
>>>>> DTLBVIRT 1,544,241 22 263,947
>>>> 41,765,447
>>>>> 27.0459 218.997 20.65
>>>> (The max here is of course when the guest was de-scheduled in favor of
>>>> another host process.)
>>>>
>>>> I think it's interesting that the min instruction emulation time is 12
>>>> usecs. In fact, our "null" exits where we do almost no processing
>>>> (ISI,
>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>>> interrupt handler overhead (for all exit types).
>>>>
>>> Will you consider about moving tlb manipulation out of entering path?
>>> Examining the modify array may cost some time.
>>
>> Yes, definitely.
>>
>> That's about the only thing I can see in the 440 interrupt path that
>> might take significant time. Unfortunately we can't profile that code
>> because we have no performance counters, so finding the problem will
>> require some experimentation.
>
>
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (4 preceding siblings ...)
2008-10-07 14:36 ` Christian Ehrhardt
@ 2008-10-08 13:49 ` Christian Ehrhardt
2008-10-08 15:41 ` Hollis Blanchard
` (4 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-08 13:49 UTC (permalink / raw)
To: kvm-ppc
Wondering about that 30.5% for postprocessing and
kvmppc_check_and_deliver_interrupts I quickly checked that in detail -
part d is now divided in 4 subparts.
I also looked at the return to guest path if the expected part
(restoring tlb) is really the main time eater there. The result shows
clearly that it is.
more detailed breakdown:
a) 10.94% - exit, saving guest state (booke_interrupt.S)
b) 8.12% - reaching kvmppc_handle_exit
c) 7.59% - syscall exit is checked and a interrupt is queued using
kvmppc_queue_exception
d1) 3.33% - some checks for all exits
d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
d3) 17.20% - can_deliver/clear&deliver exception in
kvmppc_check_and_deliver_interrupts
d4) 4.47% - updating kvm_stat statistics
e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
f1) 29.18% - restoring guest tlb
f2) 4.69% - restoring guest state ([s]regs)
These fractions are % of our ~12µs syscall exit.
=> restoring tlb on each reenter = 4µs constant overhead
=> looking a bit into irq delivery and other constant things like
kvm_stat updating
---
I changed some code of the statistic updating and the interrupt delivery
and got this:
base - impirq (d3) - impstat (d5) - impboth
a) 12.57% - 11.13% - 12.05% - 11.03% exit, saving guest
state (booke_interrupt.S)
b) 7.37% - 9.38% - 8.69% - 8.07% reaching
kvmppc_handle_exit
c) 7.38% - 7.20% - 7.49% - 9.78% syscall exit is
checked and a interrupt is queued using kvmppc_queue_exception
d1) 2.49% - 3.39% - 2.56% - 3.30% some checks for all exits
d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit in
kvmppc_check_and_deliver_interrupts
d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in
kvmppc_check_and_deliver_interrupts
d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver
exception in kvmppc_check_and_deliver_interrupts
d5) 3.65% - 4.57% - 2.68% - 4.41% updating kvm_stat
statistics
e) 6.55% - 6.30% - 6.30% - 5.89% returning from
kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% - 28.78% - 30.16% - 29.16% restoring guest tlb
f2) 4.81% - 4.77% - 5.06% - 4.66% restoring guest state
([s]regs)
Well the improvements (if existant) are too close to the measuring
inaccuracy. Therefore I can't make any assumptions right now, but the
patches seem at least to work.
I will test them soon with a longer testcase and a bit measureing
overhead reduced to avoid some inaccuracy.
The patch to our kvm_stat counters would make sense for style reasons
alone, so it is worth to try it :-)
Now I go for the TLB replacement in f1.
Christian
Christian Ehrhardt wrote:
> As intended I separated the time of one of our exits into it's
> segments to see where we loose the constant base time.
> I looked at the syscall exit which is pretty fast and does not deviate
> very much.
>
> A little tracing and debugging later I can now show the rough
> breakdown where the time is spent for that exit:
>
> avg time spent in the segments I checked:
> a) 11.5% - exit, saving guest state (booke_interrupt.S)
> b) 8.5% - reaching kvmppc_handle_exit
> c) 10% - syscall exit is checked and a interrupt is queued using
> kvmppc_queue_exception
> d) 30.5% - postprocessing for all exits =
> kvmppc_check_and_deliver_interrupts and updating some statistics
> e) 5.5% - returning from kvmppc_handle_exit to booke_interrupt.S
> f) 34% - restoring guest state tlb/regs (booke_interrupt.S)
>
> Ok, looking at these numbers I think we have two areas to check in
> detail.
> - On one hand the post processing part (d) actually should not take
> that much time.
> - On the other hand we might be able to improve the last segment by
> chanign our tlb management (Liu made a suggestion on the list two
> weeks ago)
>
> When thinking about this breakdown while having the complete timing in
> mind (see below) shows us that the average time spent for one of our
> most frequent exit is near the minimum duration (EMUL). This means if
> we can reduce some of the constant overehad this might really help us
> in those frequent and hot paths (also birt tlb misses will benefit
> from that).
>
> I continue on that and keep you all updated.
> Christian
>
> Christian Ehrhardt wrote:
>> Today I refactored my code in a way that it might eventually get
>> upstream.
>> That means I removed some debug blocks, made it configurable and
>> print output only on request per sysrq.
>> Feel free to review that code now, otherwise wait for my separate rfc
>> post that will follow ~next week.
>>
>> Additionally I realized in a discussion that my host system was still
>> configured in demo mode (X11+xfce, dhcpd, avahi, ... running in host).
>> I took some updated numbers without all those background activities -
>> testacase again boot, login, ls, kill.
>>
>> I intend to the irq path in detail as suggested and first of all I'll
>> try to get some timing of the "segments" of such a call.
>> This should show us where we loose our "constant base time" and give
>> me some hints when looking at it in detail.
>>
>> FYI - new numbers without high background load (monospace recommended):
>> NOPV:
>> MMIO count 10105 min 46 max 1534
>> sum 2952467 avg 292.1788 stddev 295.933 %
>> 2.04
>> DCR count 5428 min 40 max 209
>> sum 246168 avg 45.3515 stddev 6.758 %
>> 0.17
>> SIGNAL count 695 min 65 max 3755
>> sum 89767 avg 129.1612 stddev 314.421 %
>> 0.06
>> ITLBREAL count 80045 min 13 max 108
>> sum 1063127 avg 13.2816 stddev 2.338 %
>> 0.73
>> ITLBVIRT count 1000585 min 21 max 264827
>> sum 24300596 avg 24.2864 stddev 264.753 %
>> 16.78
>> DTLBREAL count 91206 min 13 max 69
>> sum 1285214 avg 14.0913 stddev 2.225 %
>> 0.89
>> DTLBVIRT count 977434 min 21 max 1446
>> sum 24007008 avg 24.5613 stddev 4.426 %
>> 16.58
>> SYSCALL count 10460 min 11 max 55
>> sum 116447 avg 11.1326 stddev 1.929 %
>> 0.08
>> ISI count 11724 min 11 max 61
>> sum 130007 avg 11.0890 stddev 1.929 %
>> 0.09
>> DSI count 20737 min 11 max 57
>> sum 230009 avg 11.0917 stddev 1.914 %
>> 0.16
>> EMULINST count 5683356 min 11 max 3778
>> sum 79339467 avg 13.9600 stddev 50.275 %
>> 54.78
>> DEC count 13079 min 50 max 826
>> sum 732712 avg 56.0220 stddev 22.382 %
>> 0.51
>> EXTINT count 55 min 30 max 1478
>> sum 10996 avg 199.9273 stddev 238.150 %
>> 0.01
>> FP_UNAVAIL count 280 min 11 max 53
>> sum 3163 avg 11.2964 stddev 3.495 %
>> 0.00
>> TIMEINGUEST count 7905189 min 0 max 3688
>> sum 10330742 avg 1.3068 stddev 8.970 %
>> 7.13
>> csum 15810378
>> sumsum 144837890 => ~2:24 runtime
>>
>> sumavg 7241894
>>
>> PV:
>> MMIO count 12505 min 46 max 3087
>> sum 3693782 avg 295.3844 stddev 260.788 %
>> 4.01
>> DCR count 5595 min 40 max 706
>> sum 273578 avg 48.8969 stddev 31.305 %
>> 0.30
>> SIGNAL count 654 min 65 max 4132
>> sum 300027 avg 458.7569 stddev 571.130 %
>> 0.33
>> ITLBREAL count 71711 min 13 max 104
>> sum 943053 avg 13.1507 stddev 2.360 %
>> 1.02
>> ITLBVIRT count 750649 min 21 max 1503
>> sum 18178245 avg 24.2167 stddev 7.335 %
>> 19.71
>> DTLBREAL count 83356 min 13 max 102
>> sum 1146242 avg 13.7512 stddev 2.406 %
>> 1.24
>> DTLBPV count 30086 min 20 max 237
>> sum 653556 avg 21.7229 stddev 4.639 %
>> 0.71
>> DTLBVIRT count 772811 min 21 max 713
>> sum 19079477 avg 24.6884 stddev 6.593 %
>> 20.69
>> SYSCALL count 7647 min 11 max 57
>> sum 84821 avg 11.0921 stddev 1.897 %
>> 0.09
>> HCALL count 1 min 19 max 19
>> sum 19 avg 19.0000 stddev 0.000 %
>> 0.00
>> ISI count 9895 min 11 max 73
>> sum 109667 avg 11.0831 stddev 1.904 %
>> 0.12
>> DSI count 17974 min 10 max 57
>> sum 199504 avg 11.0996 stddev 2.046 %
>> 0.22
>> EMULINST count 2567245 min 11 max 4212
>> sum 40501314 avg 15.7762 stddev 65.673 %
>> 43.92
>> DEC count 7488 min 51 max 641
>> sum 426813 avg 56.9996 stddev 23.893 %
>> 0.46
>> EXTINT count 2215 min 31 max 1677
>> sum 297495 avg 134.3093 stddev 116.219 %
>> 0.32
>> FP_UNAVAIL count 258 min 11 max 11
>> sum 2838 avg 11.0000 stddev 0.000 %
>> 0.00
>> TIMEINGUEST count 4340090 min 0 max 3850
>> sum 6316079 avg 1.4553 stddev 12.599 %
>> 6.85
>> csum 8680180
>> sumsum 92206510 => ~1:32 runtime
>>
>> sumavg 4610325
>>
>> Hollis Blanchard wrote:
>>> On Thu, 2008-09-25 at 17:32 +0800, Liu Yu-B13201 wrote:
>>>>> On Wed, 2008-09-24 at 11:24 +0200, Christian Ehrhardt wrote:
>>>>>> count min max
>>>>>> sum avg stddev time%
>>>>>> EMULINST 6,423,699 12 247,582
>>>>> 91,732,705
>>>>>> 14.2804 241.200 45.36
>>>>>> ITLBVIRT 1,777,242 21 264,257
>>>>> 47,116,557
>>>>>> 26.5111 286.040 23.30
>>>>>> DTLBVIRT 1,544,241 22 263,947
>>>>> 41,765,447
>>>>>> 27.0459 218.997 20.65
>>>>> (The max here is of course when the guest was de-scheduled in
>>>>> favor of
>>>>> another host process.)
>>>>>
>>>>> I think it's interesting that the min instruction emulation time
>>>>> is 12
>>>>> usecs. In fact, our "null" exits where we do almost no processing
>>>>> (ISI,
>>>>> DSI, syscall) are 11 usecs, which suggests we have a problem with
>>>>> interrupt handler overhead (for all exit types).
>>>>>
>>>> Will you consider about moving tlb manipulation out of entering path?
>>>> Examining the modify array may cost some time.
>>>
>>> Yes, definitely.
>>>
>>> That's about the only thing I can see in the 440 interrupt path that
>>> might take significant time. Unfortunately we can't profile that code
>>> because we have no performance counters, so finding the problem will
>>> require some experimentation.
>>
>>
>
>
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (5 preceding siblings ...)
2008-10-08 13:49 ` Christian Ehrhardt
@ 2008-10-08 15:41 ` Hollis Blanchard
2008-10-09 8:02 ` Christian Ehrhardt
` (3 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Hollis Blanchard @ 2008-10-08 15:41 UTC (permalink / raw)
To: kvm-ppc
On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
> Wondering about that 30.5% for postprocessing and
> kvmppc_check_and_deliver_interrupts I quickly checked that in detail -
> part d is now divided in 4 subparts.
> I also looked at the return to guest path if the expected part
> (restoring tlb) is really the main time eater there. The result shows
> clearly that it is.
>
> more detailed breakdown:
> a) 10.94% - exit, saving guest state (booke_interrupt.S)
> b) 8.12% - reaching kvmppc_handle_exit
> c) 7.59% - syscall exit is checked and a interrupt is queued using
> kvmppc_queue_exception
> d1) 3.33% - some checks for all exits
> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
> d3) 17.20% - can_deliver/clear&deliver exception in
> kvmppc_check_and_deliver_interrupts
> d4) 4.47% - updating kvm_stat statistics
> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
> f1) 29.18% - restoring guest tlb
> f2) 4.69% - restoring guest state ([s]regs)
>
> These fractions are % of our ~12µs syscall exit.
> => restoring tlb on each reenter = 4µs constant overhead
> => looking a bit into irq delivery and other constant things like
> kvm_stat updating
>
...
>
> Now I go for the TLB replacement in f1.
Hang on... does d3 make sense to you? It doesn't to me, and if there's a
bug there it will be easier to fix than rewriting the TLB code. :)
I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
that time is going. We're assuming the first_first_bit() loop usually
executes once, for syscall. Does it actually execute more than that? I
don't expect any of kvmppc_can_deliver_interrupt(),
kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
take lots of time.
Could it be cache effects? exception_priority[] and priority_exception[]
are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
fit into one... except they're not aligned.
Also, it looks like we use the generic find_first_bit(). That may be
more expensive than we'd like. However, since
vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
bitfield), we should be able to use ffs() instead, which has an
optimized PowerPC implementation. That might help a lot.
We might even be able to replace find_next_bit() too, by shifting a mask
over each loop, but I don't think we'll have to, since I expect the
common case to be we can deliver the first pending exception. (Worth
checking? :)
--
Hollis Blanchard
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (6 preceding siblings ...)
2008-10-08 15:41 ` Hollis Blanchard
@ 2008-10-09 8:02 ` Christian Ehrhardt
2008-10-09 9:35 ` Christian Ehrhardt
` (2 subsequent siblings)
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09 8:02 UTC (permalink / raw)
To: kvm-ppc
Hollis Blanchard wrote:
> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>
>> Wondering about that 30.5% for postprocessing and
>> kvmppc_check_and_deliver_interrupts I quickly checked that in detail -
>> part d is now divided in 4 subparts.
>> I also looked at the return to guest path if the expected part
>> (restoring tlb) is really the main time eater there. The result shows
>> clearly that it is.
>>
>> more detailed breakdown:
>> a) 10.94% - exit, saving guest state (booke_interrupt.S)
>> b) 8.12% - reaching kvmppc_handle_exit
>> c) 7.59% - syscall exit is checked and a interrupt is queued using
>> kvmppc_queue_exception
>> d1) 3.33% - some checks for all exits
>> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>> d3) 17.20% - can_deliver/clear&deliver exception in
>> kvmppc_check_and_deliver_interrupts
>> d4) 4.47% - updating kvm_stat statistics
>> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>> f1) 29.18% - restoring guest tlb
>> f2) 4.69% - restoring guest state ([s]regs)
>>
>> These fractions are % of our ~12µs syscall exit.
>> => restoring tlb on each reenter = 4µs constant overhead
>> => looking a bit into irq delivery and other constant things like
>> kvm_stat updating
>>
>>
> ...
>
>> Now I go for the TLB replacement in f1.
>>
>
> Hang on... does d3 make sense to you? It doesn't to me, and if there's a
> bug there it will be easier to fix than rewriting the TLB code. :)
>
I did not give up improving that part too :-)
> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>
I get the same results. 1% ~ 80 cycles.
> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
> that time is going. We're assuming the first_first_bit() loop usually
> executes once, for syscall. Does it actually execute more than that? I
> don't expect any of kvmppc_can_deliver_interrupt(),
> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
> take lots of time.
>
You can see below that I already had a more detailed breakdown in my old
mail:
[...]
d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit in
kvmppc_check_and_deliver_interrupts
d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in
kvmppc_check_and_deliver_interrupts
d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver
exception in kvmppc_check_and_deliver_interrupts
[...]
> Could it be cache effects? exception_priority[] and priority_exception[]
> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
> fit into one... except they're not aligned.
>
I would be so happy if I would have hardware performance counters like
cache misses :-)
> Also, it looks like we use the generic find_first_bit(). That may be
> more expensive than we'd like. However, since
> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
> bitfield), we should be able to use ffs() instead, which has an
> optimized PowerPC implementation. That might help a lot.
>
good idea.
I'll check this and some other small improvements I have in mind.
> We might even be able to replace find_next_bit() too, by shifting a mask
> over each loop, but I don't think we'll have to, since I expect the
> common case to be we can deliver the first pending exception. (Worth
> checking? :)
>
I'm not sure. It's surely worth checking how often that second
find_next_bit is called.
If that number is far too small it's not worth.
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (7 preceding siblings ...)
2008-10-09 8:02 ` Christian Ehrhardt
@ 2008-10-09 9:35 ` Christian Ehrhardt
2008-10-09 14:49 ` Christian Ehrhardt
2008-10-10 8:32 ` Christian Ehrhardt
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09 9:35 UTC (permalink / raw)
To: kvm-ppc
I modified the code according to your comments and my ideas, the new
values are shown in column impISF (irq delivery, Stat, FindFirstBit)
I changed some code of the statistic updating and the interrupt delivery
and got this:
base - impirq (d3) - impstat (d5) - impboth - impISF
a) 12.57% - 11.13% - 12.05% - 11.03% - 12.28% exit, saving
guest state (booke_interrupt.S)
b) 7.37% - 9.38% - 8.69% - 8.07% - 10.13% reaching
kvmppc_handle_exit
c) 7.38% - 7.20% - 7.49% - 9.78% - 7.85% syscall exit
is checked and a interrupt is queued using kvmppc_queue_exception
d1) 2.49% - 3.39% - 2.56% - 3.30% - 3.70% some checks
for all exits
d2) 8.84% - 8.56% - 9.28% - 8.31% - 6.07% finding
first bit in kvmppc_check_and_deliver_interrupts
d3) 6.53% - 5.25% - 6.63% - 5.10% - 4.27% can_deliver
in kvmppc_check_and_deliver_interrupts
d4) 13.66% - 15.37% - 14.12% - 14.92% - 13.96%
clear&deliver exception in kvmppc_check_and_deliver_interrupts
d5) 3.65% - 4.57% - 2.68% - 4.41% - 3.77% updating
kvm_stat statistics
e) 6.55% - 6.30% - 6.30% - 5.89% - 6.74% returning
from kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% - 28.78% - 30.16% - 29.16% - 31.19% restoring
guest tlb
f2) 4.81% - 4.77% - 5.06% - 4.66% - 5.17% restoring
guest state ([s]regs)
We all see the measurement inaccuracy, but the last columns look good at
the improved sections d2, d3 and d4.
I'll remove these detailed tracing soon and make a larger test hoping
that this will not have the inaccuracy.
But for now I still wonder about the ~14% for clear&deliver - that
should just not be "that" much.
It should be worth to look into that section once again more in detail
first.
Christian Ehrhardt wrote:
> Hollis Blanchard wrote:
>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>
>>> Wondering about that 30.5% for postprocessing and
>>> kvmppc_check_and_deliver_interrupts I quickly checked that in detail
>>> - part d is now divided in 4 subparts.
>>> I also looked at the return to guest path if the expected part
>>> (restoring tlb) is really the main time eater there. The result
>>> shows clearly that it is.
>>>
>>> more detailed breakdown:
>>> a) 10.94% - exit, saving guest state (booke_interrupt.S)
>>> b) 8.12% - reaching kvmppc_handle_exit
>>> c) 7.59% - syscall exit is checked and a interrupt is queued
>>> using kvmppc_queue_exception
>>> d1) 3.33% - some checks for all exits
>>> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>> d3) 17.20% - can_deliver/clear&deliver exception in
>>> kvmppc_check_and_deliver_interrupts
>>> d4) 4.47% - updating kvm_stat statistics
>>> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>> f1) 29.18% - restoring guest tlb
>>> f2) 4.69% - restoring guest state ([s]regs)
>>>
>>> These fractions are % of our ~12µs syscall exit.
>>> => restoring tlb on each reenter = 4µs constant overhead
>>> => looking a bit into irq delivery and other constant things like
>>> kvm_stat updating
>>>
>>>
>> ...
>>
>>> Now I go for the TLB replacement in f1.
>>>
>>
>> Hang on... does d3 make sense to you? It doesn't to me, and if there's a
>> bug there it will be easier to fix than rewriting the TLB code. :)
>>
> I did not give up improving that part too :-)
>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>
> I get the same results. 1% ~ 80 cycles.
>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>> that time is going. We're assuming the first_first_bit() loop usually
>> executes once, for syscall. Does it actually execute more than that? I
>> don't expect any of kvmppc_can_deliver_interrupt(),
>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>> take lots of time.
>>
> You can see below that I already had a more detailed breakdown in my
> old mail:
> [...]
> d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit in
> kvmppc_check_and_deliver_interrupts
> d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in
> kvmppc_check_and_deliver_interrupts
> d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver
> exception in kvmppc_check_and_deliver_interrupts
> [...]
>> Could it be cache effects? exception_priority[] and priority_exception[]
>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
>> fit into one... except they're not aligned.
>>
> I would be so happy if I would have hardware performance counters like
> cache misses :-)
>> Also, it looks like we use the generic find_first_bit(). That may be
>> more expensive than we'd like. However, since
>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>> bitfield), we should be able to use ffs() instead, which has an
>> optimized PowerPC implementation. That might help a lot.
>>
> good idea.
> I'll check this and some other small improvements I have in mind.
>
>> We might even be able to replace find_next_bit() too, by shifting a mask
>> over each loop, but I don't think we'll have to, since I expect the
>> common case to be we can deliver the first pending exception. (Worth
>> checking? :)
>>
> I'm not sure. It's surely worth checking how often that second
> find_next_bit is called.
> If that number is far too small it's not worth.
>
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (8 preceding siblings ...)
2008-10-09 9:35 ` Christian Ehrhardt
@ 2008-10-09 14:49 ` Christian Ehrhardt
2008-10-10 8:32 ` Christian Ehrhardt
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-09 14:49 UTC (permalink / raw)
To: kvm-ppc
Ok, looking at clear&deliver in detail I made some minor changes, but
eventually it has no single point to improve.
Most of the time of the 14% in clear%deliver is lost in
kvmppc_mmu_priv_switch (~8%), the rest spread evenly among the code -
nothing obvious (maybe some cache misses though).
This code has to be touched anyway when going for a change in the
guest/host (shadow)TLB management.
Christian Ehrhardt wrote:
> I modified the code according to your comments and my ideas, the new
> values are shown in column impISF (irq delivery, Stat, FindFirstBit)
>
> I changed some code of the statistic updating and the interrupt
> delivery and got this:
> base - impirq (d3) - impstat (d5) - impboth - impISF
> a) 12.57% - 11.13% - 12.05% - 11.03% - 12.28% exit,
> saving guest state (booke_interrupt.S)
> b) 7.37% - 9.38% - 8.69% - 8.07% - 10.13% reaching
> kvmppc_handle_exit
> c) 7.38% - 7.20% - 7.49% - 9.78% - 7.85% syscall
> exit is checked and a interrupt is queued using kvmppc_queue_exception
> d1) 2.49% - 3.39% - 2.56% - 3.30% - 3.70% some
> checks for all exits
> d2) 8.84% - 8.56% - 9.28% - 8.31% - 6.07% finding
> first bit in kvmppc_check_and_deliver_interrupts
> d3) 6.53% - 5.25% - 6.63% - 5.10% - 4.27%
> can_deliver in kvmppc_check_and_deliver_interrupts
> d4) 13.66% - 15.37% - 14.12% - 14.92% - 13.96%
> clear&deliver exception in kvmppc_check_and_deliver_interrupts
> d5) 3.65% - 4.57% - 2.68% - 4.41% - 3.77% updating
> kvm_stat statistics
> e) 6.55% - 6.30% - 6.30% - 5.89% - 6.74% returning
> from kvmppc_handle_exit to booke_interrupt.S
> f1) 30.90% - 28.78% - 30.16% - 29.16% - 31.19% restoring
> guest tlb
> f2) 4.81% - 4.77% - 5.06% - 4.66% - 5.17% restoring
> guest state ([s]regs)
>
> We all see the measurement inaccuracy, but the last columns look good
> at the improved sections d2, d3 and d4.
> I'll remove these detailed tracing soon and make a larger test hoping
> that this will not have the inaccuracy.
> But for now I still wonder about the ~14% for clear&deliver - that
> should just not be "that" much.
> It should be worth to look into that section once again more in detail
> first.
>
> Christian Ehrhardt wrote:
>> Hollis Blanchard wrote:
>>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>>
>>>> Wondering about that 30.5% for postprocessing and
>>>> kvmppc_check_and_deliver_interrupts I quickly checked that in
>>>> detail - part d is now divided in 4 subparts.
>>>> I also looked at the return to guest path if the expected part
>>>> (restoring tlb) is really the main time eater there. The result
>>>> shows clearly that it is.
>>>>
>>>> more detailed breakdown:
>>>> a) 10.94% - exit, saving guest state (booke_interrupt.S)
>>>> b) 8.12% - reaching kvmppc_handle_exit
>>>> c) 7.59% - syscall exit is checked and a interrupt is queued
>>>> using kvmppc_queue_exception
>>>> d1) 3.33% - some checks for all exits
>>>> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>>> d3) 17.20% - can_deliver/clear&deliver exception in
>>>> kvmppc_check_and_deliver_interrupts
>>>> d4) 4.47% - updating kvm_stat statistics
>>>> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>>> f1) 29.18% - restoring guest tlb
>>>> f2) 4.69% - restoring guest state ([s]regs)
>>>>
>>>> These fractions are % of our ~12µs syscall exit.
>>>> => restoring tlb on each reenter = 4µs constant overhead
>>>> => looking a bit into irq delivery and other constant things like
>>>> kvm_stat updating
>>>>
>>>>
>>> ...
>>>
>>>> Now I go for the TLB replacement in f1.
>>>>
>>>
>>> Hang on... does d3 make sense to you? It doesn't to me, and if
>>> there's a
>>> bug there it will be easier to fix than rewriting the TLB code. :)
>>>
>> I did not give up improving that part too :-)
>>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>>
>> I get the same results. 1% ~ 80 cycles.
>>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>>> that time is going. We're assuming the first_first_bit() loop usually
>>> executes once, for syscall. Does it actually execute more than that? I
>>> don't expect any of kvmppc_can_deliver_interrupt(),
>>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>>> take lots of time.
>>>
>> You can see below that I already had a more detailed breakdown in my
>> old mail:
>> [...]
>> d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit
>> in kvmppc_check_and_deliver_interrupts
>> d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in
>> kvmppc_check_and_deliver_interrupts
>> d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver
>> exception in kvmppc_check_and_deliver_interrupts
>> [...]
>>> Could it be cache effects? exception_priority[] and
>>> priority_exception[]
>>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should
>>> both
>>> fit into one... except they're not aligned.
>>>
>> I would be so happy if I would have hardware performance counters
>> like cache misses :-)
>>> Also, it looks like we use the generic find_first_bit(). That may be
>>> more expensive than we'd like. However, since
>>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>>> bitfield), we should be able to use ffs() instead, which has an
>>> optimized PowerPC implementation. That might help a lot.
>>>
>> good idea.
>> I'll check this and some other small improvements I have in mind.
>>
>>> We might even be able to replace find_next_bit() too, by shifting a
>>> mask
>>> over each loop, but I don't think we'll have to, since I expect the
>>> common case to be we can deliver the first pending exception. (Worth
>>> checking? :)
>>>
>> I'm not sure. It's surely worth checking how often that second
>> find_next_bit is called.
>> If that number is far too small it's not worth.
>>
>
>
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: exit timing analysis v1 - comments&discussions welcome
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
` (9 preceding siblings ...)
2008-10-09 14:49 ` Christian Ehrhardt
@ 2008-10-10 8:32 ` Christian Ehrhardt
10 siblings, 0 replies; 12+ messages in thread
From: Christian Ehrhardt @ 2008-10-10 8:32 UTC (permalink / raw)
To: kvm-ppc
Hollis Blanchard wrote:
> Also, it looks like we use the generic find_first_bit(). That may be
> more expensive than we'd like. However, since
> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
> bitfield), we should be able to use ffs() instead, which has an
> optimized PowerPC implementation. That might help a lot.
>
> We might even be able to replace find_next_bit() too, by shifting a mask
> over each loop, but I don't think we'll have to, since I expect the
> common case to be we can deliver the first pending exception. (Worth
> checking? :)
>
FFS for first one is working fine.
To check if modifying the following Find next bit to mask&__ffs I did
some quick accounting and it seems not to be worth.
On the exits casued by external interrupts we hit "find next bit" in
~50% of the cases (MMIO, DCR, DEC, EXITINT).
But on all the frequent cases like DTLVVIRT, ITLBVIRT and EMULINST we
are at ~2.75%.
Therefore it should be ok to keep the generic find_next_bit there.
--
Grüsse / regards,
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2008-10-10 8:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-24 9:24 exit timing analysis v1 - comments&discussions welcome Christian Ehrhardt
2008-09-24 15:14 ` Hollis Blanchard
2008-09-25 9:32 ` Liu Yu-B13201
2008-09-25 15:18 ` Hollis Blanchard
2008-10-02 12:02 ` Christian Ehrhardt
2008-10-07 14:36 ` Christian Ehrhardt
2008-10-08 13:49 ` Christian Ehrhardt
2008-10-08 15:41 ` Hollis Blanchard
2008-10-09 8:02 ` Christian Ehrhardt
2008-10-09 9:35 ` Christian Ehrhardt
2008-10-09 14:49 ` Christian Ehrhardt
2008-10-10 8:32 ` Christian Ehrhardt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.