From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Ehrhardt Date: Thu, 09 Oct 2008 14:49:12 +0000 Subject: Re: exit timing analysis v1 - comments&discussions welcome Message-Id: <48EE19E8.9080002@linux.vnet.ibm.com> List-Id: References: <48DA0747.3020004@linux.vnet.ibm.com> In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: kvm-ppc@vger.kernel.org Ok, looking at clear&deliver in detail I made some minor changes, but eventually it has no single point to improve. Most of the time of the 14% in clear%deliver is lost in kvmppc_mmu_priv_switch (~8%), the rest spread evenly among the code - nothing obvious (maybe some cache misses though). This code has to be touched anyway when going for a change in the guest/host (shadow)TLB management. Christian Ehrhardt wrote: > I modified the code according to your comments and my ideas, the new > values are shown in column impISF (irq delivery, Stat, FindFirstBit) > > I changed some code of the statistic updating and the interrupt > delivery and got this: > base - impirq (d3) - impstat (d5) - impboth - impISF > a) 12.57% - 11.13% - 12.05% - 11.03% - 12.28% exit, > saving guest state (booke_interrupt.S) > b) 7.37% - 9.38% - 8.69% - 8.07% - 10.13% reaching > kvmppc_handle_exit > c) 7.38% - 7.20% - 7.49% - 9.78% - 7.85% syscall > exit is checked and a interrupt is queued using kvmppc_queue_exception > d1) 2.49% - 3.39% - 2.56% - 3.30% - 3.70% some > checks for all exits > d2) 8.84% - 8.56% - 9.28% - 8.31% - 6.07% finding > first bit in kvmppc_check_and_deliver_interrupts > d3) 6.53% - 5.25% - 6.63% - 5.10% - 4.27% > can_deliver in kvmppc_check_and_deliver_interrupts > d4) 13.66% - 15.37% - 14.12% - 14.92% - 13.96% > clear&deliver exception in kvmppc_check_and_deliver_interrupts > d5) 3.65% - 4.57% - 2.68% - 4.41% - 3.77% updating > kvm_stat statistics > e) 6.55% - 6.30% - 6.30% - 5.89% - 6.74% returning > from kvmppc_handle_exit to booke_interrupt.S > f1) 30.90% - 28.78% - 30.16% - 29.16% - 31.19% restoring > guest tlb > f2) 4.81% - 4.77% - 5.06% - 4.66% - 5.17% restoring > guest state ([s]regs) > > We all see the measurement inaccuracy, but the last columns look good > at the improved sections d2, d3 and d4. > I'll remove these detailed tracing soon and make a larger test hoping > that this will not have the inaccuracy. > But for now I still wonder about the ~14% for clear&deliver - that > should just not be "that" much. > It should be worth to look into that section once again more in detail > first. > > Christian Ehrhardt wrote: >> Hollis Blanchard wrote: >>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote: >>> >>>> Wondering about that 30.5% for postprocessing and >>>> kvmppc_check_and_deliver_interrupts I quickly checked that in >>>> detail - part d is now divided in 4 subparts. >>>> I also looked at the return to guest path if the expected part >>>> (restoring tlb) is really the main time eater there. The result >>>> shows clearly that it is. >>>> >>>> more detailed breakdown: >>>> a) 10.94% - exit, saving guest state (booke_interrupt.S) >>>> b) 8.12% - reaching kvmppc_handle_exit >>>> c) 7.59% - syscall exit is checked and a interrupt is queued >>>> using kvmppc_queue_exception >>>> d1) 3.33% - some checks for all exits >>>> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts >>>> d3) 17.20% - can_deliver/clear&deliver exception in >>>> kvmppc_check_and_deliver_interrupts >>>> d4) 4.47% - updating kvm_stat statistics >>>> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S >>>> f1) 29.18% - restoring guest tlb >>>> f2) 4.69% - restoring guest state ([s]regs) >>>> >>>> These fractions are % of our ~12µs syscall exit. >>>> => restoring tlb on each reenter = 4µs constant overhead >>>> => looking a bit into irq delivery and other constant things like >>>> kvm_stat updating >>>> >>>> >>> ... >>> >>>> Now I go for the TLB replacement in f1. >>>> >>> >>> Hang on... does d3 make sense to you? It doesn't to me, and if >>> there's a >>> bug there it will be easier to fix than rewriting the TLB code. :) >>> >> I did not give up improving that part too :-) >>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20% >>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.) >>> >> I get the same results. 1% ~ 80 cycles. >>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where >>> that time is going. We're assuming the first_first_bit() loop usually >>> executes once, for syscall. Does it actually execute more than that? I >>> don't expect any of kvmppc_can_deliver_interrupt(), >>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to >>> take lots of time. >>> >> You can see below that I already had a more detailed breakdown in my >> old mail: >> [...] >> d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit >> in kvmppc_check_and_deliver_interrupts >> d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in >> kvmppc_check_and_deliver_interrupts >> d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver >> exception in kvmppc_check_and_deliver_interrupts >> [...] >>> Could it be cache effects? exception_priority[] and >>> priority_exception[] >>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should >>> both >>> fit into one... except they're not aligned. >>> >> I would be so happy if I would have hardware performance counters >> like cache misses :-) >>> Also, it looks like we use the generic find_first_bit(). That may be >>> more expensive than we'd like. However, since >>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized >>> bitfield), we should be able to use ffs() instead, which has an >>> optimized PowerPC implementation. That might help a lot. >>> >> good idea. >> I'll check this and some other small improvements I have in mind. >> >>> We might even be able to replace find_next_bit() too, by shifting a >>> mask >>> over each loop, but I don't think we'll have to, since I expect the >>> common case to be we can deliver the first pending exception. (Worth >>> checking? :) >>> >> I'm not sure. It's surely worth checking how often that second >> find_next_bit is called. >> If that number is far too small it's not worth. >> > > -- Grüsse / regards, Christian Ehrhardt IBM Linux Technology Center, Open Virtualization