From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christian Ehrhardt Date: Thu, 09 Oct 2008 08:02:46 +0000 Subject: Re: exit timing analysis v1 - comments&discussions welcome Message-Id: <48EDBAA6.9050308@linux.vnet.ibm.com> List-Id: References: <48DA0747.3020004@linux.vnet.ibm.com> In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit To: kvm-ppc@vger.kernel.org Hollis Blanchard wrote: > On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote: > >> Wondering about that 30.5% for postprocessing and >> kvmppc_check_and_deliver_interrupts I quickly checked that in detail - >> part d is now divided in 4 subparts. >> I also looked at the return to guest path if the expected part >> (restoring tlb) is really the main time eater there. The result shows >> clearly that it is. >> >> more detailed breakdown: >> a) 10.94% - exit, saving guest state (booke_interrupt.S) >> b) 8.12% - reaching kvmppc_handle_exit >> c) 7.59% - syscall exit is checked and a interrupt is queued using >> kvmppc_queue_exception >> d1) 3.33% - some checks for all exits >> d2) 8.29% - finding first bit in kvmppc_check_and_deliver_interrupts >> d3) 17.20% - can_deliver/clear&deliver exception in >> kvmppc_check_and_deliver_interrupts >> d4) 4.47% - updating kvm_stat statistics >> e) 6.13% - returning from kvmppc_handle_exit to booke_interrupt.S >> f1) 29.18% - restoring guest tlb >> f2) 4.69% - restoring guest state ([s]regs) >> >> These fractions are % of our ~12µs syscall exit. >> => restoring tlb on each reenter = 4µs constant overhead >> => looking a bit into irq delivery and other constant things like >> kvm_stat updating >> >> > ... > >> Now I go for the TLB replacement in f1. >> > > Hang on... does d3 make sense to you? It doesn't to me, and if there's a > bug there it will be easier to fix than rewriting the TLB code. :) > I did not give up improving that part too :-) > I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20% > of 12µs is 2064ns, or about 1300 cycles. (Check my math.) > I get the same results. 1% ~ 80 cycles. > Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where > that time is going. We're assuming the first_first_bit() loop usually > executes once, for syscall. Does it actually execute more than that? I > don't expect any of kvmppc_can_deliver_interrupt(), > kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to > take lots of time. > You can see below that I already had a more detailed breakdown in my old mail: [...] d2) 8.84% - 8.56% - 9.28% - 8.31% finding first bit in kvmppc_check_and_deliver_interrupts d3) 6.53% - 5.25% - 6.63% - 5.10% can_deliver in kvmppc_check_and_deliver_interrupts d4) 13.66% - 15.37% - 14.12% - 14.92% clear&deliver exception in kvmppc_check_and_deliver_interrupts [...] > Could it be cache effects? exception_priority[] and priority_exception[] > are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both > fit into one... except they're not aligned. > I would be so happy if I would have hardware performance counters like cache misses :-) > Also, it looks like we use the generic find_first_bit(). That may be > more expensive than we'd like. However, since > vcpu->arch.pending_exceptions is a single long (not an arbitrary sized > bitfield), we should be able to use ffs() instead, which has an > optimized PowerPC implementation. That might help a lot. > good idea. I'll check this and some other small improvements I have in mind. > We might even be able to replace find_next_bit() too, by shifting a mask > over each loop, but I don't think we'll have to, since I expect the > common case to be we can deliver the first pending exception. (Worth > checking? :) > I'm not sure. It's surely worth checking how often that second find_next_bit is called. If that number is far too small it's not worth. -- Grüsse / regards, Christian Ehrhardt IBM Linux Technology Center, Open Virtualization