From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Date: Thu, 09 Oct 2008 14:49:12 +0000
Subject: Re: exit timing analysis v1 - comments&discussions welcome
Message-Id: <48EE19E8.9080002@linux.vnet.ibm.com>
List-Id: <kvm-ppc.vger.kernel.org>
References: <48DA0747.3020004@linux.vnet.ibm.com>
In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
To: kvm-ppc@vger.kernel.org

Ok, looking at clear&deliver in detail I made some minor changes, but 
eventually it has no single point to improve.
Most of the time of the 14% in clear%deliver is lost in 
kvmppc_mmu_priv_switch (~8%), the rest spread evenly among the code - 
nothing obvious (maybe some cache misses though).

This code has to be touched anyway when going for a change in the 
guest/host (shadow)TLB management.

Christian Ehrhardt wrote:
> I modified the code according to your comments and my ideas, the new 
> values are shown in column impISF (irq delivery, Stat, FindFirstBit)
>
> I changed some code of the statistic updating and the interrupt 
> delivery and got this:
>     base - impirq (d3) - impstat (d5) - impboth  - impISF
> a)  12.57% -  11.13%     -  12.05%      -  11.03% - 12.28%  exit, 
> saving guest state (booke_interrupt.S)
> b)   7.37% -   9.38%     -   8.69%      -   8.07% - 10.13%  reaching 
> kvmppc_handle_exit
> c)   7.38% -   7.20%     -   7.49%      -   9.78% -  7.85%  syscall 
> exit is checked and a interrupt is queued using kvmppc_queue_exception
> d1)  2.49% -   3.39%     -   2.56%      -   3.30% -  3.70%  some 
> checks for all exits
> d2)  8.84% -   8.56%     -   9.28%      -   8.31% -  6.07%  finding 
> first bit in kvmppc_check_and_deliver_interrupts
> d3)  6.53% -   5.25%     -   6.63%      -   5.10% -  4.27%  
> can_deliver in kvmppc_check_and_deliver_interrupts
> d4) 13.66% -  15.37%     -  14.12%      -  14.92% - 13.96%  
> clear&deliver exception in kvmppc_check_and_deliver_interrupts
> d5)  3.65% -   4.57%     -   2.68%      -   4.41% -  3.77%  updating 
> kvm_stat statistics
> e)   6.55% -   6.30%     -   6.30%      -   5.89% -  6.74%  returning 
> from kvmppc_handle_exit to booke_interrupt.S
> f1) 30.90% -  28.78%     -  30.16%      -  29.16% - 31.19%  restoring 
> guest tlb
> f2)  4.81% -   4.77%     -   5.06%      -   4.66% -  5.17%  restoring 
> guest state ([s]regs)
>
> We all see the measurement inaccuracy, but the last columns look good 
> at the improved sections d2, d3 and d4.
> I'll remove these detailed tracing soon and make a larger test hoping 
> that this will not have the inaccuracy.
> But for now I still wonder about the ~14% for clear&deliver - that 
> should just not be "that" much.
> It should be worth to look into that section once again more in detail 
> first.
>
> Christian Ehrhardt wrote:
>> Hollis Blanchard wrote:
>>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>>  
>>>> Wondering about that 30.5% for postprocessing and 
>>>> kvmppc_check_and_deliver_interrupts I quickly checked that in 
>>>> detail - part d is now divided in 4 subparts.
>>>> I also looked at the return to guest path if the expected part 
>>>> (restoring tlb) is really the main time eater there. The result 
>>>> shows clearly that it is.
>>>>
>>>> more detailed breakdown:
>>>> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
>>>> b)   8.12% - reaching kvmppc_handle_exit
>>>> c)   7.59%  - syscall exit is checked and a interrupt is queued 
>>>> using kvmppc_queue_exception
>>>> d1)  3.33%  - some checks for all exits
>>>> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>>> d3) 17.20% - can_deliver/clear&deliver exception in 
>>>> kvmppc_check_and_deliver_interrupts
>>>> d4)  4.47% - updating kvm_stat statistics
>>>> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>>> f1) 29.18% - restoring guest tlb
>>>> f2)  4.69% - restoring guest state ([s]regs)
>>>>
>>>> These fractions are % of our ~12µs syscall exit.
>>>> => restoring tlb on each reenter = 4µs constant overhead
>>>> => looking a bit into irq delivery and other constant things like 
>>>> kvm_stat updating
>>>>
>>>>     
>>> ...
>>>  
>>>> Now I go for the TLB replacement in f1.
>>>>     
>>>
>>> Hang on... does d3 make sense to you? It doesn't to me, and if 
>>> there's a
>>> bug there it will be easier to fix than rewriting the TLB code. :)
>>>   
>> I did not give up improving that part too :-)
>>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>>   
>> I get the same results. 1% ~ 80 cycles.
>>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>>> that time is going. We're assuming the first_first_bit() loop usually
>>> executes once, for syscall. Does it actually execute more than that? I
>>> don't expect any of kvmppc_can_deliver_interrupt(),
>>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>>> take lots of time.
>>>   
>> You can see below that I already had a more detailed breakdown in my 
>> old mail:
>> [...]
>> d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit 
>> in kvmppc_check_and_deliver_interrupts
>> d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
>> kvmppc_check_and_deliver_interrupts
>> d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
>> exception in kvmppc_check_and_deliver_interrupts
>> [...]
>>> Could it be cache effects? exception_priority[] and 
>>> priority_exception[]
>>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should 
>>> both
>>> fit into one... except they're not aligned.
>>>   
>> I would be so happy if I would have hardware performance counters 
>> like cache misses :-)
>>> Also, it looks like we use the generic find_first_bit(). That may be
>>> more expensive than we'd like. However, since
>>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>>> bitfield), we should be able to use ffs() instead, which has an
>>> optimized PowerPC implementation. That might help a lot.
>>>   
>> good idea.
>> I'll check this and some other small improvements I have in mind.
>>
>>> We might even be able to replace find_next_bit() too, by shifting a 
>>> mask
>>> over each loop, but I don't think we'll have to, since I expect the
>>> common case to be we can deliver the first pending exception. (Worth
>>> checking? :)
>>>   
>> I'm not sure. It's surely worth checking how often that second 
>> find_next_bit is called.
>> If that number is far too small it's not worth.
>>
>
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization