From mboxrd@z Thu Jan  1 00:00:00 1970
From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Date: Thu, 09 Oct 2008 09:35:05 +0000
Subject: Re: exit timing analysis v1 - comments&discussions welcome
Message-Id: <48EDD049.1000709@linux.vnet.ibm.com>
List-Id: <kvm-ppc.vger.kernel.org>
References: <48DA0747.3020004@linux.vnet.ibm.com>
In-Reply-To: <48DA0747.3020004@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
To: kvm-ppc@vger.kernel.org

I modified the code according to your comments and my ideas, the new 
values are shown in column impISF (irq delivery, Stat, FindFirstBit)

I changed some code of the statistic updating and the interrupt delivery 
and got this:
     base - impirq (d3) - impstat (d5) - impboth  - impISF
a)  12.57% -  11.13%     -  12.05%      -  11.03% - 12.28%  exit, saving 
guest state (booke_interrupt.S)
b)   7.37% -   9.38%     -   8.69%      -   8.07% - 10.13%  reaching 
kvmppc_handle_exit
c)   7.38% -   7.20%     -   7.49%      -   9.78% -  7.85%  syscall exit 
is checked and a interrupt is queued using kvmppc_queue_exception
d1)  2.49% -   3.39%     -   2.56%      -   3.30% -  3.70%  some checks 
for all exits
d2)  8.84% -   8.56%     -   9.28%      -   8.31% -  6.07%  finding 
first bit in kvmppc_check_and_deliver_interrupts
d3)  6.53% -   5.25%     -   6.63%      -   5.10% -  4.27%  can_deliver 
in kvmppc_check_and_deliver_interrupts
d4) 13.66% -  15.37%     -  14.12%      -  14.92% - 13.96%  
clear&deliver exception in kvmppc_check_and_deliver_interrupts
d5)  3.65% -   4.57%     -   2.68%      -   4.41% -  3.77%  updating 
kvm_stat statistics
e)   6.55% -   6.30%     -   6.30%      -   5.89% -  6.74%  returning 
from kvmppc_handle_exit to booke_interrupt.S
f1) 30.90% -  28.78%     -  30.16%      -  29.16% - 31.19%  restoring 
guest tlb
f2)  4.81% -   4.77%     -   5.06%      -   4.66% -  5.17%  restoring 
guest state ([s]regs)

We all see the measurement inaccuracy, but the last columns look good at 
the improved sections d2, d3 and d4.
I'll remove these detailed tracing soon and make a larger test hoping 
that this will not have the inaccuracy.
But for now I still wonder about the ~14% for clear&deliver - that 
should just not be "that" much.
It should be worth to look into that section once again more in detail 
first.

Christian Ehrhardt wrote:
> Hollis Blanchard wrote:
>> On Wed, 2008-10-08 at 15:49 +0200, Christian Ehrhardt wrote:
>>  
>>> Wondering about that 30.5% for postprocessing and 
>>> kvmppc_check_and_deliver_interrupts I quickly checked that in detail 
>>> - part d is now divided in 4 subparts.
>>> I also looked at the return to guest path if the expected part 
>>> (restoring tlb) is really the main time eater there. The result 
>>> shows clearly that it is.
>>>
>>> more detailed breakdown:
>>> a)  10.94%  - exit, saving guest state (booke_interrupt.S)
>>> b)   8.12% - reaching kvmppc_handle_exit
>>> c)   7.59%  - syscall exit is checked and a interrupt is queued 
>>> using kvmppc_queue_exception
>>> d1)  3.33%  - some checks for all exits
>>> d2)  8.29% - finding first bit in kvmppc_check_and_deliver_interrupts
>>> d3) 17.20% - can_deliver/clear&deliver exception in 
>>> kvmppc_check_and_deliver_interrupts
>>> d4)  4.47% - updating kvm_stat statistics
>>> e)   6.13% - returning from kvmppc_handle_exit to booke_interrupt.S
>>> f1) 29.18% - restoring guest tlb
>>> f2)  4.69% - restoring guest state ([s]regs)
>>>
>>> These fractions are % of our ~12µs syscall exit.
>>> => restoring tlb on each reenter = 4µs constant overhead
>>> => looking a bit into irq delivery and other constant things like 
>>> kvm_stat updating
>>>
>>>     
>> ...
>>  
>>> Now I go for the TLB replacement in f1.
>>>     
>>
>> Hang on... does d3 make sense to you? It doesn't to me, and if there's a
>> bug there it will be easier to fix than rewriting the TLB code. :)
>>   
> I did not give up improving that part too :-)
>> I think your core runs at 667MHz, right? So that's 1.5 ns/cycle. 17.20%
>> of 12µs is 2064ns, or about 1300 cycles. (Check my math.)
>>   
> I get the same results. 1% ~ 80 cycles.
>> Now when I look at kvmppc_core_deliver_interrupts(), I'm not sure where
>> that time is going. We're assuming the first_first_bit() loop usually
>> executes once, for syscall. Does it actually execute more than that? I
>> don't expect any of kvmppc_can_deliver_interrupt(),
>> kvmppc_booke_clear_exception(), or kvmppc_booke_deliver_interrupt() to
>> take lots of time.
>>   
> You can see below that I already had a more detailed breakdown in my 
> old mail:
> [...]
> d2)  8.84% -   8.56%     -   9.28%      -   8.31% finding first bit in 
> kvmppc_check_and_deliver_interrupts
> d3)  6.53% -   5.25%     -   6.63%      -   5.10% can_deliver in 
> kvmppc_check_and_deliver_interrupts
> d4) 13.66% -  15.37%     -  14.12%      -  14.92% clear&deliver 
> exception in kvmppc_check_and_deliver_interrupts
> [...]
>> Could it be cache effects? exception_priority[] and priority_exception[]
>> are 16 bytes each, and our L1 cacheline is 32 bytes, so they should both
>> fit into one... except they're not aligned.
>>   
> I would be so happy if I would have hardware performance counters like 
> cache misses :-)
>> Also, it looks like we use the generic find_first_bit(). That may be
>> more expensive than we'd like. However, since
>> vcpu->arch.pending_exceptions is a single long (not an arbitrary sized
>> bitfield), we should be able to use ffs() instead, which has an
>> optimized PowerPC implementation. That might help a lot.
>>   
> good idea.
> I'll check this and some other small improvements I have in mind.
>
>> We might even be able to replace find_next_bit() too, by shifting a mask
>> over each loop, but I don't think we'll have to, since I expect the
>> common case to be we can deliver the first pending exception. (Worth
>> checking? :)
>>   
> I'm not sure. It's surely worth checking how often that second 
> find_next_bit is called.
> If that number is far too small it's not worth.
>


-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization