From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: "Andrew M. Theurer" <habanero@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>,
Thomas Gleixner <tglx@linutronix.de>,
Marcelo Tosatti <mtosatti@redhat.com>,
Ingo Molnar <mingo@redhat.com>, Avi Kivity <avi@redhat.com>,
Rik van Riel <riel@redhat.com>, S390 <linux-s390@vger.kernel.org>,
Carsten Otte <cotte@de.ibm.com>,
Christian Borntraeger <borntraeger@de.ibm.com>,
KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
Gleb Natapov <gleb@redhat.com>,
linux390@de.ibm.com,
Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
Joerg Roedel <joerg.roedel@amd.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
Date: Tue, 10 Jul 2012 14:56:12 +0530 [thread overview]
Message-ID: <4FFBF534.5040107@linux.vnet.ibm.com> (raw)
In-Reply-To: <1341870457.2909.27.camel@oc2024037011.ibm.com>
On 07/10/2012 03:17 AM, Andrew Theurer wrote:
> On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
>> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
>> random VCPU on PL exit. Though we already have filtering while choosing
>> the candidate to yield_to, we can do better.
>
> Hi, Raghu.
Hi Andrew,
Thank you for your analysis and inputs
>
>> Problem is, for large vcpu guests, we have more probability of yielding
>> to a bad vcpu. We are not able to prevent directed yield to same guy who
>> has done PL exit recently, who perhaps spins again and wastes CPU.
>>
>> Fix that by keeping track of who has done PL exit. So The Algorithm
in series
>> give chance to a VCPU which has:
>>
>> (a) Not done PLE exit at all (probably he is preempted lock-holder)
>>
>> (b) VCPU skipped in last iteration because it did PL exit, and
probably
>> has become eligible now (next eligible lock holder)
>>
>> Future enhancemnets:
>> (1) Currently we have a boolean to decide on eligibility of vcpu. It
>> would be nice if I get feedback on guest (>32 vcpu) whether we can
>> improve better with integer counter. (with counter = say f(log
n )).
>>
>> (2) We have not considered system load during iteration of vcpu. With
>> that information we can limit the scan and also decide whether
schedule()
>> is better. [ I am able to use #kicked vcpus to decide on this
But may
>> be there are better ideas like information from global loadavg.]
>>
>> (3) We can exploit this further with PV patches since it also
knows about
>> next eligible lock-holder.
>>
>> Summary: There is a huge improvement for moderate / no overcommit
scenario
>> for kvm based guest on PLE machine (which is difficult ;) ).
>>
>> Result:
>> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
>>
>> Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM,
>> 32 core machine
>
> Is this with HT enabled, therefore 64 CPU threads?
No. HT disabled with 32 online CPUs
>
>> Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4)
(GCC)
>> with test kernels
>>
>> Guest: fedora 16 with 32 vcpus 8GB memory.
>
> Can you briefly explain the 1x and 2x configs? This of course is highly
> dependent whether or not HT is enabled...
1x config: kernbench/ebizzy/sysbench running on 1 guest (32 vcpu)
all the benchmarks have 2*#vcpu = 64 threads
2x config: kernbench/ebizzy/sysbench running on 2 guests each with 32
vcpu)
all the benchmarks have 2*#vcpu = 64 threads
>
> FWIW, I started testing what I would call "0.5x", where I have one 40
> vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
> enabled, no extra load on the system). For ebizzy, the results are
> quite erratic from run to run, so I am inclined to discard it as a
I will be posting full run detail (individual run) in reply to this
mail since it is big. I have posted stdev also with the result.. it has
not shown too much deviation.
> workload, but maybe I should try "1x" and "2x" cpu over-commit as well.
>
>> From initial observations, at least for the ebizzy workload, the
> percentage of exits that result in a yield_to() are very low, around 1%,
> before these patches.
Hmm Ok..
IMO for a under-committed workload, probably low percentage of yield_to
was expected, but not sure whether 1% is too less though.
But importantly, number of successful yield_to can never measure
benefit.
With this patch what I am trying to address is to ensure successful
yield_to result in benefit.
So, I am concerned that at least for this test,
> reducing that number even more has diminishing returns. I am however
> still concerned about the scalability problem with yield_to(),
So did you mean you are expected to see more yield_to overheads with
large guests?
As already mentioned in future enhancements, one thing I will be trying
in future would be,
a. have counter instead of boolean for skipping yield_to
b. just scan probably f(log(n)) vcpu to yield and then schedule()/
return depending on system load.
so we will be reducing overall vcpu iteration in PLE handler from
O(n * n) to O(n log n)
which
> shows like this for me (perf):
>
>> 63.56% 282095 qemu-kvm [kernel.kallsyms] [k]
_raw_spin_lock
>> 5.42% 24420 qemu-kvm [kvm] [k]
kvm_vcpu_yield_to
>> 5.33% 26481 qemu-kvm [kernel.kallsyms] [k]
get_pid_task
>> 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to
>> 2.74% 15652 qemu-kvm [kvm] [k]
kvm_apic_present
>> 1.70% 8657 qemu-kvm [kvm] [k]
kvm_vcpu_on_spin
>> 1.45% 7889 qemu-kvm [kvm] [k]
vcpu_enter_guest
>
> For the cpu threads in the host that are actually active (in this case
> 1/2 of them), ~50% of their time is in kernel and ~43% in guest.This
> is for a no-IO workload, so that's just incredible to see so much cpu
> wasted. I feel that 2 important areas to tackle are a more scalable
> yield_to() and reducing the number of pause exits itself (hopefully by
> just tuning ple_window for the latter).
I think this is a concern and as you stated I agree that tuning
ple_window helps here.
>
> Honestly, I not confident addressing this problem will improve the
> ebizzy score. That workload is so erratic for me, that I do not trust
> the results at all. I have however seen consistent improvements in
> disabling PLE for a http guest workload and a very high IOPS guest
> workload, both with much time spent in host in the double runqueue lock
> for yield_to(), so that's why I still gravitate toward that issue.
The problem starts (in PLE disabled) when we have workload just > 1x.We
start burning so much of cpu.
IIRC, in 2x overcommit, kernel compilation that takes 10hr on non-PLE,
used to take just 1hr after pv patches (and should be same with PLE enabled)
If we leave PLE disabled case, I do not expect any degradation even in
0.5 x scenario, though you say results are erratic.
Could you please let me know, When PLE was enabled,
before and after the patch did you see any degradation for 0.5x?
> -Andrew Theurer
>
>
next prev parent reply other threads:[~2012-07-10 9:28 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-09 6:20 [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-09 6:20 ` [PATCH RFC 1/2] kvm vcpu: Note down pause loop exit Raghavendra K T
2012-07-09 6:33 ` Raghavendra K T
2012-07-09 22:39 ` Rik van Riel
2012-07-10 11:22 ` Raghavendra K T
2012-07-11 8:53 ` Avi Kivity
2012-07-11 10:52 ` Raghavendra K T
2012-07-11 11:18 ` Avi Kivity
2012-07-11 11:56 ` Raghavendra K T
2012-07-11 12:41 ` Andrew Jones
2012-07-12 10:58 ` Nikunj A Dadhania
2012-07-12 11:02 ` Raghavendra K T
2012-07-09 6:20 ` [PATCH RFC 2/2] kvm PLE handler: Choose better candidate for directed yield Raghavendra K T
2012-07-09 22:30 ` Rik van Riel
2012-07-10 11:46 ` Raghavendra K T
2012-07-09 7:55 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Christian Borntraeger
2012-07-10 8:27 ` Raghavendra K T
2012-07-11 9:06 ` Avi Kivity
2012-07-11 10:17 ` Christian Borntraeger
2012-07-11 11:04 ` Avi Kivity
2012-07-11 11:16 ` Alexander Graf
2012-07-11 11:23 ` Avi Kivity
2012-07-11 11:52 ` Alexander Graf
2012-07-11 12:48 ` Avi Kivity
2012-07-12 2:19 ` Benjamin Herrenschmidt
2012-07-11 11:18 ` Christian Borntraeger
2012-07-11 11:39 ` Avi Kivity
2012-07-12 5:11 ` Raghavendra K T
2012-07-12 8:11 ` Avi Kivity
2012-07-12 8:32 ` Raghavendra K T
2012-07-12 2:17 ` Benjamin Herrenschmidt
2012-07-12 8:12 ` Avi Kivity
2012-07-12 11:24 ` Benjamin Herrenschmidt
2012-07-12 10:38 ` Nikunj A Dadhania
2012-07-11 11:51 ` Raghavendra K T
2012-07-11 11:55 ` Christian Borntraeger
2012-07-11 12:04 ` Raghavendra K T
2012-07-11 13:04 ` Raghavendra K T
2012-07-09 21:47 ` Andrew Theurer
2012-07-10 9:26 ` Raghavendra K T [this message]
2012-07-10 10:07 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler : detailed result Raghavendra K T
2012-07-10 11:54 ` [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-10 13:27 ` Andrew Theurer
2012-07-11 9:00 ` Avi Kivity
2012-07-11 13:59 ` Raghavendra K T
2012-07-11 14:01 ` Raghavendra K T
2012-07-12 8:15 ` Avi Kivity
2012-07-12 8:25 ` Raghavendra K T
2012-07-12 12:31 ` Avi Kivity
2012-07-09 22:28 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FFBF534.5040107@linux.vnet.ibm.com \
--to=raghavendra.kt@linux.vnet.ibm.com \
--cc=avi@redhat.com \
--cc=borntraeger@de.ibm.com \
--cc=chegu_vinod@hp.com \
--cc=cotte@de.ibm.com \
--cc=gleb@redhat.com \
--cc=habanero@linux.vnet.ibm.com \
--cc=hpa@zytor.com \
--cc=joerg.roedel@amd.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=linux390@de.ibm.com \
--cc=mingo@redhat.com \
--cc=mtosatti@redhat.com \
--cc=riel@redhat.com \
--cc=srivatsa.vaddagiri@gmail.com \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).