From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: habanero@linux.vnet.ibm.com
Cc: Avi Kivity <avi@redhat.com>,
Marcelo Tosatti <mtosatti@redhat.com>,
Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@redhat.com>,
Srikar <srikar@linux.vnet.ibm.com>, KVM <kvm@vger.kernel.org>,
chegu vinod <chegu_vinod@hp.com>,
LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
Gleb Natapov <gleb@redhat.com>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
Date: Fri, 07 Sep 2012 23:36:40 +0530 [thread overview]
Message-ID: <504A37B0.7020605@linux.vnet.ibm.com> (raw)
In-Reply-To: <1347023509.10325.53.camel@oc6622382223.ibm.com>
CCing PeterZ also.
On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> I have noticed recently that PLE/yield_to() is still not that scalable
> for really large guests, sometimes even with no CPU over-commit. I have
> a small change that make a very big difference.
>
> First, let me explain what I saw:
>
> Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> thread Westmere-EX system: 645 seconds!
>
> Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> runqueue lock for yield_to()
>
> So, I added some schedstats to yield_to(), one to count when we failed
> this test in yield_to()
>
> if (task_running(p_rq, p) || p->state)
>
> and one when we pass all the conditions and get to actually yield:
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>
>
> And during boot up of this guest, I saw:
>
>
> failed yield_to() because task is running: 8368810426
> successful yield_to(): 13077658
> 0.156022% of yield_to calls
> 1 out of 640 yield_to calls
>
> Obviously, we have a problem. Every exit causes a loop over 80 vcpus,
> each one trying to get two locks. This is happening on all [but one]
> vcpus at around the same time. Not going to work well.
>
True and interesting. I had once thought of reducing overall O(n^2)
iteration to O(n log(n)) iterations by reducing number of candidates
to search to O(log(n)) instead of current O(n). May be I have to get
back to my experiment modes.
> So, since the check for a running task is nearly always true, I moved
> that -before- the double runqueue lock, so 99.84% of the attempts do not
> take the locks. Now, I do not know is this [not getting the locks] is a
> problem. However, I'd rather have a little inaccurate test for a
> running vcpu than burning 98% of CPU in host kernel. With the change
> the VM boot time went to: 100 seconds, an 85% reduction in time.
>
> I also wanted to check to see this did not affect truly over-committed
> situations, so I first started with smaller VMs at 2x cpu over-commit:
>
> 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
> throughput +/- stddev
> ----- -----
> ple off: 2281 +/- 7.32% (really bad as expected)
> ple on: 19796 +/- 1.36%
> ple on: w/fix: 19796 +/- 1.37% (no degrade at all)
>
> In this case the VMs are small enough, that we do not loop through
> enough vcpus to trigger the problem. host CPU is very low (3-4% range)
> for both default ple and with yield_to() fix.
>
> So I went on to a bigger VM:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> throughput +/- stddev
> ----- -----
> ple on: 2552 +/- .70%
> ple on: w/fix: 4621 +/- 2.12% (81% improvement!)
>
> This is where we start seeing a major difference. Without the fix, host
> cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and
> guest went from 30 to 40%). I believe this is on the right track to
> reduce the spin lock contention, still get proper directed yield, and
> therefore improve the guest CPU available and its performance.
>
> However, we still have lock contention, and I think we can reduce it
> even more. We have eliminated some attempts at double runqueue lock
> acquire because the check for the target vcpu is running is now before
> the lock. However, even if the target-to-yield-to vcpu [for the same
> guest upon we PLE exited] is not running, the physical
> processor/runqueue that target-to-yield-to vcpu is located on could be
> running a different VM's vcpu -and- going through a directed yield,
> therefore that run queue lock may already acquired. We do not want to
> just spin and wait, we want to move to the next candidate vcpu. We need
> a check to see if the smp processor/runqueue is already in a directed
> yield. Or, perhaps we just check if that cpu is not in guest mode, and
> if so, we skip that yield attempt for that vcpu and move to the next
> candidate vcpu. So, my question is: given a runqueue, what's the best
> way to check if that corresponding phys cpu is not in guest mode?
>
We are indeed avoiding CPUS in guest mode when we check
task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice?
> Here's the changes so far (schedstat changes not included here):
>
> signed-off-by: Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..f8eff8c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
> again:
> p_rq = task_rq(p);
> + if (task_running(p_rq, p) || p->state) {
> + goto out_no_unlock;
> + }
> double_rq_lock(rq, p_rq);
> while (task_rq(p) != p_rq) {
> double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
> if (curr->sched_class != p->sched_class)
> goto out;
>
> - if (task_running(p_rq, p) || p->state)
> - goto out;
>
> yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
> out:
> double_rq_unlock(rq, p_rq);
> +out_no_unlock:
> local_irq_restore(flags);
>
> if (yielded)
>
next prev parent reply other threads:[~2012-09-07 18:10 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39 ` Raghavendra K T
2012-07-19 9:47 ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34 ` Raghavendra K T
2012-07-22 12:43 ` Avi Kivity
2012-07-23 7:35 ` Christian Borntraeger
2012-07-22 17:58 ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11 ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06 ` Raghavendra K T [this message]
2012-09-07 19:42 ` Andrew Theurer
2012-09-08 8:43 ` Srikar Dronamraju
2012-09-10 13:16 ` Andrew Theurer
2012-09-10 16:03 ` Peter Zijlstra
2012-09-10 16:56 ` Srikar Dronamraju
2012-09-10 17:12 ` Peter Zijlstra
2012-09-10 19:10 ` Raghavendra K T
2012-09-10 20:12 ` Andrew Theurer
2012-09-10 20:19 ` Peter Zijlstra
2012-09-10 20:31 ` Rik van Riel
2012-09-11 6:08 ` Raghavendra K T
2012-09-11 12:48 ` Andrew Theurer
2012-09-11 18:27 ` Andrew Theurer
2012-09-13 11:48 ` Raghavendra K T
2012-09-13 21:30 ` Andrew Theurer
2012-09-14 17:10 ` Andrew Jones
2012-09-15 16:08 ` Raghavendra K T
2012-09-17 13:48 ` Andrew Jones
2012-09-14 20:34 ` Konrad Rzeszutek Wilk
2012-09-17 8:02 ` Andrew Jones
2012-09-16 8:55 ` Avi Kivity
2012-09-17 8:10 ` Andrew Jones
2012-09-18 3:03 ` Andrew Theurer
2012-09-19 13:39 ` Avi Kivity
2012-09-13 12:13 ` Avi Kivity
2012-09-11 7:04 ` Srikar Dronamraju
2012-09-10 14:43 ` Raghavendra K T
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=504A37B0.7020605@linux.vnet.ibm.com \
--to=raghavendra.kt@linux.vnet.ibm.com \
--cc=avi@redhat.com \
--cc=chegu_vinod@hp.com \
--cc=gleb@redhat.com \
--cc=habanero@linux.vnet.ibm.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=mtosatti@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=srivatsa.vaddagiri@gmail.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).