From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755282Ab2IGSKN (ORCPT ); Fri, 7 Sep 2012 14:10:13 -0400 Received: from e23smtp05.au.ibm.com ([202.81.31.147]:40699 "EHLO e23smtp05.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752803Ab2IGSKH (ORCPT ); Fri, 7 Sep 2012 14:10:07 -0400 Message-ID: <504A37B0.7020605@linux.vnet.ibm.com> Date: Fri, 07 Sep 2012 23:36:40 +0530 From: Raghavendra K T Organization: IBM User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1 MIME-Version: 1.0 To: habanero@linux.vnet.ibm.com CC: Avi Kivity , Marcelo Tosatti , Ingo Molnar , Rik van Riel , Srikar , KVM , chegu vinod , LKML , X86 , Gleb Natapov , Srivatsa Vaddagiri , Peter Zijlstra Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler References: <20120718133717.5321.71347.sendpatchset@codeblue.in.ibm.com> <500D2162.8010209@redhat.com> <1347023509.10325.53.camel@oc6622382223.ibm.com> In-Reply-To: <1347023509.10325.53.camel@oc6622382223.ibm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit x-cbid: 12090718-1396-0000-0000-000001D71CA6 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org CCing PeterZ also. On 09/07/2012 06:41 PM, Andrew Theurer wrote: > I have noticed recently that PLE/yield_to() is still not that scalable > for really large guests, sometimes even with no CPU over-commit. I have > a small change that make a very big difference. > > First, let me explain what I saw: > > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 > thread Westmere-EX system: 645 seconds! > > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double > runqueue lock for yield_to() > > So, I added some schedstats to yield_to(), one to count when we failed > this test in yield_to() > > if (task_running(p_rq, p) || p->state) > > and one when we pass all the conditions and get to actually yield: > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > > And during boot up of this guest, I saw: > > > failed yield_to() because task is running: 8368810426 > successful yield_to(): 13077658 > 0.156022% of yield_to calls > 1 out of 640 yield_to calls > > Obviously, we have a problem. Every exit causes a loop over 80 vcpus, > each one trying to get two locks. This is happening on all [but one] > vcpus at around the same time. Not going to work well. > True and interesting. I had once thought of reducing overall O(n^2) iteration to O(n log(n)) iterations by reducing number of candidates to search to O(log(n)) instead of current O(n). May be I have to get back to my experiment modes. > So, since the check for a running task is nearly always true, I moved > that -before- the double runqueue lock, so 99.84% of the attempts do not > take the locks. Now, I do not know is this [not getting the locks] is a > problem. However, I'd rather have a little inaccurate test for a > running vcpu than burning 98% of CPU in host kernel. With the change > the VM boot time went to: 100 seconds, an 85% reduction in time. > > I also wanted to check to see this did not affect truly over-committed > situations, so I first started with smaller VMs at 2x cpu over-commit: > > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) > throughput +/- stddev > ----- ----- > ple off: 2281 +/- 7.32% (really bad as expected) > ple on: 19796 +/- 1.36% > ple on: w/fix: 19796 +/- 1.37% (no degrade at all) > > In this case the VMs are small enough, that we do not loop through > enough vcpus to trigger the problem. host CPU is very low (3-4% range) > for both default ple and with yield_to() fix. > > So I went on to a bigger VM: > > 10 VMs, 16-way each, all running dbench (2x cpu over-commit) > throughput +/- stddev > ----- ----- > ple on: 2552 +/- .70% > ple on: w/fix: 4621 +/- 2.12% (81% improvement!) > > This is where we start seeing a major difference. Without the fix, host > cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and > guest went from 30 to 40%). I believe this is on the right track to > reduce the spin lock contention, still get proper directed yield, and > therefore improve the guest CPU available and its performance. > > However, we still have lock contention, and I think we can reduce it > even more. We have eliminated some attempts at double runqueue lock > acquire because the check for the target vcpu is running is now before > the lock. However, even if the target-to-yield-to vcpu [for the same > guest upon we PLE exited] is not running, the physical > processor/runqueue that target-to-yield-to vcpu is located on could be > running a different VM's vcpu -and- going through a directed yield, > therefore that run queue lock may already acquired. We do not want to > just spin and wait, we want to move to the next candidate vcpu. We need > a check to see if the smp processor/runqueue is already in a directed > yield. Or, perhaps we just check if that cpu is not in guest mode, and > if so, we skip that yield attempt for that vcpu and move to the next > candidate vcpu. So, my question is: given a runqueue, what's the best > way to check if that corresponding phys cpu is not in guest mode? > We are indeed avoiding CPUS in guest mode when we check task->flags & PF_VCPU in vcpu_on_spin path. Doesn't that suffice? > Here's the changes so far (schedstat changes not included here): > > signed-off-by: Andrew Theurer > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index fbf1fd0..f8eff8c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > > again: > p_rq = task_rq(p); > + if (task_running(p_rq, p) || p->state) { > + goto out_no_unlock; > + } > double_rq_lock(rq, p_rq); > while (task_rq(p) != p_rq) { > double_rq_unlock(rq, p_rq); > @@ -4856,8 +4859,6 @@ again: > if (curr->sched_class != p->sched_class) > goto out; > > - if (task_running(p_rq, p) || p->state) > - goto out; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4879,6 +4880,7 @@ again: > > out: > double_rq_unlock(rq, p_rq); > +out_no_unlock: > local_irq_restore(flags); > > if (yielded) >