Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: habanero@linux.vnet.ibm.com
Cc: Avi Kivity <avi@redhat.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@redhat.com>,
	Srikar <srikar@linux.vnet.ibm.com>, KVM <kvm@vger.kernel.org>,
	chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
Date: Fri, 07 Sep 2012 23:36:40 +0530	[thread overview]
Message-ID: <504A37B0.7020605@linux.vnet.ibm.com> (raw)
In-Reply-To: <1347023509.10325.53.camel@oc6622382223.ibm.com>

CCing PeterZ also.

On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> I have noticed recently that PLE/yield_to() is still not that scalable
> for really large guests, sometimes even with no CPU over-commit.  I have
> a small change that make a very big difference.
>
> First, let me explain what I saw:
>
> Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> thread Westmere-EX system:  645 seconds!
>
> Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> runqueue lock for yield_to()
>
> So, I added some schedstats to yield_to(), one to count when we failed
> this test in yield_to()
>
>      if (task_running(p_rq, p) || p->state)
>
> and one when we pass all the conditions and get to actually yield:
>
>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>
>
> And during boot up of this guest, I saw:
>
>
> failed yield_to() because task is running: 8368810426
> successful yield_to(): 13077658
>                        0.156022% of yield_to calls
>                        1 out of 640 yield_to calls
>
> Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
> each one trying to get two locks.  This is happening on all [but one]
> vcpus at around the same time.  Not going to work well.
>

True and interesting. I had once thought of reducing overall O(n^2)
iteration to O(n log(n)) iterations by reducing number of candidates
to search to O(log(n)) instead of current O(n). May be I have to get 
back to my experiment modes.

> So, since the check for a running task is nearly always true, I moved
> that -before- the double runqueue lock, so 99.84% of the attempts do not
> take the locks.  Now, I do not know is this [not getting the locks] is a
> problem.  However, I'd rather have a little inaccurate test for a
> running vcpu than burning 98% of CPU in host kernel.  With the change
> the VM boot time went to:  100 seconds, an 85% reduction in time.
>
> I also wanted to check to see this did not affect truly over-committed
> situations, so I first started with smaller VMs at 2x cpu over-commit:
>
> 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
>             throughput +/- stddev
>                 -----     -----
> ple off:        2281 +/- 7.32%  (really bad as expected)
> ple on:        19796 +/- 1.36%
> ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)
>
> In this case the VMs are small enough, that we do not loop through
> enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
> for both default ple and with yield_to() fix.
>
> So I went on to a bigger VM:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
>             throughput +/- stddev
>                 -----     -----
> ple on:         2552 +/- .70%
> ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)
>
> This is where we start seeing a major difference.  Without the fix, host
> cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
> guest went from 30 to 40%).  I believe this is on the right track to
> reduce the spin lock contention, still get proper directed yield, and
> therefore improve the guest CPU available and its performance.
>
> However, we still have lock contention, and I think we can reduce it
> even more.  We have eliminated some attempts at double runqueue lock
> acquire because the check for the target vcpu is running is now before
> the lock.  However, even if the target-to-yield-to vcpu [for the same
> guest upon we PLE exited] is not running, the physical
> processor/runqueue that target-to-yield-to vcpu is located on could be
> running a different VM's vcpu -and- going through a directed yield,
> therefore that run queue lock may already acquired.  We do not want to
> just spin and wait, we want to move to the next candidate vcpu.  We need
> a check to see if the smp processor/runqueue is already in a directed
> yield.  Or, perhaps we just check if that cpu is not in guest mode, and
> if so, we skip that yield attempt for that vcpu and move to the next
> candidate vcpu.  So, my question is:  given a runqueue, what's the best
> way to check if that corresponding phys cpu is not in guest mode?
>

We are indeed avoiding CPUS in guest mode when we check
task->flags & PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

> Here's the changes so far (schedstat changes not included here):
>
> signed-off-by:  Andrew Theurer<habanero@linux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..f8eff8c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
>   again:
>   	p_rq = task_rq(p);
> +	if (task_running(p_rq, p) || p->state) {
> +		goto out_no_unlock;
> +	}
>   	double_rq_lock(rq, p_rq);
>   	while (task_rq(p) != p_rq) {
>   		double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
>   	if (curr->sched_class != p->sched_class)
>   		goto out;
>
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
>
>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   	if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
>   out:
>   	double_rq_unlock(rq, p_rq);
> +out_no_unlock:
>   	local_irq_restore(flags);
>
>   	if (yielded)
>

next prev parent reply	other threads:[~2012-09-07 18:06 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39   ` Raghavendra K T
2012-07-19  9:47     ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34   ` Raghavendra K T
2012-07-22 12:43     ` Avi Kivity
2012-07-23  7:35       ` Christian Borntraeger
2012-07-22 17:58     ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06     ` Raghavendra K T [this message]
2012-09-07 19:42       ` Andrew Theurer
2012-09-08  8:43         ` Srikar Dronamraju
2012-09-10 13:16           ` Andrew Theurer
2012-09-10 16:03             ` Peter Zijlstra
2012-09-10 16:56               ` Srikar Dronamraju
2012-09-10 17:12                 ` Peter Zijlstra
2012-09-10 19:10                   ` Raghavendra K T
2012-09-10 20:12                   ` Andrew Theurer
2012-09-10 20:19                     ` Peter Zijlstra
2012-09-10 20:31                       ` Rik van Riel
2012-09-11  6:08                     ` Raghavendra K T
2012-09-11 12:48                       ` Andrew Theurer
2012-09-11 18:27                       ` Andrew Theurer
2012-09-13 11:48                         ` Raghavendra K T
2012-09-13 21:30                           ` Andrew Theurer
2012-09-14 17:10                             ` Andrew Jones
2012-09-15 16:08                               ` Raghavendra K T
2012-09-17 13:48                                 ` Andrew Jones
2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
2012-09-17  8:02                               ` Andrew Jones
2012-09-16  8:55                             ` Avi Kivity
2012-09-17  8:10                               ` Andrew Jones
2012-09-18  3:03                               ` Andrew Theurer
2012-09-19 13:39                                 ` Avi Kivity
2012-09-13 12:13                         ` Avi Kivity
2012-09-11  7:04                   ` Srikar Dronamraju
2012-09-10 14:43         ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=504A37B0.7020605@linux.vnet.ibm.com \
    --to=raghavendra.kt@linux.vnet.ibm.com \
    --cc=avi@redhat.com \
    --cc=chegu_vinod@hp.com \
    --cc=gleb@redhat.com \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=srivatsa.vaddagiri@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.