Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: habanero@linux.vnet.ibm.com
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@redhat.com>,
	KVM <kvm@vger.kernel.org>, chegu vinod <chegu_vinod@hp.com>,
	LKML <linux-kernel@vger.kernel.org>, X86 <x86@kernel.org>,
	Gleb Natapov <gleb@redhat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
Date: Wed, 19 Sep 2012 16:39:29 +0300	[thread overview]
Message-ID: <5059CB11.2000804@redhat.com> (raw)
In-Reply-To: <1347937398.10325.190.camel@oc6622382223.ibm.com>

On 09/18/2012 06:03 AM, Andrew Theurer wrote:
> On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
>> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
>> 
>> > The concern I have is that even though we have gone through changes to
>> > help reduce the candidate vcpus we yield to, we still have a very poor
>> > idea of which vcpu really needs to run.  The result is high cpu usage in
>> > the get_pid_task and still some contention in the double runqueue lock.
>> > To make this scalable, we either need to significantly reduce the
>> > occurrence of the lock-holder preemption, or do a much better job of
>> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
>> > which do not need to run).
>> > 
>> > On reducing the occurrence:  The worst case for lock-holder preemption
>> > is having vcpus of same VM on the same runqueue.  This guarantees the
>> > situation of 1 vcpu running while another [of the same VM] is not.  To
>> > prove the point, I ran the same test, but with vcpus restricted to a
>> > range of host cpus, such that any single VM's vcpus can never be on the
>> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
>> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
>> > 
>> > kvm_cpu_spin, and all
>> > yield_to changes, plus
>> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
>> > 
>> > On picking a better vcpu to yield to:  I really hesitate to rely on
>> > paravirt hint [telling us which vcpu is holding a lock], but I am not
>> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
>> > are yielding to way more vcpus than are prempted lock-holders, and that
>> > IMO is just work accomplishing nothing.  Trying to think of way to
>> > further reduce candidate vcpus....
>> 
>> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
>> That other vcpu gets work done (unless it is in pause loop itself) and
>> the yielding vcpu gets put to sleep for a while, so it doesn't spend
>> cycles spinning.  While we haven't fixed the problem at least the guest
>> is accomplishing work, and meanwhile the real lock holder may get
>> naturally scheduled and clear the lock.
> 
> OK, yes, if the other thread gets useful work done, then it is not
> wasteful.  I was thinking of the worst case scenario, where any other
> vcpu would likely spin as well, and the host side cpu-time for switching
> vcpu threads was not all that productive.  Well, I suppose it does help
> eliminate potential lock holding vcpus; it just seems to be not that
> efficient or fast enough.

If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then
yes, we must iterate over N-1 vcpus until we find Mr. Right.  Eventually
it's not-a-timeslice will expire and we go through this again.  If
N*y_yield is comparable to the timeslice, we start losing efficiency.
Because of lock contention, t_yield can scale with the number of host
cpus.  So in this worst case, we get quadratic behaviour.

One way out is to increase the not-a-timeslice.  Can we get spinning
vcpus to do that for running vcpus, if they cannot find a
runnable-but-not-running vcpu?

That's not guaranteed to help, if we boost a running vcpu too much it
will skew how vcpu runtime is distributed even after the lock is released.

> 
>> The main problem with this theory is that the experiments don't seem to
>> bear it out.
> 
> Granted, my test case is quite brutal.  It's nothing but over-committed
> VMs which always have some spin lock activity.  However, we really
> should try to fix the worst case scenario.

Yes.  And other guests may not scale as well as Linux, so they may show
this behaviour more often.

> 
>>   So maybe one of the assumptions is wrong - the yielding
>> vcpu gets scheduled early.  That could be the case if the two vcpus are
>> on different runqueues - you could be changing the relative priority of
>> vcpus on the target runqueue, but still remain on top yourself.  Is this
>> possible with the current code?
>> 
>> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
>> and only fall back to remote vcpus when we see it didn't help.
>> 
>> Let's examine a few cases:
>> 
>> 1. spinner on cpu 0, lock holder on cpu 0
>> 
>> win!
>> 
>> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
>> 
>> Spinner gets put to sleep, random vcpus get to work, low lock contention
>> (no double_rq_lock), by the time spinner gets scheduled we might have won
>> 
>> 3. spinner on cpu 0, another spinner on cpu 0
>> 
>> Worst case, we'll just spin some more.  Need to detect this case and
>> migrate something in.
> 
> Well, we can certainly experiment and see what we get.
> 
> IMO, the key to getting this working really well on the large VMs is
> finding the lock-holding cpu -quickly-.  What I think is happening is
> that we go through a relatively long process to get to that one right
> vcpu.  I guess I need to find a faster way to get there.

pvspinlocks will find the right one, every time.  Otherwise I see no way
to do this.

-- 
error compiling committee.c: too many arguments to function

next prev parent reply	other threads:[~2012-09-19 13:40 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-07-18 13:37 [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 1/3] kvm/config: Add config to support ple or cpu relax optimzation Raghavendra K T
2012-07-18 13:37 ` [PATCH RFC V5 2/3] kvm: Note down when cpu relax intercepted or pause loop exited Raghavendra K T
2012-07-18 13:38 ` [PATCH RFC V5 3/3] kvm: Choose better candidate for directed yield Raghavendra K T
2012-07-18 14:39   ` Raghavendra K T
2012-07-19  9:47     ` [RESEND PATCH " Raghavendra K T
2012-07-20 17:36 ` [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler Marcelo Tosatti
2012-07-22 12:34   ` Raghavendra K T
2012-07-22 12:43     ` Avi Kivity
2012-07-23  7:35       ` Christian Borntraeger
2012-07-22 17:58     ` Rik van Riel
2012-07-23 10:03 ` Avi Kivity
2012-09-07 13:11   ` [RFC][PATCH] Improving directed yield scalability for " Andrew Theurer
2012-09-07 18:06     ` Raghavendra K T
2012-09-07 19:42       ` Andrew Theurer
2012-09-08  8:43         ` Srikar Dronamraju
2012-09-10 13:16           ` Andrew Theurer
2012-09-10 16:03             ` Peter Zijlstra
2012-09-10 16:56               ` Srikar Dronamraju
2012-09-10 17:12                 ` Peter Zijlstra
2012-09-10 19:10                   ` Raghavendra K T
2012-09-10 20:12                   ` Andrew Theurer
2012-09-10 20:19                     ` Peter Zijlstra
2012-09-10 20:31                       ` Rik van Riel
2012-09-11  6:08                     ` Raghavendra K T
2012-09-11 12:48                       ` Andrew Theurer
2012-09-11 18:27                       ` Andrew Theurer
2012-09-13 11:48                         ` Raghavendra K T
2012-09-13 21:30                           ` Andrew Theurer
2012-09-14 17:10                             ` Andrew Jones
2012-09-15 16:08                               ` Raghavendra K T
2012-09-17 13:48                                 ` Andrew Jones
2012-09-14 20:34                             ` Konrad Rzeszutek Wilk
2012-09-17  8:02                               ` Andrew Jones
2012-09-16  8:55                             ` Avi Kivity
2012-09-17  8:10                               ` Andrew Jones
2012-09-18  3:03                               ` Andrew Theurer
2012-09-19 13:39                                 ` Avi Kivity [this message]
2012-09-13 12:13                         ` Avi Kivity
2012-09-11  7:04                   ` Srikar Dronamraju
2012-09-10 14:43         ` Raghavendra K T

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5059CB11.2000804@redhat.com \
    --to=avi@redhat.com \
    --cc=chegu_vinod@hp.com \
    --cc=gleb@redhat.com \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mtosatti@redhat.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@linux.vnet.ibm.com \
    --cc=riel@redhat.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=srivatsa.vaddagiri@gmail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).