From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
To: habanero@linux.vnet.ibm.com, Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
"H. Peter Anvin" <hpa@zytor.com>,
Marcelo Tosatti <mtosatti@redhat.com>,
Ingo Molnar <mingo@redhat.com>, Rik van Riel <riel@redhat.com>,
Srikar <srikar@linux.vnet.ibm.com>,
"Nikunj A. Dadhania" <nikunj@linux.vnet.ibm.com>,
KVM <kvm@vger.kernel.org>, Jiannan Ouyang <ouyang@cs.pitt.edu>,
Chegu Vinod <chegu_vinod@hp.com>,
LKML <linux-kernel@vger.kernel.org>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@gmail.com>,
Gleb Natapov <gleb@redhat.com>, Andrew Jones <drjones@redhat.com>
Subject: Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios
Date: Wed, 31 Oct 2012 12:06:34 +0530 [thread overview]
Message-ID: <5090C6F2.5030103@linux.vnet.ibm.com> (raw)
In-Reply-To: <1351599420.23105.14.camel@oc6622382223.ibm.com>
On 10/30/2012 05:47 PM, Andrew Theurer wrote:
> On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
>> In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>> Similarly, when we have large number of small guests, it is
>> possible that a spinning vcpu fails to yield_to any vcpu of same
>> VM and go back and spin. This is also not effective when we are
>> over-committed. Instead, we do a yield() so that we give chance
>> to other VMs to run.
>>
>> This patch tries to optimize above scenarios.
>>
>> The first patch optimizes all the yield_to by bailing out when there
>> is no need to continue yield_to (i.e., when there is only one task
>> in source and target rq).
>>
>> Second patch uses that in PLE handler.
>>
>> Third patch uses overall system load knowledge to take decison on
>> continuing in yield_to handler, and also yielding in overcommits.
>> To be precise,
>> * loadavg is converted to a scale of 2048 / per CPU
>> * a load value of less than 1024 is considered as undercommit and we
>> return from PLE handler in those cases
>> * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
>> and we yield to other VMs in such cases.
>>
>> (let threshold = 2048)
>> Rationale for using threshold/2 for undercommit limit:
>> Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
>> scenarios where we still have lock holder preempted vcpu waiting to be
>> scheduled. (scenario arises when rq length is > 1 even when we are under
>> committed)
>>
>> Rationale for using (1.75 * threshold) for overcommit scenario:
>> This is a heuristic where we should probably see rq length > 1
>> and a vcpu of a different VM is waiting to be scheduled.
>>
>> Related future work (independent of this series):
>>
>> - Dynamically changing PLE window depending on system load.
>>
>> Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
>> with 32 core PLE machine with 32 vcpu guest.
>> I believe we should get very good improvements for overcommit (especially > 2)
>> on large machines with small vcpu guests. (Could not test this as I do not have
>> access to a bigger machine)
>>
>> base = 3.7.0-rc1
>> machine: 32 core mx3850 x5 PLE mc
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> ebizzy (rec/sec higher is beter)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
>> 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
>> 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> dbench (throughput in MB/sec. higher is better)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
>> 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
>> 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
>> --+-----------+-----------+-----------+------------+-----------+
>
> Could you include a PLE-off result for 1x over-commit, so we know what
> the best possible result is?
Yes,
base no PLE
ebizzy_1x 7651.3000 rec/sec
ebizzy_2x 51.5000 rec/sec
ebizzy we are closer.
dbench_1x 12631.4210 MB/sec
dbench_2x 45.0842 MB/sec
(strangely dbench 1x result is not consistent sometime despite 10 runs
of 3min + 30 sec warmup runs on a 3G tmpfs. But surely it tells the trend)
>
> Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
> know if the performance is the same as PLE off for 1x. I am concerned
> the vcpu to task lookup is still expensive.
>
Yes. I still see that.
> Based on Peter's comments I would say the 3rd patch and the 2x,3x
> results are not conclusive at this time.
Avi, IMO patch 1 and 2 seem to be good to go. Please let me know.
>
> I think we should also discuss what we think a good target is. We
> should know what our high-water mark is, and IMO, if we cannot get
> close, then I do not feel we are heading down the right path. For
> example, if dbench aggregate throughput for 1x with PLE off is 10000
> MB/sec, then the best possible 2x,3x result, should be a little lower
> than that due to task switching the vcpus and sharing chaches. This
> should be quite evident with current PLE handler and smaller VMs (like
> 10 vcpus or less).
Very much agree here. If we see the 2x 3x results (all/any of them).
aggregate is not near 1x. May be even 70% is a good target.
prev parent reply other threads:[~2012-10-31 6:41 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-29 14:06 [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios Raghavendra K T
2012-10-29 14:06 ` [PATCH V2 RFC 1/3] sched: Bail out of yield_to when source and target runqueue has one task Raghavendra K T
2012-10-29 14:07 ` [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case Raghavendra K T
2012-10-31 12:38 ` Avi Kivity
2012-10-31 12:41 ` Raghavendra K T
2012-10-31 13:15 ` Raghavendra K T
2012-10-31 13:41 ` Avi Kivity
2012-10-31 17:06 ` Raghavendra K T
2012-11-07 10:25 ` Raghavendra K T
2012-11-09 8:38 ` [PATCH V2 RESEND " Raghavendra K T
2012-10-29 14:07 ` [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly Raghavendra K T
2012-10-29 17:54 ` Peter Zijlstra
2012-10-30 5:57 ` Raghavendra K T
2012-10-30 6:34 ` Andrew Jones
2012-10-30 7:31 ` Raghavendra K T
2012-10-30 9:07 ` Andrew Jones
2012-10-31 12:24 ` Raghavendra K T
2012-10-30 8:14 ` Peter Zijlstra
2012-10-31 6:10 ` Raghavendra K T
2012-10-30 12:17 ` [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios Andrew Theurer
2012-10-31 6:36 ` Raghavendra K T [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5090C6F2.5030103@linux.vnet.ibm.com \
--to=raghavendra.kt@linux.vnet.ibm.com \
--cc=avi@redhat.com \
--cc=chegu_vinod@hp.com \
--cc=drjones@redhat.com \
--cc=gleb@redhat.com \
--cc=habanero@linux.vnet.ibm.com \
--cc=hpa@zytor.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=mtosatti@redhat.com \
--cc=nikunj@linux.vnet.ibm.com \
--cc=ouyang@cs.pitt.edu \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=srikar@linux.vnet.ibm.com \
--cc=srivatsa.vaddagiri@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.