[RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influence nice priority

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influence nice priority
@ 2008-07-15 23:18 Darrick J. Wong
  2008-07-16  5:56 ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Tian, Kevin
  0 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2008-07-15 23:18 UTC (permalink / raw)
  To: kvm

Hi all,

This patch set attempts to distinguish which guests are generating load
on the host because of interrupts, overhead, etc, against the guests
that are generating host load because they're truly doing something.
This is done by presenting cpufreq tables to kvm guests.  A guest that
is truly busy will select the highest CPU frequency, whereas a mostly
idle guest will select the lowest speed; based on this, we can change
the nice level of the guest CPU thread.

I envision four scenarios:

0. Guests that don't know about cpufreq still run at whatever nice level
they started with.

1. If we have a system with a lot of idle VMs, they will all run with +5
nice and this patch has no effect.

2. If we have a system with a lot of busy VMs, they all run with -5 nice
and this patch also has no effect.

3. If, however, we have a lot of idle VMs and a few busy ones, then the
-5 nice of the busy VMs will get those VMs extra CPU time.  On a really
crummy FPU microbenchmark I have, the score goes from about 500 to 2000
with the patch applied, though of course YMMV.  In some respects this
implementation shares a few ideas with the current Intel Dynamic
Acceleration implementation--you ask it for a speed that is higher than
what's written on the box, and if everything else is idle you actually
get the higher speed.  Otherwise you get what's written on the box.  But
you can't really know for sure.

There are some warts to this patch--most notably, the current
implementation uses the Intel MSRs and EST feature flag ... even if the
guest reports the CPU as being AuthenticAMD.  Also, there could be
timing problems introduced by this change--the OS thinks the CPU
frequency changes, but I don't know the effect on the guest CPU TSCs.

Questions?  Comments?  Please don't apply this to mainline.
---
This patch implements the Intel cpufreq control MSR.  Writes to the MSR are
used to bump up the nice level of the guest CPU thread if the OS picks a
sufficiently high p-state.

Control values are as as follows:
0: Nobody's touched cpufreq.  nice is the whatever the default is.
1: Lowest speed.  nice +5.
2: Medium speed.  nice is reset.
3: High speed.  nice -5.

The actual nice value is set via differential, so if a VM is started with a
nondefault nice priority it will fluctuate up and down from the initial value.

(This requires ACPI support from kvm-userspace, etc.)

Applies against vanilla 2.6.26.

Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
---

 arch/x86/kvm/x86.c         |   51 +++++++++++++++++++++++++++++++++++++++++---
 include/asm-x86/kvm_host.h |    1 +
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63a77ca..233ded2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -27,6 +27,7 @@
 #include <linux/module.h>
 #include <linux/mman.h>
 #include <linux/highmem.h>
+#include <linux/security.h>
 
 #include <asm/uaccess.h>
 #include <asm/msr.h>
@@ -431,7 +432,7 @@ static u32 msrs_to_save[] = {
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
 #endif
 	MSR_IA32_TIME_STAMP_COUNTER, MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
-	MSR_IA32_PERF_STATUS,
+	MSR_IA32_PERF_STATUS, MSR_IA32_PERF_CTL,
 };
 
 static unsigned num_msrs_to_save;
@@ -604,6 +605,44 @@ static void kvm_write_guest_time(struct kvm_vcpu *v)
 	mark_page_dirty(v->kvm, vcpu->time >> PAGE_SHIFT);
 }
 
+static int perf_ctl_to_nice(int pctl)
+{
+	switch (pctl) {
+	case 3: /* most favorable */
+		return 10;
+	case 2:
+		return 20;
+	case 1: /* least favorable */
+		return 30;
+	default:
+		return -EINVAL;
+	}
+}
+
+static void write_perf_ctl(struct kvm_vcpu *vcpu, u64 pctl)
+{
+	int new_nice;
+	int old_nice_boost = perf_ctl_to_nice(vcpu->arch.ia32_perf_ctl);
+	int new_nice_boost = perf_ctl_to_nice(pctl);
+
+	if (old_nice_boost < 0)
+		old_nice_boost = 0;
+	else
+		old_nice_boost -= 20;
+
+	if (new_nice_boost < 0)
+		return;
+	new_nice_boost -= 20;
+
+	new_nice = (new_nice_boost - old_nice_boost) + task_nice(current);
+	if (new_nice < -20)
+		new_nice = -20;
+	if (new_nice > 19)
+		new_nice = 19;
+
+	set_user_nice(current, new_nice);
+	vcpu->arch.ia32_perf_ctl = pctl;
+}
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
@@ -633,6 +672,9 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	case MSR_IA32_MISC_ENABLE:
 		vcpu->arch.ia32_misc_enable_msr = data;
 		break;
+	case MSR_IA32_PERF_CTL:
+		write_perf_ctl(vcpu, data);
+		break;
 	case MSR_KVM_WALL_CLOCK:
 		vcpu->kvm->arch.wall_clock = data;
 		kvm_write_wall_clock(vcpu->kvm, data);
@@ -717,13 +759,15 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 		data = kvm_get_apic_base(vcpu);
 		break;
 	case MSR_IA32_MISC_ENABLE:
-		data = vcpu->arch.ia32_misc_enable_msr;
+		data = vcpu->arch.ia32_misc_enable_msr | 0x10000;
 		break;
 	case MSR_IA32_PERF_STATUS:
 		/* TSC increment by tick */
 		data = 1000ULL;
 		/* CPU multiplier */
 		data |= (((uint64_t)4ULL) << 40);
+	case MSR_IA32_PERF_CTL:
+		data = vcpu->arch.ia32_perf_ctl;
 		break;
 	case MSR_EFER:
 		data = vcpu->arch.shadow_efer;
@@ -1113,7 +1157,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		bit(X86_FEATURE_3DNOWEXT) |
 		bit(X86_FEATURE_3DNOW);
 	const u32 kvm_supported_word3_x86_features =
-		bit(X86_FEATURE_XMM3) | bit(X86_FEATURE_CX16);
+		bit(X86_FEATURE_XMM3) | bit(X86_FEATURE_CX16) |
+		bit(X86_FEATURE_EST);
 	const u32 kvm_supported_word6_x86_features =
 		bit(X86_FEATURE_LAHF_LM) | bit(X86_FEATURE_CMP_LEGACY);
 
diff --git a/include/asm-x86/kvm_host.h b/include/asm-x86/kvm_host.h
index 844f2a8..0bfa7bb 100644
--- a/include/asm-x86/kvm_host.h
+++ b/include/asm-x86/kvm_host.h
@@ -231,6 +231,7 @@ struct kvm_vcpu_arch {
 	int mp_state;
 	int sipi_vector;
 	u64 ia32_misc_enable_msr;
+	u64 ia32_perf_ctl;
 	bool tpr_access_reporting;
 
 	struct kvm_mmu mmu;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority
  2008-07-15 23:18 [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influence nice priority Darrick J. Wong
@ 2008-07-16  5:56 ` Tian, Kevin
  2008-07-17 19:05   ` Darrick J. Wong
  2008-07-27  8:27   ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Avi Kivity
  0 siblings, 2 replies; 7+ messages in thread
From: Tian, Kevin @ 2008-07-16  5:56 UTC (permalink / raw)
  To: djwong, kvm

>From: Darrick J. Wong
>Sent: 2008年7月16日 7:18
>
>I envision four scenarios:
>
>0. Guests that don't know about cpufreq still run at whatever 
>nice level
>they started with.
>
>1. If we have a system with a lot of idle VMs, they will all 
>run with +5
>nice and this patch has no effect.
>
>2. If we have a system with a lot of busy VMs, they all run 
>with -5 nice
>and this patch also has no effect.
>
>3. If, however, we have a lot of idle VMs and a few busy ones, then the
>-5 nice of the busy VMs will get those VMs extra CPU time.  On a really
>crummy FPU microbenchmark I have, the score goes from about 500 to 2000
>with the patch applied, though of course YMMV.  In some respects this

How many VMs did you run in this test? All the VMs are idle except
the one where your benchmark runs?

How about the actual effect when several VMs are doing some stuff?

There's another scenario where some VMs don't support cpufreq while
others do. Here is it unfair to just renice the latter when the former is
not 'nice' at all?

Guess this feature has to be applied with some qualifications, e.g.
in a group of VMs with known same PM abilities...

>
>There are some warts to this patch--most notably, the current
>implementation uses the Intel MSRs and EST feature flag ... even if the
>guest reports the CPU as being AuthenticAMD.  Also, there could be
>timing problems introduced by this change--the OS thinks the CPU
>frequency changes, but I don't know the effect on the guest CPU TSCs.

You can report constant tsc feature in cpuid virtualization. Of course
if physical TSC is unstable, it's another story about how to mark guest
TSC untrustable. (e.g. Marcelo develops one method by simulating C2)

>
>Control values are as as follows:
>0: Nobody's touched cpufreq.  nice is the whatever the default is.
>1: Lowest speed.  nice +5.
>2: Medium speed.  nice is reset.
>3: High speed.  nice -5.

This description seems mismatch with the implementaion, which pushes
+10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
doesn't reflect a meanful state to guest, since there's no perf entry in
ACPI table to carry such value. One likely result is that guest'd think the
cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
sense to set initial value to 2 (P1), as keeping the default nice value, or 
even 3 (P0), if you take that state as IDA style which may over-clock but
not assure.

More critical points to be further thought of, if expecting this feature to be
in real use, is the difinition of exposed virtual freq states, and how these
states can be mapped to scheduler knobs. Inappropriate exposure may
cause guest to excessively bounce between virtual freq points. For example, 
'nice' value is only a relative hint to scheduler and there's no guarantee that
same portion of cpu cycles are added as what 'nice' value changes. There's
even the case where guest requests lowest speed while actual cpu cycles
allocated to it keeps similar as last epoch when it's in high speed. This
will fool the guest that lowest speed can satisfy its requirement. It's similar 
to the requirement on core-based hardware coordination logic, where some 
feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
actual freq in last sampling period. Here the VM case may need similar 
virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
deduced however.

Maybe it's applausive to compare the freq change count for same benchmark
between VM and native, and more interesting is, how's the effect when 
multiple VMs all take use of such features? For example, whether the
expected effect is counteracted with only overhead added? Any strange
behaviors exposed as in real 'nice' won't be changed so frequently in dozens
of ms level? :-)

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority
  2008-07-16  5:56 ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Tian, Kevin
@ 2008-07-17 19:05   ` Darrick J. Wong
  2008-07-18  5:44     ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority Tian, Kevin
  2008-07-27  8:27   ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Avi Kivity
  1 sibling, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2008-07-17 19:05 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: kvm

On Wed, Jul 16, 2008 at 01:56:51PM +0800, Tian, Kevin wrote:
> 
> How many VMs did you run in this test?

100 idle

> All the VMs are idle except
> the one where your benchmark runs?

Yes.

> How about the actual effect when several VMs are doing some stuff?

If there are multiple VMs that are busy, the busy ones will fight among
themselves for CPU time.  I still see some priority boost, just not as
much.

> There's another scenario where some VMs don't support cpufreq while
> others do. Here is it unfair to just renice the latter when the former is
> not 'nice' at all?
> 
> Guess this feature has to be applied with some qualifications, e.g.
> in a group of VMs with known same PM abilities...

Agreed it's not very convenient for guests that don't know about
cpufreq, though I was planning to make it the case that unaware guests
get no priority boost/reduction.

> You can report constant tsc feature in cpuid virtualization. Of course
> if physical TSC is unstable, it's another story about how to mark guest
> TSC untrustable. (e.g. Marcelo develops one method by simulating C2)

I wonder how stable the virtual tsc is...?  Will have to study this.

> This description seems mismatch with the implementaion, which pushes
> +10 and -10 for 1 and 3 case. Maybe I misinterpret the code?

Nope, that's a mistake on my part.

> One interesting point is the initial value of PERF_CTL MSR. Current 'zero'
> doesn't reflect a meanful state to guest, since there's no perf entry in
> ACPI table to carry such value. One likely result is that guest'd think the
> cur freq as 0 when initializing ACPI cpufreq driver. So it would more make
> sense to set initial value to 2 (P1), as keeping the default nice value, or 
> even 3 (P0), if you take that state as IDA style which may over-clock but
> not assure.

Indeed.  I had pondered this point considerably myself.  For this RFC I
decided that I could leave the MSR as zero as a way of detecting a guest
that didn't know anything, in case that ability is useful.  However, the
Linux drivers seem to give you either 0MHz or some arbitrary p-state, so
I think I'll change it to value 1.

> More critical points to be further thought of, if expecting this feature to be
> in real use, is the difinition of exposed virtual freq states, and how these
> states can be mapped to scheduler knobs. Inappropriate exposure may
> cause guest to excessively bounce between virtual freq points. For example, 
> 'nice' value is only a relative hint to scheduler and there's no guarantee that
> same portion of cpu cycles are added as what 'nice' value changes. There's

IDA has the same problem... the T61 BIOS "compensates" for this fakery
by reporting a frequency of $max_freq + 1 so if you're smart then you'll
somehow know that you might see a boost that you can't measure. :P

I suppose the problem here is that p-states were designed on the
assumption that you're directly manipulating hardware speeds, whereas
what we really want in both this patch and IDA are qualitative values
("medium speed", "highest speed", "ludicrous speed?")

> even the case where guest requests lowest speed while actual cpu cycles
> allocated to it keeps similar as last epoch when it's in high speed. This
> will fool the guest that lowest speed can satisfy its requirement. It's similar 

On the other hand, if you get the same performance  at both high and low
speeds, then it doesn't really matter which one you choose.  At least
not until the load changes.  I suppose the next question is, how much
software is dependent on knowing the exact CPU frequency, and are
workload schedulers smart enough to realize that performance
characteristics can change over time (throttling, TM1/TM2, etc)?
Inasmuch as you actually ever know, since with hardware coordination of
cpufreq the hardware can do whatever it wants.

> to the requirement on core-based hardware coordination logic, where some 
> feedback mechanism (e.g. APERF/MPERF MSR pair) is required to reveal 
> actual freq in last sampling period. Here the VM case may need similar 
> virtualized feedback mechanism. Not sure whether 'actual' freq is easily 
> deduced however.

I don't think it's easily deduced.  I also don't think APERF/MPERF are emulated in
kvm yet.  I suppose it wouldn't be difficult to add those two, though
measuring that might be a bit messy.

Maybe the cheap workaround for now is to report the CPU speeds in the
table as n-1, n, n+1.

> Maybe it's applausive to compare the freq change count for same benchmark
> between VM and native, and more interesting is, how's the effect when 
> multiple VMs all take use of such features? For example, whether the
> expected effect is counteracted with only overhead added? Any strange
> behaviors exposed as in real 'nice' won't be changed so frequently in dozens
> of ms level? :-)

I'll run some benchmarks and see what happens over the next week.

--D

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority
  2008-07-17 19:05   ` Darrick J. Wong
@ 2008-07-18  5:44     ` Tian, Kevin
  0 siblings, 0 replies; 7+ messages in thread
From: Tian, Kevin @ 2008-07-18  5:44 UTC (permalink / raw)
  To: djwong; +Cc: kvm

>From: Darrick J. Wong [mailto:djwong@us.ibm.com] 
>Sent: 2008年7月18日 3:05
>
>If there are multiple VMs that are busy, the busy ones will fight among
>themselves for CPU time.  I still see some priority boost, just not as
>much.

some micro-level analysis is useful here.

>
>I wonder how stable the virtual tsc is...?  Will have to study this.

My point is that to expose virtual freq states doesn't change the fact
whether virtual tsc is stable, since the interception logic about virtual
freq change request only impacts nice. That's expected behavior.

Then whether a virtual tsc is stable is just another issue out of this 
feature.

>
>IDA has the same problem... the T61 BIOS "compensates" for this fakery
>by reporting a frequency of $max_freq + 1 so if you're smart 
>then you'll
>somehow know that you might see a boost that you can't measure. :P

It can be measured, as one necessary requirement pushed on any
hardware coordinated logic, to provide some type of feedback 
mechanism. For example, Intel processors provides APERF/MPERF
pairs with MPERF incremented in proportion to a fixed boot frequency,
while APERF increments tin proportion to actual performance.
Software should use APERF/MPERF to understand actual freq in
elapsed sampling period. 

>
>I suppose the problem here is that p-states were designed on the
>assumption that you're directly manipulating hardware speeds, whereas
>what we really want in both this patch and IDA are qualitative values
>("medium speed", "highest speed", "ludicrous speed?")

It's still a bit different. 

For IDA, when ludicrous speed is requested, it may be granted. 
However when it's not, the actual freq will be still at highest speed
and never be lower.

However for this feature, how much cpu cycles can be granted is
not decided by a single 'nice' value, which instead depends on num
of active vcpus at given time on given cpu. Whatever speed is 
requested, either medium, highest or ludicrous, granted cycles can
always vary from some minimal (many vcpus contends) to 100%
(only current is active).  

>On the other hand, if you get the same performance  at both 
>high and low
>speeds, then it doesn't really matter which one you choose.  At least
>not until the load changes.  I suppose the next question is, how much
>software is dependent on knowing the exact CPU frequency, and are
>workload schedulers smart enough to realize that performance
>characteristics can change over time (throttling, TM1/TM2, etc)?
>Inasmuch as you actually ever know, since with hardware coordination of
>cpufreq the hardware can do whatever it wants.

Throttling or TM1/TM2 are related to thermal when some threshold
is reached. Here let's focus on DBS (Demand Based Switching)
which is actively conducted by OSPM per workload estimation. A
typical freq demotion flow is like below:

	If (PercentBusy * Pc/Pn) < threshold
		switch Pc to Pn;

Here PercentBusy represents CPU utilization in elapsed sampling
period. Pc stands for freq used in elapsed period, and Pn is the
candidate lower freq to be changed. If freq change can still keep
CPU utilization under predefined threshold, then transition is viable.

Here the keypoint is PercentBusy and Pc, which may make the
final decision pointless if inaccurate. That's why hardware coordi-
nation logic is required to provide some feedback to get Pc. 

I agree that finally guest should be able to catch up if wrong decision
makes its workload restrained or over-granted. E.g. when there's
only one vcpu active on pcpu which requests a medium speed,
100% cycles are granted to make it think that medium speed is
enough for its current workload. Later when other vcpus are active
on same pcpu, its granted cycles reduces and then it may realize
medium speed is not enough and then request to highest speed
which then may adds back some cycles with a lower nice value.

But it's better to do some micro-level analysis to understand 
whether it works as expected, and more important, how fast this
catch-up may be. Note that guest is checking freq change at 
like 20ms level, and we then need make sure no thrash is caused
to mess both guest and host.

Actually another concern just raised is the measurement to
PercentBusy. Take Linux for example, it normally substracts
idle time from elapsed time. If guest doesn't understand steal
time, PercentBusy may not reflect the fact at all. For example,
say a vcpu is continuously busy for 10ms, and then happens
to enter idle loop and then scheduled out for 20ms. Next time 
when re-scheduled in, its dbs timer will get PercentBusy as 
33.3% though actually its fully busy. However when vcpu is 
scheduled out outside of idle loop, that steal time is calculated 
as busy portion. I'm still not clear how this may affect this patch...

>
>I don't think it's easily deduced.  I also don't think 
>APERF/MPERF are emulated in
>kvm yet.  I suppose it wouldn't be difficult to add those two, though
>measuring that might be a bit messy.
>
>Maybe the cheap workaround for now is to report the CPU speeds in the
>table as n-1, n, n+1.

Yes, we may report at least at qualitative level to see effect.

>
>I'll run some benchmarks and see what happens over the next week.
>

Thanks to your work.

Kevin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority
  2008-07-16  5:56 ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Tian, Kevin
  2008-07-17 19:05   ` Darrick J. Wong
@ 2008-07-27  8:27   ` Avi Kivity
  2008-07-28  0:56     ` Tian, Kevin
  1 sibling, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2008-07-27  8:27 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: djwong, kvm

Tian, Kevin wrote:
>> From: Darrick J. Wong
>> Sent: 2008年7月16日 7:18
>>
>> I envision four scenarios:
>>
>> 0. Guests that don't know about cpufreq still run at whatever 
>> nice level
>> they started with.
>>
>> 1. If we have a system with a lot of idle VMs, they will all 
>> run with +5
>> nice and this patch has no effect.
>>
>> 2. If we have a system with a lot of busy VMs, they all run 
>> with -5 nice
>> and this patch also has no effect.
>>
>> 3. If, however, we have a lot of idle VMs and a few busy ones, then the
>> -5 nice of the busy VMs will get those VMs extra CPU time.  On a really
>> crummy FPU microbenchmark I have, the score goes from about 500 to 2000
>> with the patch applied, though of course YMMV.  In some respects this
>>     
>
> How many VMs did you run in this test? All the VMs are idle except
> the one where your benchmark runs?
>
> How about the actual effect when several VMs are doing some stuff?
>
> There's another scenario where some VMs don't support cpufreq while
> others do. Here is it unfair to just renice the latter when the former is
> not 'nice' at all?
>   

I guess the solution for such issues is not to have kvm (or qemu) play
with nice levels, but instead send notifications on virtual frequency
changes on the qemu monitor. The management application can then choose
whether to ignore the information, play with nice levels, or even
propagate the frequency change to the host (useful in client-side
virtualization).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority
  2008-07-27  8:27   ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Avi Kivity
@ 2008-07-28  0:56     ` Tian, Kevin
  2008-09-04 19:38       ` Darrick J. Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Tian, Kevin @ 2008-07-28  0:56 UTC (permalink / raw)
  To: Avi Kivity; +Cc: djwong, kvm

>From: Avi Kivity [mailto:avi@qumranet.com] 
>Sent: 2008年7月27日 16:27
>
>Tian, Kevin wrote:
>>> From: Darrick J. Wong
>>> Sent: 2008年7月16日 7:18
>>>
>>> I envision four scenarios:
>>>
>>> 0. Guests that don't know about cpufreq still run at whatever 
>>> nice level
>>> they started with.
>>>
>>> 1. If we have a system with a lot of idle VMs, they will all 
>>> run with +5
>>> nice and this patch has no effect.
>>>
>>> 2. If we have a system with a lot of busy VMs, they all run 
>>> with -5 nice
>>> and this patch also has no effect.
>>>
>>> 3. If, however, we have a lot of idle VMs and a few busy 
>ones, then the
>>> -5 nice of the busy VMs will get those VMs extra CPU time.  
>On a really
>>> crummy FPU microbenchmark I have, the score goes from about 
>500 to 2000
>>> with the patch applied, though of course YMMV.  In some 
>respects this
>>>     
>>
>> How many VMs did you run in this test? All the VMs are idle except
>> the one where your benchmark runs?
>>
>> How about the actual effect when several VMs are doing some stuff?
>>
>> There's another scenario where some VMs don't support cpufreq while
>> others do. Here is it unfair to just renice the latter when 
>the former is
>> not 'nice' at all?
>>   
>
>I guess the solution for such issues is not to have kvm (or qemu) play
>with nice levels, but instead send notifications on virtual frequency
>changes on the qemu monitor. The management application can then choose
>whether to ignore the information, play with nice levels, or even
>propagate the frequency change to the host (useful in client-side
>virtualization).
>

Yes, that'd be more flexible and cleaner.

Thanks,
Kevin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority
  2008-07-28  0:56     ` Tian, Kevin
@ 2008-09-04 19:38       ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2008-09-04 19:38 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Avi Kivity, kvm

On Mon, Jul 28, 2008 at 08:56:34AM +0800, Tian, Kevin wrote:
> >I guess the solution for such issues is not to have kvm (or qemu) play
> >with nice levels, but instead send notifications on virtual frequency
> >changes on the qemu monitor. The management application can then choose
> >whether to ignore the information, play with nice levels, or even
> >propagate the frequency change to the host (useful in client-side
> >virtualization).

I like this idea too.

I've been giving a little more thought to how we present cpufreq
"control" to the guest.  According to the ACPI specs, either we can
implement a fixed hardware implementation (i.e. MSRs) or we can provide
a system i/o address that (presumably) traps to the firmware so that the
BIOS can do the actual work.  Since the MSRs controls are different
between Intel and AMD (Linux refuses to use Intel MSRs on an AMD CPU and
vice versa), I'm thinking it might be easier to use the system I/O route
because then we don't have to spend any code emulating the hardware
mechanisms when we don't need to do so.

Of course, that's assuming that it's easy to set up a "magic" I/O port
on the guest that will trap into the VMM so that we can perform whatever
magic we want.  I would assume that this is the case, though I've only
just now gotten back to this patch set.  I've also not studied the speed
difference between the emulated wrmsr command and this manner of I/O
port access, but I suppose I can try it and find out. :)

At least in theory this would also eliminate an obstacle to migrating
VMs from Intel to AMD CPUs, but I suspect that's not really feasible
anyway.

--D

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-09-04 19:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-15 23:18 [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influence nice priority Darrick J. Wong
2008-07-16  5:56 ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Tian, Kevin
2008-07-17 19:05   ` Darrick J. Wong
2008-07-18  5:44     ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests toinfluencenice priority Tian, Kevin
2008-07-27  8:27   ` [RFC 1/2] Simulate Intel cpufreq MSRs in kvm guests to influencenice priority Avi Kivity
2008-07-28  0:56     ` Tian, Kevin
2008-09-04 19:38       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox