* [patch 1/3] cpufreq: implement min/max/up/down functions
2017-03-01 15:04 [patch 0/3] KVM CPU frequency change hypercalls (resend) Marcelo Tosatti
@ 2017-03-01 15:04 ` Marcelo Tosatti
2017-03-01 15:04 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-01 15:04 UTC (permalink / raw)
To: kvm, linux-pm
Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
Marcelo Tosatti
[-- Attachment #1: 1 --]
[-- Type: text/plain, Size: 6522 bytes --]
Implement functions in cpufreq userspace code to:
* Change current frequency to {max,min,up,down} frequencies.
up/down being relative to current one.
These will be used to implement KVM hypercalls for the guest
to issue frequency changes.
Current situation with DPDK and frequency changes is as follows:
An algorithm in the guest decides when to increase/decrease
frequency based on the queue length of the device.
On the host, a power manager daemon is used to listen for
frequency change requests (on another core) and issue these
requests.
However frequency changes are performance sensitive events because:
On a change from low load condition to max load condition,
the frequency should be raised as soon as possible.
Sending a virtio-serial notification to another pCPU,
waiting for that pCPU to initiate an IPI to the requestor pCPU
to change frequency, is slower and more cache costly than
a direct hypercall to host to switch the frequency.
Moreover, if the pCPU where the power manager daemon is running
is not busy spinning on requests from the isolated DPDK vcpus,
there is also the cost of HLT wakeup for that pCPU.
Instructions to setup:
Disable the intel_pstate driver (intel_pstate=disable host kernel
command line option), and set cpufreq userspace governor for
the isolated pCPU.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
drivers/cpufreq/cpufreq_userspace.c | 172 ++++++++++++++++++++++++++++++++++++
include/linux/cpufreq.h | 7 +
2 files changed, 179 insertions(+)
Index: kvm-pvfreq/drivers/cpufreq/cpufreq_userspace.c
===================================================================
--- kvm-pvfreq.orig/drivers/cpufreq/cpufreq_userspace.c 2017-01-31 10:41:54.102575877 -0200
+++ kvm-pvfreq/drivers/cpufreq/cpufreq_userspace.c 2017-02-02 15:32:53.456262640 -0200
@@ -118,6 +118,178 @@
mutex_unlock(&userspace_mutex);
}
+static int cpufreq_is_userspace_governor(int cpu)
+{
+ int ret;
+
+ mutex_lock(&userspace_mutex);
+ ret = per_cpu(cpu_is_managed, cpu);
+ mutex_unlock(&userspace_mutex);
+
+ return ret;
+}
+
+int cpufreq_userspace_freq_up(int cpu)
+{
+ unsigned int curfreq, nextminfreq;
+ unsigned int ret = 0;
+ struct cpufreq_frequency_table *pos, *table;
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+
+ if (!policy)
+ return -EINVAL;
+
+ if (!cpufreq_is_userspace_governor(cpu)) {
+ cpufreq_cpu_put(policy);
+ return -EINVAL;
+ }
+
+ cpufreq_cpu_put(policy);
+
+ mutex_lock(&userspace_mutex);
+ table = policy->freq_table;
+ if (!table) {
+ mutex_unlock(&userspace_mutex);
+ return -ENODEV;
+ }
+ nextminfreq = cpufreq_quick_get_max(cpu);
+ curfreq = policy->cur;
+
+ cpufreq_for_each_valid_entry(pos, table) {
+ if (pos->frequency > curfreq &&
+ pos->frequency < nextminfreq)
+ nextminfreq = pos->frequency;
+ }
+
+ if (nextminfreq != curfreq) {
+ unsigned int *setspeed = policy->governor_data;
+
+ *setspeed = nextminfreq;
+ ret = __cpufreq_driver_target(policy, nextminfreq,
+ CPUFREQ_RELATION_L);
+ } else
+ ret = 1;
+ mutex_unlock(&userspace_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_up);
+
+int cpufreq_userspace_freq_down(int cpu)
+{
+ unsigned int curfreq, prevmaxfreq;
+ unsigned int ret = 0;
+ struct cpufreq_frequency_table *pos, *table;
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+
+ if (!policy)
+ return -EINVAL;
+
+ if (!cpufreq_is_userspace_governor(cpu)) {
+ cpufreq_cpu_put(policy);
+ return -EINVAL;
+ }
+
+ cpufreq_cpu_put(policy);
+
+ mutex_lock(&userspace_mutex);
+ table = policy->freq_table;
+ if (!table) {
+ mutex_unlock(&userspace_mutex);
+ return -ENODEV;
+ }
+ prevmaxfreq = policy->min;
+ curfreq = policy->cur;
+
+ cpufreq_for_each_valid_entry(pos, table) {
+ if (pos->frequency < curfreq &&
+ pos->frequency > prevmaxfreq)
+ prevmaxfreq = pos->frequency;
+ }
+
+ if (prevmaxfreq != curfreq) {
+ unsigned int *setspeed = policy->governor_data;
+
+ *setspeed = prevmaxfreq;
+ ret = __cpufreq_driver_target(policy, prevmaxfreq,
+ CPUFREQ_RELATION_L);
+ } else
+ ret = 1;
+ mutex_unlock(&userspace_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_down);
+
+int cpufreq_userspace_freq_max(int cpu)
+{
+ unsigned int maxfreq;
+ unsigned int ret = 0;
+ struct cpufreq_frequency_table *table;
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ unsigned int *setspeed = policy->governor_data;
+
+
+ if (!policy)
+ return -EINVAL;
+
+ if (!cpufreq_is_userspace_governor(cpu)) {
+ cpufreq_cpu_put(policy);
+ return -EINVAL;
+ }
+
+ cpufreq_cpu_put(policy);
+
+ mutex_lock(&userspace_mutex);
+ table = policy->freq_table;
+ if (!table) {
+ mutex_unlock(&userspace_mutex);
+ return -ENODEV;
+ }
+ maxfreq = cpufreq_quick_get_max(cpu);
+
+ *setspeed = maxfreq;
+ ret = __cpufreq_driver_target(policy, maxfreq, CPUFREQ_RELATION_L);
+ mutex_unlock(&userspace_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_max);
+
+int cpufreq_userspace_freq_min(int cpu)
+{
+ unsigned int minfreq;
+ unsigned int ret = 0;
+ struct cpufreq_frequency_table *table;
+ struct cpufreq_policy *policy = cpufreq_cpu_get(cpu);
+ unsigned int *setspeed = policy->governor_data;
+
+ if (!policy)
+ return -EINVAL;
+
+ if (!cpufreq_is_userspace_governor(cpu)) {
+ cpufreq_cpu_put(policy);
+ return -EINVAL;
+ }
+ minfreq = policy->min;
+
+ cpufreq_cpu_put(policy);
+
+ mutex_lock(&userspace_mutex);
+ table = policy->freq_table;
+ if (!table) {
+ mutex_unlock(&userspace_mutex);
+ return -ENODEV;
+ }
+
+ *setspeed = minfreq;
+ ret = __cpufreq_driver_target(policy, minfreq, CPUFREQ_RELATION_L);
+ mutex_unlock(&userspace_mutex);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(cpufreq_userspace_freq_min);
+
static struct cpufreq_governor cpufreq_gov_userspace = {
.name = "userspace",
.init = cpufreq_userspace_policy_init,
Index: kvm-pvfreq/include/linux/cpufreq.h
===================================================================
--- kvm-pvfreq.orig/include/linux/cpufreq.h 2017-01-31 10:41:54.102575877 -0200
+++ kvm-pvfreq/include/linux/cpufreq.h 2017-01-31 14:20:00.508613672 -0200
@@ -890,4 +890,11 @@
int cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency);
+#ifdef CONFIG_CPU_FREQ
+int cpufreq_userspace_freq_down(int cpu);
+int cpufreq_userspace_freq_up(int cpu);
+int cpufreq_userspace_freq_max(int cpu);
+int cpufreq_userspace_freq_min(int cpu);
+#else
+#endif
#endif /* _LINUX_CPUFREQ_H */
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls
2017-03-01 15:04 [patch 0/3] KVM CPU frequency change hypercalls (resend) Marcelo Tosatti
2017-03-01 15:04 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
@ 2017-03-01 15:04 ` Marcelo Tosatti
2017-03-01 15:04 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
2017-03-02 10:15 ` [patch 0/3] KVM CPU frequency change hypercalls (resend) Paolo Bonzini
3 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-01 15:04 UTC (permalink / raw)
To: kvm, linux-pm
Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
Marcelo Tosatti
[-- Attachment #1: 2 --]
[-- Type: text/plain, Size: 3480 bytes --]
For most VMs, modifying the host frequency is an undesired
operation. Introduce ioctl to enable the guest to
modify host CPU frequency.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/include/uapi/asm/kvm.h | 5 +++++
arch/x86/kvm/x86.c | 20 ++++++++++++++++++++
include/uapi/linux/kvm.h | 3 +++
virt/kvm/kvm_main.c | 2 ++
5 files changed, 32 insertions(+)
Index: kvm-pvfreq/arch/x86/kvm/x86.c
===================================================================
--- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-01-31 10:34:25.443618639 -0200
@@ -3665,6 +3665,26 @@
r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
break;
}
+ case KVM_SET_VCPU_ALLOW_FREQ_HC: {
+ struct kvm_vcpu_allow_freq freq;
+
+ r = -EFAULT;
+ if (copy_from_user(&freq, argp, sizeof(freq)))
+ goto out;
+ vcpu->arch.allow_freq_hypercall = freq.enable;
+ r = 0;
+ break;
+ }
+ case KVM_GET_VCPU_ALLOW_FREQ_HC: {
+ struct kvm_vcpu_allow_freq freq;
+
+ memset(&freq, 0, sizeof(struct kvm_vcpu_allow_freq));
+ r = -EFAULT;
+ if (copy_to_user(&freq, argp, sizeof(freq)))
+ break;
+ r = 0;
+ break;
+ }
default:
r = -EINVAL;
}
Index: kvm-pvfreq/include/uapi/linux/kvm.h
===================================================================
--- kvm-pvfreq.orig/include/uapi/linux/kvm.h 2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/include/uapi/linux/kvm.h 2017-01-31 10:32:38.000389402 -0200
@@ -871,6 +871,7 @@
#define KVM_CAP_S390_USER_INSTR0 130
#define KVM_CAP_MSI_DEVID 131
#define KVM_CAP_PPC_HTM 132
+#define KVM_CAP_ALLOW_FREQ_HC 133
#ifdef KVM_CAP_IRQ_ROUTING
@@ -1281,6 +1282,8 @@
#define KVM_S390_GET_IRQ_STATE _IOW(KVMIO, 0xb6, struct kvm_s390_irq_state)
/* Available with KVM_CAP_X86_SMM */
#define KVM_SMI _IO(KVMIO, 0xb7)
+#define KVM_SET_VCPU_ALLOW_FREQ_HC _IO(KVMIO, 0xb8)
+#define KVM_GET_VCPU_ALLOW_FREQ_HC _IO(KVMIO, 0xb9)
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
Index: kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h
===================================================================
--- kvm-pvfreq.orig/arch/x86/include/uapi/asm/kvm.h 2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/include/uapi/asm/kvm.h 2017-01-31 10:32:38.000389402 -0200
@@ -357,4 +357,9 @@
#define KVM_X86_QUIRK_LINT0_REENABLED (1 << 0)
#define KVM_X86_QUIRK_CD_NW_CLEARED (1 << 1)
+struct kvm_vcpu_allow_freq {
+ __u16 enable;
+ __u16 pad[7];
+};
+
#endif /* _ASM_X86_KVM_H */
Index: kvm-pvfreq/virt/kvm/kvm_main.c
===================================================================
--- kvm-pvfreq.orig/virt/kvm/kvm_main.c 2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/virt/kvm/kvm_main.c 2017-01-31 10:32:38.001389404 -0200
@@ -2938,6 +2938,8 @@
#endif
case KVM_CAP_MAX_VCPU_ID:
return KVM_MAX_VCPU_ID;
+ case KVM_CAP_ALLOW_FREQ_HC:
+ return 1;
default:
break;
}
Index: kvm-pvfreq/arch/x86/include/asm/kvm_host.h
===================================================================
--- kvm-pvfreq.orig/arch/x86/include/asm/kvm_host.h 2017-01-31 10:32:33.023378783 -0200
+++ kvm-pvfreq/arch/x86/include/asm/kvm_host.h 2017-01-31 10:32:38.001389404 -0200
@@ -678,6 +678,8 @@
/* GPA available (AMD only) */
bool gpa_available;
+
+ bool allow_freq_hypercall;
};
struct kvm_lpage_info {
^ permalink raw reply [flat|nested] 15+ messages in thread* [patch 3/3] KVM: x86: frequency change hypercalls
2017-03-01 15:04 [patch 0/3] KVM CPU frequency change hypercalls (resend) Marcelo Tosatti
2017-03-01 15:04 ` [patch 1/3] cpufreq: implement min/max/up/down functions Marcelo Tosatti
2017-03-01 15:04 ` [patch 2/3] KVM: x86: introduce ioctl to allow frequency hypercalls Marcelo Tosatti
@ 2017-03-01 15:04 ` Marcelo Tosatti
2017-03-02 10:15 ` [patch 0/3] KVM CPU frequency change hypercalls (resend) Paolo Bonzini
3 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-01 15:04 UTC (permalink / raw)
To: kvm, linux-pm
Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
Marcelo Tosatti
[-- Attachment #1: 3 --]
[-- Type: text/plain, Size: 4840 bytes --]
Implement min/max/up/down frequency change
KVM hypercalls. To be used by DPDK implementation.
Also allow such hypercalls from guest userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
Documentation/virtual/kvm/hypercalls.txt | 45 +++++++++++++++++++
arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++-
include/uapi/linux/kvm_para.h | 5 ++
3 files changed, 120 insertions(+), 1 deletion(-)
Index: kvm-pvfreq/arch/x86/kvm/x86.c
===================================================================
--- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-02-02 11:17:17.063756725 -0200
+++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-02-02 11:17:17.822752510 -0200
@@ -6219,10 +6219,58 @@
kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
}
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+/* call into cpufreq-userspace governor */
+static int kvm_pvfreq_up(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_up(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_down(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_down(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_max(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_max(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_min(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_min(cpu);
+ put_cpu();
+
+ return ret;
+}
+#endif
+
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
unsigned long nr, a0, a1, a2, a3, ret;
int op_64_bit, r;
+ bool cpl_check;
r = kvm_skip_emulated_instruction(vcpu);
@@ -6246,7 +6294,13 @@
a3 &= 0xFFFFFFFF;
}
- if (kvm_x86_ops->get_cpl(vcpu) != 0) {
+ cpl_check = true;
+ if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
+ nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
+ if (vcpu->arch.allow_freq_hypercall == true)
+ cpl_check = false;
+
+ if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
ret = -KVM_EPERM;
goto out;
}
@@ -6262,6 +6316,21 @@
case KVM_HC_CLOCK_PAIRING:
ret = kvm_pv_clock_pairing(vcpu, a0, a1);
break;
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+ case KVM_HC_FREQ_UP:
+ ret = kvm_pvfreq_up(vcpu);
+ break;
+ case KVM_HC_FREQ_DOWN:
+ ret = kvm_pvfreq_down(vcpu);
+ break;
+ case KVM_HC_FREQ_MAX:
+ ret = kvm_pvfreq_max(vcpu);
+ break;
+ case KVM_HC_FREQ_MIN:
+ ret = kvm_pvfreq_min(vcpu);
+ break;
+#endif
+
default:
ret = -KVM_ENOSYS;
break;
Index: kvm-pvfreq/include/uapi/linux/kvm_para.h
===================================================================
--- kvm-pvfreq.orig/include/uapi/linux/kvm_para.h 2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/include/uapi/linux/kvm_para.h 2017-02-02 11:17:17.824752499 -0200
@@ -25,6 +25,11 @@
#define KVM_HC_MIPS_EXIT_VM 7
#define KVM_HC_MIPS_CONSOLE_OUTPUT 8
#define KVM_HC_CLOCK_PAIRING 9
+#define KVM_HC_FREQ_UP 10
+#define KVM_HC_FREQ_DOWN 11
+#define KVM_HC_FREQ_MAX 12
+#define KVM_HC_FREQ_MIN 13
+
/*
* hypercalls use architecture specific
Index: kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt
===================================================================
--- kvm-pvfreq.orig/Documentation/virtual/kvm/hypercalls.txt 2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt 2017-02-02 15:29:24.401692793 -0200
@@ -116,3 +116,48 @@
Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_FREQ_UP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the next
+higher frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, error otherwise.
+
+8. KVM_HC_FREQ_DOWN
+---------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the next
+lower frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, negative error otherwise.
+
+9. KVM_HC_FREQ_MIN
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the
+minimum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+
+10. KVM_HC_FREQ_MAX
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the
+maximum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-01 15:04 [patch 0/3] KVM CPU frequency change hypercalls (resend) Marcelo Tosatti
` (2 preceding siblings ...)
2017-03-01 15:04 ` [patch 3/3] KVM: x86: frequency change hypercalls Marcelo Tosatti
@ 2017-03-02 10:15 ` Paolo Bonzini
2017-03-02 13:59 ` Marcelo Tosatti
3 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-02 10:15 UTC (permalink / raw)
To: Marcelo Tosatti, kvm, linux-pm
Cc: Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On 01/03/2017 16:04, Marcelo Tosatti wrote:
>
> Paolo: please comment on your objections and what should
> be done instead. Note the case "multiple vcpus
> on a given pcpu" is not part of the usecase in question.
I would like to understand the intended usecase of cpufreq-userspace.
My understanding is that you would have a daemon handling a systemwide
policy; examples are the historical (and now obsolete) users such as
cpufreqd, cpudyn, powernowd, or cpuspeed.
The user alternatively can play the role of the daemon by writing to
sysfs, but I've never seen userspace tasks talking to cpufreq-userspace
to set their own running frequency.
Apparently DPDK does that, and I would like to know the opinion of the
linux-pm folks; one obvious downside is that any application that you
run after DPDK will have its CPU frequency hardcoded to something that
is not appropriate. This might be acceptable for DPDK, but it is worse
for KVM which tries to provide isolation to its vCPU tasks.
Here are two possibilities that I could think of:
1) Introduce a mechanism that allows a task to override the governor's
choice of CPU frequency. This could be a ioctl, a prctl, a cgroup-based
mechanism or whatever else. As Marcelo pointed out in the original kvm@
thread, the latency and overhead of switching frequencies make it
impractical to associate a desired CPU frequency with a task, because
multiple tasks could be requesting a given frequency. One possibility
could be to treat the per-task CPU frequency as advisory and only obey
it in restricted cases---for example only if nohz_full is in effect.
2) In the KVM API, the userspace program that enables the hypercalls
must pass writable file descriptors for the physical CPU's
scaling_setspeed files. These file descriptors act as a "proof" that
the userspace program could anyway modify the speed. However, this is
just a very handwavy description, and in particular I haven't thought
very much of how to handle either task migration or CPU hotplug.
Thanks in advance to anyone contributing to the discussions.
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-02 10:15 ` [patch 0/3] KVM CPU frequency change hypercalls (resend) Paolo Bonzini
@ 2017-03-02 13:59 ` Marcelo Tosatti
2017-03-14 16:40 ` Paolo Bonzini
0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-02 13:59 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvm, linux-pm, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On Thu, Mar 02, 2017 at 11:15:00AM +0100, Paolo Bonzini wrote:
>
>
> On 01/03/2017 16:04, Marcelo Tosatti wrote:
> >
> > Paolo: please comment on your objections and what should
> > be done instead. Note the case "multiple vcpus
> > on a given pcpu" is not part of the usecase in question.
>
> I would like to understand the intended usecase of cpufreq-userspace.
>
> My understanding is that you would have a daemon handling a systemwide
> policy; examples are the historical (and now obsolete) users such as
> cpufreqd, cpudyn, powernowd, or cpuspeed.
>
> The user alternatively can play the role of the daemon by writing to
> sysfs, but I've never seen userspace tasks talking to cpufreq-userspace
> to set their own running frequency.
>
> Apparently DPDK does that, and I would like to know the opinion of the
> linux-pm folks;
Only through the number of in-use RX/TX queue entries you can correctly
set the processor frequency (for this use case where only the network
processing is being performed by the machine).
> one obvious downside is that any application that you
> run after DPDK will have its CPU frequency hardcoded to something that
> is not appropriate.
To isolate the CPU where DPDK runs it is already necessary to perform
special procedures such as changing the cpumask of other tasks, changing
cpumask of interrupt handlers (to remove the isolated CPU from that
cpumask), etc. Changing the cpufreq governor to userspace is another
step of that setup phase.
On shutdown (or CPU unpin), you can switch back the CPU to the previous
governor, which can switch the frequency to whatever it finds suitable.
> This might be acceptable for DPDK, but it is worse
> for KVM which tries to provide isolation to its vCPU tasks.
Well in this case you know the only program which executes
on the CPU is handling of network packets and therefore you allow
that program to control the frequency.
> Here are two possibilities that I could think of:
>
> 1) Introduce a mechanism that allows a task to override the governor's
> choice of CPU frequency. This could be a ioctl, a prctl, a cgroup-based
> mechanism or whatever else. As Marcelo pointed out in the original kvm@
> thread, the latency and overhead of switching frequencies make it
> impractical to associate a desired CPU frequency with a task, because
> multiple tasks could be requesting a given frequency. One possibility
> could be to treat the per-task CPU frequency as advisory
DPDK can't afford the frequency as advisory: failure in setting the
processor frequency when requested means dropped packets (not
dropping packets being a requirement).
> and only obey
> it in restricted cases---for example only if nohz_full is in effect.
>From cpufreq documentation:
"On all other cpufreq implementations, these boundaries still need to
be set. Then, a "governor" must be selected. Such a "governor" decides
what speed the processor shall run within the boundaries. One such
"governor" is the "userspace" governor. This one allows the user - or
a yet-to-implement userspace program - to decide what specific speed
the processor shall run at."
(it seems the cpufreq-hypercall+cpufreq-userspace combination is in
accord with what cpufreq-userspace has been designed for).
Secondly, setting frequencies for multiple tasks is somewhat
contradictory:
In the DPDK context, or in any context actually, it makes sense for a
program to lower processor frequency when it decides the current
frequency is sufficient to handle the job: that is lowering the
frequency will still make it possible to handle the load.
With multiple applications sharing that processor, the percentage
of time given to a certain application also interferes with the
time it spends handling the job. So the other variable that
affects "instructions per second" is timeslice given to the
task by the scheduler, not only "frequency".
Having a task request for a particular frequency in that case becomes
ambiguous: you could be asking for "increased timeslice".
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-02 13:59 ` Marcelo Tosatti
@ 2017-03-14 16:40 ` Paolo Bonzini
2017-03-14 23:27 ` Marcelo Tosatti
0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-14 16:40 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: kvm, linux-pm, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On 02/03/2017 14:59, Marcelo Tosatti wrote:
> On Thu, Mar 02, 2017 at 11:15:00AM +0100, Paolo Bonzini wrote:
>> one obvious downside is that any application that you
>> run after DPDK will have its CPU frequency hardcoded to something that
>> is not appropriate.
>
> To isolate the CPU where DPDK runs it is already necessary to perform
> special procedures such as changing the cpumask of other tasks, changing
> cpumask of interrupt handlers (to remove the isolated CPU from that
> cpumask), etc. Changing the cpufreq governor to userspace is another
> step of that setup phase.
>
> On shutdown (or CPU unpin), you can switch back the CPU to the previous
> governor, which can switch the frequency to whatever it finds suitable.
But I thought that one of the reasons to do NFV is to simplify this
setup. If you now have to do the same thing on virtual machines, things
become more complicated to set up, and I don't think that NFV virtual
machines are _that_ special.
In addition, in the list of setup steps above you forgot "chmod the
sysfs files for cpufreq so that DPDK can access it". Doing that chmod
is a very explicit act, and that's unlike the functionality of this patch.
By letting virtual machines do the same with a simple hypercall, you're
giving powers to whoever opens /dev/kvm that they didn't have before
(unless the userspace process also had access to sysfs). Worse, the
effects last beyond the moment /dev/kvm is closed.
So, the question then is how to design the hypervisor so that these NFV
virtual machines can play with cpufreq, but there are no adverse
indefinite effects. One possibility is to have some kind of per-task
cpufreq. Another is to do everything in userspace with virtual ACPI
P-states and the userspace governor in the VM.
I was hoping to get more feedback from linux-pm.
>> Here are two possibilities that I could think of:
>>
>> 1) Introduce a mechanism that allows a task to override the governor's
>> choice of CPU frequency. This could be a ioctl, a prctl, a cgroup-based
>> mechanism or whatever else. As Marcelo pointed out in the original kvm@
>> thread, the latency and overhead of switching frequencies make it
>> impractical to associate a desired CPU frequency with a task, because
>> multiple tasks could be requesting a given frequency. One possibility
>> could be to treat the per-task CPU frequency as advisory
>
> DPDK can't afford the frequency as advisory: failure in setting the
> processor frequency when requested means dropped packets (not
> dropping packets being a requirement).
It can be advisory if you document a proper configuration where it's obeyed.
Paolo
>> and only obey
>> it in restricted cases---for example only if nohz_full is in effect.
>
> From cpufreq documentation:
>
> "On all other cpufreq implementations, these boundaries still need to
> be set. Then, a "governor" must be selected. Such a "governor" decides
> what speed the processor shall run within the boundaries. One such
> "governor" is the "userspace" governor. This one allows the user - or
> a yet-to-implement userspace program - to decide what specific speed
> the processor shall run at."
>
> (it seems the cpufreq-hypercall+cpufreq-userspace combination is in
> accord with what cpufreq-userspace has been designed for).
>
> Secondly, setting frequencies for multiple tasks is somewhat
> contradictory:
>
> In the DPDK context, or in any context actually, it makes sense for a
> program to lower processor frequency when it decides the current
> frequency is sufficient to handle the job: that is lowering the
> frequency will still make it possible to handle the load.
>
> With multiple applications sharing that processor, the percentage
> of time given to a certain application also interferes with the
> time it spends handling the job. So the other variable that
> affects "instructions per second" is timeslice given to the
> task by the scheduler, not only "frequency".
>
> Having a task request for a particular frequency in that case becomes
> ambiguous: you could be asking for "increased timeslice".
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-14 16:40 ` Paolo Bonzini
@ 2017-03-14 23:27 ` Marcelo Tosatti
2017-03-15 8:23 ` Paolo Bonzini
0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-14 23:27 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvm, linux-pm, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
Hi Paolo,
On Tue, Mar 14, 2017 at 05:40:21PM +0100, Paolo Bonzini wrote:
>
>
> On 02/03/2017 14:59, Marcelo Tosatti wrote:
> > On Thu, Mar 02, 2017 at 11:15:00AM +0100, Paolo Bonzini wrote:
> >> one obvious downside is that any application that you
> >> run after DPDK will have its CPU frequency hardcoded to something that
> >> is not appropriate.
> >
> > To isolate the CPU where DPDK runs it is already necessary to perform
> > special procedures such as changing the cpumask of other tasks, changing
> > cpumask of interrupt handlers (to remove the isolated CPU from that
> > cpumask), etc. Changing the cpufreq governor to userspace is another
> > step of that setup phase.
> >
> > On shutdown (or CPU unpin), you can switch back the CPU to the previous
> > governor, which can switch the frequency to whatever it finds suitable.
>
> But I thought that one of the reasons to do NFV is to simplify this
> setup. If you now have to do the same thing on virtual machines, things
> become more complicated to set up, and I don't think that NFV virtual
> machines are _that_ special.
>
> In addition, in the list of setup steps above you forgot "chmod the
> sysfs files for cpufreq so that DPDK can access it". Doing that chmod
> is a very explicit act, and that's unlike the functionality of this patch.
>
> By letting virtual machines do the same with a simple hypercall, you're
> giving powers to whoever opens /dev/kvm that they didn't have before
> (unless the userspace process also had access to sysfs). Worse, the
> effects last beyond the moment /dev/kvm is closed.
This can be fixed by requiring qemu-kvm-vcpu thread, which runs
the hypercall, to have sufficient priority (similar to other cpufreq
users). Fine, good point.
> So, the question then is how to design the hypervisor so that these NFV
> virtual machines can play with cpufreq, but there are no adverse
> indefinite effects.
Ok, we can modify the cpufreq cgroups patch, to, from the hypercalls
set the:
"The first three patches of this series introduces
capacity_{min,max} tracking
in the core scheduler, as an extension of the CPU controller."
capacity_min == capacity_max values (which forces the CPU to run
at that frequency, given there are no other tasks requesting
frequency information on that CPU).
This is good enough DPDK.
> One possibility is to have some kind of per-task
> cpufreq. Another is to do everything in userspace with virtual ACPI
> P-states and the userspace governor in the VM.
Virtual ACPI P-state, that is an option. But why not make it
in-kernel, the exit to userspace can be a significant
fraction of the total if the frequency change time is small (say, 10us
freq change and 5us for userspace exit).
> I was hoping to get more feedback from linux-pm.
>
> >> Here are two possibilities that I could think of:
> >>
> >> 1) Introduce a mechanism that allows a task to override the governor's
> >> choice of CPU frequency. This could be a ioctl, a prctl, a cgroup-based
> >> mechanism or whatever else. As Marcelo pointed out in the original kvm@
> >> thread, the latency and overhead of switching frequencies make it
> >> impractical to associate a desired CPU frequency with a task, because
> >> multiple tasks could be requesting a given frequency. One possibility
> >> could be to treat the per-task CPU frequency as advisory
> >
> > DPDK can't afford the frequency as advisory: failure in setting the
> > processor frequency when requested means dropped packets (not
> > dropping packets being a requirement).
>
> It can be advisory if you document a proper configuration where it's obeyed.
Sure.
>
> Paolo
>
> >> and only obey
> >> it in restricted cases---for example only if nohz_full is in effect.
> >
> > From cpufreq documentation:
> >
> > "On all other cpufreq implementations, these boundaries still need to
> > be set. Then, a "governor" must be selected. Such a "governor" decides
> > what speed the processor shall run within the boundaries. One such
> > "governor" is the "userspace" governor. This one allows the user - or
> > a yet-to-implement userspace program - to decide what specific speed
> > the processor shall run at."
> >
> > (it seems the cpufreq-hypercall+cpufreq-userspace combination is in
> > accord with what cpufreq-userspace has been designed for).
> >
> > Secondly, setting frequencies for multiple tasks is somewhat
> > contradictory:
> >
> > In the DPDK context, or in any context actually, it makes sense for a
> > program to lower processor frequency when it decides the current
> > frequency is sufficient to handle the job: that is lowering the
> > frequency will still make it possible to handle the load.
> >
> > With multiple applications sharing that processor, the percentage
> > of time given to a certain application also interferes with the
> > time it spends handling the job. So the other variable that
> > affects "instructions per second" is timeslice given to the
> > task by the scheduler, not only "frequency".
> >
> > Having a task request for a particular frequency in that case becomes
> > ambiguous: you could be asking for "increased timeslice".
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-14 23:27 ` Marcelo Tosatti
@ 2017-03-15 8:23 ` Paolo Bonzini
2017-03-15 18:30 ` Marcelo Tosatti
0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2017-03-15 8:23 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: kvm, linux-pm, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On 15/03/2017 00:27, Marcelo Tosatti wrote:
>> So, the question then is how to design the hypervisor so that these NFV
>> virtual machines can play with cpufreq, but there are no adverse
>> indefinite effects.
> Ok, we can modify the cpufreq cgroups patch, to, from the hypercalls
> set the:
>
> "The first three patches of this series introduces
> capacity_{min,max} tracking
> in the core scheduler, as an extension of the CPU controller."
>
> capacity_min == capacity_max values (which forces the CPU to run
> at that frequency, given there are no other tasks requesting
> frequency information on that CPU).
>
> This is good enough DPDK.
So this sounds like a plan?
>> One possibility is to have some kind of per-task
>> cpufreq. Another is to do everything in userspace with virtual ACPI
>> P-states and the userspace governor in the VM.
>
> Virtual ACPI P-state, that is an option. But why not make it
> in-kernel, the exit to userspace can be a significant
> fraction of the total if the frequency change time is small (say, 10us
> freq change and 5us for userspace exit).
The advantage of doing it in userspace is that the sysfs chmod is a
clear way to say "this VM should have the privilege of setting cpufreq.
In effect, userspace's file descriptor for the sysfs files represents
the capability to set cpufreq for the VM. You can even pass the file
descriptor with SCM_RIGHTS if you wish to do so.
But of course that's only needed if the frequency change is global per
physical CPU. if the CPU controller gains the ability to do per-task
frequency switching, that's even better for KVM. Then the hypercalls
are just fine and we can have a KVM-specific cpufreq controller.
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 0/3] KVM CPU frequency change hypercalls (resend)
2017-03-15 8:23 ` Paolo Bonzini
@ 2017-03-15 18:30 ` Marcelo Tosatti
0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-03-15 18:30 UTC (permalink / raw)
To: Paolo Bonzini
Cc: kvm, linux-pm, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On Wed, Mar 15, 2017 at 09:23:10AM +0100, Paolo Bonzini wrote:
>
>
> On 15/03/2017 00:27, Marcelo Tosatti wrote:
> >> So, the question then is how to design the hypervisor so that these NFV
> >> virtual machines can play with cpufreq, but there are no adverse
> >> indefinite effects.
> > Ok, we can modify the cpufreq cgroups patch, to, from the hypercalls
> > set the:
> >
> > "The first three patches of this series introduces
> > capacity_{min,max} tracking
> > in the core scheduler, as an extension of the CPU controller."
> >
> > capacity_min == capacity_max values (which forces the CPU to run
> > at that frequency, given there are no other tasks requesting
> > frequency information on that CPU).
> >
> > This is good enough DPDK.
>
> So this sounds like a plan?
Yes, trying that now...
>
> >> One possibility is to have some kind of per-task
> >> cpufreq. Another is to do everything in userspace with virtual ACPI
> >> P-states and the userspace governor in the VM.
> >
> > Virtual ACPI P-state, that is an option. But why not make it
> > in-kernel, the exit to userspace can be a significant
> > fraction of the total if the frequency change time is small (say, 10us
> > freq change and 5us for userspace exit).
>
> The advantage of doing it in userspace is that the sysfs chmod is a
> clear way to say "this VM should have the privilege of setting cpufreq.
> In effect, userspace's file descriptor for the sysfs files represents
> the capability to set cpufreq for the VM. You can even pass the file
> descriptor with SCM_RIGHTS if you wish to do so.
>
> But of course that's only needed if the frequency change is global per
> physical CPU. if the CPU controller gains the ability to do per-task
> frequency switching, that's even better for KVM. Then the hypercalls
> are just fine and we can have a KVM-specific cpufreq controller.
>
> Paolo
I see, thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [patch 3/3] KVM: x86: frequency change hypercalls
2017-02-02 17:47 [patch 0/3] KVM CPU frequency change hypercalls Marcelo Tosatti
@ 2017-02-02 17:47 ` Marcelo Tosatti
2017-02-02 18:01 ` Marcelo Tosatti
2017-02-03 17:40 ` Radim Krcmar
0 siblings, 2 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 17:47 UTC (permalink / raw)
To: kvm, linux-kernel
Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar,
Marcelo Tosatti
[-- Attachment #1: kvm-cpufreq-api --]
[-- Type: text/plain, Size: 4840 bytes --]
Implement min/max/up/down frequency change
KVM hypercalls. To be used by DPDK implementation.
Also allow such hypercalls from guest userspace.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
Documentation/virtual/kvm/hypercalls.txt | 45 +++++++++++++++++++
arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++-
include/uapi/linux/kvm_para.h | 5 ++
3 files changed, 120 insertions(+), 1 deletion(-)
Index: kvm-pvfreq/arch/x86/kvm/x86.c
===================================================================
--- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-02-02 11:17:17.063756725 -0200
+++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-02-02 11:17:17.822752510 -0200
@@ -6219,10 +6219,58 @@
kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
}
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+/* call into cpufreq-userspace governor */
+static int kvm_pvfreq_up(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_up(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_down(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_down(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_max(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_max(cpu);
+ put_cpu();
+
+ return ret;
+}
+
+static int kvm_pvfreq_min(struct kvm_vcpu *vcpu)
+{
+ int ret;
+ int cpu = get_cpu();
+
+ ret = cpufreq_userspace_freq_min(cpu);
+ put_cpu();
+
+ return ret;
+}
+#endif
+
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
{
unsigned long nr, a0, a1, a2, a3, ret;
int op_64_bit, r;
+ bool cpl_check;
r = kvm_skip_emulated_instruction(vcpu);
@@ -6246,7 +6294,13 @@
a3 &= 0xFFFFFFFF;
}
- if (kvm_x86_ops->get_cpl(vcpu) != 0) {
+ cpl_check = true;
+ if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
+ nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
+ if (vcpu->arch.allow_freq_hypercall == true)
+ cpl_check = false;
+
+ if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
ret = -KVM_EPERM;
goto out;
}
@@ -6262,6 +6316,21 @@
case KVM_HC_CLOCK_PAIRING:
ret = kvm_pv_clock_pairing(vcpu, a0, a1);
break;
+#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
+ case KVM_HC_FREQ_UP:
+ ret = kvm_pvfreq_up(vcpu);
+ break;
+ case KVM_HC_FREQ_DOWN:
+ ret = kvm_pvfreq_down(vcpu);
+ break;
+ case KVM_HC_FREQ_MAX:
+ ret = kvm_pvfreq_max(vcpu);
+ break;
+ case KVM_HC_FREQ_MIN:
+ ret = kvm_pvfreq_min(vcpu);
+ break;
+#endif
+
default:
ret = -KVM_ENOSYS;
break;
Index: kvm-pvfreq/include/uapi/linux/kvm_para.h
===================================================================
--- kvm-pvfreq.orig/include/uapi/linux/kvm_para.h 2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/include/uapi/linux/kvm_para.h 2017-02-02 11:17:17.824752499 -0200
@@ -25,6 +25,11 @@
#define KVM_HC_MIPS_EXIT_VM 7
#define KVM_HC_MIPS_CONSOLE_OUTPUT 8
#define KVM_HC_CLOCK_PAIRING 9
+#define KVM_HC_FREQ_UP 10
+#define KVM_HC_FREQ_DOWN 11
+#define KVM_HC_FREQ_MAX 12
+#define KVM_HC_FREQ_MIN 13
+
/*
* hypercalls use architecture specific
Index: kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt
===================================================================
--- kvm-pvfreq.orig/Documentation/virtual/kvm/hypercalls.txt 2017-02-02 10:51:53.741217306 -0200
+++ kvm-pvfreq/Documentation/virtual/kvm/hypercalls.txt 2017-02-02 15:29:24.401692793 -0200
@@ -116,3 +116,48 @@
Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.
+
+7. KVM_HC_FREQ_UP
+-----------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the next
+higher frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, error otherwise.
+
+8. KVM_HC_FREQ_DOWN
+---------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the next
+lower frequency.
+Usage example: DPDK power aware applications, that run on
+isolated CPUs. No input argument, returns 0 if success,
+1 if already at lowest frequency, negative error otherwise.
+
+9. KVM_HC_FREQ_MIN
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to decrease frequency to the
+minimum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+
+10. KVM_HC_FREQ_MAX
+-------------------
+
+Architecture: x86
+Status: active
+Purpose: Hypercall used to increase frequency to the
+maximum frequency.
+Usage example: DPDK power aware applications, that run
+on isolated CPUs. No input argument, returns 0 if success
+error otherwise.
+
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/3] KVM: x86: frequency change hypercalls
2017-02-02 17:47 ` [patch 3/3] KVM: x86: " Marcelo Tosatti
@ 2017-02-02 18:01 ` Marcelo Tosatti
2017-02-03 17:40 ` Radim Krcmar
1 sibling, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2017-02-02 18:01 UTC (permalink / raw)
To: kvm, linux-kernel
Cc: Paolo Bonzini, Radim Krcmar, Rafael J. Wysocki, Viresh Kumar
On Thu, Feb 02, 2017 at 03:47:58PM -0200, Marcelo Tosatti wrote:
> Implement min/max/up/down frequency change
> KVM hypercalls. To be used by DPDK implementation.
>
> Also allow such hypercalls from guest userspace.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>
> ---
> Documentation/virtual/kvm/hypercalls.txt | 45 +++++++++++++++++++
> arch/x86/kvm/x86.c | 71 ++++++++++++++++++++++++++++++-
> include/uapi/linux/kvm_para.h | 5 ++
> 3 files changed, 120 insertions(+), 1 deletion(-)
>
> Index: kvm-pvfreq/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-02-02 11:17:17.063756725 -0200
> +++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-02-02 11:17:17.822752510 -0200
> @@ -6219,10 +6219,58 @@
> kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
> }
>
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
> +/* call into cpufreq-userspace governor */
> +static int kvm_pvfreq_up(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> + int cpu = get_cpu();
> +
> + ret = cpufreq_userspace_freq_up(cpu);
> + put_cpu();
> +
> + return ret;
> +}
> +
> +static int kvm_pvfreq_down(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> + int cpu = get_cpu();
> +
> + ret = cpufreq_userspace_freq_down(cpu);
> + put_cpu();
> +
> + return ret;
> +}
> +
> +static int kvm_pvfreq_max(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> + int cpu = get_cpu();
> +
> + ret = cpufreq_userspace_freq_max(cpu);
> + put_cpu();
> +
> + return ret;
> +}
> +
> +static int kvm_pvfreq_min(struct kvm_vcpu *vcpu)
> +{
> + int ret;
> + int cpu = get_cpu();
> +
> + ret = cpufreq_userspace_freq_min(cpu);
> + put_cpu();
> +
> + return ret;
> +}
> +#endif
> +
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> unsigned long nr, a0, a1, a2, a3, ret;
> int op_64_bit, r;
> + bool cpl_check;
>
> r = kvm_skip_emulated_instruction(vcpu);
>
> @@ -6246,7 +6294,13 @@
> a3 &= 0xFFFFFFFF;
> }
>
> - if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> + cpl_check = true;
> + if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> + nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> + if (vcpu->arch.allow_freq_hypercall == true)
> + cpl_check = false;
> +
> + if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
> ret = -KVM_EPERM;
> goto out;
This should fail with EPERM if vcpu->arch.allow_freq_hypercall ==
false, independently of CPL level.
Will resend with that (and other comments) in v2.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/3] KVM: x86: frequency change hypercalls
2017-02-02 17:47 ` [patch 3/3] KVM: x86: " Marcelo Tosatti
2017-02-02 18:01 ` Marcelo Tosatti
@ 2017-02-03 17:40 ` Radim Krcmar
2017-02-03 18:24 ` Marcelo Tosatti
1 sibling, 1 reply; 15+ messages in thread
From: Radim Krcmar @ 2017-02-03 17:40 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar
2017-02-02 15:47-0200, Marcelo Tosatti:
> Implement min/max/up/down frequency change
> KVM hypercalls. To be used by DPDK implementation.
>
> Also allow such hypercalls from guest userspace.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
>
> ---
> Index: kvm-pvfreq/arch/x86/kvm/x86.c
> ===================================================================
> --- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-02-02 11:17:17.063756725 -0200
> +++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-02-02 11:17:17.822752510 -0200
> @@ -6219,10 +6219,58 @@
[Here lived copy-paste.]
> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> {
> unsigned long nr, a0, a1, a2, a3, ret;
> int op_64_bit, r;
> + bool cpl_check;
>
> r = kvm_skip_emulated_instruction(vcpu);
>
> @@ -6246,7 +6294,13 @@
> a3 &= 0xFFFFFFFF;
> }
>
> - if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> + cpl_check = true;
> + if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> + nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> + if (vcpu->arch.allow_freq_hypercall == true)
> + cpl_check = false;
> +
> + if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
> ret = -KVM_EPERM;
> goto out;
> }
> @@ -6262,6 +6316,21 @@
> case KVM_HC_CLOCK_PAIRING:
> ret = kvm_pv_clock_pairing(vcpu, a0, a1);
> break;
> +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
CONFIG_CPU_FREQ_GOV_USERSPACE should be checked when enabling the
capability.
> + case KVM_HC_FREQ_UP:
> + ret = kvm_pvfreq_up(vcpu);
> + break;
> + case KVM_HC_FREQ_DOWN:
> + ret = kvm_pvfreq_down(vcpu);
> + break;
> + case KVM_HC_FREQ_MAX:
> + ret = kvm_pvfreq_max(vcpu);
> + break;
> + case KVM_HC_FREQ_MIN:
> + ret = kvm_pvfreq_min(vcpu);
> + break;
Having 4 hypercalls for this is an overkill.
You can make it one hypercall with an argument.
And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
int, which would also allow you to do -2 steps.
A number over the capabilites of stepping would just map to MAX/MIN.
Avoiding an absolute scale for interface simplifies migration, where the
guest cannot really depend much on this. Except that calling it with
MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
frequency.
Plese explictly say in documentation that things like the number of
steps, which the guest can learn by doing MAX and then -1 until the
hypercall fails, is undefined and should not be depended upon.
Userspace might still want know the number of steps to avoid useless
hypercall -- I think we should return a different value when the limit
is reached, not just after the guest wants to go past it.
> +#endif
> +
> default:
> ret = -KVM_ENOSYS;
> break;
And thinking more about migration, userspace cannot learn the current
frequency (at least MIN/MAX), so the new host will just pick at random,
which will break userspace's expectations that it cannot increase or
decrease the frequency. Is migration left for the future, because DPDK
doesn't migrate anyway?
Thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/3] KVM: x86: frequency change hypercalls
2017-02-03 17:40 ` Radim Krcmar
@ 2017-02-03 18:24 ` Marcelo Tosatti
2017-02-03 19:28 ` Radim Krcmar
0 siblings, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2017-02-03 18:24 UTC (permalink / raw)
To: Radim Krcmar
Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar
On Fri, Feb 03, 2017 at 06:40:34PM +0100, Radim Krcmar wrote:
> 2017-02-02 15:47-0200, Marcelo Tosatti:
> > Implement min/max/up/down frequency change
> > KVM hypercalls. To be used by DPDK implementation.
> >
> > Also allow such hypercalls from guest userspace.
> >
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> >
> > ---
> > Index: kvm-pvfreq/arch/x86/kvm/x86.c
> > ===================================================================
> > --- kvm-pvfreq.orig/arch/x86/kvm/x86.c 2017-02-02 11:17:17.063756725 -0200
> > +++ kvm-pvfreq/arch/x86/kvm/x86.c 2017-02-02 11:17:17.822752510 -0200
> > @@ -6219,10 +6219,58 @@
>
> [Here lived copy-paste.]
>
> > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > {
> > unsigned long nr, a0, a1, a2, a3, ret;
> > int op_64_bit, r;
> > + bool cpl_check;
> >
> > r = kvm_skip_emulated_instruction(vcpu);
> >
> > @@ -6246,7 +6294,13 @@
> > a3 &= 0xFFFFFFFF;
> > }
> >
> > - if (kvm_x86_ops->get_cpl(vcpu) != 0) {
> > + cpl_check = true;
> > + if (nr == KVM_HC_FREQ_UP || nr == KVM_HC_FREQ_DOWN ||
> > + nr == KVM_HC_FREQ_MIN || nr == KVM_HC_FREQ_MAX)
> > + if (vcpu->arch.allow_freq_hypercall == true)
> > + cpl_check = false;
> > +
> > + if (cpl_check == true && kvm_x86_ops->get_cpl(vcpu) != 0) {
> > ret = -KVM_EPERM;
> > goto out;
> > }
> > @@ -6262,6 +6316,21 @@
> > case KVM_HC_CLOCK_PAIRING:
> > ret = kvm_pv_clock_pairing(vcpu, a0, a1);
> > break;
> > +#ifdef CONFIG_CPU_FREQ_GOV_USERSPACE
>
> CONFIG_CPU_FREQ_GOV_USERSPACE should be checked when enabling the
> capability.
>
> > + case KVM_HC_FREQ_UP:
> > + ret = kvm_pvfreq_up(vcpu);
> > + break;
> > + case KVM_HC_FREQ_DOWN:
> > + ret = kvm_pvfreq_down(vcpu);
> > + break;
> > + case KVM_HC_FREQ_MAX:
> > + ret = kvm_pvfreq_max(vcpu);
> > + break;
> > + case KVM_HC_FREQ_MIN:
> > + ret = kvm_pvfreq_min(vcpu);
> > + break;
>
> Having 4 hypercalls for this is an overkill.
> You can make it one hypercall with an argument.
Fine.
> And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
> int, which would also allow you to do -2 steps.
Are you suggesting to have an integer to signify the number of steps up
or down.
> A number over the capabilites of stepping would just map to MAX/MIN.
Then MAX == any positive value above the number of steps
MIN == any negative value below the negative of number of steps
Sure.
> Avoiding an absolute scale for interface simplifies migration, where the
> guest cannot really depend much on this. Except that calling it with
> MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
> frequency.
Are you suggesting for the hypercall to return the maximum/minimum
frequency if called with the highest integer and lowest negative integer
respectively? (That same hypercall).
Sure.
> Plese explictly say in documentation that things like the number of
> steps, which the guest can learn by doing MAX and then -1 until the
> hypercall fails, is undefined and should not be depended upon.
Sure, because it fails over migration.
> Userspace might still want know the number of steps to avoid useless
> hypercall -- I think we should return a different value when the limit
> is reached, not just after the guest wants to go past it.
Are you suggesting to return a different value when going from
max-1 -> max
and
min+1 -> min
frequencies?
Fine.
> > +#endif
> > +
> > default:
> > ret = -KVM_ENOSYS;
> > break;
>
> And thinking more about migration, userspace cannot learn the current
> frequency (at least MIN/MAX), so the new host will just pick at random,
> which will break userspace's expectations that it cannot increase or
> decrease the frequency. Is migration left for the future, because DPDK
> doesn't migrate anyway?
>
> Thanks.
The new host should start with the highest frequency always. Then
the frequency tuning algorithm can reduce frequency afterwards.
Migration is a desired feature for DPDK, so it should be supported
(thats one reason why virtio-net drivers are used in the guest BTW).
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [patch 3/3] KVM: x86: frequency change hypercalls
2017-02-03 18:24 ` Marcelo Tosatti
@ 2017-02-03 19:28 ` Radim Krcmar
0 siblings, 0 replies; 15+ messages in thread
From: Radim Krcmar @ 2017-02-03 19:28 UTC (permalink / raw)
To: Marcelo Tosatti
Cc: kvm, linux-kernel, Paolo Bonzini, Rafael J. Wysocki, Viresh Kumar
2017-02-03 16:24-0200, Marcelo Tosatti:
> On Fri, Feb 03, 2017 at 06:40:34PM +0100, Radim Krcmar wrote:
>> You can make it one hypercall with an argument.
>
> Fine.
>
>> And the argument doesn't have to be enum {UP, DOWN, MAX, MIN}, but an
>> int, which would also allow you to do -2 steps.
>
> Are you suggesting to have an integer to signify the number of steps up
> or down.
Yes.
>> A number over the capabilites of stepping would just map to MAX/MIN.
>
> Then MAX == any positive value above the number of steps
> MIN == any negative value below the negative of number of steps
>
> Sure.
>
>> Avoiding an absolute scale for interface simplifies migration, where the
>> guest cannot really depend much on this. Except that calling it with
>> MIN (INT_MIN) will get the minimum and MAX (INT_MAX) the maximum
>> frequency.
>
> Are you suggesting for the hypercall to return the maximum/minimum
> frequency if called with the highest integer and lowest negative integer
> respectively? (That same hypercall).
No, I meant that we will guarantee that the guest will always get (the
CPU will be in) the minimal frequency when hypercall parameter is
INT_MIN and the maximal with INT_MAX -- just so the guest wouldn't lose
the ability which you provided by MIN and MAX hypercalls.
(We could also make a stronger assertion that there is never going to be
more than INT_MAX steps, CPUs that run KVM will probably never have
that fine frequency control.)
>> Plese explictly say in documentation that things like the number of
>> steps, which the guest can learn by doing MAX and then -1 until the
>> hypercall fails, is undefined and should not be depended upon.
>
> Sure, because it fails over migration.
>
>> Userspace might still want know the number of steps to avoid useless
>> hypercall -- I think we should return a different value when the limit
>> is reached, not just after the guest wants to go past it.
>
> Are you suggesting to return a different value when going from
>
> max-1 -> max
> and
> min+1 -> min
>
> frequencies?
Yes. Like you do now when going "up" from "max".
It saves one call of the hypercall.
> Fine.
>
>> > +#endif
>> > +
>> > default:
>> > ret = -KVM_ENOSYS;
>> > break;
>>
>> And thinking more about migration, userspace cannot learn the current
>> frequency (at least MIN/MAX), so the new host will just pick at random,
>> which will break userspace's expectations that it cannot increase or
>> decrease the frequency. Is migration left for the future, because DPDK
>> doesn't migrate anyway?
>>
>> Thanks.
>
> The new host should start with the highest frequency always. Then
> the frequency tuning algorithm can reduce frequency afterwards.
That is not going to work on migration.
Suppose we do that and the CPU is in minimal frequency before the
migration. This means that queue is below the threshold and userspace
knows that it is in minimum frequency (because we provide that
information when going down), so it doesn't trigger useless hypercalls.
After migration, the host would set frequency to maximum, but userspace
would still thing that it is minimal, so it would decrease it.
The only reason for this series -- power saving -- is lost.
> Migration is a desired feature for DPDK, so it should be supported
> (thats one reason why virtio-net drivers are used in the guest BTW).
Oh, nice,
thanks.
^ permalink raw reply [flat|nested] 15+ messages in thread