xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0 of 2] [RFC] Patches to work with processor-passthru driver (v1).
@ 2012-02-24  6:43 Konrad Rzeszutek Wilk
  2012-02-24  6:43 ` [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc) Konrad Rzeszutek Wilk
  2012-02-24  6:43 ` [PATCH 2 of 2] linux-xencommons: Load processor-passthru Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-24  6:43 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk


These two patches provide the neccessary infrastructure changes
for the processor-passthru driver [www.spinics.net/lists/linux-acpi/msg34655.html]
to properly function.

The first one is quite easy - we just modprobe the processor-passthru driver.

The second allows it to work under AMD machines by exposing the PM RDMSR
to dom0. It has been tested with 2.6.32 kernel as well to make sure it does
not negativly impact them and it does not. 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
  2012-02-24  6:43 [PATCH 0 of 2] [RFC] Patches to work with processor-passthru driver (v1) Konrad Rzeszutek Wilk
@ 2012-02-24  6:43 ` Konrad Rzeszutek Wilk
  2012-02-24 10:38   ` Jan Beulich
  2012-02-24  6:43 ` [PATCH 2 of 2] linux-xencommons: Load processor-passthru Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-24  6:43 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1330065828 18000
# Node ID a34b652afb0ca484b7416008c95fd36ffbbea334
# Parent  a4d93d0e0df2fafe5b3e2dab3e34799498a875e2
traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)

The restriction to read and write the AMD power management MSRs is gated if the
domain 0 is the PM domain (so FREQCTL_dom0_kernel is set). But we can
relax this restriction and allow the privileged domain to read the MSRs
(but not write). This allows the priviliged domain to harvest the power
management information (ACPI _PSS states) and send it to the hypervisor.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r a4d93d0e0df2 -r a34b652afb0c xen/arch/x86/traps.c
--- a/xen/arch/x86/traps.c	Wed Feb 22 14:33:24 2012 +0000
+++ b/xen/arch/x86/traps.c	Fri Feb 24 01:43:48 2012 -0500
@@ -2610,7 +2610,7 @@ static int emulate_privileged_op(struct 
         case MSR_K8_PSTATE7:
             if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
                 goto fail;
-            if ( !is_cpufreq_controller(v->domain) )
+            if ( !is_cpufreq_controller(v->domain) && !IS_PRIV(v->domain) )
             {
                 regs->eax = regs->edx = 0;
                 break;

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-02-24  6:43 [PATCH 0 of 2] [RFC] Patches to work with processor-passthru driver (v1) Konrad Rzeszutek Wilk
  2012-02-24  6:43 ` [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc) Konrad Rzeszutek Wilk
@ 2012-02-24  6:43 ` Konrad Rzeszutek Wilk
  2012-02-24 10:44   ` Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-24  6:43 UTC (permalink / raw)
  To: xen-devel, Ian.Jackson, stefano.stabellini, Ian.Campbell; +Cc: konrad.wilk

# HG changeset patch
# User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
# Date 1330065828 18000
# Node ID aa3a294327ed075bafbe3c070b39e3dbacbc49cd
# Parent  a34b652afb0ca484b7416008c95fd36ffbbea334
linux-xencommons: Load processor-passthru

The processor-passthru module is used in the upstream kernels
to upload power management information to the hypervisor. We
need to load it to work first.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff -r a34b652afb0c -r aa3a294327ed tools/hotplug/Linux/init.d/xencommons
--- a/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 01:43:48 2012 -0500
+++ b/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 01:43:48 2012 -0500
@@ -58,6 +58,7 @@ do_start () {
 	modprobe xen-gntdev 2>/dev/null
 	modprobe evtchn 2>/dev/null
 	modprobe gntdev 2>/dev/null
+	modprobe processor-passthru 2>/dev/null
 
 	if ! `xenstore-read -s / >/dev/null 2>&1`
 	then

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
  2012-02-24  6:43 ` [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc) Konrad Rzeszutek Wilk
@ 2012-02-24 10:38   ` Jan Beulich
  2012-03-01 16:33     ` Keir Fraser
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-02-24 10:38 UTC (permalink / raw)
  To: konrad.wilk; +Cc: xen-devel, Ian.Jackson, Ian.Campbell, stefano.stabellini

>>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> # HG changeset patch
> # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> # Date 1330065828 18000
> # Node ID a34b652afb0ca484b7416008c95fd36ffbbea334
> # Parent  a4d93d0e0df2fafe5b3e2dab3e34799498a875e2
> traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
> 
> The restriction to read and write the AMD power management MSRs is gated if 
> the
> domain 0 is the PM domain (so FREQCTL_dom0_kernel is set). But we can
> relax this restriction and allow the privileged domain to read the MSRs
> (but not write). This allows the priviliged domain to harvest the power
> management information (ACPI _PSS states) and send it to the hypervisor.
> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> diff -r a4d93d0e0df2 -r a34b652afb0c xen/arch/x86/traps.c
> --- a/xen/arch/x86/traps.c	Wed Feb 22 14:33:24 2012 +0000
> +++ b/xen/arch/x86/traps.c	Fri Feb 24 01:43:48 2012 -0500
> @@ -2610,7 +2610,7 @@ static int emulate_privileged_op(struct 
>          case MSR_K8_PSTATE7:
>              if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
>                  goto fail;
> -            if ( !is_cpufreq_controller(v->domain) )
> +            if ( !is_cpufreq_controller(v->domain) && !IS_PRIV(v->domain) )

As said already in response to your Linux side overview mail, while
I don't view this as having potential to cause any problems, I also
don't see the need.

And then if this indeed is wanted, replacing is_cpufreq_controller()
by IS_PRIV() rather than adding the latter would seem preferable
(because of being more strait forward).

Jan

>              {
>                  regs->eax = regs->edx = 0;
>                  break;
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-02-24  6:43 ` [PATCH 2 of 2] linux-xencommons: Load processor-passthru Konrad Rzeszutek Wilk
@ 2012-02-24 10:44   ` Jan Beulich
  2012-02-27  1:46     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-02-24 10:44 UTC (permalink / raw)
  To: konrad.wilk; +Cc: xen-devel, Ian.Jackson, Ian.Campbell, stefano.stabellini

>>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> # HG changeset patch
> # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> # Date 1330065828 18000
> # Node ID aa3a294327ed075bafbe3c070b39e3dbacbc49cd
> # Parent  a34b652afb0ca484b7416008c95fd36ffbbea334
> linux-xencommons: Load processor-passthru
> 
> The processor-passthru module is used in the upstream kernels
> to upload power management information to the hypervisor. We
> need to load it to work first.

Hmm, I don't really like this (and prior) pvops-specific additions here,
even if stderr gets directed into nowhere. I don't think this (and any
other) script intended to target Linux in general should target just
a specific implementation.

Furthermore, given the purpose of the driver, I'm thinking that this
is too late in the boot process anyway, and the driver should rather
be loaded at the point where other CPUFreq drivers would normally
be by a particular distro (which would still be later than when the
respective code gets run on traditional Xenified Linux).

Jan

> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> 
> diff -r a34b652afb0c -r aa3a294327ed tools/hotplug/Linux/init.d/xencommons
> --- a/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 01:43:48 2012 -0500
> +++ b/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 01:43:48 2012 -0500
> @@ -58,6 +58,7 @@ do_start () {
>  	modprobe xen-gntdev 2>/dev/null
>  	modprobe evtchn 2>/dev/null
>  	modprobe gntdev 2>/dev/null
> +	modprobe processor-passthru 2>/dev/null
>  
>  	if ! `xenstore-read -s / >/dev/null 2>&1`
>  	then
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org 
> http://lists.xen.org/xen-devel 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-02-24 10:44   ` Jan Beulich
@ 2012-02-27  1:46     ` Konrad Rzeszutek Wilk
  2012-02-27  8:31       ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-02-27  1:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Ian.Jackson, Ian.Campbell, stefano.stabellini

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

On Fri, Feb 24, 2012 at 10:44:54AM +0000, Jan Beulich wrote:
> >>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> > # HG changeset patch
> > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> > # Date 1330065828 18000
> > # Node ID aa3a294327ed075bafbe3c070b39e3dbacbc49cd
> > # Parent  a34b652afb0ca484b7416008c95fd36ffbbea334
> > linux-xencommons: Load processor-passthru
> > 
> > The processor-passthru module is used in the upstream kernels
> > to upload power management information to the hypervisor. We
> > need to load it to work first.
> 
> Hmm, I don't really like this (and prior) pvops-specific additions here,
> even if stderr gets directed into nowhere. I don't think this (and any
> other) script intended to target Linux in general should target just
> a specific implementation.
> 
> Furthermore, given the purpose of the driver, I'm thinking that this
> is too late in the boot process anyway, and the driver should rather
> be loaded at the point where other CPUFreq drivers would normally
> be by a particular distro (which would still be later than when the
> respective code gets run on traditional Xenified Linux).

My thinking is to keep the amount of changes to be within the Xen-ecosystem.

Doing the changes in the old-fashioned way in drivers/acpi has not been
very productive - so I've been looking at just some form of uploading
the PM information to the hypervisor. And I've been piggybacking on
top of the cpufreq scaling drivers collecting the information. So with that 
in mind, the next idea I had was to provide a cpufreq governor, called 'xen'
which would inhibit the cpufreq scaling drivers from changing anything in dom0
(so that the hypervisor can do it instead). Also it would be responsible
for uploading the PM information to the hypervisor. See attached.

The fix to the tool-stack would be something along:
# HG changeset patch
# Parent 917f845f3e838dcc1efccb132abe2d1f254cb7b8

diff -r 917f845f3e83 tools/hotplug/Linux/init.d/xencommons
--- a/tools/hotplug/Linux/init.d/xencommons	Thu Feb 23 13:29:00 2012 -0500
+++ b/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 21:42:20 2012 -0500
@@ -58,7 +58,7 @@ do_start () {
 	modprobe xen-gntdev 2>/dev/null
 	modprobe evtchn 2>/dev/null
 	modprobe gntdev 2>/dev/null
-
+	for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo xen > $file; done 2>/dev/null
 	if ! `xenstore-read -s / >/dev/null 2>&1`
 	then
 		test -z "$XENSTORED_ROOTDIR" || XENSTORED_ROOTDIR="/var/lib/xenstored"

[-- Attachment #2: 0001-CPUFREQ-xen-governor-for-Xen-hypervisor-frequency-sc.patch --]
[-- Type: text/plain, Size: 17087 bytes --]

>From 20e7a07fa0f8a0dbe30a0f732686d78849d29d96 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 3 Feb 2012 16:03:20 -0500
Subject: [PATCH] [CPUFREQ] xen: governor for Xen hypervisor frequency
 scaling.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This CPU freq governor leaves the frequency decision to the Xen hypervisor.

To do that the driver parses the Power Management data and uploads said
information to the Xen hypervisor. Then the Xen hypervisor can select the
proper Cx and Pxx states for the initial domain and all other domains.

To upload the information, this CPU frequency driver reads Power Management (PM)
(_Pxx and _Cx) which are populated in the 'struct acpi_processor' structure.
It simply reads the contents of that structure and pass it up the Xen hypervisor.
For that to work we depend on the appropriate CPU frequency scaling driver
to do the heavy-lifting - so that the contents is correct.

The CPU frequency governor it has been loaded also sets up a timer
to check if the ACPI IDs count is different from the APIC ID count - which
can happen if the user choose to use dom0_max_vcpu argument. In such a case
a backup of the PM structure is used and uploaded to the hypervisor.

[v1-v2: Initial RFC implementations that were posted]
[v3: Changed the name to passthru suggested by Pasi Kärkkäinen <pasik@iki.fi>]
[v4: Added vCPU != pCPU support - aka dom0_max_vcpus support]
[v5: Cleaned up the driver, fix bug under Athlon XP]
[v6: Changed the driver to a CPU frequency governor]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/xen/Kconfig       |   15 ++
 drivers/xen/Makefile      |    2 +-
 drivers/xen/cpufreq_xen.c |  445 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 461 insertions(+), 1 deletions(-)
 create mode 100644 drivers/xen/cpufreq_xen.c

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index a1ced52..28ba371 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -178,4 +178,19 @@ config XEN_PRIVCMD
 	depends on XEN
 	default m
 
+config CPU_FREQ_GOV_XEN
+	tristate "'xen' governor for hypervisor scaling"
+	depends on XEN && X86 && ACPI_PROCESSOR && CPU_FREQ
+	default m
+	help
+          This cpufreq governor leaves the frequency decision to the Xen hypervisor.
+
+	  To do that the driver parses the Power Management data and uploads said
+	  information to the Xen hypervisor. Then the Xen hypervisor can select the
+          proper Cx and Pxx states.
+
+          To compile this driver as a module, choose M here: the
+          module will be called cpufreq_xen.  If you do not know what to choose,
+          select M here.
+
 endmenu
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index aa31337..5802220 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -20,7 +20,7 @@ obj-$(CONFIG_SWIOTLB_XEN)		+= swiotlb-xen.o
 obj-$(CONFIG_XEN_DOM0)			+= pci.o
 obj-$(CONFIG_XEN_PCIDEV_BACKEND)	+= xen-pciback/
 obj-$(CONFIG_XEN_PRIVCMD)		+= xen-privcmd.o
-
+obj-$(CONFIG_CPU_FREQ_GOV_XEN)		+= cpufreq_xen.o
 xen-evtchn-y				:= evtchn.o
 xen-gntdev-y				:= gntdev.o
 xen-gntalloc-y				:= gntalloc.o
diff --git a/drivers/xen/cpufreq_xen.c b/drivers/xen/cpufreq_xen.c
new file mode 100644
index 0000000..1b709bfc
--- /dev/null
+++ b/drivers/xen/cpufreq_xen.c
@@ -0,0 +1,445 @@
+/*
+ * Copyright 2012 by Oracle Inc
+ * Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
+ *
+ * This code borrows ideas from https://lkml.org/lkml/2011/11/30/249
+ * so many thanks go to Kevin Tian <kevin.tian@intel.com>
+ * and Yu Ke <ke.yu@intel.com>.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ */
+
+#include <linux/cpumask.h>
+#include <linux/cpufreq.h>
+#include <linux/freezer.h>
+#include <linux/kernel.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <acpi/acpi_bus.h>
+#include <acpi/acpi_drivers.h>
+#include <acpi/processor.h>
+
+#include <xen/interface/platform.h>
+#include <asm/xen/hypercall.h>
+
+#define DRV_NAME "cpufreq-xen"
+
+static int no_hypercall;
+MODULE_PARM_DESC(off, "Inhibit the hypercall.");
+module_param_named(off, no_hypercall, int, 0400);
+
+/*
+ * Mutex to protect the acpi_ids_done.
+ */
+static DEFINE_MUTEX(acpi_ids_mutex);
+/*
+ * Don't think convert this to cpumask_var_t or use cpumask_bit - as those
+ * shrink to nr_cpu_bits (which is dependent on possible_cpu), which can be
+ * less than what we want to put in.
+ */
+#define NR_ACPI_CPUS	NR_CPUS
+#define MAX_ACPI_BITS	(BITS_TO_LONGS(NR_ACPI_CPUS))
+static unsigned long *acpi_ids_done;
+/*
+ * Again, don't convert to cpumask - as we are reading the raw ACPI CPU ids
+ * which can go beyond what we presently see.
+ */
+static unsigned long *acpi_id_present;
+
+/*
+ * Pertient data for the timer to be launched to check if the # of
+ * ACPI CPU ids is different from the one we have processed.
+ */
+#define DELAY_TIMER	msecs_to_jiffies(5000 /* 5 sec */)
+static struct acpi_processor *pr_backup;
+static struct delayed_work work;
+
+static int push_cxx_to_hypervisor(struct acpi_processor *_pr)
+{
+	struct xen_platform_op op = {
+		.cmd			= XENPF_set_processor_pminfo,
+		.interface_version	= XENPF_INTERFACE_VERSION,
+		.u.set_pminfo.id	= _pr->acpi_id,
+		.u.set_pminfo.type	= XEN_PM_CX,
+	};
+	struct xen_processor_cx *dst_cx, *dst_cx_states = NULL;
+	struct acpi_processor_cx *cx;
+	int i, ok, ret = 0;
+
+	dst_cx_states = kcalloc(_pr->power.count,
+				sizeof(struct xen_processor_cx), GFP_KERNEL);
+	if (!dst_cx_states)
+		return -ENOMEM;
+
+	for (ok = 0, i = 1; i <= _pr->power.count; i++) {
+		cx = &_pr->power.states[i];
+		if (!cx->valid)
+			continue;
+
+		dst_cx = &(dst_cx_states[ok++]);
+
+		dst_cx->reg.space_id = ACPI_ADR_SPACE_SYSTEM_IO;
+		if (cx->entry_method == ACPI_CSTATE_SYSTEMIO) {
+			dst_cx->reg.bit_width = 8;
+			dst_cx->reg.bit_offset = 0;
+			dst_cx->reg.access_size = 1;
+		} else {
+			dst_cx->reg.space_id = ACPI_ADR_SPACE_FIXED_HARDWARE;
+			if (cx->entry_method == ACPI_CSTATE_FFH) {
+				/* NATIVE_CSTATE_BEYOND_HALT */
+				dst_cx->reg.bit_offset = 2;
+				dst_cx->reg.bit_width = 1; /* VENDOR_INTEL */
+			}
+			dst_cx->reg.access_size = 0;
+		}
+		dst_cx->reg.address = cx->address;
+
+		dst_cx->type = cx->type;
+		dst_cx->latency = cx->latency;
+		dst_cx->power = cx->power;
+
+		dst_cx->dpcnt = 0;
+		set_xen_guest_handle(dst_cx->dp, NULL);
+#ifdef DEBUG
+		pr_debug(DRV_NAME ": CX: ID:%d [C%d:%s] entry:%d\n",
+			_pr->acpi_id, cx->type, cx->desc, cx->entry_method);
+#endif
+	}
+	if (!ok) {
+		pr_err(DRV_NAME ": No _Cx for CPU %d\n", _pr->acpi_id);
+		kfree(dst_cx_states);
+		return -EINVAL;
+	}
+	op.u.set_pminfo.power.count = ok;
+	op.u.set_pminfo.power.flags.bm_control = _pr->flags.bm_control;
+	op.u.set_pminfo.power.flags.bm_check = _pr->flags.bm_check;
+	op.u.set_pminfo.power.flags.has_cst = _pr->flags.has_cst;
+	op.u.set_pminfo.power.flags.power_setup_done =
+		_pr->flags.power_setup_done;
+
+	set_xen_guest_handle(op.u.set_pminfo.power.states, dst_cx_states);
+
+	if (!no_hypercall)
+		ret = HYPERVISOR_dom0_op(&op);
+
+	if (ret)
+		pr_err(DRV_NAME "(CX): Hypervisor error (%d) for ACPI ID: %d\n",
+		       ret, _pr->acpi_id);
+
+	kfree(dst_cx_states);
+
+	return ret;
+}
+static struct xen_processor_px *
+xen_copy_pss_data(struct acpi_processor *_pr,
+		  struct xen_processor_performance *dst_perf)
+{
+	struct xen_processor_px *dst_states = NULL;
+	int i;
+
+	BUILD_BUG_ON(sizeof(struct xen_processor_px) !=
+		     sizeof(struct acpi_processor_px));
+
+	dst_states = kcalloc(_pr->performance->state_count,
+			     sizeof(struct xen_processor_px), GFP_KERNEL);
+	if (!dst_states)
+		return ERR_PTR(-ENOMEM);
+
+	dst_perf->state_count = _pr->performance->state_count;
+	for (i = 0; i < _pr->performance->state_count; i++) {
+		/* Fortunatly for us, they are both the same size */
+		memcpy(&(dst_states[i]), &(_pr->performance->states[i]),
+		       sizeof(struct acpi_processor_px));
+	}
+	return dst_states;
+}
+static int xen_copy_psd_data(struct acpi_processor *_pr,
+			     struct xen_processor_performance *dst)
+{
+	BUILD_BUG_ON(sizeof(struct xen_psd_package) !=
+		     sizeof(struct acpi_psd_package));
+
+	if (_pr->performance->shared_type != CPUFREQ_SHARED_TYPE_NONE) {
+		dst->shared_type = _pr->performance->shared_type;
+
+		memcpy(&(dst->domain_info), &(_pr->performance->domain_info),
+		       sizeof(struct acpi_psd_package));
+	} else {
+		if ((&cpu_data(0))->x86_vendor != X86_VENDOR_AMD)
+			return -EINVAL;
+
+		/* On AMD, the powernow-k8 is loaded before acpi_cpufreq
+		 * meaning that acpi_processor_preregister_performance never
+		 * gets called which would parse the _PSD. The only relevant
+		 * information from _PSD we need is whether it is HW_ALL or any
+		 * other type. AMD K8 >= are SW_ALL or SW_ANY, AMD K7<= HW_ANY.
+		 * This driver checks at the start whether it is K8 so it
+		 * if we get here it can only be K8.
+		 */
+		dst->shared_type = CPUFREQ_SHARED_TYPE_ANY;
+		dst->domain_info.coord_type = DOMAIN_COORD_TYPE_SW_ANY;
+		dst->domain_info.num_processors = num_online_cpus();
+	}
+	return 0;
+}
+static int xen_copy_pct_data(struct acpi_pct_register *pct,
+			     struct xen_pct_register *dst_pct)
+{
+	/* It would be nice if you could just do 'memcpy(pct, dst_pct') but
+	 * sadly the Xen structure did not have the proper padding so the
+	 * descriptor field takes two (dst_pct) bytes instead of one (pct).
+	 */
+	dst_pct->descriptor = pct->descriptor;
+	dst_pct->length = pct->length;
+	dst_pct->space_id = pct->space_id;
+	dst_pct->bit_width = pct->bit_width;
+	dst_pct->bit_offset = pct->bit_offset;
+	dst_pct->reserved = pct->reserved;
+	dst_pct->address = pct->address;
+	return 0;
+}
+static int push_pxx_to_hypervisor(struct acpi_processor *_pr)
+{
+	int ret = 0;
+	struct xen_platform_op op = {
+		.cmd			= XENPF_set_processor_pminfo,
+		.interface_version	= XENPF_INTERFACE_VERSION,
+		.u.set_pminfo.id	= _pr->acpi_id,
+		.u.set_pminfo.type	= XEN_PM_PX,
+	};
+	struct xen_processor_performance *dst_perf;
+	struct xen_processor_px *dst_states = NULL;
+
+	dst_perf = &op.u.set_pminfo.perf;
+
+	dst_perf->platform_limit = _pr->performance_platform_limit;
+	dst_perf->flags |= XEN_PX_PPC;
+	xen_copy_pct_data(&(_pr->performance->control_register),
+			  &dst_perf->control_register);
+	xen_copy_pct_data(&(_pr->performance->status_register),
+			  &dst_perf->status_register);
+	dst_perf->flags |= XEN_PX_PCT;
+	dst_states = xen_copy_pss_data(_pr, dst_perf);
+	if (!IS_ERR_OR_NULL(dst_states)) {
+		set_xen_guest_handle(dst_perf->states, dst_states);
+		dst_perf->flags |= XEN_PX_PSS;
+	}
+	if (!xen_copy_psd_data(_pr, dst_perf))
+		dst_perf->flags |= XEN_PX_PSD;
+
+	if (!no_hypercall)
+		ret = HYPERVISOR_dom0_op(&op);
+
+	if (ret)
+		pr_err(DRV_NAME "(_PXX): Hypervisor error (%d) for ACPI ID %d\n",
+		       ret, _pr->acpi_id);
+
+	if (!IS_ERR_OR_NULL(dst_states))
+		kfree(dst_states);
+
+	return ret;
+}
+static int upload_pm_data(struct acpi_processor *_pr)
+{
+	int err = 0;
+
+	if (__test_and_set_bit(_pr->acpi_id, acpi_ids_done))
+		return -EBUSY;
+
+	if (_pr->flags.power)
+		err = push_cxx_to_hypervisor(_pr);
+
+	if (_pr->performance && _pr->performance->states)
+		err |= push_pxx_to_hypervisor(_pr);
+
+	return err;
+}
+static acpi_status
+read_acpi_id(acpi_handle handle, u32 lvl, void *context, void **rv)
+{
+	u32 acpi_id;
+	acpi_status status;
+	acpi_object_type acpi_type;
+	unsigned long long tmp;
+	union acpi_object object = { 0 };
+	struct acpi_buffer buffer = { sizeof(union acpi_object), &object };
+
+	status = acpi_get_type(handle, &acpi_type);
+	if (ACPI_FAILURE(status))
+		return AE_OK;
+
+	switch (acpi_type) {
+	case ACPI_TYPE_PROCESSOR:
+		status = acpi_evaluate_object(handle, NULL, NULL, &buffer);
+		if (ACPI_FAILURE(status))
+			return AE_OK;
+		acpi_id = object.processor.proc_id;
+		break;
+	case ACPI_TYPE_DEVICE:
+		status = acpi_evaluate_integer(handle, "_UID", NULL, &tmp);
+		if (ACPI_FAILURE(status))
+			return AE_OK;
+		acpi_id = tmp;
+		break;
+	default:
+		return AE_OK;
+	}
+	if (acpi_id > NR_ACPI_CPUS) {
+		WARN_ONCE(1, "There are %d ACPI processors, but kernel can only do %d!\n",
+		     acpi_id, NR_ACPI_CPUS);
+		return AE_OK;
+	}
+	__set_bit(acpi_id, acpi_id_present);
+
+	return AE_OK;
+}
+static unsigned int more_acpi_ids(void)
+{
+	unsigned int n = 0;
+
+	acpi_walk_namespace(ACPI_TYPE_PROCESSOR, ACPI_ROOT_OBJECT,
+			    ACPI_UINT32_MAX,
+			    read_acpi_id, NULL, NULL, NULL);
+	acpi_get_devices("ACPI0007", read_acpi_id, NULL, NULL);
+
+	mutex_lock(&acpi_ids_mutex);
+	if (!bitmap_equal(acpi_id_present, acpi_ids_done, MAX_ACPI_BITS))
+		n = bitmap_weight(acpi_id_present, MAX_ACPI_BITS);
+	mutex_unlock(&acpi_ids_mutex);
+
+	return n;
+}
+static void do_check_acpi_id_timer(struct work_struct *_work)
+{
+	/* All online CPUs have been processed at this stage. Now verify
+	 * whether in fact "online CPUs" == physical CPUs.
+	 */
+	acpi_id_present = kcalloc(MAX_ACPI_BITS, sizeof(unsigned long), GFP_KERNEL);
+	if (!acpi_id_present)
+		return;
+	memset(acpi_id_present, 0, MAX_ACPI_BITS * sizeof(unsigned long));
+
+	if (more_acpi_ids()) {
+		int cpu;
+		if (!pr_backup) {
+			schedule_delayed_work(&work, DELAY_TIMER);
+			return;
+		}
+		for_each_set_bit(cpu, acpi_id_present, MAX_ACPI_BITS) {
+			pr_backup->acpi_id = cpu;
+			mutex_lock(&acpi_ids_mutex);
+			(void)upload_pm_data(pr_backup);
+			mutex_unlock(&acpi_ids_mutex);
+		}
+	}
+	kfree(acpi_id_present);
+	acpi_id_present = NULL;
+}
+
+static int cpufreq_governor_xen(struct cpufreq_policy *policy,
+				unsigned int event)
+{
+	struct acpi_processor *_pr;
+
+	switch (event) {
+	case CPUFREQ_GOV_START:
+	case CPUFREQ_GOV_LIMITS:
+		/* Set it to max and let the hypervisor take over */
+		__cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);
+
+		_pr = per_cpu(processors, policy->cpu /* APIC ID */);
+		if (!_pr)
+			break;
+
+		mutex_lock(&acpi_ids_mutex);
+		if (!pr_backup) {
+			pr_backup = kzalloc(sizeof(struct acpi_processor), GFP_KERNEL);
+			memcpy(pr_backup, _pr, sizeof(struct acpi_processor));
+
+			INIT_DELAYED_WORK_DEFERRABLE(&work, do_check_acpi_id_timer);
+			schedule_delayed_work(&work, DELAY_TIMER);
+		}
+		(void)upload_pm_data(_pr);
+		mutex_unlock(&acpi_ids_mutex);
+		break;
+	default:
+		break;
+	}
+	return 0;
+}
+static struct cpufreq_governor cpufreq_gov_xen = {
+	.name		= "xen",
+	.governor	= cpufreq_governor_xen,
+	.owner		= THIS_MODULE,
+};
+static int __init check_prereq(void)
+{
+	struct cpuinfo_x86 *c = &cpu_data(0);
+
+	if (!xen_initial_domain())
+		return -ENODEV;
+
+	if (!acpi_gbl_FADT.smi_command)
+		return -ENODEV;
+
+	if (c->x86_vendor == X86_VENDOR_INTEL) {
+		if (!cpu_has(c, X86_FEATURE_EST))
+			return -ENODEV;
+
+		return 0;
+	}
+	if (c->x86_vendor == X86_VENDOR_AMD) {
+		u32 hi = 0, lo = 0;
+		/* Copied from powernow-k8.h, can't include ../cpufreq/powernow
+		 * as we get compile warnings for the static functions.
+		 */
+#define MSR_PSTATE_CUR_LIMIT    0xc0010061 /* pstate current limit MSR */
+		rdmsr(MSR_PSTATE_CUR_LIMIT, lo, hi);
+
+		/* If the MSR cannot provide the data, the powernow-k8
+		 * won't process the data properly either.
+		 */
+		if (hi || lo)
+			return 0;
+	}
+	return -ENODEV;
+}
+
+static int __init xen_processor_passthru_init(void)
+{
+	int rc = check_prereq();
+
+	if (rc)
+		return rc;
+
+	acpi_ids_done = kcalloc(MAX_ACPI_BITS, sizeof(unsigned long), GFP_KERNEL);
+	if (!acpi_ids_done)
+		return -ENOMEM;
+	memset(acpi_ids_done, 0, MAX_ACPI_BITS * sizeof(unsigned long));
+
+	return cpufreq_register_governor(&cpufreq_gov_xen);
+}
+static void __exit xen_processor_passthru_exit(void)
+{
+	cpufreq_unregister_governor(&cpufreq_gov_xen);
+	cancel_delayed_work_sync(&work);
+	kfree(acpi_ids_done);
+	kfree(pr_backup);
+}
+
+MODULE_AUTHOR("Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>");
+MODULE_DESCRIPTION("CPUfreq policy governor 'xen' which uploads PM data to Xen hypervisor");
+MODULE_LICENSE("GPL");
+
+late_initcall(xen_processor_passthru_init);
+module_exit(xen_processor_passthru_exit);
-- 
1.7.9.48.g85da4d


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-02-27  1:46     ` Konrad Rzeszutek Wilk
@ 2012-02-27  8:31       ` Jan Beulich
  2012-03-05  2:49         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2012-02-27  8:31 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: xen-devel, Ian.Jackson, Ian.Campbell, stefano.stabellini

>>> On 27.02.12 at 02:46, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> On Fri, Feb 24, 2012 at 10:44:54AM +0000, Jan Beulich wrote:
>> >>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> > # HG changeset patch
>> > # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> > # Date 1330065828 18000
>> > # Node ID aa3a294327ed075bafbe3c070b39e3dbacbc49cd
>> > # Parent  a34b652afb0ca484b7416008c95fd36ffbbea334
>> > linux-xencommons: Load processor-passthru
>> > 
>> > The processor-passthru module is used in the upstream kernels
>> > to upload power management information to the hypervisor. We
>> > need to load it to work first.
>> 
>> Hmm, I don't really like this (and prior) pvops-specific additions here,
>> even if stderr gets directed into nowhere. I don't think this (and any
>> other) script intended to target Linux in general should target just
>> a specific implementation.
>> 
>> Furthermore, given the purpose of the driver, I'm thinking that this
>> is too late in the boot process anyway, and the driver should rather
>> be loaded at the point where other CPUFreq drivers would normally
>> be by a particular distro (which would still be later than when the
>> respective code gets run on traditional Xenified Linux).
> 
> My thinking is to keep the amount of changes to be within the Xen-ecosystem.
> 
> Doing the changes in the old-fashioned way in drivers/acpi has not been
> very productive - so I've been looking at just some form of uploading
> the PM information to the hypervisor. And I've been piggybacking on
> top of the cpufreq scaling drivers collecting the information. So with that 
> in mind, the next idea I had was to provide a cpufreq governor, called 'xen'
> which would inhibit the cpufreq scaling drivers from changing anything in 
> dom0
> (so that the hypervisor can do it instead). Also it would be responsible
> for uploading the PM information to the hypervisor. See attached.
> 
> The fix to the tool-stack would be something along:
> # HG changeset patch
> # Parent 917f845f3e838dcc1efccb132abe2d1f254cb7b8
> 
> diff -r 917f845f3e83 tools/hotplug/Linux/init.d/xencommons
> --- a/tools/hotplug/Linux/init.d/xencommons	Thu Feb 23 13:29:00 2012 -0500
> +++ b/tools/hotplug/Linux/init.d/xencommons	Fri Feb 24 21:42:20 2012 -0500
> @@ -58,7 +58,7 @@ do_start () {
>  	modprobe xen-gntdev 2>/dev/null
>  	modprobe evtchn 2>/dev/null
>  	modprobe gntdev 2>/dev/null
> -
> +	for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo xen > $file; done 2>/dev/null

While not an unreasonable idea (yes, I sort of contradict my other mail
from a few minutes ago, given your explanation above), this won't work
with subsequently onlined vCPU-s (and hence, consistent with my other
mail, having the thing be a governor is likely the wrong choice - it ought
to be a driver).

Jan

>  	if ! `xenstore-read -s / >/dev/null 2>&1`
>  	then
>  		test -z "$XENSTORED_ROOTDIR" || XENSTORED_ROOTDIR="/var/lib/xenstored"

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
  2012-02-24 10:38   ` Jan Beulich
@ 2012-03-01 16:33     ` Keir Fraser
  2012-03-12 18:11       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Keir Fraser @ 2012-03-01 16:33 UTC (permalink / raw)
  To: Jan Beulich, Konrad Rzeszutek Wilk
  Cc: xen-devel, Ian Jackson, Ian.Campbell, Stefano Stabellini

On 24/02/2012 10:38, "Jan Beulich" <JBeulich@suse.com> wrote:

>>>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
>> # HG changeset patch
>> # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> # Date 1330065828 18000
>> # Node ID a34b652afb0ca484b7416008c95fd36ffbbea334
>> # Parent  a4d93d0e0df2fafe5b3e2dab3e34799498a875e2
>> traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
>> 
>> The restriction to read and write the AMD power management MSRs is gated if
>> the
>> domain 0 is the PM domain (so FREQCTL_dom0_kernel is set). But we can
>> relax this restriction and allow the privileged domain to read the MSRs
>> (but not write). This allows the priviliged domain to harvest the power
>> management information (ACPI _PSS states) and send it to the hypervisor.
>> 
>> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
>> 
>> diff -r a4d93d0e0df2 -r a34b652afb0c xen/arch/x86/traps.c
>> --- a/xen/arch/x86/traps.c Wed Feb 22 14:33:24 2012 +0000
>> +++ b/xen/arch/x86/traps.c Fri Feb 24 01:43:48 2012 -0500
>> @@ -2610,7 +2610,7 @@ static int emulate_privileged_op(struct
>>          case MSR_K8_PSTATE7:
>>              if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
>>                  goto fail;
>> -            if ( !is_cpufreq_controller(v->domain) )
>> +            if ( !is_cpufreq_controller(v->domain) && !IS_PRIV(v->domain) )
> 
> As said already in response to your Linux side overview mail, while
> I don't view this as having potential to cause any problems, I also
> don't see the need.
> 
> And then if this indeed is wanted, replacing is_cpufreq_controller()
> by IS_PRIV() rather than adding the latter would seem preferable
> (because of being more strait forward).

Any response to this, Konrad?

 K.

> Jan
> 
>>              {
>>                  regs->eax = regs->edx = 0;
>>                  break;
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-02-27  8:31       ` Jan Beulich
@ 2012-03-05  2:49         ` Konrad Rzeszutek Wilk
  2012-03-05 11:43           ` Jan Beulich
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-03-05  2:49 UTC (permalink / raw)
  To: Jan Beulich, davej
  Cc: xen-devel, Ian.Jackson, Ian.Campbell, stefano.stabellini,
	Konrad Rzeszutek Wilk

>>> Furthermore, given the purpose of the driver, I'm thinking that this
>>> is too late in the boot process anyway, and the driver should rather
>>> be loaded at the point where other CPUFreq drivers would normally
>>> be by a particular distro (which would still be later than when the

In Ubuntu and Fedora they are set to:
CONFIG_X86_POWERNOW_K8=y
CONFIG_X86_ACPI_CPUFREQ=y

(thought previous to Fedora 16 POWERNOW_K8 was =m, per
https://bugzilla.redhat.com/show_bug.cgi?id=739159, " powernow-k8
transitions fail in xen dom0")

Looking at how the built is done with =y, the cpufreq drivers are loaded
in the late_init stage and there is a strict order (drivers/cpufreq/Makefile):
  # x86 drivers.
  # Link order matters. K8 is preferred to ACPI because of firmware
bugs in early
  # K8 systems. ACPI is preferred to all other hardware-specific drivers.
  # speedstep-* is preferred over p4-clockmod.

Both of them (acpi-cpufreq.c and powernow-k8.c) have a symbol
dependency on drivers/acpi/processor.c

>>> respective code gets run on traditional Xenified Linux).
>>
>> My thinking is to keep the amount of changes to be within the Xen-ecosystem.

.. snip..
>> The fix to the tool-stack would be something along:
.. snip..
>> diff -r 917f845f3e83 tools/hotplug/Linux/init.d/xencommons
>> --- a/tools/hotplug/Linux/init.d/xencommons   Thu Feb 23 13:29:00 2012 -0500
>> +++ b/tools/hotplug/Linux/init.d/xencommons   Fri Feb 24 21:42:20 2012 -0500
>> @@ -58,7 +58,7 @@ do_start () {
>>       modprobe xen-gntdev 2>/dev/null
>>       modprobe evtchn 2>/dev/null
>>       modprobe gntdev 2>/dev/null
>> -
>> +     for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo xen > $file; done 2>/dev/null
>
> While not an unreasonable idea (yes, I sort of contradict my other mail
> from a few minutes ago, given your explanation above), this won't work
> with subsequently onlined vCPU-s (and hence, consistent with my other

This udev rule could do it? [completely untested]

ACTION=="add", SUBSYSTEM=="cpu", RUN+="/bin/sh -c '[ ! -e
/sys$devpath/online ] || echo xen >
/sys$devpath/cpufreq/scaling_governor'"

> mail, having the thing be a governor is likely the wrong choice - it ought
> to be a driver).
>

By driver, I think you mean either a) a new cpufreq scaling driver, or
b) the processor-core driver.

The b) is what the old Xen-O-Linux tree had, and a simplified version
of this had been posted on LKML: https://lkml.org/lkml/2011/11/30/245
. There was no response from Len Brown about it and from the off-topic
email conversations I got the hint that Len wasn't particularly
thrilled about it.

For a), this would mean some form of unregistering the existing
cpufreq scaling drivers.The reason
for that is we want to use the generic ones (acpi-cpufreq and
powernow-k8) b/c they do all the filtering and parsing of the ACPI
data instead of re-implementing it in our own cpufreq-xen-scaling.
Thought one other option is to export both powernow-k8 and
acpi-cpufreq functions that do this and use them within the
cpufreq-xen-scaling-driver but that sounds icky.

I think the "unregistering" would be akin to
"cpufreq_unregister_driver" but without any parameters.
However, we would be the only user of this mechanism and I am not sure
what Dave Jones feelings are about that (CC-ed here) Here is what I
was thinking (this is of course completely untested/uncompiled):

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 622013f..049fca9 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1881,9 +1881,12 @@ int cpufreq_unregister_driver(struct
cpufreq_driver *driver)
 {
 	unsigned long flags;

-	if (!cpufreq_driver || (driver != cpufreq_driver))
+	if (!cpufreq_driver)
 		return -EINVAL;

+	if (!driver)
+		driver = cpufreq_driver;
+
 	pr_debug("unregistering driver %s\n", driver->name);

 	subsys_interface_unregister(&cpufreq_interface);

There would also be the need to do some tweaking of the order of drivers to load
(so that cpufreq-xen-scaling-driver gets loaded _after_ all of the
other ones). And I am afraid
about the scenario where a cpufreq-gov-X tries to set a specific
frequency and we yank it out
from within and might end up with a NULL pointer. But I think the
"cpufreq_driver_lock" inhibits that. The other thing that strikes me
as dangerous is that we don't call the module_exit of the cpufreq
scaling driver.

While I was writing this up I thought about the overall picture.
Right now, the APIs work in this way:

[acpi processors] -> [cpufreq-scaling] <- [cpufreq-gov-*]

And we want to leverage them to do two things:
1). Inhibit cpufreq-gov-* telling cpufreq-scaling drivers what to do
[which is a bug
right now with Linux dom0 pvops - as the cpufreq-gov-* has no clue
about the other domains and might set the P states to something
incorrect]
2). Upload the power management information up to the hypervisor.

The 1) option can be either done by ignoring the WRMSR or outb in the hypervisor
(but for that we need 2) to know _which_ ones to inhibit). Or it can
be done by your
suggestion by introducing a cpufreq scaling driver that would nop all
calls and unregister
the other drivers. Or lastly by the cpufreq-gov-xen driver which would
leave it to the hypervisor to do it.

The 2). Can be done by either the proposed cpufreq-gov-xen, or the
earlier patch set I sent called processor-passthru
[https://lwn.net/Articles/483668/] which is a simple driver that
checks when cpufreq-scaling driver is loaded and then it upload the
power management data. It does nothing
with the cpufreq-[scaling|gov] APIs. Granted it does need a 'modprobe
processor-passthru' in the xencommons to take advantage of the data -
and yes, it does deal with hot-cpu on lining.

Thoughts?

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2 of 2] linux-xencommons: Load processor-passthru
  2012-03-05  2:49         ` Konrad Rzeszutek Wilk
@ 2012-03-05 11:43           ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2012-03-05 11:43 UTC (permalink / raw)
  To: konrad, davej
  Cc: xen-devel, Ian.Jackson, Ian.Campbell, Konrad Rzeszutek Wilk,
	stefano.stabellini

>>> On 05.03.12 at 03:49, Konrad Rzeszutek Wilk <konrad@darnok.org> wrote:
>>>> Furthermore, given the purpose of the driver, I'm thinking that this
>>>> is too late in the boot process anyway, and the driver should rather
>>>> be loaded at the point where other CPUFreq drivers would normally
>>>> be by a particular distro (which would still be later than when the
> 
> In Ubuntu and Fedora they are set to:
> CONFIG_X86_POWERNOW_K8=y
> CONFIG_X86_ACPI_CPUFREQ=y
> 
> (thought previous to Fedora 16 POWERNOW_K8 was =m, per
> https://bugzilla.redhat.com/show_bug.cgi?id=739159, " powernow-k8
> transitions fail in xen dom0")
> 
> Looking at how the built is done with =y, the cpufreq drivers are loaded
> in the late_init stage and there is a strict order 
> (drivers/cpufreq/Makefile):
>   # x86 drivers.
>   # Link order matters. K8 is preferred to ACPI because of firmware
> bugs in early
>   # K8 systems. ACPI is preferred to all other hardware-specific drivers.
>   # speedstep-* is preferred over p4-clockmod.
> 
> Both of them (acpi-cpufreq.c and powernow-k8.c) have a symbol
> dependency on drivers/acpi/processor.c

But them being 'm' or 'y' shouldn't matter in the end.

>> mail, having the thing be a governor is likely the wrong choice - it ought
>> to be a driver).
>>
> 
> By driver, I think you mean either a) a new cpufreq scaling driver, or
> b) the processor-core driver.

I really meant a).

> For a), this would mean some form of unregistering the existing
> cpufreq scaling drivers.The reason

Or loading before them (and not depending on them), thus
preventing them from loading successfully.

> for that is we want to use the generic ones (acpi-cpufreq and
> powernow-k8) b/c they do all the filtering and parsing of the ACPI
> data instead of re-implementing it in our own cpufreq-xen-scaling.
> Thought one other option is to export both powernow-k8 and
> acpi-cpufreq functions that do this and use them within the
> cpufreq-xen-scaling-driver but that sounds icky.

Indeed.

> 2). Upload the power management information up to the hypervisor.

Which doesn't require cpufreq drivers at all (in non-pv-ops we simply
suppress the CPU_FREQ config option when XEN is set).

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
  2012-03-01 16:33     ` Keir Fraser
@ 2012-03-12 18:11       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-03-12 18:11 UTC (permalink / raw)
  To: Keir Fraser
  Cc: xen-devel, Ian Jackson, Ian.Campbell, Jan Beulich,
	Stefano Stabellini

On Thu, Mar 01, 2012 at 04:33:42PM +0000, Keir Fraser wrote:
> On 24/02/2012 10:38, "Jan Beulich" <JBeulich@suse.com> wrote:
> 
> >>>> On 24.02.12 at 07:43, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote:
> >> # HG changeset patch
> >> # User Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> # Date 1330065828 18000
> >> # Node ID a34b652afb0ca484b7416008c95fd36ffbbea334
> >> # Parent  a4d93d0e0df2fafe5b3e2dab3e34799498a875e2
> >> traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc)
> >> 
> >> The restriction to read and write the AMD power management MSRs is gated if
> >> the
> >> domain 0 is the PM domain (so FREQCTL_dom0_kernel is set). But we can
> >> relax this restriction and allow the privileged domain to read the MSRs
> >> (but not write). This allows the priviliged domain to harvest the power
> >> management information (ACPI _PSS states) and send it to the hypervisor.
> >> 
> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> >> 
> >> diff -r a4d93d0e0df2 -r a34b652afb0c xen/arch/x86/traps.c
> >> --- a/xen/arch/x86/traps.c Wed Feb 22 14:33:24 2012 +0000
> >> +++ b/xen/arch/x86/traps.c Fri Feb 24 01:43:48 2012 -0500
> >> @@ -2610,7 +2610,7 @@ static int emulate_privileged_op(struct
> >>          case MSR_K8_PSTATE7:
> >>              if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD )
> >>                  goto fail;
> >> -            if ( !is_cpufreq_controller(v->domain) )
> >> +            if ( !is_cpufreq_controller(v->domain) && !IS_PRIV(v->domain) )
> > 
> > As said already in response to your Linux side overview mail, while
> > I don't view this as having potential to cause any problems, I also
> > don't see the need.
> > 
> > And then if this indeed is wanted, replacing is_cpufreq_controller()
> > by IS_PRIV() rather than adding the latter would seem preferable
> > (because of being more strait forward).
> 
> Any response to this, Konrad?

Ah, lets drop this patch. Jan got me thinking about doing it in a much
nicer way without a need to do the MSRs.

> 
>  K.
> 
> > Jan
> > 
> >>              {
> >>                  regs->eax = regs->edx = 0;
> >>                  break;
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xen.org
> >> http://lists.xen.org/xen-devel
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-03-12 18:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-24  6:43 [PATCH 0 of 2] [RFC] Patches to work with processor-passthru driver (v1) Konrad Rzeszutek Wilk
2012-02-24  6:43 ` [PATCH 1 of 2] traps: AMD PM RDMSRs (MSR_K8_PSTATE_CTRL, etc) Konrad Rzeszutek Wilk
2012-02-24 10:38   ` Jan Beulich
2012-03-01 16:33     ` Keir Fraser
2012-03-12 18:11       ` Konrad Rzeszutek Wilk
2012-02-24  6:43 ` [PATCH 2 of 2] linux-xencommons: Load processor-passthru Konrad Rzeszutek Wilk
2012-02-24 10:44   ` Jan Beulich
2012-02-27  1:46     ` Konrad Rzeszutek Wilk
2012-02-27  8:31       ` Jan Beulich
2012-03-05  2:49         ` Konrad Rzeszutek Wilk
2012-03-05 11:43           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).