Linux Power Management development

Linux Power Management development
 help / color / mirror / Atom feed

* Re: [PATCH 4/6 v4] arm highbank: add support for pl320 IPC
From: Rob Herring @ 2012-11-14 14:03 UTC (permalink / raw)
  To: Mark Langsdorf; +Cc: linux-kernel, cpufreq, linux-pm
In-Reply-To: <1352313166-28980-5-git-send-email-mark.langsdorf@calxeda.com>

On 11/07/2012 12:32 PM, Mark Langsdorf wrote:
> From: Rob Herring <rob.herring@calxeda.com>
> 
> The pl320 IPC allows for interprocessor communication between the highbank A9
> and the EnergyCore Management Engine. The pl320 implements a straightforward
> mailbox protocol.
> 
> Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
> Signed-off-by: Rob Herring <rob.herring@calxeda.com>
> ---
> Changes from v3, v2
> 	None
> Changes from v1
>         Removed erroneous changes for cpufreq Kconfig
> 
>  arch/arm/include/asm/pl320-ipc.h                |  20 ++

asm/hardware/ is probably more appropriate.

>  arch/arm/mach-highbank/Makefile                 |   2 +
>  arch/arm/mach-highbank/include/mach/pl320-ipc.h |  20 ++

Need to delete this file.

>  arch/arm/mach-highbank/pl320-ipc.c              | 232 ++++++++++++++++++++++++
>  4 files changed, 274 insertions(+)
>  create mode 100644 arch/arm/include/asm/pl320-ipc.h
>  create mode 100644 arch/arm/mach-highbank/include/mach/pl320-ipc.h
>  create mode 100644 arch/arm/mach-highbank/pl320-ipc.c
> 
> diff --git a/arch/arm/include/asm/pl320-ipc.h b/arch/arm/include/asm/pl320-ipc.h
> new file mode 100644
> index 0000000..a0e58ee
> --- /dev/null
> +++ b/arch/arm/include/asm/pl320-ipc.h
> @@ -0,0 +1,20 @@
> +/*
> + * Copyright 2010 Calxeda, Inc.

Update copyright.

> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +int ipc_call_fast(u32 *data);

We should get rid of fast and slow channels and just have a single tx
channel as it is all the same and we don't use the fast channel.

> +int ipc_call_slow(u32 *data);
> +
> +extern int pl320_ipc_register_notifier(struct notifier_block *nb);
> +extern int pl320_ipc_unregister_notifier(struct notifier_block *nb);
> diff --git a/arch/arm/mach-highbank/Makefile b/arch/arm/mach-highbank/Makefile
> index 3ec8bdd..b894708 100644
> --- a/arch/arm/mach-highbank/Makefile
> +++ b/arch/arm/mach-highbank/Makefile
> @@ -7,3 +7,5 @@ obj-$(CONFIG_DEBUG_HIGHBANK_UART)	+= lluart.o
>  obj-$(CONFIG_SMP)			+= platsmp.o
>  obj-$(CONFIG_HOTPLUG_CPU)		+= hotplug.o
>  obj-$(CONFIG_PM_SLEEP)			+= pm.o
> +
> +obj-y					+= pl320-ipc.o
> diff --git a/arch/arm/mach-highbank/include/mach/pl320-ipc.h b/arch/arm/mach-highbank/include/mach/pl320-ipc.h
> new file mode 100644
> index 0000000..a0e58ee
> --- /dev/null
> +++ b/arch/arm/mach-highbank/include/mach/pl320-ipc.h
> @@ -0,0 +1,20 @@
> +/*
> + * Copyright 2010 Calxeda, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +int ipc_call_fast(u32 *data);
> +int ipc_call_slow(u32 *data);
> +
> +extern int pl320_ipc_register_notifier(struct notifier_block *nb);
> +extern int pl320_ipc_unregister_notifier(struct notifier_block *nb);
> diff --git a/arch/arm/mach-highbank/pl320-ipc.c b/arch/arm/mach-highbank/pl320-ipc.c
> new file mode 100644
> index 0000000..0eb92e4
> --- /dev/null
> +++ b/arch/arm/mach-highbank/pl320-ipc.c
> @@ -0,0 +1,232 @@
> +/*
> + * Copyright 2012 Calxeda, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License along with
> + * this program.  If not, see <http://www.gnu.org/licenses/>.
> + */
> +#include <linux/types.h>
> +#include <linux/err.h>
> +#include <linux/delay.h>
> +#include <linux/export.h>
> +#include <linux/io.h>
> +#include <linux/interrupt.h>
> +#include <linux/completion.h>
> +#include <linux/mutex.h>
> +#include <linux/notifier.h>
> +#include <linux/spinlock.h>
> +#include <linux/device.h>
> +#include <linux/amba/bus.h>
> +
> +#include <asm/pl320-ipc.h>
> +
> +#define IPCMxSOURCE(m)		((m) * 0x40)
> +#define IPCMxDSET(m)		(((m) * 0x40) + 0x004)
> +#define IPCMxDCLEAR(m)		(((m) * 0x40) + 0x008)
> +#define IPCMxDSTATUS(m)		(((m) * 0x40) + 0x00C)
> +#define IPCMxMODE(m)		(((m) * 0x40) + 0x010)
> +#define IPCMxMSET(m)		(((m) * 0x40) + 0x014)
> +#define IPCMxMCLEAR(m)		(((m) * 0x40) + 0x018)
> +#define IPCMxMSTATUS(m)		(((m) * 0x40) + 0x01C)
> +#define IPCMxSEND(m)		(((m) * 0x40) + 0x020)
> +#define IPCMxDR(m, dr)		(((m) * 0x40) + ((dr) * 4) + 0x024)
> +
> +#define IPCMMIS(irq)		(((irq) * 8) + 0x800)
> +#define IPCMRIS(irq)		(((irq) * 8) + 0x804)
> +
> +#define MBOX_MASK(n)		(1 << (n))
> +#define IPC_FAST_MBOX		0
> +#define IPC_SLOW_MBOX		1
> +#define IPC_RX_MBOX		2
> +
> +#define CHAN_MASK(n)		(1 << (n))
> +#define A9_SOURCE		1
> +#define M3_SOURCE		0
> +
> +static void __iomem *ipc_base;
> +static int ipc_irq;
> +static DEFINE_SPINLOCK(ipc_m0_lock);
> +static DEFINE_MUTEX(ipc_m1_lock);
> +static DECLARE_COMPLETION(ipc_completion);
> +static ATOMIC_NOTIFIER_HEAD(ipc_notifier);
> +
> +static inline void set_destination(int source, int mbox)
> +{
> +	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDSET(mbox));
> +	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMSET(mbox));
> +}
> +
> +static inline void clear_destination(int source, int mbox)
> +{
> +	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxDCLEAR(mbox));
> +	__raw_writel(CHAN_MASK(source), ipc_base + IPCMxMCLEAR(mbox));
> +}
> +
> +static void __ipc_send(int mbox, u32 *data)
> +{
> +	int i;
> +	for (i = 0; i < 7; i++)
> +		__raw_writel(data[i], ipc_base + IPCMxDR(mbox, i));
> +	__raw_writel(0x1, ipc_base + IPCMxSEND(mbox));
> +}
> +
> +static u32 __ipc_rcv(int mbox, u32 *data)
> +{
> +	int i;
> +	for (i = 0; i < 7; i++)
> +		data[i] = __raw_readl(ipc_base + IPCMxDR(mbox, i));
> +	return data[1];
> +}
> +
> +/* non-blocking implementation from the A9 side, interrupt safe in theory */
> +int ipc_call_fast(u32 *data)
> +{
> +	int timeout, ret;
> +
> +	spin_lock(&ipc_m0_lock);
> +
> +	__ipc_send(IPC_FAST_MBOX, data);
> +
> +	for (timeout = 500; timeout > 0; timeout--) {
> +		if (__raw_readl(ipc_base + IPCMxSEND(IPC_FAST_MBOX)) == 0x2)
> +			break;
> +		udelay(100);
> +	}
> +	if (timeout == 0) {
> +		ret = -ETIMEDOUT;
> +		goto out;
> +	}
> +
> +	ret = __ipc_rcv(IPC_FAST_MBOX, data);
> +out:
> +	__raw_writel(0, ipc_base + IPCMxSEND(IPC_FAST_MBOX));
> +	spin_unlock(&ipc_m0_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ipc_call_fast);
> +
> +/* blocking implmentation from the A9 side, not usuable in interrupts! */
> +int ipc_call_slow(u32 *data)
> +{
> +	int ret;
> +
> +	mutex_lock(&ipc_m1_lock);
> +
> +	init_completion(&ipc_completion);
> +	__ipc_send(IPC_SLOW_MBOX, data);
> +	ret = wait_for_completion_timeout(&ipc_completion,
> +					  msecs_to_jiffies(1000));
> +	if (ret == 0) {
> +		ret = -ETIMEDOUT;
> +		goto out;
> +	}
> +
> +	ret = __ipc_rcv(IPC_SLOW_MBOX, data);
> +out:
> +	mutex_unlock(&ipc_m1_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(ipc_call_slow);
> +
> +irqreturn_t ipc_handler(int irq, void *dev)
> +{
> +	u32 irq_stat;
> +	u32 data[7];
> +
> +	irq_stat = __raw_readl(ipc_base + IPCMMIS(1));
> +	if (irq_stat & MBOX_MASK(IPC_SLOW_MBOX)) {
> +		__raw_writel(0, ipc_base + IPCMxSEND(IPC_SLOW_MBOX));
> +		complete(&ipc_completion);
> +	}
> +	if (irq_stat & MBOX_MASK(IPC_RX_MBOX)) {
> +		__ipc_rcv(IPC_RX_MBOX, data);
> +		atomic_notifier_call_chain(&ipc_notifier, data[0], data + 1);
> +		__raw_writel(2, ipc_base + IPCMxSEND(IPC_RX_MBOX));
> +	}
> +
> +	return IRQ_HANDLED;
> +}
> +
> +int pl320_ipc_register_notifier(struct notifier_block *nb)
> +{
> +	return atomic_notifier_chain_register(&ipc_notifier, nb);
> +}
> +
> +int pl320_ipc_unregister_notifier(struct notifier_block *nb)
> +{
> +	return atomic_notifier_chain_unregister(&ipc_notifier, nb);
> +}
> +
> +static int __devinit pl320_probe(struct amba_device *adev,
> +				const struct amba_id *id)
> +{
> +	int ret;
> +
> +	ipc_base = ioremap(adev->res.start, resource_size(&adev->res));
> +	if (ipc_base == NULL)
> +		return -ENOMEM;
> +
> +	__raw_writel(0, ipc_base + IPCMxSEND(IPC_FAST_MBOX));
> +	__raw_writel(0, ipc_base + IPCMxSEND(IPC_SLOW_MBOX));
> +
> +	ipc_irq = adev->irq[0];
> +	ret = request_irq(ipc_irq, ipc_handler, 0, dev_name(&adev->dev), NULL);
> +	if (ret < 0)
> +		goto err;
> +
> +	/* Init fast mailbox */
> +	__raw_writel(CHAN_MASK(A9_SOURCE),
> +			ipc_base + IPCMxSOURCE(IPC_FAST_MBOX));
> +	set_destination(M3_SOURCE, IPC_FAST_MBOX);
> +
> +	/* Init slow mailbox */
> +	__raw_writel(CHAN_MASK(A9_SOURCE),
> +			ipc_base + IPCMxSOURCE(IPC_SLOW_MBOX));
> +	__raw_writel(CHAN_MASK(M3_SOURCE),
> +			ipc_base + IPCMxDSET(IPC_SLOW_MBOX));
> +	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
> +		     ipc_base + IPCMxMSET(IPC_SLOW_MBOX));
> +
> +	/* Init receive mailbox */
> +	__raw_writel(CHAN_MASK(M3_SOURCE),
> +			ipc_base + IPCMxSOURCE(IPC_RX_MBOX));
> +	__raw_writel(CHAN_MASK(A9_SOURCE),
> +			ipc_base + IPCMxDSET(IPC_RX_MBOX));
> +	__raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
> +		     ipc_base + IPCMxMSET(IPC_RX_MBOX));
> +
> +	return 0;
> +err:
> +	iounmap(ipc_base);
> +	return ret;
> +}
> +
> +static struct amba_id pl320_ids[] = {
> +	{
> +		.id	= 0x00041320,
> +		.mask	= 0x000fffff,
> +	},
> +	{ 0, 0 },
> +};
> +
> +static struct amba_driver pl320_driver = {
> +	.drv = {
> +		.name	= "pl320",
> +	},
> +	.id_table	= pl320_ids,
> +	.probe		= pl320_probe,
> +};
> +
> +static int __init ipc_init(void)
> +{
> +	return amba_driver_register(&pl320_driver);
> +}
> +module_init(ipc_init);
> 

^ permalink raw reply

* Re: [RFC PATCH v2 3/6] usb: add runtime pm support for usb port device
From: Lan Tianyu @ 2012-11-14 14:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Lan Tianyu, gregkh, sarah.a.sharp, stern, oneukum, linux-usb,
	Linux PM list
In-Reply-To: <50A39285.80305@gmail.com>

于 2012/11/14 20:45, Lan Tianyu 写道:
> 于 2012/11/14 17:49, Rafael J. Wysocki 写道:
>> On Wednesday, November 14, 2012 02:34:37 PM Lan Tianyu wrote:
>>> On 2012年11月14日 08:08, Rafael J. Wysocki wrote:
>>>> On Tuesday, November 13, 2012 04:00:02 PM Lan Tianyu wrote:
>>>>> This patch is to add runtime pm callback for usb port device.
>>>>> Set/clear PORT_POWER feature in the resume/suspend callbak.
>>>>> Add portnum for struct usb_port to record port number.
>>>>>
>>>>> Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
>>>>
>>>> This one looks reasonable to me.  From the PM side
>>>>
>>>> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> Hi Rafael and Alan:
>>>     This patch has a collaboration problem with pm qos. Since pm core would
>>> pm_runtime_get_sync/put(port_dev) if pm qos flags was changed and port's
>>> suspend call_back() clear PORT_POWER feature without any check. This
>>> will cause PORT_POWER feather being cleared every time after pm qos
>>> flags being changed.
>>>
>>>     I have an idea that add check in the port's runtime idle callback.
>>> Check NO_POWER_OFF flag firstly. If set return. Second, for port without
>>> device, suspend port directly and for port with device, increase child
>>> device's runtime usage without resume and do barrier to ensure all
>>> suspend process finish, at last check the child runtime status. If it
>>> was suspended, suspend port and if do nothing.
>>>
>>> static int usb_port_runtime_idle(struct device *dev)
>>> {
>>>     struct usb_port *port_dev = to_usb_port(dev);
>>>     int retval = 0;
>>>
>>>     if (dev_pm_qos_flags(&port_dev->dev, PM_QOS_FLAG_NO_POWER_OFF)
>>>             == PM_QOS_FLAGS_ALL)
>>>         return 0;
>>>
>>>     if (!port_dev->child) {
>>>         retval = pm_runtime_suspend(&port_dev->dev);
>>>         if (!retval)
>>>             port_dev->power_on =false;
>>>     }
>>>     else {
>>
>> This usually is written as "} else {" in the kernel code.
>>
>>>         pm_runtime_get_noresume(&port_dev->child->dev);
>>>         pm_runtime_barrier(&port_dev->child->dev);
>>>         if (port_dev->child->dev.power.runtime_status
>>>                 == RPM_SUSPENDED) {
>>>             retval = pm_runtime_suspend(&port_dev->dev);
>>>             if (!retval)
>>>                 port_dev->power_on =false;
>>>         }
>>>         pm_runtime_put_noidle(&port_dev->child->dev);
>>>     }
>>
>> Hmm.  If child->dev is not suspended, then our usage_count should be
>> at least 1, so pm_runtime_suspend(&port_dev->dev) shouldn't actually
>> suspend us.  Isn't that the case?
> No, because the child device is not under port device and so even if
> child->dev is not suspended, port device's usage still can be 0 and
> power off the port.

Maybe I should add pm_runtime_get_noresume(&port_dev->dev) before enable
port's runtime pm. Just like following. Then it will work like you said.

@@ -72,6 +109,8 @@ int usb_hub_create_port_device(struct device *intfdev,
  	if (retval)
  		goto error_register;

+	pm_runtime_set_active(&port_dev->dev);
+	pm_runtime_get_noresume(&port_dev->dev); /* new add */
+	pm_runtime_enable(&port_dev->dev);
>>
>> Rafael
>>
>>
>

-- 
Best regards
Tianyu Lan

^ permalink raw reply

* Re: [RFC PATCH v2 3/6] usb: add runtime pm support for usb port device
From: Lan Tianyu @ 2012-11-14 15:13 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Lan Tianyu, gregkh-hQyY1W1yCW8ekmWlsbkhG0B+6BGkLq7r,
	sarah.a.sharp-VuQAYsv1563Yd54FQh9/CA,
	stern-nwvwT67g6+6dFdvTe/nMLpVzexx5G7lz, oneukum-l3A5Bk7waGM,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, Linux PM list
In-Reply-To: <50A39285.80305-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

于 2012/11/14 20:45, Lan Tianyu 写道:
> 于 2012/11/14 17:49, Rafael J. Wysocki 写道:
>> On Wednesday, November 14, 2012 02:34:37 PM Lan Tianyu wrote:
>>> On 2012年11月14日 08:08, Rafael J. Wysocki wrote:
>>>> On Tuesday, November 13, 2012 04:00:02 PM Lan Tianyu wrote:
>>>>> This patch is to add runtime pm callback for usb port device.
>>>>> Set/clear PORT_POWER feature in the resume/suspend callbak.
>>>>> Add portnum for struct usb_port to record port number.
>>>>>
>>>>> Signed-off-by: Lan Tianyu <tianyu.lan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>>>
>>>> This one looks reasonable to me.  From the PM side
>>>>
>>>> Acked-by: Rafael J. Wysocki <rafael.j.wysocki-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
>>> Hi Rafael and Alan:
>>>     This patch has a collaboration problem with pm qos. Since pm core would
>>> pm_runtime_get_sync/put(port_dev) if pm qos flags was changed and port's
>>> suspend call_back() clear PORT_POWER feature without any check. This
>>> will cause PORT_POWER feather being cleared every time after pm qos
>>> flags being changed.
>>>
>>>     I have an idea that add check in the port's runtime idle callback.
>>> Check NO_POWER_OFF flag firstly. If set return. Second, for port without
>>> device, suspend port directly and for port with device, increase child
>>> device's runtime usage without resume and do barrier to ensure all
>>> suspend process finish, at last check the child runtime status. If it
>>> was suspended, suspend port and if do nothing.
>>>
>>> static int usb_port_runtime_idle(struct device *dev)
>>> {
>>>     struct usb_port *port_dev = to_usb_port(dev);
>>>     int retval = 0;
>>>
>>>     if (dev_pm_qos_flags(&port_dev->dev, PM_QOS_FLAG_NO_POWER_OFF)
>>>             == PM_QOS_FLAGS_ALL)
>>>         return 0;
>>>
>>>     if (!port_dev->child) {
>>>         retval = pm_runtime_suspend(&port_dev->dev);
>>>         if (!retval)
>>>             port_dev->power_on =false;
>>>     }
>>>     else {
>>
>> This usually is written as "} else {" in the kernel code.
>>
>>>         pm_runtime_get_noresume(&port_dev->child->dev);
>>>         pm_runtime_barrier(&port_dev->child->dev);
>>>         if (port_dev->child->dev.power.runtime_status
>>>                 == RPM_SUSPENDED) {
>>>             retval = pm_runtime_suspend(&port_dev->dev);
>>>             if (!retval)
>>>                 port_dev->power_on =false;
>>>         }
>>>         pm_runtime_put_noidle(&port_dev->child->dev);
>>>     }
>>
>> Hmm.  If child->dev is not suspended, then our usage_count should be
>> at least 1, so pm_runtime_suspend(&port_dev->dev) shouldn't actually
>> suspend us.  Isn't that the case?
> No, because the child device is not under port device and so even if
> child->dev is not suspended, port device's usage still can be 0 and
> power off the port.
>>
Please ignore this reply. I may not understand your reply correctly.
You are right if the child->dev is not suspended, the usage count should be at
last 1. But how about if the child->dev is suspended.

Assume that usb device was suspended and power off, so port's usage count must be 0
since it has been suspended. If pm qos NO_POWER_OFF was set at this time, pm core
would get port resume and suspend it again. the usage change 0 - 1 - 0. So port is
power off with NO_POWER_OFF flag setting, Does this make sense?

>> Rafael
>>
>>
>

-- 
Best regards
Tianyu Lan
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-14 16:06 UTC (permalink / raw)
  To: Huang Ying; +Cc: Rafael J. Wysocki, linux-kernel, linux-pm
In-Reply-To: <1352900114.5254.3.camel@yhuang-mobile.sh.intel.com>

On Wed, 14 Nov 2012, Huang Ying wrote:

> > What changes specifically do you mean to be precise?
> 
> I mean the following changes from Alan's email.
> 
>         pm_runtime_set_suspended should fail if dev->power.runtime_auto
>         is clear.
> 
>         pm_runtime_forbid should call pm_runtime_set_active if
>         dev->power.disable_depth > 0.  (This would run into a problem
>         if the parent is suspended and disabled.  Maybe 
>         pm_runtime_forbid should fail when this happens.)
> 
> For the second one, is it possible that the device is really in low
> power state when pm_runtime_forbid is called?  That situation is hard to
> deal with too.

Yes, it is possible.  I don't see what we can do about it.  By
disabling the device, the driver has said that it doesn't want to 
handle any runtime PM callbacks.  Without the driver's help, there 
isn't any good way to bring the device back to full power.

On the other hand, the PM core doesn't know the device's actual power 
state.  All it knows is the value of dev->power.runtime_status.  So it 
doesn't have any way to detect when this problem occurs.

Alan Stern

^ permalink raw reply

* RE: [linux-pm] [PATCH 1/1] thermal: cpu cooling: allow module builds
From: R, Durgadoss @ 2012-11-14 16:21 UTC (permalink / raw)
  To: Eduardo Valentin, amit.kachhap@linaro.org
  Cc: eballetbo@gmail.com, linux-acpi@vger.kernel.org,
	linux-pm@lists.linux-foundation.org, Zhang, Rui,
	Linux PM list (linux-pm@vger.kernel.org)
In-Reply-To: <1352906610-549-1-git-send-email-eduardo.valentin@ti.com>


> -----Original Message-----
> From: linux-pm-bounces@lists.linux-foundation.org [mailto:linux-pm-
> bounces@lists.linux-foundation.org] On Behalf Of Eduardo Valentin
> Sent: Wednesday, November 14, 2012 8:54 PM
> To: amit.kachhap@linaro.org
> Cc: eballetbo@gmail.com; linux-acpi@vger.kernel.org; linux-pm@lists.linux-
> foundation.org
> Subject: [linux-pm] [PATCH 1/1] thermal: cpu cooling: allow module builds
> 
> As thermal drivers can be built as modules and also
> the thermal framework itself, building cpu cooling
> only as built-in can cause linking errors. For instance:
> * Generic Thermal sysfs driver
> *
> Generic Thermal sysfs driver (THERMAL) [M/n/y/?] m
>   generic cpu cooling support (CPU_THERMAL) [N/y/?] (NEW) y
> 
> with the following drive:
> CONFIG_OMAP_BANDGAP=m

Nice catch Eduardo :-)

Reviewed-by: Durgadoss R <durgadoss.r@intel.com>

Also, Ccing Rui to this e-mail and adding linux-pm.

Thanks,
Durga
> 
> generates:
> ERROR: "cpufreq_cooling_unregister" [drivers/staging/omap-thermal/omap-
> thermal.ko] undefined!
> ERROR: "cpufreq_cooling_register" [drivers/staging/omap-thermal/omap-
> thermal.ko] undefined!
> 
> This patch changes cpu cooling driver to allow it
> to be built as module.
> 
> Reported-by: Enric Balletbo i Serra <eballetbo@gmail.com>
> Signed-off-by: Eduardo Valentin <eduardo.valentin@ti.com>
> ---
>  drivers/thermal/Kconfig     |    2 +-
>  include/linux/cpu_cooling.h |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index e1cb6bd..3b03c8b 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -20,7 +20,7 @@ config THERMAL_HWMON
>  	default y
> 
>  config CPU_THERMAL
> -	bool "generic cpu cooling support"
> +	tristate "generic cpu cooling support"
>  	depends on THERMAL && CPU_FREQ
>  	select CPU_FREQ_TABLE
>  	help
> diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h
> index b30cc79c..40b4ef5 100644
> --- a/include/linux/cpu_cooling.h
> +++ b/include/linux/cpu_cooling.h
> @@ -29,7 +29,7 @@
>  #define CPUFREQ_COOLING_START		0
>  #define CPUFREQ_COOLING_STOP		1
> 
> -#ifdef CONFIG_CPU_THERMAL
> +#if defined(CONFIG_CPU_THERMAL) ||
> defined(CONFIG_CPU_THERMAL_MODULE)
>  /**
>   * cpufreq_cooling_register - function to create cpufreq cooling device.
>   * @clip_cpus: cpumask of cpus where the frequency constraints will happen
> --
> 1.7.7.1.488.ge8e1c


^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-14 16:42 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <1456950.61QZjXbNpt@vostro.rjw.lan>

On Wed, 14 Nov 2012, Rafael J. Wysocki wrote:

> On Thursday, November 08, 2012 12:07:54 PM Alan Stern wrote:
> > On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:
> 
> [...]
> 
> I'd like to revisit this for a while if you don't mind.

Not at all.

> > Your revised patch does do the job, except for a few problems.  
> > Namely, while local_pci_probe() and pci_device_remove() are running,
> > the device _does_ have a driver.
> 
> Right.
> 
> > This means that local_pci_probe() should not call pm_runtime_get_sync(),
> > for example.  Doing so would invoke the driver's runtime_resume routine
> > before calling the driver's probe routine!
> > 
> > The USB subsystem solves this problem by carefully keeping track of the 
> > state of the device-driver binding:
> > 
> > 	Originally the device is UNBOUND.
> > 
> > 	At the start of the subsystem's probe routine, the state
> > 	changes to BINDING.
> > 
> > 	If the probe succeeds then it changes to BOUND; otherwise
> > 	it goes back to UNBOUND.
> > 
> > 	At the start of the subsystem's remove routine, the state
> > 	changes to UNBINDING.  At the end it goes to UNBOUND.
> > 
> > When the state is anything other than BOUND, the subsystem's runtime PM 
> > routines act as though there is no driver.
> 
> Well, that wouldn't help PCI, because some drivers want to use the
> pm_runtime_* stuff in their .probe() routines and actually expect it to
> work. :-)

PCI could do something like this:

	local_pci_probe() calls pm_runtime_get_sync() twice before
	it changes the binding state to BINDING.  It then calls 
	pm_runtime_put_sync() after the state is BOUND.

	pci_device_remove() calls pm_runtime_get_sync() before it
	changes the binding state to UNBINDING.  It then calls
	pm_runtime_put_sync() twice after the state is UNBOUND.

(Obviously some of those calls could be _get_noresume() or
_put_noidle().)

This has the side effect that when a driver unbinds, it can't leave the 
device in a special low-power state.  The device will always end up in 
the generic low-power state supported by the PCI core.

> Perhaps we can introduce something like
> 
> pm_runtime_get[_put]_skip_callbacks()
> 
> that would treat the device as though it had the power.no_callbacks flag
> set and use that around the driver's .probe() in the PCI core?

That would prevent the PM core from invoking the PCI subsystem's own 
callback, not just the driver's callback.  So I don't think that's what 
you want.

Alan Stern


^ permalink raw reply

* Re: [RFC PATCH v2 3/6] usb: add runtime pm support for usb port device
From: Alan Stern @ 2012-11-14 17:07 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Rafael J. Wysocki, Lan Tianyu, gregkh, sarah.a.sharp, oneukum,
	linux-usb, Linux PM list
In-Reply-To: <50A3B50D.3000408@gmail.com>

On Wed, 14 Nov 2012, Lan Tianyu wrote:

> >>> Hi Rafael and Alan:
> >>>     This patch has a collaboration problem with pm qos. Since pm core would
> >>> pm_runtime_get_sync/put(port_dev) if pm qos flags was changed and port's
> >>> suspend call_back() clear PORT_POWER feature without any check. This
> >>> will cause PORT_POWER feather being cleared every time after pm qos
> >>> flags being changed.
> >>>
> >>>     I have an idea that add check in the port's runtime idle callback.
> >>> Check NO_POWER_OFF flag firstly. If set return. Second, for port without
> >>> device, suspend port directly and for port with device, increase child
> >>> device's runtime usage without resume and do barrier to ensure all
> >>> suspend process finish, at last check the child runtime status. If it
> >>> was suspended, suspend port and if do nothing.

> >> Hmm.  If child->dev is not suspended, then our usage_count should be
> >> at least 1, so pm_runtime_suspend(&port_dev->dev) shouldn't actually
> >> suspend us.  Isn't that the case?
> > No, because the child device is not under port device and so even if
> > child->dev is not suspended, port device's usage still can be 0 and
> > power off the port.
> >>
> Please ignore this reply. I may not understand your reply correctly.
> You are right if the child->dev is not suspended, the usage count should be at
> last 1. But how about if the child->dev is suspended.
> 
> Assume that usb device was suspended and power off, so port's usage count must be 0
> since it has been suspended. If pm qos NO_POWER_OFF was set at this time, pm core
> would get port resume and suspend it again. the usage change 0 - 1 - 0. So port is
> power off with NO_POWER_OFF flag setting, Does this make sense?

Suppose, as you say, the USB device is suspended and the port is
powered off.  Now the user wants to set the PM QOS NO_POWER_OFF flag.  
When this happens, the PM core will first do a runtime resume of the
port, then it will set the flag, and then it will do a runtime suspend
of the port.  The port's runtime_suspend callback should see that the
flag is set and return -EAGAIN, leaving the port powered on.

Alan Stern


^ permalink raw reply

* Re: [PATCH] cpuidle: Measure idle state durations with monotonic clock
From: Julius Werner @ 2012-11-14 17:15 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: linux-kernel, Len Brown, Rafael J. Wysocki, Kevin Hilman,
	Andrew Morton, Srivatsa S. Bhat, linux-acpi, linux-pm,
	linuxppc-dev, Deepthi Dharwar, Trinabh Gupta, Sameer Nanda,
	Lists Linaro-dev
In-Reply-To: <50A37ADD.8040000@linaro.org>

> Maybe you can remove all these computations and set the flag
> en_core_tk_irqen for the driver ? That will be handled by the cpuidle
> framework, no ?
>
> Same comment for the intel_idle driver.

Yeah, I thought about that, too. I was a little too afraid of touching
the sched_clock_idle_wakeup_event() parameter that is tied to the
measurement, but it seems to have been vestigial for some time now and
other drivers also just set it 0. I will whip up another version of
the patch (won't change the PPC further though, if this version works
I would just leave it at that... thanks for testing, Deepthi).

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-14 19:42 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211141127440.1620-100000@iolanthe.rowland.org>

On Wednesday, November 14, 2012 11:42:33 AM Alan Stern wrote:
> On Wed, 14 Nov 2012, Rafael J. Wysocki wrote:
> 
> > On Thursday, November 08, 2012 12:07:54 PM Alan Stern wrote:
> > > On Thu, 8 Nov 2012, Rafael J. Wysocki wrote:
> > 
> > [...]
> > 
> > I'd like to revisit this for a while if you don't mind.
> 
> Not at all.
> 
> > > Your revised patch does do the job, except for a few problems.  
> > > Namely, while local_pci_probe() and pci_device_remove() are running,
> > > the device _does_ have a driver.
> > 
> > Right.
> > 
> > > This means that local_pci_probe() should not call pm_runtime_get_sync(),
> > > for example.  Doing so would invoke the driver's runtime_resume routine
> > > before calling the driver's probe routine!
> > > 
> > > The USB subsystem solves this problem by carefully keeping track of the 
> > > state of the device-driver binding:
> > > 
> > > 	Originally the device is UNBOUND.
> > > 
> > > 	At the start of the subsystem's probe routine, the state
> > > 	changes to BINDING.
> > > 
> > > 	If the probe succeeds then it changes to BOUND; otherwise
> > > 	it goes back to UNBOUND.
> > > 
> > > 	At the start of the subsystem's remove routine, the state
> > > 	changes to UNBINDING.  At the end it goes to UNBOUND.
> > > 
> > > When the state is anything other than BOUND, the subsystem's runtime PM 
> > > routines act as though there is no driver.
> > 
> > Well, that wouldn't help PCI, because some drivers want to use the
> > pm_runtime_* stuff in their .probe() routines and actually expect it to
> > work. :-)
> 
> PCI could do something like this:
> 
> 	local_pci_probe() calls pm_runtime_get_sync() twice before
> 	it changes the binding state to BINDING.  It then calls 
> 	pm_runtime_put_sync() after the state is BOUND.
> 
> 	pci_device_remove() calls pm_runtime_get_sync() before it
> 	changes the binding state to UNBINDING.  It then calls
> 	pm_runtime_put_sync() twice after the state is UNBOUND.
> 
> (Obviously some of those calls could be _get_noresume() or
> _put_noidle().)
> 
> This has the side effect that when a driver unbinds, it can't leave the 
> device in a special low-power state.  The device will always end up in 
> the generic low-power state supported by the PCI core.

Well, I'm not sure I'd like that.

Let's just go back even one step more and think what we'd like to have in
general terms and then how to implement it. :-)

Suppose that pci_pm_init() calls pm_runtime_enable() for all devices (in
addition to what it does currently).  The runtime PM status of each device is
RPM_SUSPENDED at this point.  Then:

(1) We want to keep the current semantics during probe, i.e. the device should
    (a) be RPM_ACTIVE and (b) have usage_count == (user space usage_count + 1)
    right before ddi->drv->probe() is executed.

(2) We don't want the driver's PM callbacks to be run before ddi->drv->probe().
    There's a question if we want the bus type's PM callbacks to be run at
    that point, but they are not run currently and IMO we shouldn't change
    that.

(4) If ddi->drv->probe() fails, we want the device's status to change to
    RPM_SUSPENDED and it's usage_count to be equal to the user space part,
    so that the conditions are the same as before when probing is repeated.

(5) During ddi->drv->probe(), if the driver decrements the device's usage_count,
    which it is supposed to do if it supports runtime PM, then runtime PM
    should work for the device normally going forward (unless the .probe()
    eventually fails, but then the driver is supposed to do the cleanup).

(6) In pci_device_remove() we want the status to change to RPM_SUSPENDED and
    the device's usage_count to be equal to the user space part after
    drv->remove() has run.

(7) We want neither the driver's nor the PCI bus type's PM callbacks
    to be run after drv->remove() has returned (that's what happens now).

> > Perhaps we can introduce something like
> > 
> > pm_runtime_get[_put]_skip_callbacks()
> > 
> > that would treat the device as though it had the power.no_callbacks flag
> > set and use that around the driver's .probe() in the PCI core?
> 
> That would prevent the PM core from invoking the PCI subsystem's own 
> callback, not just the driver's callback.  So I don't think that's what 
> you want.

Actually, looking at the above, I think that's pretty much what I want. :-)

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* [PATCH 1/7] tools/power turbostat: Repair Segmentation fault when using -i option
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown
In-Reply-To: <1352925804-6746-1-git-send-email-lenb@kernel.org>

From: Len Brown <len.brown@intel.com>

Fix regression caused by commit 8e180f3cb6b7510a3bdf14e16ce87c9f5d86f102
(tools/power turbostat: add [-d MSR#][-D MSR#] options to print counter
deltas)

Signed-off-by: Len Brown <len.brown@intel.com>
---
 tools/power/x86/turbostat/turbostat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 2655ae9..9942dee 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -1594,7 +1594,7 @@ void cmdline(int argc, char **argv)
 
 	progname = argv[0];
 
-	while ((opt = getopt(argc, argv, "+pPSvisc:sC:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "+pPSvi:sc:sC:m:M:")) != -1) {
 		switch (opt) {
 		case 'p':
 			show_core_only++;
-- 
1.8.0

^ permalink raw reply related

* [PATCH 2/7] tools/power turbostat: graceful fail on garbage input
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Len Brown <len.brown@intel.com>

When invald MSR's are specified on the command line,
turbostat should simply print an error and exit.

Signed-off-by: Len Brown <len.brown@intel.com>
---
 tools/power/x86/turbostat/turbostat.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 9942dee..ea095ab 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -206,8 +206,10 @@ int get_msr(int cpu, off_t offset, unsigned long long *msr)
 	retval = pread(fd, msr, sizeof *msr, offset);
 	close(fd);
 
-	if (retval != sizeof *msr)
+	if (retval != sizeof *msr) {
+		fprintf(stderr, "%s offset 0x%zx read failed\n", pathname, offset);
 		return -1;
+	}
 
 	return 0;
 }
@@ -1101,7 +1103,9 @@ void turbostat_loop()
 
 restart:
 	retval = for_all_cpus(get_counters, EVEN_COUNTERS);
-	if (retval) {
+	if (retval < -1) {
+		exit(retval);
+	} else if (retval == -1) {
 		re_initialize();
 		goto restart;
 	}
@@ -1114,7 +1118,9 @@ restart:
 		}
 		sleep(interval_sec);
 		retval = for_all_cpus(get_counters, ODD_COUNTERS);
-		if (retval) {
+		if (retval < -1) {
+			exit(retval);
+		} else if (retval == -1) {
 			re_initialize();
 			goto restart;
 		}
@@ -1126,7 +1132,9 @@ restart:
 		flush_stdout();
 		sleep(interval_sec);
 		retval = for_all_cpus(get_counters, EVEN_COUNTERS);
-		if (retval) {
+		if (retval < -1) {
+			exit(retval);
+		} else if (retval == -1) {
 			re_initialize();
 			goto restart;
 		}
@@ -1545,8 +1553,11 @@ void turbostat_init()
 int fork_it(char **argv)
 {
 	pid_t child_pid;
+	int status;
 
-	for_all_cpus(get_counters, EVEN_COUNTERS);
+	status = for_all_cpus(get_counters, EVEN_COUNTERS);
+	if (status)
+		exit(status);
 	/* clear affinity side-effect of get_counters() */
 	sched_setaffinity(0, cpu_present_setsize, cpu_present_set);
 	gettimeofday(&tv_even, (struct timezone *)NULL);
@@ -1556,7 +1567,6 @@ int fork_it(char **argv)
 		/* child */
 		execvp(argv[0], argv);
 	} else {
-		int status;
 
 		/* parent */
 		if (child_pid == -1) {
@@ -1568,7 +1578,7 @@ int fork_it(char **argv)
 		signal(SIGQUIT, SIG_IGN);
 		if (waitpid(child_pid, &status, 0) == -1) {
 			perror("wait");
-			exit(1);
+			exit(status);
 		}
 	}
 	/*
@@ -1585,7 +1595,7 @@ int fork_it(char **argv)
 
 	fprintf(stderr, "%.6f sec\n", tv_delta.tv_sec + tv_delta.tv_usec/1000000.0);
 
-	return 0;
+	return status;
 }
 
 void cmdline(int argc, char **argv)
-- 
1.8.0

^ permalink raw reply related

* [PATCH 3/7] tools/power/x86/turbostat: use kernel MSR #defines
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown, x86
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Len Brown <len.brown@intel.com>

Now that turbostat is built in the kernel tree,
it can share MSR #defines with the kernel.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/msr-index.h      | 12 ++++++++++++
 tools/power/x86/turbostat/Makefile    |  1 +
 tools/power/x86/turbostat/turbostat.c | 26 +++++++-------------------
 3 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 7f0edce..c9775a3 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -35,6 +35,7 @@
 #define MSR_IA32_PERFCTR0		0x000000c1
 #define MSR_IA32_PERFCTR1		0x000000c2
 #define MSR_FSB_FREQ			0x000000cd
+#define MSR_NHM_PLATFORM_INFO		0x000000ce
 
 #define MSR_NHM_SNB_PKG_CST_CFG_CTL	0x000000e2
 #define NHM_C3_AUTO_DEMOTE		(1UL << 25)
@@ -55,6 +56,8 @@
 
 #define MSR_OFFCORE_RSP_0		0x000001a6
 #define MSR_OFFCORE_RSP_1		0x000001a7
+#define MSR_NHM_TURBO_RATIO_LIMIT	0x000001ad
+#define MSR_IVT_TURBO_RATIO_LIMIT	0x000001ae
 
 #define MSR_LBR_SELECT			0x000001c8
 #define MSR_LBR_TOS			0x000001c9
@@ -103,6 +106,15 @@
 #define MSR_IA32_MC0_ADDR		0x00000402
 #define MSR_IA32_MC0_MISC		0x00000403
 
+/* C-state Residency Counters */
+#define MSR_PKG_C3_RESIDENCY		0x000003f8
+#define MSR_PKG_C6_RESIDENCY		0x000003f9
+#define MSR_PKG_C7_RESIDENCY		0x000003fa
+#define MSR_CORE_C3_RESIDENCY		0x000003fc
+#define MSR_CORE_C6_RESIDENCY		0x000003fd
+#define MSR_CORE_C7_RESIDENCY		0x000003fe
+#define MSR_PKG_C2_RESIDENCY		0x0000060d
+
 #define MSR_AMD64_MC0_MASK		0xc0010044
 
 #define MSR_IA32_MCx_CTL(x)		(MSR_IA32_MC0_CTL + 4*(x))
diff --git a/tools/power/x86/turbostat/Makefile b/tools/power/x86/turbostat/Makefile
index f856495..51880e8 100644
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -1,5 +1,6 @@
 turbostat : turbostat.c
 CFLAGS +=	-Wall
+CFLAGS +=	-I../../../../arch/x86/include/
 
 clean :
 	rm -f turbostat
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index ea095ab..3c063a0 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -20,6 +20,7 @@
  */
 
 #define _GNU_SOURCE
+#include <asm/msr.h>
 #include <stdio.h>
 #include <unistd.h>
 #include <sys/types.h>
@@ -35,19 +36,6 @@
 #include <ctype.h>
 #include <sched.h>
 
-#define MSR_NEHALEM_PLATFORM_INFO	0xCE
-#define MSR_NEHALEM_TURBO_RATIO_LIMIT	0x1AD
-#define MSR_IVT_TURBO_RATIO_LIMIT	0x1AE
-#define MSR_APERF	0xE8
-#define MSR_MPERF	0xE7
-#define MSR_PKG_C2_RESIDENCY	0x60D	/* SNB only */
-#define MSR_PKG_C3_RESIDENCY	0x3F8
-#define MSR_PKG_C6_RESIDENCY	0x3F9
-#define MSR_PKG_C7_RESIDENCY	0x3FA	/* SNB only */
-#define MSR_CORE_C3_RESIDENCY	0x3FC
-#define MSR_CORE_C6_RESIDENCY	0x3FD
-#define MSR_CORE_C7_RESIDENCY	0x3FE	/* SNB only */
-
 char *proc_stat = "/proc/stat";
 unsigned int interval_sec = 5;	/* set with -i interval_sec */
 unsigned int verbose;		/* set with -v */
@@ -674,9 +662,9 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 	t->tsc = rdtsc();	/* we are running on local CPU of interest */
 
 	if (has_aperf) {
-		if (get_msr(cpu, MSR_APERF, &t->aperf))
+		if (get_msr(cpu, MSR_IA32_APERF, &t->aperf))
 			return -3;
-		if (get_msr(cpu, MSR_MPERF, &t->mperf))
+		if (get_msr(cpu, MSR_IA32_MPERF, &t->mperf))
 			return -4;
 	}
 
@@ -742,10 +730,10 @@ void print_verbose_header(void)
 	if (!do_nehalem_platform_info)
 		return;
 
-	get_msr(0, MSR_NEHALEM_PLATFORM_INFO, &msr);
+	get_msr(0, MSR_NHM_PLATFORM_INFO, &msr);
 
 	if (verbose > 1)
-		fprintf(stderr, "MSR_NEHALEM_PLATFORM_INFO: 0x%llx\n", msr);
+		fprintf(stderr, "MSR_NHM_PLATFORM_INFO: 0x%llx\n", msr);
 
 	ratio = (msr >> 40) & 0xFF;
 	fprintf(stderr, "%d * %.0f = %.0f MHz max efficiency\n",
@@ -808,10 +796,10 @@ print_nhm_turbo_ratio_limits:
 	if (!do_nehalem_turbo_ratio_limit)
 		return;
 
-	get_msr(0, MSR_NEHALEM_TURBO_RATIO_LIMIT, &msr);
+	get_msr(0, MSR_NHM_TURBO_RATIO_LIMIT, &msr);
 
 	if (verbose > 1)
-		fprintf(stderr, "MSR_NEHALEM_TURBO_RATIO_LIMIT: 0x%llx\n", msr);
+		fprintf(stderr, "MSR_NHM_TURBO_RATIO_LIMIT: 0x%llx\n", msr);
 
 	ratio = (msr >> 56) & 0xFF;
 	if (ratio)
-- 
1.8.0

^ permalink raw reply related

* [PATCH 4/7] x86 power: define RAPL MSRs
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown, x86
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Len Brown <len.brown@intel.com>

The Run Time Average Power Limiting interface
is currently model specific, present on Sandy Bridge
and Ivy Bridge processors.

These #defines correspond to documentation in the latest
"Intel® 64 and IA-32 Architectures Software Developer Manual",
plus some typos in that document corrected.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
---
 arch/x86/include/asm/msr-index.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index c9775a3..7d05006 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -115,6 +115,29 @@
 #define MSR_CORE_C7_RESIDENCY		0x000003fe
 #define MSR_PKG_C2_RESIDENCY		0x0000060d
 
+/* Run Time Average Power Limiting (RAPL) Interface */
+
+#define MSR_RAPL_POWER_UNIT		0x00000606
+
+#define MSR_PKG_POWER_LIMIT		0x00000610
+#define MSR_PKG_ENERGY_STATUS		0x00000611
+#define MSR_PKG_PERF_STATUS		0x00000613
+#define MSR_PKG_POWER_INFO		0x00000614
+
+#define MSR_DRAM_POWER_LIMIT		0x00000618
+#define MSR_DRAM_ENERGY_STATUS		0x00000619
+#define MSR_DRAM_PERF_STATUS		0x0000061b
+#define MSR_DRAM_POWER_INFO		0x0000061c
+
+#define MSR_PP0_POWER_LIMIT		0x00000638
+#define MSR_PP0_ENERGY_STATUS		0x00000639
+#define MSR_PP0_POLICY			0x0000063a
+#define MSR_PP0_PERF_STATUS		0x0000063b
+
+#define MSR_PP1_POWER_LIMIT		0x00000640
+#define MSR_PP1_ENERGY_STATUS		0x00000641
+#define MSR_PP1_POLICY			0x00000642
+
 #define MSR_AMD64_MC0_MASK		0xc0010044
 
 #define MSR_IA32_MCx_CTL(x)		(MSR_IA32_MC0_CTL + 4*(x))
-- 
1.8.0

^ permalink raw reply related

* [PATCH 5/7] tools: Allow tools to be installed in a user specified location
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Josh Boyer, Len Brown
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Josh Boyer <jwboyer@redhat.com>

When building x86_energy_perf_policy or turbostat within the confines of
a packaging system such as RPM, we need to be able to have it install to
the buildroot and not the root filesystem of the build machine.  This
adds a DESTDIR variable that when set will act as a prefix for the
install location of these tools.

Signed-off-by: Josh Boyer <jwboyer@redhat.com>
Signed-off-by: Len Brown <len.brown@intel.com>
---
 tools/power/x86/turbostat/Makefile              | 6 ++++--
 tools/power/x86/x86_energy_perf_policy/Makefile | 6 ++++--
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/tools/power/x86/turbostat/Makefile b/tools/power/x86/turbostat/Makefile
index 51880e8..e79f794 100644
--- a/tools/power/x86/turbostat/Makefile
+++ b/tools/power/x86/turbostat/Makefile
@@ -1,3 +1,5 @@
+DESTDIR ?=
+
 turbostat : turbostat.c
 CFLAGS +=	-Wall
 CFLAGS +=	-I../../../../arch/x86/include/
@@ -6,5 +8,5 @@ clean :
 	rm -f turbostat
 
 install :
-	install turbostat /usr/bin/turbostat
-	install turbostat.8 /usr/share/man/man8
+	install turbostat ${DESTDIR}/usr/bin/turbostat
+	install turbostat.8 ${DESTDIR}/usr/share/man/man8
diff --git a/tools/power/x86/x86_energy_perf_policy/Makefile b/tools/power/x86/x86_energy_perf_policy/Makefile
index f458237..971c9ff 100644
--- a/tools/power/x86/x86_energy_perf_policy/Makefile
+++ b/tools/power/x86/x86_energy_perf_policy/Makefile
@@ -1,8 +1,10 @@
+DESTDIR ?=
+
 x86_energy_perf_policy : x86_energy_perf_policy.c
 
 clean :
 	rm -f x86_energy_perf_policy
 
 install :
-	install x86_energy_perf_policy /usr/bin/
-	install x86_energy_perf_policy.8 /usr/share/man/man8/
+	install x86_energy_perf_policy ${DESTDIR}/usr/bin/
+	install x86_energy_perf_policy.8 ${DESTDIR}/usr/share/man/man8/
-- 
1.8.0

^ permalink raw reply related

* [PATCH 6/7] tools/power turbostat: prevent infinite loop on migration error path
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Len Brown <len.brown@intel.com>

Turbostat assumed if it can't migrate to a CPU, then the CPU
must have gone off-line and turbostat should re-initialize
with the new topology.

But if turbostat can not migrate because it is restricted by
a cpuset, then it will fail to migrate even after re-initialization,
resulting in an infinite loop.

Spit out a warning when we can't migrate
and endure only 2 re-initialize cycles in a row
before giving up and exiting.

Signed-off-by: Len Brown <len.brown@intel.com>
---
 tools/power/x86/turbostat/turbostat.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 3c063a0..77e76b1 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -656,8 +656,10 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 {
 	int cpu = t->cpu_id;
 
-	if (cpu_migrate(cpu))
+	if (cpu_migrate(cpu)) {
+		fprintf(stderr, "Could not migrate to CPU %d\n", cpu);
 		return -1;
+	}
 
 	t->tsc = rdtsc();	/* we are running on local CPU of interest */
 
@@ -1088,15 +1090,22 @@ int mark_cpu_present(int cpu)
 void turbostat_loop()
 {
 	int retval;
+	int restarted = 0;
 
 restart:
+	restarted++;
+
 	retval = for_all_cpus(get_counters, EVEN_COUNTERS);
 	if (retval < -1) {
 		exit(retval);
 	} else if (retval == -1) {
+		if (restarted > 1) {
+			exit(retval);
+		}
 		re_initialize();
 		goto restart;
 	}
+	restarted = 0;
 	gettimeofday(&tv_even, (struct timezone *)NULL);
 
 	while (1) {
-- 
1.8.0

^ permalink raw reply related

* [PATCH 7/7] tools/power turbostat: print Watts
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel, Len Brown
In-Reply-To: <39300ffb9b6666714c60735cf854e1280e4e75f4.1352925508.git.len.brown@intel.com>

From: Len Brown <len.brown@intel.com>

Intel's Sandy Bridge and Ivy Bridge processor generations support RAPL (Run-Time-Average-Power-Limiting).
Per the Intel SDM (Intel® 64 and IA-32 Architectures Software Developer Manual)
RAPL provides hardware power information and control via MSRs (Model Specific Registers).
RAPL MSRs are designed primarily as a method to implement power capping.
However, even if power capping is not enabled, the RAPL regsiters
are useful for monitoring system power and operation.

Turbostat now displays the information provided by reading RAPL MSRs.
As always, turbostat never writes any MSRs.

turbostat's default display now includes Watts for hardware that
supports RAPL:

[root@sandy]# turbostat
cor CPU    %c0  GHz  TSC    %c1    %c3    %c6    %c7   %pc2   %pc3 %pc6   %pc7  Pkg_W  Cor_W GFX_W
          0.07 0.80 2.29   0.13   0.00   0.00  99.80   0.43   0.00 0.72  98.16   3.49   0.12  0.14
  0   0   0.14 0.80 2.29   0.12   0.00   0.00  99.74   0.43   0.00 0.72  98.16   3.49   0.12  0.14
  0   4   0.04 0.80 2.29   0.22
  1   1   0.06 0.80 2.29   0.08   0.00   0.00  99.86
  1   5   0.03 0.80 2.29   0.10
  2   2   0.17 0.80 2.29   0.14   0.00   0.00  99.69
  2   6   0.03 0.79 2.29   0.28
  3   3   0.03 0.80 2.29   0.07   0.00   0.00  99.90
  3   7   0.04 0.80 2.29   0.06

The Pkg_W column shows Watts for each package (socket) in the system.
On multi-socket systems, the system summary on the 1st row shows the total.

The Cor_W column shows Watts due to processors cores.
Core_W is included in Pkg_W.

The optional GFX_W column shows Watts due to the graphics "un-core".
GFX_W is included in Pkg_W.

The optional PKG_% column shows the % of time in the measurement interval that
RAPL power limiting is in effect.

Note that the RAPL energy counters have some limitations.

First hardware updates the countesr about once every milli-second.
This is fine for typical turbostat measurement intervals > 1 sec.
However, when turbostat is used to measure events that approach
1ms, the counters are less useful.

Second, the energy counters are 32-bits long and subject to wrapping.
For example, the counter increments in 15 micro-Joule units on my
local server, and the part could (in theory) consume energy at
its TDP specification of 130 Watts.  Here the 32-bit Joule counter
coult wrap as soon as 8 minutes.
Turbostat detects and handles up to 1 counter overflow per interval.
But when the measurement interval exceeds the guaranteed
counter range, we can't detect if more than 1 overflow occured.
So in this case turbostat indicates that the results are
in question by replacing the fractional part of the result
with "**":

Pkg_W  Cor_W GFX_W
  3**    0**   0**

Third, the RAPL counters are energy (Joule) counters -- they sum up
weighted events in the package to estimate energy consumed.  They are
not analong power (Watt) meters.  In practice, they tend to under-count
because they don't cover every possible use of energy in the package.
Also, the accuracy of the RAPL counters will vary between product generations,
and between SKU's in the same product generation.

turbostat's -v option now displays per-Package Thermal Design Power (TDP).
This is the specification for the part's maximum power consumption.
eg. on a 2-package SNB-Xeon system:

cpu0: 130.00 Watts Pkg Thermal Design Spec
cpu8: 130.00 Watts Pkg Thermal Design Spec

Finally, turbostat's -R option enables decoding and output of all RAPL registers
on turbostat startup.

Increment turbostat version number to 3.

Signed-off-by: Len Brown <len.brown@intel.com>
---
 tools/power/x86/turbostat/turbostat.8 |  35 ++--
 tools/power/x86/turbostat/turbostat.c | 339 +++++++++++++++++++++++++++++++++-
 2 files changed, 350 insertions(+), 24 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.8 b/tools/power/x86/turbostat/turbostat.8
index e4d0690..8094caa 100644
--- a/tools/power/x86/turbostat/turbostat.8
+++ b/tools/power/x86/turbostat/turbostat.8
@@ -31,6 +31,8 @@ The \fB-S\fP option limits output to a 1-line System Summary for each interval.
 .PP
 The \fB-v\fP option increases verbosity.
 .PP
+The \fB-R\fP option enables verbose RAPL register decoding on startup.
+.PP
 The \fB-s\fP option prints the SMI counter, equivalent to "-c 0x34"
 .PP
 The \fB-c MSR#\fP option includes the delta of the specified 32-bit MSR counter.
@@ -58,6 +60,10 @@ Note that multiple CPUs per core indicate support for Intel(R) Hyper-Threading T
 \fBTSC\fP average GHz that the TSC ran during the entire interval.
 \fB%c1, %c3, %c6, %c7\fP show the percentage residency in hardware core idle states.
 \fB%pc2, %pc3, %pc6, %pc7\fP percentage residency in hardware package idle states.
+\fBPkg_W\fP Watts consumed by the whole package.
+\fBCor_W\fP Watts consumed by the core part of the package.
+\fBGFX_W\fP Watts consumed by the Graphics part of the package.
+\fBPKG_%\fP percent of the interval that RAPL throttling was active.
 .fi
 .PP
 .SH EXAMPLE
@@ -66,25 +72,22 @@ Without any parameters, turbostat prints out counters ever 5 seconds.
 for turbostat to fork).
 
 The first row of statistics is a summary for the entire system.
-Note that the summary is a weighted average.
+For residency % columns, the summary is a weighted average.
+For Watts columns, the summary is a system total.
 Subsequent rows show per-CPU statistics.
 
 .nf
-[root@x980]# ./turbostat
-cor CPU    %c0  GHz  TSC    %c1    %c3    %c6   %pc3   %pc6
-          0.09 1.62 3.38   1.83   0.32  97.76   1.26  83.61
-  0   0   0.15 1.62 3.38  10.23   0.05  89.56   1.26  83.61
-  0   6   0.05 1.62 3.38  10.34
-  1   2   0.03 1.62 3.38   0.07   0.05  99.86
-  1   8   0.03 1.62 3.38   0.06
-  2   4   0.21 1.62 3.38   0.10   1.49  98.21
-  2  10   0.02 1.62 3.38   0.29
-  8   1   0.04 1.62 3.38   0.04   0.08  99.84
-  8   7   0.01 1.62 3.38   0.06
-  9   3   0.53 1.62 3.38   0.10   0.20  99.17
-  9   9   0.02 1.62 3.38   0.60
- 10   5   0.01 1.62 3.38   0.02   0.04  99.92
- 10  11   0.02 1.62 3.38   0.02
+[root@sandy]# ./turbostat
+cor CPU    %c0  GHz  TSC    %c1    %c3    %c6    %c7   %pc2   %pc3   %pc6   %pc7  Pkg_W  Cor_W GFX_W
+          0.07 0.80 2.29   0.13   0.00   0.00  99.80   0.43   0.00   0.72  98.16   3.49   0.12  0.14
+  0   0   0.14 0.80 2.29   0.12   0.00   0.00  99.74   0.43   0.00   0.72  98.16   3.49   0.12  0.14
+  0   4   0.04 0.80 2.29   0.22
+  1   1   0.06 0.80 2.29   0.08   0.00   0.00  99.86
+  1   5   0.03 0.80 2.29   0.10
+  2   2   0.17 0.80 2.29   0.14   0.00   0.00  99.69
+  2   6   0.03 0.79 2.29   0.28
+  3   3   0.03 0.80 2.29   0.07   0.00   0.00  99.90
+  3   7   0.04 0.80 2.29   0.06
 .fi
 .SH SUMMARY EXAMPLE
 The "-s" option prints the column headers just once,
diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c
index 77e76b1..7315c41 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -39,6 +39,7 @@
 char *proc_stat = "/proc/stat";
 unsigned int interval_sec = 5;	/* set with -i interval_sec */
 unsigned int verbose;		/* set with -v */
+unsigned int rapl_verbose;	/* set with -R */
 unsigned int summary_only;	/* set with -s */
 unsigned int skip_c0;
 unsigned int skip_c1;
@@ -62,6 +63,17 @@ unsigned int show_cpu;
 unsigned int show_pkg_only;
 unsigned int show_core_only;
 char *output_buffer, *outp;
+unsigned int has_rapl;
+unsigned int do_rapl;
+double rapl_power_units, rapl_energy_units, rapl_time_units;
+double rapl_joule_counter_range;
+
+#define RAPL_PKG	(1 << 0)
+#define RAPL_CORES	(1 << 1)
+#define RAPL_GFX	(1 << 2)
+#define RAPL_DRAM	(1 << 3)
+#define RAPL_PKG_PERF_STATUS	(1 << 4)
+#define RAPL_DRAM_PERF_STATUS	(1 << 5)
 
 int aperf_mperf_unstable;
 int backwards_count;
@@ -98,6 +110,13 @@ struct pkg_data {
 	unsigned long long pc6;
 	unsigned long long pc7;
 	unsigned int package_id;
+	unsigned int energy_pkg;	/* MSR_PKG_ENERGY_STATUS */
+	unsigned int energy_dram;	/* MSR_DRAM_ENERGY_STATUS */
+	unsigned int energy_cores;	/* MSR_PP0_ENERGY_STATUS */
+	unsigned int energy_gfx;	/* MSR_PP1_ENERGY_STATUS */
+	unsigned int rapl_pkg_perf_status;	/* MSR_PKG_PERF_STATUS */
+	unsigned int rapl_dram_perf_status;	/* MSR_DRAM_PERF_STATUS */
+
 } *package_even, *package_odd;
 
 #define ODD_COUNTERS thread_odd, core_odd, package_odd
@@ -244,6 +263,19 @@ void print_header(void)
 	if (do_snb_cstates)
 		outp += sprintf(outp, "   %%pc7");
 
+	if (do_rapl & RAPL_PKG)
+		outp += sprintf(outp, "  Pkg_W");
+	if (do_rapl & RAPL_CORES)
+		outp += sprintf(outp, "  Cor_W");
+	if (do_rapl & RAPL_GFX)
+		outp += sprintf(outp, " GFX_W");
+	if (do_rapl & RAPL_DRAM)
+		outp += sprintf(outp, " RAM_W");
+	if (do_rapl & RAPL_PKG_PERF_STATUS)
+		outp += sprintf(outp, " PKG_%%");
+	if (do_rapl & RAPL_DRAM_PERF_STATUS)
+		outp += sprintf(outp, " RAM_%%");
+
 	outp += sprintf(outp, "\n");
 }
 
@@ -281,6 +313,12 @@ int dump_counters(struct thread_data *t, struct core_data *c,
 		fprintf(stderr, "pc3: %016llX\n", p->pc3);
 		fprintf(stderr, "pc6: %016llX\n", p->pc6);
 		fprintf(stderr, "pc7: %016llX\n", p->pc7);
+		fprintf(stderr, "Joules PKG: %0X\n", p->energy_pkg);
+		fprintf(stderr, "Joules COR: %0X\n", p->energy_cores);
+		fprintf(stderr, "Joules GFX: %0X\n", p->energy_gfx);
+		fprintf(stderr, "Joules RAM: %0X\n", p->energy_dram);
+		fprintf(stderr, "Throttle PKG: %0X\n", p->rapl_pkg_perf_status);
+		fprintf(stderr, "Throttle RAM: %0X\n", p->rapl_dram_perf_status);
 	}
 	return 0;
 }
@@ -290,14 +328,20 @@ int dump_counters(struct thread_data *t, struct core_data *c,
  * package: "pk" 2 columns %2d
  * core: "cor" 3 columns %3d
  * CPU: "CPU" 3 columns %3d
+ * Pkg_W: %6.2
+ * Cor_W: %6.2
+ * GFX_W: %5.2
+ * RAM_W: %5.2
  * GHz: "GHz" 3 columns %3.2
  * TSC: "TSC" 3 columns %3.2
  * percentage " %pc3" %6.2
+ * Perf Status percentage: %5.2
  */
 int format_counters(struct thread_data *t, struct core_data *c,
 	struct pkg_data *p)
 {
 	double interval_float;
+	char *fmt5, *fmt6;
 
 	 /* if showing only 1st thread in core and this isn't one, bail out */
 	if (show_core_only && !(t->flags & CPU_IS_FIRST_THREAD_IN_CORE))
@@ -337,7 +381,6 @@ int format_counters(struct thread_data *t, struct core_data *c,
 		if (show_cpu)
 			outp += sprintf(outp, " %3d", t->cpu_id);
 	}
-
 	/* %c0 */
 	if (do_nhm_cstates) {
 		if (show_pkg || show_core || show_cpu)
@@ -414,6 +457,31 @@ int format_counters(struct thread_data *t, struct core_data *c,
 		outp += sprintf(outp, " %6.2f", 100.0 * p->pc6/t->tsc);
 	if (do_snb_cstates)
 		outp += sprintf(outp, " %6.2f", 100.0 * p->pc7/t->tsc);
+
+	/*
+ 	 * If measurement interval exceeds minimum RAPL Joule Counter range,
+ 	 * indicate that results are suspect by printing "**" in fraction place.
+ 	 */
+	if (interval_float < rapl_joule_counter_range) {
+		fmt5 = " %5.2f";
+		fmt6 = " %6.2f";
+	} else {
+		fmt5 = " %3.0f**";
+		fmt6 = " %4.0f**";
+	}
+
+	if (do_rapl & RAPL_PKG)
+		outp += sprintf(outp, fmt6, p->energy_pkg * rapl_energy_units / interval_float);
+	if (do_rapl & RAPL_CORES)
+		outp += sprintf(outp, fmt6, p->energy_cores * rapl_energy_units / interval_float);
+	if (do_rapl & RAPL_GFX)
+		outp += sprintf(outp, fmt5, p->energy_gfx * rapl_energy_units / interval_float); 
+	if (do_rapl & RAPL_DRAM)
+		outp += sprintf(outp, fmt5, p->energy_dram * rapl_energy_units / interval_float);
+	if (do_rapl & RAPL_PKG_PERF_STATUS )
+		outp += sprintf(outp, fmt5, 100.0 * p->rapl_pkg_perf_status * rapl_time_units / interval_float);
+	if (do_rapl & RAPL_DRAM_PERF_STATUS )
+		outp += sprintf(outp, fmt5, 100.0 * p->rapl_dram_perf_status * rapl_time_units / interval_float);
 done:
 	outp += sprintf(outp, "\n");
 
@@ -449,6 +517,13 @@ void format_all_counters(struct thread_data *t, struct core_data *c, struct pkg_
 	for_all_cpus(format_counters, t, c, p);
 }
 
+#define DELTA_WRAP32(new, old)			\
+	if (new > old) {			\
+		old = new - old;		\
+	} else {				\
+		old = 0x100000000 + new - old;	\
+	}
+
 void
 delta_package(struct pkg_data *new, struct pkg_data *old)
 {
@@ -456,6 +531,13 @@ delta_package(struct pkg_data *new, struct pkg_data *old)
 	old->pc3 = new->pc3 - old->pc3;
 	old->pc6 = new->pc6 - old->pc6;
 	old->pc7 = new->pc7 - old->pc7;
+
+	DELTA_WRAP32(new->energy_pkg, old->energy_pkg);
+	DELTA_WRAP32(new->energy_cores, old->energy_cores);
+	DELTA_WRAP32(new->energy_gfx, old->energy_gfx);
+	DELTA_WRAP32(new->energy_dram, old->energy_dram);
+	DELTA_WRAP32(new->rapl_pkg_perf_status, old->rapl_pkg_perf_status);
+	DELTA_WRAP32(new->rapl_dram_perf_status, old->rapl_dram_perf_status);
 }
 
 void
@@ -575,6 +657,13 @@ void clear_counters(struct thread_data *t, struct core_data *c, struct pkg_data
 	p->pc3 = 0;
 	p->pc6 = 0;
 	p->pc7 = 0;
+
+	p->energy_pkg = 0;
+	p->energy_dram = 0;
+	p->energy_cores = 0;
+	p->energy_gfx = 0;
+	p->rapl_pkg_perf_status = 0;
+	p->rapl_dram_perf_status = 0;
 }
 int sum_counters(struct thread_data *t, struct core_data *c,
 	struct pkg_data *p)
@@ -604,6 +693,13 @@ int sum_counters(struct thread_data *t, struct core_data *c,
 	average.packages.pc6 += p->pc6;
 	average.packages.pc7 += p->pc7;
 
+	average.packages.energy_pkg += p->energy_pkg;
+	average.packages.energy_dram += p->energy_dram;
+	average.packages.energy_cores += p->energy_cores;
+	average.packages.energy_gfx += p->energy_gfx;
+
+	average.packages.rapl_pkg_perf_status += p->rapl_pkg_perf_status;
+	average.packages.rapl_dram_perf_status += p->rapl_dram_perf_status;
 	return 0;
 }
 /*
@@ -655,6 +751,7 @@ static unsigned long long rdtsc(void)
 int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 {
 	int cpu = t->cpu_id;
+	unsigned long long msr;
 
 	if (cpu_migrate(cpu)) {
 		fprintf(stderr, "Could not migrate to CPU %d\n", cpu);
@@ -671,9 +768,9 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 	}
 
 	if (extra_delta_offset32) {
-		if (get_msr(cpu, extra_delta_offset32, &t->extra_delta32))
+		if (get_msr(cpu, extra_delta_offset32, &msr))
 			return -5;
-		t->extra_delta32 &= 0xFFFFFFFF;
+		t->extra_delta32 = msr & 0xFFFFFFFF;
 	}
 
 	if (extra_delta_offset64)
@@ -681,9 +778,9 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 			return -5;
 
 	if (extra_msr_offset32) {
-		if (get_msr(cpu, extra_msr_offset32, &t->extra_msr32))
+		if (get_msr(cpu, extra_msr_offset32, &msr))
 			return -5;
-		t->extra_msr32 &= 0xFFFFFFFF;
+		t->extra_msr32 = msr & 0xFFFFFFFF;
 	}
 
 	if (extra_msr_offset64)
@@ -721,6 +818,36 @@ int get_counters(struct thread_data *t, struct core_data *c, struct pkg_data *p)
 		if (get_msr(cpu, MSR_PKG_C7_RESIDENCY, &p->pc7))
 			return -12;
 	}
+	if (do_rapl & RAPL_PKG) {
+		if (get_msr(cpu, MSR_PKG_ENERGY_STATUS, &msr))
+			return -13;
+		p->energy_pkg = msr & 0xFFFFFFFF;
+	}
+	if (do_rapl & RAPL_CORES) {
+		if (get_msr(cpu, MSR_PP0_ENERGY_STATUS, &msr))
+			return -14;
+		p->energy_cores = msr & 0xFFFFFFFF;
+	}
+	if (do_rapl & RAPL_DRAM) {
+		if (get_msr(cpu, MSR_DRAM_ENERGY_STATUS, &msr))
+			return -15;
+		p->energy_dram = msr & 0xFFFFFFFF;
+	}
+	if (do_rapl & RAPL_GFX) {
+		if (get_msr(cpu, MSR_PP1_ENERGY_STATUS, &msr))
+			return -16;
+		p->energy_gfx = msr & 0xFFFFFFFF;
+	}
+	if (do_rapl & RAPL_PKG_PERF_STATUS) {
+		if (get_msr(cpu, MSR_PKG_PERF_STATUS, &msr))
+			return -16;
+		p->rapl_pkg_perf_status = msr & 0xFFFFFFFF;
+	}
+	if (do_rapl & RAPL_DRAM_PERF_STATUS) {
+		if (get_msr(cpu, MSR_DRAM_PERF_STATUS, &msr))
+			return -16;
+		p->rapl_dram_perf_status = msr & 0xFFFFFFFF;
+	}
 	return 0;
 }
 
@@ -1204,6 +1331,194 @@ int has_ivt_turbo_ratio_limit(unsigned int family, unsigned int model)
 	}
 }
 
+#define	RAPL_POWER_GRANULARITY	0x7FFF	/* 15 bit power granularity */
+#define	RAPL_TIME_GRANULARITY	0x3F /* 6 bit time granularity */
+
+/*
+ * rapl_probe()
+ *
+ * sets has_rapl
+ */
+void rapl_probe(unsigned int family, unsigned int model)
+{
+	unsigned long long msr;
+	double tdp;
+
+	if (!genuine_intel)
+		return;
+
+	if (family != 6)
+		return;
+
+	switch (model) {
+	case 0x2A:
+	case 0x3A:
+		has_rapl = RAPL_PKG | RAPL_CORES | RAPL_GFX;
+		break;
+	case 0x2D:
+	case 0x3E:
+		has_rapl = RAPL_PKG | RAPL_CORES | RAPL_PKG_PERF_STATUS ;
+		break;
+	default:
+		return;
+	}
+
+	/* units on package 0, verify later other packages match */
+	if (get_msr(0, MSR_RAPL_POWER_UNIT, &msr))
+		return;
+
+	rapl_power_units = 1.0 / (1 << (msr & 0xF));
+	rapl_energy_units = 1.0 / (1 << (msr >> 8 & 0x1F));
+	rapl_time_units = 1.0 / (1 << (msr >> 16 & 0xF));
+
+	/* get TDP to determine energy counter range */
+	if (get_msr(0, MSR_PKG_POWER_INFO, &msr))
+		return;
+
+	tdp = ((msr >> 0) & RAPL_POWER_GRANULARITY) * rapl_power_units;
+
+	rapl_joule_counter_range = 0xFFFFFFFF * rapl_energy_units / tdp;
+
+	if (verbose || rapl_verbose)
+		fprintf(stderr, "%.0f sec RAPL Joule Counter Range\n", rapl_joule_counter_range);
+
+	return;
+}
+	
+void print_power_limit_msr(int cpu, unsigned long long msr, char *label)
+{
+	fprintf(stderr, "cpu%d: %s: %f Watts %sabled, %f sec clamp %sabled\n",
+		cpu, label,
+		((msr >> 0) & 0x7FFF) * rapl_power_units,
+		((msr >> 15) & 1) ? "EN" : "DIS",
+		((msr >> 17) & 0x7F) * rapl_time_units,
+		((msr >> 16) & 1) ? "EN" : "DIS");
+
+	return;
+}
+
+int print_rapl(struct thread_data *t, struct core_data *c, struct pkg_data *p)
+{
+	unsigned long long msr;
+	int cpu;
+	double local_rapl_power_units, local_rapl_energy_units, local_rapl_time_units;
+
+	if (!has_rapl)
+		return 0;
+
+	/* RAPL counters are per package, so print only for 1st thread/package */
+	if (!(t->flags & CPU_IS_FIRST_THREAD_IN_CORE) || !(t->flags & CPU_IS_FIRST_CORE_IN_PACKAGE))
+		return 0;
+
+	cpu = t->cpu_id;
+
+	if (get_msr(cpu, MSR_RAPL_POWER_UNIT, &msr))
+		return -1;
+
+	local_rapl_power_units = 1.0 / (1 << (msr & 0xF));
+	local_rapl_energy_units = 1.0 / (1 << (msr >> 8 & 0x1F));
+	local_rapl_time_units = 1.0 / (1 << (msr >> 16 & 0xF));
+
+	if (local_rapl_power_units != rapl_power_units)
+		fprintf(stderr, "cpu%d, ERROR: Power units mis-match\n", cpu);
+	if (local_rapl_energy_units != rapl_energy_units)
+		fprintf(stderr, "cpu%d, ERROR: Energy units mis-match\n", cpu);
+	if (local_rapl_time_units != rapl_time_units)
+		fprintf(stderr, "cpu%d, ERROR: Time units mis-match\n", cpu);
+
+	if (verbose > 1 || rapl_verbose) {
+		fprintf(stderr, "cpu%d: MSR_RAPL_POWER_UNIT: 0x%08llx "
+			"%f Watts, %f Joules, %f Seconds\n", cpu, msr,
+			local_rapl_power_units, local_rapl_energy_units, local_rapl_time_units);
+	}
+	if (has_rapl & RAPL_PKG) {
+		double tdp;
+
+		if (get_msr(cpu, MSR_PKG_POWER_INFO, &msr))
+                	return -5;
+
+		tdp = ((msr >>  0) & RAPL_POWER_GRANULARITY) * rapl_power_units;
+
+		fprintf(stderr, "cpu%d: %.2f Watts Pkg Thermal Design Spec\n",
+			cpu, tdp);
+
+		if (verbose > 1 || rapl_verbose) {
+			fprintf(stderr, "cpu%d: MSR_PKG_POWER_INFO: 0x%016llx\n", cpu, msr);
+			fprintf(stderr, "%.2f Watts Pkg RAPL Minimum\n",
+				((msr >> 16) & RAPL_POWER_GRANULARITY) * rapl_power_units);
+			fprintf(stderr, "%.2f Watts Pkg RAPL Maximum\n",
+				((msr >> 32) & RAPL_POWER_GRANULARITY) * rapl_power_units);
+			fprintf(stderr, "%f Sec. Maximum Pkg RAPL Time Window\n",
+				((msr >> 48) & RAPL_TIME_GRANULARITY) * rapl_time_units);
+
+			if (get_msr(cpu, MSR_PKG_POWER_LIMIT, &msr))
+				return -9;
+			fprintf(stderr, "cpu%d: MSR_PKG_POWER_LIMIT: %llx %sLOCKED\n",
+					cpu, msr, (msr >> 63) & 1 ? "": "UN-");
+			print_power_limit_msr(cpu, msr, "PKG Limit #1");
+			fprintf(stderr, "cpu%d: PKG Limit #2: %f Watts %sabled, %f sec clamp %sabled\n",
+					cpu,
+					((msr >> 32) & 0x7FFF) * rapl_power_units,
+					((msr >> 47) & 1) ? "EN" : "DIS",
+					((msr >> 49) & 0x7F) * rapl_time_units,
+					((msr >> 48) & 1) ? "EN" : "DIS");
+		}
+	}
+
+	if (has_rapl & RAPL_DRAM) {
+		if (get_msr(cpu, MSR_DRAM_POWER_INFO, &msr))
+                	return -6;
+
+		fprintf(stderr, "cpu%d: %.2f Watts DRAM Thermal Design Spec\n", cpu,
+			((msr >>  0) & RAPL_POWER_GRANULARITY) * rapl_power_units);
+
+		if (verbose > 1 || rapl_verbose) {
+			fprintf(stderr, "cpu%d: MSR_DRAM_POWER_INFO: 0x%016llx\n", cpu, msr);
+			fprintf(stderr, "%.2f Watts DRAM RAPL Minimum\n",
+				((msr >> 16) & RAPL_POWER_GRANULARITY) * rapl_power_units);
+			fprintf(stderr, "%.2f Watts DRAM RAPL Maximum\n",
+				((msr >> 32) & RAPL_POWER_GRANULARITY) * rapl_power_units);
+			fprintf(stderr, "%f Sec. Maximum DRAM RAPL Time Window\n",
+				((msr >> 48) & RAPL_TIME_GRANULARITY) * rapl_time_units);
+
+			if (get_msr(cpu, MSR_DRAM_POWER_LIMIT, &msr))
+				return -9;
+			fprintf(stderr, "cpu%d: MSR_DRAM_POWER_LIMIT: %llx %sLOCKED\n",
+					cpu, msr, (msr >> 31) & 1 ? "": "UN-");
+			print_power_limit_msr(cpu, msr, "DRAM Limit");
+		}
+	}
+	if (has_rapl & RAPL_CORES) {
+		if (verbose > 1 || rapl_verbose) {
+			if (get_msr(cpu, MSR_PP0_POLICY, &msr))
+				return -7;
+
+			fprintf(stderr, "cpu%d: MSR_PP0_POLICY: %lld\n", cpu, msr & 0xF);
+
+			if (get_msr(cpu, MSR_PP0_POWER_LIMIT, &msr))
+				return -9;
+			fprintf(stderr, "cpu%d: MSR_PP0_POWER_LIMIT: %llx %sLOCKED\n",
+					cpu, msr, (msr >> 31) & 1 ? "": "UN-");
+			print_power_limit_msr(cpu, msr, "Cores Limit");
+		}
+	}
+	if (has_rapl & RAPL_GFX) {
+		if (verbose > 1 || rapl_verbose) {
+			if (get_msr(cpu, MSR_PP1_POLICY, &msr))
+				return -8;
+
+			fprintf(stderr, "cpu%d: MSR_PP1_POLICY: %lld\n", cpu, msr & 0xF);
+
+			if (get_msr(cpu, MSR_PP1_POWER_LIMIT, &msr))
+				return -9;
+			fprintf(stderr, "cpu%d: MSR_PP1_POWER_LIMIT: %llx %sLOCKED\n",
+					cpu, msr, (msr >> 31) & 1 ? "": "UN-");
+			print_power_limit_msr(cpu, msr, "GFX Limit");
+		}
+	}
+	return 0;
+}
+
 
 int is_snb(unsigned int family, unsigned int model)
 {
@@ -1304,12 +1619,14 @@ void check_cpuid()
 
 	do_nehalem_turbo_ratio_limit = has_nehalem_turbo_ratio_limit(family, model);
 	do_ivt_turbo_ratio_limit = has_ivt_turbo_ratio_limit(family, model);
+	rapl_probe(family, model);
+	do_rapl = has_rapl; /* for now */
 }
 
 
 void usage()
 {
-	fprintf(stderr, "%s: [-v][-p|-P|-S][-c MSR# | -s]][-C MSR#][-m MSR#][-M MSR#][-i interval_sec | command ...]\n",
+	fprintf(stderr, "%s: [-v][-R][-p|-P|-S][-c MSR# | -s]][-C MSR#][-m MSR#][-M MSR#][-i interval_sec | command ...]\n",
 		progname);
 	exit(1);
 }
@@ -1545,6 +1862,9 @@ void turbostat_init()
 
 	if (verbose)
 		print_verbose_header();
+
+	if (verbose || rapl_verbose)
+		for_all_cpus(print_rapl, ODD_COUNTERS);
 }
 
 int fork_it(char **argv)
@@ -1601,7 +1921,7 @@ void cmdline(int argc, char **argv)
 
 	progname = argv[0];
 
-	while ((opt = getopt(argc, argv, "+pPSvi:sc:sC:m:M:")) != -1) {
+	while ((opt = getopt(argc, argv, "+pPSvi:sc:sC:m:M:R")) != -1) {
 		switch (opt) {
 		case 'p':
 			show_core_only++;
@@ -1633,6 +1953,9 @@ void cmdline(int argc, char **argv)
 		case 'M':
 			sscanf(optarg, "%x", &extra_msr_offset64);
 			break;
+		case 'R':
+			rapl_verbose++;
+			break;
 		default:
 			usage();
 		}
@@ -1644,7 +1967,7 @@ int main(int argc, char **argv)
 	cmdline(argc, argv);
 
 	if (verbose > 1)
-		fprintf(stderr, "turbostat v2.1 October 6, 2012"
+		fprintf(stderr, "turbostat v3.0 November 14, 2012"
 			" - Len Brown <lenb@kernel.org>\n");
 
 	turbostat_init();
-- 
1.8.0

^ permalink raw reply related

* turbostat tool update for Linux-3.8
From: Len Brown @ 2012-11-14 20:43 UTC (permalink / raw)
  To: linux-pm; +Cc: linux-kernel

Here are some turbostat patches I have staged.
The 1st two I've requested be pulled into 3.7,
the rest are for 3.8

The final patch allows turbostat to print Watts
as measured by hardware RAPL counters -- something
that people have been asking for.

Please let me know if you see troubles with any of these patches.

thanks,
Len Brown, Intel Open Source Technology Center

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Alan Stern @ 2012-11-14 21:45 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <1808040.vTl5iT6HSY@vostro.rjw.lan>

On Wed, 14 Nov 2012, Rafael J. Wysocki wrote:

> > This has the side effect that when a driver unbinds, it can't leave the 
> > device in a special low-power state.  The device will always end up in 
> > the generic low-power state supported by the PCI core.
> 
> Well, I'm not sure I'd like that.
> 
> Let's just go back even one step more and think what we'd like to have in
> general terms and then how to implement it. :-)
> 
> Suppose that pci_pm_init() calls pm_runtime_enable() for all devices (in
> addition to what it does currently).  The runtime PM status of each device is
> RPM_SUSPENDED at this point.  Then:

Wait a moment.  When the device is detected and initialized, it is in
D0, right?  Currently we don't care much because the device starts out
disabled for runtime PM.  But now you are going to enable it.  While
the device is enabled, its runtime status should match the physical
power level.

This means the initialization routine would have to call
pm_runtime_set_active() before pm_runtime_enable().  If you then wanted
to change the status to RPM_SUSPENDED, you would actually have to put
the device into D3 by calling pm_runtime_suspend() (or maybe
pm_runtime_schedule_suspend() to give drivers some time to get loaded 
and bind).

> (1) We want to keep the current semantics during probe, i.e. the device should
>     (a) be RPM_ACTIVE and (b) have usage_count == (user space usage_count + 1)
>     right before ddi->drv->probe() is executed.

In theory the usage_count could be higher and then adjusted back after
the probe is finished, if that would make anything easier.

> (2) We don't want the driver's PM callbacks to be run before ddi->drv->probe().
>     There's a question if we want the bus type's PM callbacks to be run at
>     that point, but they are not run currently and IMO we shouldn't change
>     that.

The device is supposed to be in D0 when it is probed.  Since we are
assuming that initialization is now going to leave it in D3, there's no
choice -- you _have_ to invoke pci_pm_runtime_resume(), which would
invoke the driver's callback, which we don't want.

Therefore you need to figure out a way to tell pci_pm_runtime_resume() 
(and presumably pci_pm_runtime_suspend() as well) when not to invoke 
the driver's callback.  Add a flag to the pci_device structure, maybe.

> (4) If ddi->drv->probe() fails, we want the device's status to change to
>     RPM_SUSPENDED and it's usage_count to be equal to the user space part,
>     so that the conditions are the same as before when probing is repeated.
> 
> (5) During ddi->drv->probe(), if the driver decrements the device's usage_count,
>     which it is supposed to do if it supports runtime PM, then runtime PM
>     should work for the device normally going forward (unless the .probe()
>     eventually fails, but then the driver is supposed to do the cleanup).

It would be okay if the normal runtime PM doesn't kick in until after 
the probe routine returns.  For example, if the PCI core made an extra 
call to pm_runtime_get_noresume() before ddi->drv->probe() and a 
matching call to pm_runtime_put_sync() afterward.

> (6) In pci_device_remove() we want the status to change to RPM_SUSPENDED and
>     the device's usage_count to be equal to the user space part after
>     drv->remove() has run.

Basically, pci_device_remove() should undo the actions taken by 
local_pci_probe().

> (7) We want neither the driver's nor the PCI bus type's PM callbacks
>     to be run after drv->remove() has returned (that's what happens now).

What if the driver doesn't support runtime PM?  Then you have
contradictory requirements: The device is in D0 before drv->remove()  
is called, the driver's remove routine won't do any runtime PM, and
you don't want any PM callbacks after drv->remove() returns.  So how
can the device get put back into D3?

Are you suggesting that the unbound device should remain in D0, even
though runtime PM is enabled and the status is SUSPENDED?  I don't
think that would be a good idea.

> > > Perhaps we can introduce something like
> > > 
> > > pm_runtime_get[_put]_skip_callbacks()
> > > 
> > > that would treat the device as though it had the power.no_callbacks flag
> > > set and use that around the driver's .probe() in the PCI core?
> > 
> > That would prevent the PM core from invoking the PCI subsystem's own 
> > callback, not just the driver's callback.  So I don't think that's what 
> > you want.
> 
> Actually, looking at the above, I think that's pretty much what I want. :-)

I don't agree.  In my opinion it would be better to invoke the PCI
bus-type callbacks and tell them somehow to skip calling the driver's
callbacks before drv->probe() or after drv->remove().

Alan Stern

^ permalink raw reply

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Rafael J. Wysocki @ 2012-11-14 23:10 UTC (permalink / raw)
  To: Alan Stern; +Cc: Huang Ying, linux-kernel, linux-pm
In-Reply-To: <Pine.LNX.4.44L0.1211141608200.1620-100000@iolanthe.rowland.org>

On Wednesday, November 14, 2012 04:45:01 PM Alan Stern wrote:
> On Wed, 14 Nov 2012, Rafael J. Wysocki wrote:
> 
> > > This has the side effect that when a driver unbinds, it can't leave the 
> > > device in a special low-power state.  The device will always end up in 
> > > the generic low-power state supported by the PCI core.
> > 
> > Well, I'm not sure I'd like that.
> > 
> > Let's just go back even one step more and think what we'd like to have in
> > general terms and then how to implement it. :-)
> > 
> > Suppose that pci_pm_init() calls pm_runtime_enable() for all devices (in
> > addition to what it does currently).  The runtime PM status of each device is
> > RPM_SUSPENDED at this point.  Then:
> 
> Wait a moment.  When the device is detected and initialized, it is in
> D0, right?  Currently we don't care much because the device starts out
> disabled for runtime PM.  But now you are going to enable it.  While
> the device is enabled, its runtime status should match the physical
> power level.

OK

> This means the initialization routine would have to call
> pm_runtime_set_active() before pm_runtime_enable().  If you then wanted
> to change the status to RPM_SUSPENDED, you would actually have to put
> the device into D3 by calling pm_runtime_suspend() (or maybe
> pm_runtime_schedule_suspend() to give drivers some time to get loaded 
> and bind).

No, I don't want that.  It may be RPM_ACTIVE all the time as long as the
device doesn't have a driver.  Which probably would even make things
simpler. :-)

> > (1) We want to keep the current semantics during probe, i.e. the device should
> >     (a) be RPM_ACTIVE and (b) have usage_count == (user space usage_count + 1)
> >     right before ddi->drv->probe() is executed.
> 
> In theory the usage_count could be higher and then adjusted back after
> the probe is finished, if that would make anything easier.

No, it wouldn't, because of (5).  Suppose that the driver wants to suspend
the device directly from .probe() and the user space doesn't mind.  We can't
prevent that from being doable.

> > (2) We don't want the driver's PM callbacks to be run before ddi->drv->probe().
> >     There's a question if we want the bus type's PM callbacks to be run at
> >     that point, but they are not run currently and IMO we shouldn't change
> >     that.
> 
> The device is supposed to be in D0 when it is probed.  Since we are
> assuming that initialization is now going to leave it in D3, there's no
> choice -- you _have_ to invoke pci_pm_runtime_resume(), which would
> invoke the driver's callback, which we don't want.

Let's say the device will stay in D0 after the initialization and then
we'll require that it be in D0 if .probe() fails or after .remove().

The only thing we'll need to do before .probe() in that case is to
bump up the usage counter and then to bump it down if .probe() fails
(and after .remove()).

The only problem we have in that case are buggy drivers that leave
devices in, say, D3cold after a failing .probe().  That doesn't
seem to be avoidable, though.

Thanks,
Rafael

-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [PATCH 1/1] thermal: Exynos: Add missing dependency
From: Zhang Rui @ 2012-11-15  0:11 UTC (permalink / raw)
  To: Sachin Kamat; +Cc: linux-pm, durgadoss.r, patches, akpm, Amit Daniel Kachhap
In-Reply-To: <1352875704-2178-1-git-send-email-sachin.kamat@linaro.org>

Hi, Sachin,

thanks for catching the problem.

On Wed, 2012-11-14 at 12:18 +0530, Sachin Kamat wrote:
> CPU_FREQ_TABLE depends on CPU_FREQ. Selecting CPU_FREQ_TABLE without checking
> for dependencies gives the following compilation warnings:
> warning: (ARCH_TEGRA_2x_SOC && ARCH_TEGRA_3x_SOC && UX500_SOC_DB8500 &&
> CPU_THERMAL && EXYNOS_THERMAL) selects CPU_FREQ_TABLE which has unmet
> direct dependencies (ARCH_HAS_CPUFREQ && CPU_FREQ)
> 
Amit,

how is exynos driver supposed to work?
do you want the exynos driver still be loaded without CPU_THERMAL?
If yes, EXYNOS_THERMAL should not select CPU_FREQ_TABLE.
If no, EXYNOS_THERMAL should depends on CPU_THERMAL instead of THERMAL.
and CPU_THERMAL will select CPU_FREQ_TABLE instead.

IMO, either of the above solution will be more proper to fix this
warning.

thanks,
rui

> Cc: Amit Daniel Kachhap <amit.kachhap@linaro.org>
> Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
> ---
> Build tested using exynos4_defconfig on linux-next tree of 20121114.
> ---
>  drivers/thermal/Kconfig |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 266c15e..197b7db 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -50,7 +50,7 @@ config RCAR_THERMAL
>  
>  config EXYNOS_THERMAL
>  	tristate "Temperature sensor on Samsung EXYNOS"
> -	depends on (ARCH_EXYNOS4 || ARCH_EXYNOS5) && THERMAL
> +	depends on (ARCH_EXYNOS4 || ARCH_EXYNOS5) && THERMAL && CPU_FREQ
>  	select CPU_FREQ_TABLE
>  	help
>  	  If you say yes here you get support for TMU (Thermal Managment



^ permalink raw reply

* [PATCH] Thermal: Add Linux/Thermal subsystem info in MAINTAINER file
From: Zhang Rui @ 2012-11-15  0:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux PM list, LKML, ACPI Devel Maling List, Rafael J. Wysocki,
	Len Brown, durga, Zhang, Rui, Amit Kachhap, jhbird.choi,
	kuninori.morimoto.gx, eduardo.valentin, hongbo.zhang,
	Viresh Kumar, Sachin Kamat

Add Linux/Thermal subsystem info in MAINTAINER file.

All the changes made to the generic thermal layer, or
platform thermal drivers that make use of the thermal layer,
should be sent to linux-pm@vger.kernel.org for discussion.

And as the maintainer, I will only apply the patches that have been
sent to linux-pm@vger.kernel.org.

Signed-off-by: Zhang Rui <rui.zhang@intel.com>
---
 MAINTAINERS |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 59203e7..2d8512b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7210,6 +7210,14 @@ L:	linux-xtensa@linux-xtensa.org
 S:	Maintained
 F:	arch/xtensa/
 
+THERMAL
+M:      Zhang Rui <rui.zhang@intel.com>
+L:      linux-pm@vger.kernel.org
+T:      git git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux.git
+S:      Supported
+F:      drivers/thermal/
+F:      include/linux/thermal.h
+
 THINKPAD ACPI EXTRAS DRIVER
 M:	Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
 L:	ibm-acpi-devel@lists.sourceforge.net
-- 
1.7.9.5




^ permalink raw reply related

* Re: [BUGFIX] PM: Fix active child counting when disabled and forbidden
From: Huang Ying @ 2012-11-15  1:03 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Alan Stern, linux-kernel, linux-pm
In-Reply-To: <37583314.JAxoSZqTsM@vostro.rjw.lan>

On Thu, 2012-11-15 at 00:10 +0100, Rafael J. Wysocki wrote:
> On Wednesday, November 14, 2012 04:45:01 PM Alan Stern wrote:
> > On Wed, 14 Nov 2012, Rafael J. Wysocki wrote:
> > 
> > > > This has the side effect that when a driver unbinds, it can't leave the 
> > > > device in a special low-power state.  The device will always end up in 
> > > > the generic low-power state supported by the PCI core.
> > > 
> > > Well, I'm not sure I'd like that.
> > > 
> > > Let's just go back even one step more and think what we'd like to have in
> > > general terms and then how to implement it. :-)
> > > 
> > > Suppose that pci_pm_init() calls pm_runtime_enable() for all devices (in
> > > addition to what it does currently).  The runtime PM status of each device is
> > > RPM_SUSPENDED at this point.  Then:
> > 
> > Wait a moment.  When the device is detected and initialized, it is in
> > D0, right?  Currently we don't care much because the device starts out
> > disabled for runtime PM.  But now you are going to enable it.  While
> > the device is enabled, its runtime status should match the physical
> > power level.
> 
> OK

If my memory were correct, RPM_SUSPENDED just means device stop working,
but need not be put into low-power state.  So for RPM_ACTIVE, PCI
devices should be in D0, but for RPM_SUSPENDED, PCI devices can in any
power state.

Best Regards,
Huang Ying


> > This means the initialization routine would have to call
> > pm_runtime_set_active() before pm_runtime_enable().  If you then wanted
> > to change the status to RPM_SUSPENDED, you would actually have to put
> > the device into D3 by calling pm_runtime_suspend() (or maybe
> > pm_runtime_schedule_suspend() to give drivers some time to get loaded 
> > and bind).
> 
> No, I don't want that.  It may be RPM_ACTIVE all the time as long as the
> device doesn't have a driver.  Which probably would even make things
> simpler. :-)
> 
> > > (1) We want to keep the current semantics during probe, i.e. the device should
> > >     (a) be RPM_ACTIVE and (b) have usage_count == (user space usage_count + 1)
> > >     right before ddi->drv->probe() is executed.
> > 
> > In theory the usage_count could be higher and then adjusted back after
> > the probe is finished, if that would make anything easier.
> 
> No, it wouldn't, because of (5).  Suppose that the driver wants to suspend
> the device directly from .probe() and the user space doesn't mind.  We can't
> prevent that from being doable.
> 
> > > (2) We don't want the driver's PM callbacks to be run before ddi->drv->probe().
> > >     There's a question if we want the bus type's PM callbacks to be run at
> > >     that point, but they are not run currently and IMO we shouldn't change
> > >     that.
> > 
> > The device is supposed to be in D0 when it is probed.  Since we are
> > assuming that initialization is now going to leave it in D3, there's no
> > choice -- you _have_ to invoke pci_pm_runtime_resume(), which would
> > invoke the driver's callback, which we don't want.
> 
> Let's say the device will stay in D0 after the initialization and then
> we'll require that it be in D0 if .probe() fails or after .remove().
> 
> The only thing we'll need to do before .probe() in that case is to
> bump up the usage counter and then to bump it down if .probe() fails
> (and after .remove()).
> 
> The only problem we have in that case are buggy drivers that leave
> devices in, say, D3cold after a failing .probe().  That doesn't
> seem to be avoidable, though.
> 
> Thanks,
> Rafael
> 
> 



^ permalink raw reply

* [pm:acpi-dev-pm 9/10] include/linux/acpi.h:463:68: error: 'ENODEV' undeclared
From: kbuild test robot @ 2012-11-15  1:45 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: linux-pm

tree:   git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git acpi-dev-pm
head:   99926a8cd36b6088448fec41aed4a3b5b05b3679
commit: e5cc8ef31267317f3e177415c84e3f3602e5bfc9 [9/10] ACPI / PM: Provide ACPI PM callback routines for subsystems
config: make ARCH=arm tegra_defconfig

All error/warnings:

In file included from drivers/pci/irq.c:7:0:
include/linux/acpi.h: In function 'acpi_dev_pm_attach':
include/linux/acpi.h:463:68: error: 'ENODEV' undeclared (first use in this function)
include/linux/acpi.h:463:68: note: each undeclared identifier is reported only once for each function it appears in

vim +463 +/ENODEV include/linux/acpi.h

e5cc8ef3 Rafael J. Wysocki 2012-11-02  457  #endif
e5cc8ef3 Rafael J. Wysocki 2012-11-02  458  
e5cc8ef3 Rafael J. Wysocki 2012-11-02  459  #if defined(CONFIG_ACPI) && defined(CONFIG_PM)
e5cc8ef3 Rafael J. Wysocki 2012-11-02  460  int acpi_dev_pm_attach(struct device *dev);
e5cc8ef3 Rafael J. Wysocki 2012-11-02  461  int acpi_dev_pm_detach(struct device *dev);
e5cc8ef3 Rafael J. Wysocki 2012-11-02  462  #else
e5cc8ef3 Rafael J. Wysocki 2012-11-02 @463  static inline int acpi_dev_pm_attach(struct device *dev) { return -ENODEV; }
e5cc8ef3 Rafael J. Wysocki 2012-11-02  464  static inline void acpi_dev_pm_detach(struct device *dev) {}
e5cc8ef3 Rafael J. Wysocki 2012-11-02  465  #endif
e5cc8ef3 Rafael J. Wysocki 2012-11-02  466  

---
0-DAY kernel build testing backend         Open Source Technology Center
Fengguang Wu, Yuanhan Liu                              Intel Corporation

^ permalink raw reply

* [PATCH] cpuidle: Measure idle state durations with monotonic clock
From: Julius Werner @ 2012-11-15  1:56 UTC (permalink / raw)
  To: linux-kernel
  Cc: Len Brown, Rafael J. Wysocki, Kevin Hilman, Andrew Morton,
	Srivatsa S. Bhat, linux-acpi, linux-pm, linuxppc-dev,
	Deepthi Dharwar, Trinabh Gupta, Sameer Nanda, Lists Linaro-dev,
	Daniel Lezcano, Julius Werner
In-Reply-To: <CAODwPW873M32d9zFT9fpJT9+PuMbz8htzUxdF1TNE+J1zR3jYA@mail.gmail.com>

Many cpuidle drivers measure their time spent in an idle state by
reading the wallclock time before and after idling and calculating the
difference. This leads to erroneous results when the wallclock time gets
updated by another processor in the meantime, adding that clock
adjustment to the idle state's time counter.

If the clock adjustment was negative, the result is even worse due to an
erroneous cast from int to unsigned long long of the last_residency
variable. The negative 32 bit integer will zero-extend and result in a
forward time jump of roughly four billion milliseconds or 1.3 hours on
the idle state residency counter.

This patch changes all affected cpuidle drivers to either use the
monotonic clock for their measurements or make use of the generic time
measurement wrapper in cpuidle.c, which was already working correctly.
Some superfluous CLIs/STIs in the ACPI code are removed (interrupts
should always already be disabled before entering the idle function, and
not get reenabled until the generic wrapper has performed its second
measurement). It also removes the erroneous cast, making sure that
negative residency values are applied correctly even though they should
not appear anymore.

Signed-off-by: Julius Werner <jwerner@chromium.org>
---
 arch/powerpc/platforms/pseries/processor_idle.c |    4 +-
 drivers/acpi/processor_idle.c                   |   57 +---------------------
 drivers/cpuidle/cpuidle.c                       |    3 +-
 drivers/idle/intel_idle.c                       |   14 +-----
 4 files changed, 7 insertions(+), 71 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/processor_idle.c b/arch/powerpc/platforms/pseries/processor_idle.c
index 45d00e5..4d806b4 100644
--- a/arch/powerpc/platforms/pseries/processor_idle.c
+++ b/arch/powerpc/platforms/pseries/processor_idle.c
@@ -36,7 +36,7 @@ static struct cpuidle_state *cpuidle_state_table;
 static inline void idle_loop_prolog(unsigned long *in_purr, ktime_t *kt_before)
 {
 
-	*kt_before = ktime_get_real();
+	*kt_before = ktime_get();
 	*in_purr = mfspr(SPRN_PURR);
 	/*
 	 * Indicate to the HV that we are idle. Now would be
@@ -50,7 +50,7 @@ static inline  s64 idle_loop_epilog(unsigned long in_purr, ktime_t kt_before)
 	get_lppaca()->wait_state_cycles += mfspr(SPRN_PURR) - in_purr;
 	get_lppaca()->idle = 0;
 
-	return ktime_to_us(ktime_sub(ktime_get_real(), kt_before));
+	return ktime_to_us(ktime_sub(ktime_get(), kt_before));
 }
 
 static int snooze_loop(struct cpuidle_device *dev,
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index e8086c7..f1a5da4 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -735,31 +735,18 @@ static inline void acpi_idle_do_entry(struct acpi_processor_cx *cx)
 static int acpi_idle_enter_c1(struct cpuidle_device *dev,
 		struct cpuidle_driver *drv, int index)
 {
-	ktime_t  kt1, kt2;
-	s64 idle_time;
 	struct acpi_processor *pr;
 	struct cpuidle_state_usage *state_usage = &dev->states_usage[index];
 	struct acpi_processor_cx *cx = cpuidle_get_statedata(state_usage);
 
 	pr = __this_cpu_read(processors);
-	dev->last_residency = 0;
 
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	local_irq_disable();
-
-
 	lapic_timer_state_broadcast(pr, cx, 1);
-	kt1 = ktime_get_real();
 	acpi_idle_do_entry(cx);
-	kt2 = ktime_get_real();
-	idle_time =  ktime_to_us(ktime_sub(kt2, kt1));
-
-	/* Update device last_residency*/
-	dev->last_residency = (int)idle_time;
 
-	local_irq_enable();
 	lapic_timer_state_broadcast(pr, cx, 0);
 
 	return index;
@@ -806,19 +793,12 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	struct acpi_processor *pr;
 	struct cpuidle_state_usage *state_usage = &dev->states_usage[index];
 	struct acpi_processor_cx *cx = cpuidle_get_statedata(state_usage);
-	ktime_t  kt1, kt2;
-	s64 idle_time_ns;
-	s64 idle_time;
 
 	pr = __this_cpu_read(processors);
-	dev->last_residency = 0;
 
 	if (unlikely(!pr))
 		return -EINVAL;
 
-	local_irq_disable();
-
-
 	if (cx->entry_method != ACPI_CSTATE_FFH) {
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
@@ -829,7 +809,6 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 
 		if (unlikely(need_resched())) {
 			current_thread_info()->status |= TS_POLLING;
-			local_irq_enable();
 			return -EINVAL;
 		}
 	}
@@ -843,22 +822,12 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
 	if (cx->type == ACPI_STATE_C3)
 		ACPI_FLUSH_CPU_CACHE();
 
-	kt1 = ktime_get_real();
 	/* Tell the scheduler that we are going deep-idle: */
 	sched_clock_idle_sleep_event();
 	acpi_idle_do_entry(cx);
-	kt2 = ktime_get_real();
-	idle_time_ns = ktime_to_ns(ktime_sub(kt2, kt1));
-	idle_time = idle_time_ns;
-	do_div(idle_time, NSEC_PER_USEC);
 
-	/* Update device last_residency*/
-	dev->last_residency = (int)idle_time;
+	sched_clock_idle_wakeup_event(0);
 
-	/* Tell the scheduler how much we idled: */
-	sched_clock_idle_wakeup_event(idle_time_ns);
-
-	local_irq_enable();
 	if (cx->entry_method != ACPI_CSTATE_FFH)
 		current_thread_info()->status |= TS_POLLING;
 
@@ -883,13 +852,8 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 	struct acpi_processor *pr;
 	struct cpuidle_state_usage *state_usage = &dev->states_usage[index];
 	struct acpi_processor_cx *cx = cpuidle_get_statedata(state_usage);
-	ktime_t  kt1, kt2;
-	s64 idle_time_ns;
-	s64 idle_time;
-
 
 	pr = __this_cpu_read(processors);
-	dev->last_residency = 0;
 
 	if (unlikely(!pr))
 		return -EINVAL;
@@ -899,16 +863,11 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 			return drv->states[drv->safe_state_index].enter(dev,
 						drv, drv->safe_state_index);
 		} else {
-			local_irq_disable();
 			acpi_safe_halt();
-			local_irq_enable();
 			return -EBUSY;
 		}
 	}
 
-	local_irq_disable();
-
-
 	if (cx->entry_method != ACPI_CSTATE_FFH) {
 		current_thread_info()->status &= ~TS_POLLING;
 		/*
@@ -919,7 +878,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 
 		if (unlikely(need_resched())) {
 			current_thread_info()->status |= TS_POLLING;
-			local_irq_enable();
 			return -EINVAL;
 		}
 	}
@@ -934,7 +892,6 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 	 */
 	lapic_timer_state_broadcast(pr, cx, 1);
 
-	kt1 = ktime_get_real();
 	/*
 	 * disable bus master
 	 * bm_check implies we need ARB_DIS
@@ -965,18 +922,9 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 		c3_cpu_count--;
 		raw_spin_unlock(&c3_lock);
 	}
-	kt2 = ktime_get_real();
-	idle_time_ns = ktime_to_ns(ktime_sub(kt2, kt1));
-	idle_time = idle_time_ns;
-	do_div(idle_time, NSEC_PER_USEC);
-
-	/* Update device last_residency*/
-	dev->last_residency = (int)idle_time;
 
-	/* Tell the scheduler how much we idled: */
-	sched_clock_idle_wakeup_event(idle_time_ns);
+	sched_clock_idle_wakeup_event(0);
 
-	local_irq_enable();
 	if (cx->entry_method != ACPI_CSTATE_FFH)
 		current_thread_info()->status |= TS_POLLING;
 
@@ -987,6 +935,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
 struct cpuidle_driver acpi_idle_driver = {
 	.name =		"acpi_idle",
 	.owner =	THIS_MODULE,
+	.en_core_tk_irqen = 1,
 };
 
 /**
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 7f15b85..1536edd 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -109,8 +109,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv,
 		/* This can be moved to within driver enter routine
 		 * but that results in multiple copies of same code.
 		 */
-		dev->states_usage[entered_state].time +=
-				(unsigned long long)dev->last_residency;
+		dev->states_usage[entered_state].time += dev->last_residency;
 		dev->states_usage[entered_state].usage++;
 	} else {
 		dev->last_residency = 0;
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index b0f6b4c..c49c04d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -56,7 +56,6 @@
 #include <linux/kernel.h>
 #include <linux/cpuidle.h>
 #include <linux/clockchips.h>
-#include <linux/hrtimer.h>	/* ktime_get_real() */
 #include <trace/events/power.h>
 #include <linux/sched.h>
 #include <linux/notifier.h>
@@ -72,6 +71,7 @@
 static struct cpuidle_driver intel_idle_driver = {
 	.name = "intel_idle",
 	.owner = THIS_MODULE,
+	.en_core_tk_irqen = 1,
 };
 /* intel_idle.max_cstate=0 disables driver */
 static int max_cstate = MWAIT_MAX_NUM_CSTATES - 1;
@@ -281,8 +281,6 @@ static int intel_idle(struct cpuidle_device *dev,
 	struct cpuidle_state_usage *state_usage = &dev->states_usage[index];
 	unsigned long eax = (unsigned long)cpuidle_get_statedata(state_usage);
 	unsigned int cstate;
-	ktime_t kt_before, kt_after;
-	s64 usec_delta;
 	int cpu = smp_processor_id();
 
 	cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
@@ -297,8 +295,6 @@ static int intel_idle(struct cpuidle_device *dev,
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, &cpu);
 
-	kt_before = ktime_get_real();
-
 	stop_critical_timings();
 	if (!need_resched()) {
 
@@ -310,17 +306,9 @@ static int intel_idle(struct cpuidle_device *dev,
 
 	start_critical_timings();
 
-	kt_after = ktime_get_real();
-	usec_delta = ktime_to_us(ktime_sub(kt_after, kt_before));
-
-	local_irq_enable();
-
 	if (!(lapic_timer_reliable_states & (1 << (cstate))))
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &cpu);
 
-	/* Update cpuidle counters */
-	dev->last_residency = (int)usec_delta;
-
 	return index;
 }
 
-- 
1.7.8.6


^ permalink raw reply related

* Re: [PATCH 3/3] PM: Introduce Intel PowerClamp Driver
From: Paul E. McKenney @ 2012-11-15  3:22 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Jacob Pan, Linux PM, LKML, Rafael Wysocki, Len Brown,
	Thomas Gleixner, H. Peter Anvin, Ingo Molnar, Zhang Rui,
	Rob Landley
In-Reply-To: <50A308FA.40001@linux.intel.com>

On Tue, Nov 13, 2012 at 06:59:06PM -0800, Arjan van de Ven wrote:
> On 11/13/2012 5:34 PM, Paul E. McKenney wrote:
> > On Tue, Nov 13, 2012 at 05:14:50PM -0800, Jacob Pan wrote:
> >> On Tue, 13 Nov 2012 16:08:54 -0800
> >> Arjan van de Ven <arjan@linux.intel.com> wrote:
> >>
> >>>> I think I know, but I feel the need to ask anyway.  Why not tell
> >>>> RCU about the clamping?  
> >>>
> >>> I don't mind telling RCU, but what cannot happen is a bunch of CPU
> >>> time suddenly getting used (since that is the opposite of what is
> >>> needed at the specific point in time of going idle)
> > 
> > Another round of RCU_FAST_NO_HZ rework, you are asking for?  ;-)
> 
> well
> we can tell you we're about to mwait
> and we can tell you when we're done being idle.
> you could just do the actual work at that point, we don't care anymore ;-)
> just at the start of the mandated idle period we can't afford to have more
> jitter than we already have (which is more than I'd like, but it's manageable.
> More jitter means more performance hit, since during the time of the jitter, some cpus
> are forced idle, e.g. costing performance, without the actual big-step power savings
> kicking in yet....)

Fair enough -- but probably best to see what problems arise rather than
trying to guess too far ahead.  Who knows?  It might "just work".

> > If you are only having the system take 6-millisecond "vacations", probably
> 
> it's not all that different from running a while (1) loop for 6 msec inside
> a kernel thread.... other than the power level of course...

Well, a while (1) on all CPUs simultaneously, anyway.

							Thanx, Paul


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox