* Re: [PATCH RFC] PM/Devfreq: Add Exynos5-bus devfreq driver for Exynos5250.
From: Abhilash Kesavan @ 2012-12-05 3:05 UTC (permalink / raw)
To: jhbird.choi, myungjoo.ham, Kukjin Kim
Cc: kyungmin.park, rjw, linux-kernel, linux-pm
In-Reply-To: <loom.20121204T170759-102@post.gmane.org>
Cc'ing missed out recipients
---------- Forwarded message ----------
From: Abhilash Kesavan <kesavan.abhilash@gmail.com>
Date: Tue, Dec 4, 2012 at 9:48 PM
Subject: Re: [PATCH RFC] PM/Devfreq: Add Exynos5-bus devfreq driver
for Exynos5250.
To: linux-pm@vger.kernel.org
Jonghwan Choi <jhbird.choi <at> samsung.com> writes:
>
>
> Hi Abhilash Kesavan.
Hi,
>
> I compiled in 3.7-rc8
> I got a compile error & warning.
>
> Compile error.
>
> CC drivers/devfreq/exynos5_ppmu.o
> drivers/devfreq/exynos5_ppmu.c:56:14: error: 'S5P_VA_PPMU_DDR_C' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:59:14: error: 'S5P_VA_PPMU_DDR_R1' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:62:13: error: 'S5P_VA_PPMU_DDR_L' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:65:13: error: 'S5P_VA_PPMU_RIGHT' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:68:14: error: 'S5P_VA_PPMU_CPU' undeclared
> here (not in a function)
> make[2]: *** [drivers/devfreq/exynos5_ppmu.o] Error 1
I have submitted only the driver changes. As mentioned in the commit message I
will post the arch side changes after the drivers gets a complete review.
However, if you want to test the driver I will post it tommorrow.
[...]
> drivers/devfreq/exynos5_bus.c: In function 'exynos5_busfreq_int_probe':
> drivers/devfreq/exynos5_bus.c:388:9: warning: passing argument 3 of
> 'devfreq_add_device' from incompatible pointer type [enabled by default]
> include/linux/devfreq.h:170:24: note: expected 'struct devfreq_governor
> const *' but argument is of type 'char *'
[...]
> Need change?
These changes are based on
git://git.kernel.org/pub/scm/linux/kernel/git/mzx/devfreq.git : for-rafael
devfreq_add_device prototype has recently changed and that might be the cause of
the compilation error. Can you please try on this tree.
> How to change the INT clock? I think you have to change
> arch/arm/mach-exynos/clock-exynos5250.c
> (If you want to control via clock framework.)
[...]
> exynos5_int_clk.ops = &exynos5_clk_int_ops;
This is part of the arch side patch that has not been posted.
Regards,
Abhilash
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
From: Michael Wang @ 2012-12-05 2:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Srivatsa S. Bhat, tglx, peterz, paulmck, rusty, mingo, namhyung,
vincent.guittot, sbw, tj, amit.kucheria, rostedt, rjw,
xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204141017.94a559f1.akpm@linux-foundation.org>
On 12/05/2012 06:10 AM, Andrew Morton wrote:
> On Tue, 04 Dec 2012 14:23:41 +0530
> "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
>
>> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>>
>> There are places where preempt_disable() is used to prevent any CPU from
>> going offline during the critical section. Let us call them as "atomic
>> hotplug readers" (atomic because they run in atomic contexts).
>>
>> Often, these atomic hotplug readers have a simple need : they want the cpu
>> online mask that they work with (inside their critical section), to be
>> stable, i.e., it should be guaranteed that CPUs in that mask won't go
>> offline during the critical section. An important point here is that they
>> don't really care if such a "stable" mask is a subset of the actual
>> cpu_online_mask.
>>
>> The intent of this patch is to provide such a "stable" cpu online mask
>> for that class of atomic hotplug readers.
>>
>> Fundamental idea behind the design:
>> -----------------------------------
>>
>> Simply put, have a separate mask called the stable cpu online mask; and
>> at the hotplug writer (cpu_down()), note down the CPU that is about to go
>> offline, and remove it from the stable cpu online mask. Then, feel free
>> to take that CPU offline, since the atomic hotplug readers won't see it
>> from now on. Also, don't start any new cpu_down() operations until all
>> existing atomic hotplug readers have completed (because they might still
>> be working with the old value of the stable online mask).
>>
>> Some important design requirements and considerations:
>> -----------------------------------------------------
>>
>> 1. The atomic hotplug readers should ideally *never* wait for the hotplug
>> writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
>> can be in very hot-paths like interrupt handling/IPI and hence, if they
>> have to wait for an ongoing cpu_down() to complete, it would pretty much
>> introduce the same performance/latency problems as stop_machine().
>>
>> 2. Any synchronization at the atomic hotplug readers side must be highly
>> scalable - avoid global locks/counters etc. Because, these paths currently
>> use the extremely fast preempt_disable(); our replacement to
>> preempt_disable() should not become ridiculously costly.
>>
>> 3. preempt_disable() was recursive. The replacement should also be recursive.
>>
>> Implementation of the design:
>> ----------------------------
>>
>> Atomic hotplug reader side:
>>
>> We use per-cpu counters to mark the presence of atomic hotplug readers.
>> A reader would increment its per-cpu counter and continue, without waiting
>> for anything. And no locks are used in this path. Together, these satisfy
>> all the 3 requirements mentioned above.
>>
>> The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
>> ensure that all existing atomic hotplug readers have completed. Only after
>> that, it will proceed to actually take the CPU offline.
>>
>> [ Michael: Designed the synchronization for the IPI case ]
>
> Like this:
>
> [wangyun@linux.vnet.ibm.com: designed the synchronization for the IPI case]
>
>> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
>> [ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
>> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>>
>> ...
>>
>> --- a/include/linux/cpu.h
>> +++ b/include/linux/cpu.h
>> @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
>>
>> extern void get_online_cpus(void);
>> extern void put_online_cpus(void);
>> +extern void get_online_cpus_stable_atomic(void);
>> +extern void put_online_cpus_stable_atomic(void);
>> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
>> #define register_hotcpu_notifier(nb) register_cpu_notifier(nb)
>> #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb)
>> @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
>>
>> #define get_online_cpus() do { } while (0)
>> #define put_online_cpus() do { } while (0)
>> +#define get_online_cpus_stable_atomic() do { } while (0)
>> +#define put_online_cpus_stable_atomic() do { } while (0)
>
> static inline C functions would be preferred if possible. Feel free to
> fix up the wrong crufty surrounding code as well ;)
>
>>
>> ...
>>
>> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>>
>> #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
>> #define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)
>> +#define for_each_online_cpu_stable(cpu) \
>> + for_each_cpu((cpu), cpu_online_stable_mask)
>> #define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)
>>
>> /* Wrappers for arch boot code to manipulate normally-constant masks */
>> void set_cpu_possible(unsigned int cpu, bool possible);
>> void set_cpu_present(unsigned int cpu, bool present);
>> void set_cpu_online(unsigned int cpu, bool online);
>> +void set_cpu_online_stable(unsigned int cpu, bool online);
>
> The naming is inconsistent. This is "cpu_online_stable", but
> for_each_online_cpu_stable() is "online_cpu_stable". Can we make
> everything the same?
>
>> void set_cpu_active(unsigned int cpu, bool active);
>> void init_cpu_present(const struct cpumask *src);
>> void init_cpu_possible(const struct cpumask *src);
>>
>> ...
>>
>> +void get_online_cpus_stable_atomic(void)
>> +{
>> + preempt_disable();
>> + this_cpu_inc(hotplug_reader_refcount);
>> +
>> + /*
>> + * Prevent reordering of writes to hotplug_reader_refcount and
>> + * reads from cpu_online_stable_mask.
>> + */
>> + smp_mb();
>> +}
>> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
>> +
>> +void put_online_cpus_stable_atomic(void)
>> +{
>> + /*
>> + * Prevent reordering of reads from cpu_online_stable_mask and
>> + * writes to hotplug_reader_refcount.
>> + */
>> + smp_mb();
>> + this_cpu_dec(hotplug_reader_refcount);
>> + preempt_enable();
>> +}
>> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
>> +
>> static struct {
>> struct task_struct *active_writer;
>> struct mutex lock; /* Synchronizes accesses to refcount, */
>> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
>> write_unlock_irq(&tasklist_lock);
>> }
>>
>> +/*
>> + * We want all atomic hotplug readers to refer to the new value of
>> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
>> + * to complete. Any new atomic hotplug readers will see the updated mask
>> + * and hence pose no problems.
>> + */
>> +static void sync_hotplug_readers(void)
>> +{
>> + unsigned int cpu;
>> +
>> + for_each_online_cpu(cpu) {
>> + while (per_cpu(hotplug_reader_refcount, cpu))
>> + cpu_relax();
>> + }
>> +}
>
> That all looks solid to me.
>
>> +/*
>> + * We are serious about taking this CPU down. So clear the CPU from the
>> + * stable online mask.
>> + */
>> +static void prepare_cpu_take_down(unsigned int cpu)
>> +{
>> + set_cpu_online_stable(cpu, false);
>> +
>> + /*
>> + * Prevent reordering of writes to cpu_online_stable_mask and reads
>> + * from hotplug_reader_refcount.
>> + */
>> + smp_mb();
>> +
>> + /*
>> + * Wait for all active hotplug readers to complete, to ensure that
>> + * there are no critical sections still referring to the old stable
>> + * online mask.
>> + */
>> + sync_hotplug_readers();
>> +}
>
> I wonder about the cpu-online case. A typical caller might want to do:
>
>
> /*
> * Set each online CPU's "foo" to "bar"
> */
>
> int global_bar;
>
> void set_cpu_foo(int bar)
> {
> get_online_cpus_stable_atomic();
> global_bar = bar;
> for_each_online_cpu_stable()
> cpu->foo = bar;
> put_online_cpus_stable_atomic()
> }
>
> void_cpu_online_notifier_handler(void)
> {
> cpu->foo = global_bar;
> }
>
> And I think that set_cpu_foo() would be buggy, because a CPU could come
> online before global_bar was altered, and that newly-online CPU would
> pick up the old value of `bar'.
>
> So what's the rule here? global_bar must be written before we run
> get_online_cpus_stable_atomic()?
>
> Anyway, please have a think and spell all this out?
That's right, actually this related to one question, should the hotplug
happen during get_online and put_online?
Answer will be YES according to old API which using mutex, the hotplug
won't happen in critical section, but the cost is get_online() will
block, which will kill the performance.
So we designed this mechanism to do acceleration, but as you pointed
out, although the result will never be wrong, but the 'stable' mask is
not stable since it could be changed in critical section.
And we have two solution.
One is from Srivatsa, using 'read_lock' and 'write_lock', it will
prevent hotplug happen just like the old rule, the cost is we need a
global 'rw_lock' which perform bad on NUMA system, and no doubt,
get_online() will block for short time when doing hotplug.
Another is to maintain a per-cpu cache mask, this mask will only be
updated in get_online(), and be used in critical section, then we will
get a real stable mask, but one flaw is, on different cpu in critical
section, online mask will be different.
We will be appreciate if we could collect some comments on which one to
be used in next version.
Regards,
Michael Wang
>
>> struct take_cpu_down_param {
>> unsigned long mod;
>> void *hcpu;
>> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
>> static int __ref take_cpu_down(void *_param)
>> {
>> struct take_cpu_down_param *param = _param;
>> - int err;
>> + int err, cpu = (long)(param->hcpu);
>> +
>
> Like this please:
>
> int err;
> int cpu = (long)(param->hcpu);
>
>> + prepare_cpu_take_down(cpu);
>>
>> /* Ensure this CPU doesn't handle any more interrupts. */
>> err = __cpu_disable();
>>
>> ...
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
^ permalink raw reply
* [PATCH 3/3] ACPI / PM: Export power states of ACPI devices via sysfs
From: Rafael J. Wysocki @ 2012-12-05 0:42 UTC (permalink / raw)
To: ACPI Devel Maling List; +Cc: Linux PM list, LKML, Greg Kroah-Hartman, Len Brown
In-Reply-To: <1464990.iyrX0DeZyT@vostro.rjw.lan>
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Make it possible to retrieve the current power state of an ACPI
device from user space via sysfs by adding a new attribute
power_state to the power subdirectory of the sysfs directory
associated with the struct acpi_device representing the device's
ACPI node.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
Documentation/ABI/testing/sysfs-devices-power | 13 ++++++++
drivers/acpi/scan.c | 41 ++++++++++++++++++++++++++
2 files changed, 54 insertions(+)
Index: linux/drivers/acpi/scan.c
===================================================================
--- linux.orig/drivers/acpi/scan.c
+++ linux/drivers/acpi/scan.c
@@ -174,6 +174,43 @@ err_out:
}
EXPORT_SYMBOL(acpi_bus_hot_remove_device);
+#ifdef CONFIG_PM
+static ssize_t power_state_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct acpi_device *adev = to_acpi_device(dev);
+ int state;
+ int ret;
+
+ ret = acpi_device_get_power(adev, &state);
+ return ret ? ret : sprintf(buf, "%s\n", acpi_power_state_string(state));
+}
+
+static DEVICE_ATTR(power_state, 0444, power_state_show, NULL);
+
+static struct attribute *acpi_dev_pm_attrs[] = {
+ &dev_attr_power_state.attr,
+ NULL,
+};
+static struct attribute_group acpi_dev_pm_attr_group = {
+ .name = power_group_name,
+ .attrs = acpi_dev_pm_attrs,
+};
+
+static void acpi_dev_pm_sysfs_add(struct device *dev)
+{
+ sysfs_merge_group(&dev->kobj, &acpi_dev_pm_attr_group);
+}
+
+static void acpi_dev_pm_sysfs_remove(struct device *dev)
+{
+ sysfs_unmerge_group(&dev->kobj, &acpi_dev_pm_attr_group);
+}
+#else /* !CONFIG_PM */
+static inline void acpi_dev_pm_sysfs_add(struct device *dev) {}
+static inline void acpi_dev_pm_sysfs_remove(struct device *dev) {}
+#endif /* !CONFIG_PM */
+
static ssize_t
acpi_eject_store(struct device *d, struct device_attribute *attr,
const char *buf, size_t count)
@@ -367,6 +404,9 @@ static int acpi_device_setup_files(struc
status = acpi_get_handle(dev->handle, "_EJ0", &temp);
if (ACPI_SUCCESS(status))
result = device_create_file(&dev->dev, &dev_attr_eject);
+
+ acpi_dev_pm_sysfs_add(&dev->dev);
+
end:
return result;
}
@@ -376,6 +416,7 @@ static void acpi_device_remove_files(str
acpi_status status;
acpi_handle temp;
+ acpi_dev_pm_sysfs_remove(&dev->dev);
/*
* If device has _STR, remove 'description' file
*/
Index: linux/Documentation/ABI/testing/sysfs-devices-power
===================================================================
--- linux.orig/Documentation/ABI/testing/sysfs-devices-power
+++ linux/Documentation/ABI/testing/sysfs-devices-power
@@ -235,3 +235,16 @@ Description:
This attribute has no effect on system-wide suspend/resume and
hibernation.
+
+What: /sys/devices/.../power/power_state
+Date: December 2012
+Contact: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+Description:
+ The /sys/devices/.../power/power_state attribute is only present
+ for ACPI device nodes (i.e. objects of type struct acpi_device).
+
+ If present, it contains the string representation of the current
+ ACPI power state of the device represented by the given ACPI
+ device node.
+
+ This attribute is read-only.
^ permalink raw reply
* [PATCH 2/3] ACPI / PM: Common string representations of device power states
From: Rafael J. Wysocki @ 2012-12-05 0:41 UTC (permalink / raw)
To: ACPI Devel Maling List; +Cc: Linux PM list, LKML, Greg Kroah-Hartman, Len Brown
In-Reply-To: <1464990.iyrX0DeZyT@vostro.rjw.lan>
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function returning string representations of ACPI device power
states, state_string((), is now static, because it is only used
internally in drivers/acpi/bus.c. However, it will be used outside
of that file going forward, so rename it to
acpi_power_state_string(), add a kerneldoc comment to it and add its
header to acpi_bus.h.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/acpi/bus.c | 18 ++++++++++++------
include/acpi/acpi_bus.h | 1 +
2 files changed, 13 insertions(+), 6 deletions(-)
Index: linux/drivers/acpi/bus.c
===================================================================
--- linux.orig/drivers/acpi/bus.c
+++ linux/drivers/acpi/bus.c
@@ -182,7 +182,11 @@ EXPORT_SYMBOL(acpi_bus_get_private_data)
Power Management
-------------------------------------------------------------------------- */
-static const char *state_string(int state)
+/**
+ * acpi_power_state_string - String representation of ACPI device power state.
+ * @state: ACPI device power state to return the string representation of.
+ */
+const char *acpi_power_state_string(int state)
{
switch (state) {
case ACPI_STATE_D0:
@@ -260,7 +264,7 @@ int acpi_device_get_power(struct acpi_de
out:
ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Device [%s] power state is %s\n",
- device->pnp.bus_id, state_string(*state)));
+ device->pnp.bus_id, acpi_power_state_string(*state)));
return 0;
}
@@ -287,13 +291,13 @@ int acpi_device_set_power(struct acpi_de
if (state == device->power.state) {
ACPI_DEBUG_PRINT((ACPI_DB_INFO, "Device is already at %s\n",
- state_string(state)));
+ acpi_power_state_string(state)));
return 0;
}
if (!device->power.states[state].flags.valid) {
printk(KERN_WARNING PREFIX "Device does not support %s\n",
- state_string(state));
+ acpi_power_state_string(state));
return -ENODEV;
}
if (device->parent && (state < device->parent->power.state)) {
@@ -354,12 +358,14 @@ int acpi_device_set_power(struct acpi_de
if (result)
printk(KERN_WARNING PREFIX
"Device [%s] failed to transition to %s\n",
- device->pnp.bus_id, state_string(state));
+ device->pnp.bus_id,
+ acpi_power_state_string(state));
else {
device->power.state = state;
ACPI_DEBUG_PRINT((ACPI_DB_INFO,
"Device [%s] transitioned to %s\n",
- device->pnp.bus_id, state_string(state)));
+ device->pnp.bus_id,
+ acpi_power_state_string(state)));
}
return result;
Index: linux/include/acpi/acpi_bus.h
===================================================================
--- linux.orig/include/acpi/acpi_bus.h
+++ linux/include/acpi/acpi_bus.h
@@ -340,6 +340,7 @@ acpi_status acpi_bus_get_status_handle(a
unsigned long long *sta);
int acpi_bus_get_status(struct acpi_device *device);
int acpi_bus_set_power(acpi_handle handle, int state);
+const char *acpi_power_state_string(int state);
int acpi_device_get_power(struct acpi_device *device, int *state);
int acpi_device_set_power(struct acpi_device *device, int state);
int acpi_bus_update_power(acpi_handle handle, int *state_p);
^ permalink raw reply
* [PATCH 1/3] ACPI / PM: More visible function for retrieving device power states
From: Rafael J. Wysocki @ 2012-12-05 0:40 UTC (permalink / raw)
To: ACPI Devel Maling List; +Cc: Linux PM list, LKML, Greg Kroah-Hartman, Len Brown
In-Reply-To: <1464990.iyrX0DeZyT@vostro.rjw.lan>
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The function used for retrieving ACPI device power states,
__acpi_bus_get_power(), is now static, because it is only used
internally in drivers/acpi/bus.c. However, it will be used
outside of that file going forward, so rename it to
acpi_device_get_power(), in analogy with acpi_device_set_power(),
add a kerneldoc comment to it and add its header to acpi_bus.h.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/acpi/bus.c | 15 ++++++++++++---
include/acpi/acpi_bus.h | 1 +
2 files changed, 13 insertions(+), 3 deletions(-)
Index: linux/drivers/acpi/bus.c
===================================================================
--- linux.orig/drivers/acpi/bus.c
+++ linux/drivers/acpi/bus.c
@@ -200,7 +200,16 @@ static const char *state_string(int stat
}
}
-static int __acpi_bus_get_power(struct acpi_device *device, int *state)
+/**
+ * acpi_device_get_power - Get power state of an ACPI device.
+ * @device: Device to get the power state of.
+ * @state: Place to store the power state of the device.
+ *
+ * This function does not update the device's power.state field, but it may
+ * update its parent's power.state field (when the parent's power state is
+ * unknown and the device's power state turns out to be D0).
+ */
+int acpi_device_get_power(struct acpi_device *device, int *state)
{
int result = ACPI_STATE_UNKNOWN;
@@ -389,7 +398,7 @@ int acpi_bus_init_power(struct acpi_devi
device->power.state = ACPI_STATE_UNKNOWN;
- result = __acpi_bus_get_power(device, &state);
+ result = acpi_device_get_power(device, &state);
if (result)
return result;
@@ -413,7 +422,7 @@ int acpi_bus_update_power(acpi_handle ha
if (result)
return result;
- result = __acpi_bus_get_power(device, &state);
+ result = acpi_device_get_power(device, &state);
if (result)
return result;
Index: linux/include/acpi/acpi_bus.h
===================================================================
--- linux.orig/include/acpi/acpi_bus.h
+++ linux/include/acpi/acpi_bus.h
@@ -340,6 +340,7 @@ acpi_status acpi_bus_get_status_handle(a
unsigned long long *sta);
int acpi_bus_get_status(struct acpi_device *device);
int acpi_bus_set_power(acpi_handle handle, int state);
+int acpi_device_get_power(struct acpi_device *device, int *state);
int acpi_device_set_power(struct acpi_device *device, int state);
int acpi_bus_update_power(acpi_handle handle, int *state_p);
bool acpi_bus_power_manageable(acpi_handle handle);
^ permalink raw reply
* [PATCH 0/3] ACPI / PM: Export ACPI power states of devices via sysfs
From: Rafael J. Wysocki @ 2012-12-05 0:39 UTC (permalink / raw)
To: ACPI Devel Maling List; +Cc: Linux PM list, LKML, Greg Kroah-Hartman, Len Brown
Hi All,
In some situations it is very useful to know the current ACPI power states of
devices. For this reason, the following series of patches makes it possible
to retrieve that information via sysfs.
Patches [1-2/3] are trivial function renames, patch [3/3] is the real thing.
The series is on top of the master branch of linux-pm.git.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
^ permalink raw reply
* Re: [PATCH] PCI/PM: Keep runtime PM enabled for unbound PCI devices
From: Bjorn Helgaas @ 2012-12-05 0:06 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Huang Ying, linux-kernel, linux-pci, linux-pm, Rafael J. Wysocki,
Alan Stern
In-Reply-To: <1962888.rQgbRUetgj@vostro.rjw.lan>
On Wed, Nov 28, 2012 at 3:43 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> On Wednesday, November 28, 2012 03:25:59 PM Bjorn Helgaas wrote:
>> On Tue, Nov 20, 2012 at 1:08 AM, Huang Ying <ying.huang@intel.com> wrote:
>> > For unbound PCI devices, what we need is:
>> >
>> > - Always in D0 state, because some devices does not work again after
>> > being put into D3 by the PCI bus.
>> >
>> > - In SUSPENDED state if allowed, so that the parent devices can still
>> > be put into low power state.
>> >
>> > To satisfy these requirement, the runtime PM for the unbound PCI
>> > devices are disabled and set to SUSPENDED state. One issue of this
>> > solution is that the PCI devices will be put into SUSPENDED state even
>> > if the SUSPENDED state is forbidden via the sysfs interface
>> > (.../power/control) of the device. This is not an issue for most
>> > devices, because most PCI devices are not used at all if unbounded.
>> > But there are exceptions. For example, unbound VGA card can be used
>> > for display, but suspend its parents make it stop working.
>> >
>> > To fix the issue, we keep the runtime PM enabled when the PCI devices
>> > are unbound. But the runtime PM callbacks will do nothing if the PCI
>> > devices are unbound. This way, we can put the PCI devices into
>> > SUSPENDED state without put the PCI devices into D3 state.
>>
>> Does this fix a reported problem? Is there a bug report URL? What
>> problem would a user notice?
>
> There is a BZ: https://bugzilla.kernel.org/show_bug.cgi?id=48201
I added the BZ link and put this in my -next branch. Thanks!
Bjorn
^ permalink raw reply
* Re: [RFC PATCH 02/10] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
From: Rusty Russell @ 2012-12-04 23:39 UTC (permalink / raw)
To: tglx, peterz, paulmck, mingo, akpm, namhyung, vincent.guittot
Cc: sbw, tj, amit.kucheria, rostedt, rjw, srivatsa.bhat, wangyun,
xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085419.25919.79543.stgit@srivatsabhat.in.ibm.com>
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> writes:
> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>
> With stop_machine() gone from the CPU offline path, we can't depend on
> preempt_disable() to prevent CPUs from going offline from under us.
Minor gripe: I'd prefer this paragraph to use the future rather than
past tense, like: "Once stop_machine() is gone ... we won't be able to
depend".
Since you're not supposed to use the _stable() accessors without calling
get_online_cpus_stable_atomic(), why not have
get_online_cpus_stable_atomic() return a pointer to the stable cpumask?
(Which is otherwise static, at least for debug).
Might make the patches messier though...
Oh, and I'd love to see actual benchmarks to make sure we've actually
fixed a problem with this ;)
Rusty.
^ permalink raw reply
* Re: [RFC PATCH 02/10] smp, cpu hotplug: Fix smp_call_function_*() to prevent CPU offline properly
From: Andrew Morton @ 2012-12-04 22:17 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: tglx, peterz, paulmck, rusty, mingo, namhyung, vincent.guittot,
sbw, tj, amit.kucheria, rostedt, rjw, wangyun, xiaoguangrong,
nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085419.25919.79543.stgit@srivatsabhat.in.ibm.com>
On Tue, 04 Dec 2012 14:24:28 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>
> With stop_machine() gone from the CPU offline path, we can't depend on
> preempt_disable() to prevent CPUs from going offline from under us.
>
> Use the get/put_online_cpus_stable_atomic() APIs to prevent CPUs from going
> offline, while invoking from atomic context.
>
> ...
>
> */
> - this_cpu = get_cpu();
> + get_online_cpus_stable_atomic();
> + this_cpu = smp_processor_id();
I wonder if get_online_cpus_stable_atomic() should return the local CPU
ID. Just as a little convenience thing. Time will tell.
> /*
> * Can deadlock when called with interrupts disabled.
>
> ...
>
> @@ -380,15 +383,15 @@ int smp_call_function_any(const struct cpumask *mask,
> nodemask = cpumask_of_node(cpu_to_node(cpu));
> for (cpu = cpumask_first_and(nodemask, mask); cpu < nr_cpu_ids;
> cpu = cpumask_next_and(cpu, nodemask, mask)) {
> - if (cpu_online(cpu))
> + if (cpu_online_stable(cpu))
> goto call;
> }
>
> /* Any online will do: smp_call_function_single handles nr_cpu_ids. */
> - cpu = cpumask_any_and(mask, cpu_online_mask);
> + cpu = cpumask_any_and(mask, cpu_online_stable_mask);
> call:
> ret = smp_call_function_single(cpu, func, info, wait);
> - put_cpu();
> + put_online_cpus_stable_atomic();
> return ret;
> }
> EXPORT_SYMBOL_GPL(smp_call_function_any);
So smp_call_function_any() has no synchronization against CPUs coming
online. Hence callers of smp_call_function_any() are responsible for
ensuring that CPUs which are concurrently coming online will adopt the
required state?
I guess that has always been the case...
>
> ...
>
^ permalink raw reply
* Re: [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
From: Andrew Morton @ 2012-12-04 22:10 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: tglx, peterz, paulmck, rusty, mingo, namhyung, vincent.guittot,
sbw, tj, amit.kucheria, rostedt, rjw, wangyun, xiaoguangrong,
nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085324.25919.53090.stgit@srivatsabhat.in.ibm.com>
On Tue, 04 Dec 2012 14:23:41 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:
> From: Michael Wang <wangyun@linux.vnet.ibm.com>
>
> There are places where preempt_disable() is used to prevent any CPU from
> going offline during the critical section. Let us call them as "atomic
> hotplug readers" (atomic because they run in atomic contexts).
>
> Often, these atomic hotplug readers have a simple need : they want the cpu
> online mask that they work with (inside their critical section), to be
> stable, i.e., it should be guaranteed that CPUs in that mask won't go
> offline during the critical section. An important point here is that they
> don't really care if such a "stable" mask is a subset of the actual
> cpu_online_mask.
>
> The intent of this patch is to provide such a "stable" cpu online mask
> for that class of atomic hotplug readers.
>
> Fundamental idea behind the design:
> -----------------------------------
>
> Simply put, have a separate mask called the stable cpu online mask; and
> at the hotplug writer (cpu_down()), note down the CPU that is about to go
> offline, and remove it from the stable cpu online mask. Then, feel free
> to take that CPU offline, since the atomic hotplug readers won't see it
> from now on. Also, don't start any new cpu_down() operations until all
> existing atomic hotplug readers have completed (because they might still
> be working with the old value of the stable online mask).
>
> Some important design requirements and considerations:
> -----------------------------------------------------
>
> 1. The atomic hotplug readers should ideally *never* wait for the hotplug
> writer (cpu_down()) for *anything*. Because, these atomic hotplug readers
> can be in very hot-paths like interrupt handling/IPI and hence, if they
> have to wait for an ongoing cpu_down() to complete, it would pretty much
> introduce the same performance/latency problems as stop_machine().
>
> 2. Any synchronization at the atomic hotplug readers side must be highly
> scalable - avoid global locks/counters etc. Because, these paths currently
> use the extremely fast preempt_disable(); our replacement to
> preempt_disable() should not become ridiculously costly.
>
> 3. preempt_disable() was recursive. The replacement should also be recursive.
>
> Implementation of the design:
> ----------------------------
>
> Atomic hotplug reader side:
>
> We use per-cpu counters to mark the presence of atomic hotplug readers.
> A reader would increment its per-cpu counter and continue, without waiting
> for anything. And no locks are used in this path. Together, these satisfy
> all the 3 requirements mentioned above.
>
> The hotplug writer uses (reads) the per-cpu counters of all CPUs in order to
> ensure that all existing atomic hotplug readers have completed. Only after
> that, it will proceed to actually take the CPU offline.
>
> [ Michael: Designed the synchronization for the IPI case ]
Like this:
[wangyun@linux.vnet.ibm.com: designed the synchronization for the IPI case]
> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com>
> [ Srivatsa: Generalized it to work for all cases and wrote the changelog ]
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
>
> ...
>
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -175,6 +175,8 @@ extern struct bus_type cpu_subsys;
>
> extern void get_online_cpus(void);
> extern void put_online_cpus(void);
> +extern void get_online_cpus_stable_atomic(void);
> +extern void put_online_cpus_stable_atomic(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> #define register_hotcpu_notifier(nb) register_cpu_notifier(nb)
> #define unregister_hotcpu_notifier(nb) unregister_cpu_notifier(nb)
> @@ -198,6 +200,8 @@ static inline void cpu_hotplug_driver_unlock(void)
>
> #define get_online_cpus() do { } while (0)
> #define put_online_cpus() do { } while (0)
> +#define get_online_cpus_stable_atomic() do { } while (0)
> +#define put_online_cpus_stable_atomic() do { } while (0)
static inline C functions would be preferred if possible. Feel free to
fix up the wrong crufty surrounding code as well ;)
>
> ...
>
> @@ -705,12 +707,15 @@ extern const DECLARE_BITMAP(cpu_all_bits, NR_CPUS);
>
> #define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)
> #define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)
> +#define for_each_online_cpu_stable(cpu) \
> + for_each_cpu((cpu), cpu_online_stable_mask)
> #define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)
>
> /* Wrappers for arch boot code to manipulate normally-constant masks */
> void set_cpu_possible(unsigned int cpu, bool possible);
> void set_cpu_present(unsigned int cpu, bool present);
> void set_cpu_online(unsigned int cpu, bool online);
> +void set_cpu_online_stable(unsigned int cpu, bool online);
The naming is inconsistent. This is "cpu_online_stable", but
for_each_online_cpu_stable() is "online_cpu_stable". Can we make
everything the same?
> void set_cpu_active(unsigned int cpu, bool active);
> void init_cpu_present(const struct cpumask *src);
> void init_cpu_possible(const struct cpumask *src);
>
> ...
>
> +void get_online_cpus_stable_atomic(void)
> +{
> + preempt_disable();
> + this_cpu_inc(hotplug_reader_refcount);
> +
> + /*
> + * Prevent reordering of writes to hotplug_reader_refcount and
> + * reads from cpu_online_stable_mask.
> + */
> + smp_mb();
> +}
> +EXPORT_SYMBOL_GPL(get_online_cpus_stable_atomic);
> +
> +void put_online_cpus_stable_atomic(void)
> +{
> + /*
> + * Prevent reordering of reads from cpu_online_stable_mask and
> + * writes to hotplug_reader_refcount.
> + */
> + smp_mb();
> + this_cpu_dec(hotplug_reader_refcount);
> + preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(put_online_cpus_stable_atomic);
> +
> static struct {
> struct task_struct *active_writer;
> struct mutex lock; /* Synchronizes accesses to refcount, */
> @@ -237,6 +304,44 @@ static inline void check_for_tasks(int cpu)
> write_unlock_irq(&tasklist_lock);
> }
>
> +/*
> + * We want all atomic hotplug readers to refer to the new value of
> + * cpu_online_stable_mask. So wait for the existing atomic hotplug readers
> + * to complete. Any new atomic hotplug readers will see the updated mask
> + * and hence pose no problems.
> + */
> +static void sync_hotplug_readers(void)
> +{
> + unsigned int cpu;
> +
> + for_each_online_cpu(cpu) {
> + while (per_cpu(hotplug_reader_refcount, cpu))
> + cpu_relax();
> + }
> +}
That all looks solid to me.
> +/*
> + * We are serious about taking this CPU down. So clear the CPU from the
> + * stable online mask.
> + */
> +static void prepare_cpu_take_down(unsigned int cpu)
> +{
> + set_cpu_online_stable(cpu, false);
> +
> + /*
> + * Prevent reordering of writes to cpu_online_stable_mask and reads
> + * from hotplug_reader_refcount.
> + */
> + smp_mb();
> +
> + /*
> + * Wait for all active hotplug readers to complete, to ensure that
> + * there are no critical sections still referring to the old stable
> + * online mask.
> + */
> + sync_hotplug_readers();
> +}
I wonder about the cpu-online case. A typical caller might want to do:
/*
* Set each online CPU's "foo" to "bar"
*/
int global_bar;
void set_cpu_foo(int bar)
{
get_online_cpus_stable_atomic();
global_bar = bar;
for_each_online_cpu_stable()
cpu->foo = bar;
put_online_cpus_stable_atomic()
}
void_cpu_online_notifier_handler(void)
{
cpu->foo = global_bar;
}
And I think that set_cpu_foo() would be buggy, because a CPU could come
online before global_bar was altered, and that newly-online CPU would
pick up the old value of `bar'.
So what's the rule here? global_bar must be written before we run
get_online_cpus_stable_atomic()?
Anyway, please have a think and spell all this out?
> struct take_cpu_down_param {
> unsigned long mod;
> void *hcpu;
> @@ -246,7 +351,9 @@ struct take_cpu_down_param {
> static int __ref take_cpu_down(void *_param)
> {
> struct take_cpu_down_param *param = _param;
> - int err;
> + int err, cpu = (long)(param->hcpu);
> +
Like this please:
int err;
int cpu = (long)(param->hcpu);
> + prepare_cpu_take_down(cpu);
>
> /* Ensure this CPU doesn't handle any more interrupts. */
> err = __cpu_disable();
>
> ...
>
^ permalink raw reply
* Re: [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
From: Srivatsa S. Bhat @ 2012-12-04 21:14 UTC (permalink / raw)
To: Tejun Heo
Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
vincent.guittot, sbw, amit.kucheria, rostedt, rjw, wangyun,
xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204151734.GB3885@mtj.dyndns.org>
Hi Tejun,
On 12/04/2012 08:47 PM, Tejun Heo wrote:
> Hello, Srivatsa.
>
> On Tue, Dec 04, 2012 at 02:23:41PM +0530, Srivatsa S. Bhat wrote:
>> extern const struct cpumask *const cpu_possible_mask;
>> extern const struct cpumask *const cpu_online_mask;
>> +extern const struct cpumask *const cpu_online_stable_mask;
>> extern const struct cpumask *const cpu_present_mask;
>> extern const struct cpumask *const cpu_active_mask;
>
> This is a bit nasty. The distinction between cpu_online_mask and the
> stable one is quite subtle and there's no mechanism to verify the
> right one is in use. IIUC, the only time cpu_online_mask and
> cpu_online_stable_mask can deviate is during the final stage CPU take
> down, right?
No, actually they deviate in the initial stage itself. We flip the bit
in the stable mask right in the beginning, and then flip the bit in the
online mask slightly later, in __cpu_disable().
...which makes it look stupid to have a separate "stable" mask in the
first place! Hmm...
Thinking in this direction a bit more, I have written a patchset that
doesn't need a separate stable mask, but which works with the existing
cpu_online_mask itself. I'll post it tomorrow after testing and updating
the patch descriptions.
One of the things I'm trying to achieve is to identify 2 types of
hotplug readers:
1. Readers who care only about synchronizing with the updates to
cpu_online_mask (light-weight readers)
2. Readers who really want full synchronization with the entire CPU
tear-down sequence.
The reason for doing this, instead of assuming every reader to be of
type 2 is that, if we don't make this distinction, we can end up in the
very same latency issues and performance problems that we hit when
using stop_machine(), without even using stop_machine()!
[The readers can be in very hot paths, like interrupt handlers. So if
there is no distinction between light-weight readers and full-readers,
we can potentially slow down the entire machine unnecessarily, effectively
creating the same effect as stop_machine()]
IOW, IMHO, one of the goals of the replacement to stop_machine() should
be that it should not indirectly induce the "stop_machine() effect".
The new patchset that I have written takes care of this requirement
and provides APIs for both types of readers, and also doesn't use
any extra cpu masks. I'll post this patchset tomorrow, after taking a
careful look at it again.
Regards,
Srivatsa S. Bhat
^ permalink raw reply
* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Sarah Sharp @ 2012-12-04 18:14 UTC (permalink / raw)
To: Andy Green
Cc: Alan Stern, Ming Lei, Greg Kroah-Hartman, Lan Tianyu,
Rafael J. Wysocki, linux-pm, Oliver Neukum, linux-omap, linux-usb,
Roger Quadros, Felipe Balbi
In-Reply-To: <50BD7095.5090103@linaro.org>
On Tue, Dec 04, 2012 at 11:40:05AM +0800, Andy Green wrote:
> On 04/12/12 01:09, the mail apparently from Alan Stern included:
> >On Mon, 3 Dec 2012, Andy Green wrote:
> >
> >>Unless someone NAKs it for sure already (if you're already sure you're
> >>going to, please do so to avoid wasting time), I'll issue a try#2 of my
> >>code later which demonstrates what I mean. At least I guess it's useful
> >>for comparative purposes.
> >
> >Before you go writing a whole lot more code, we should discuss the
> >basics a bit more clearly. There are several unsettled issues here:
>
> > 1. Should the LAN95xx stuff be associated with the ehci-omap.0's
> > driver or with the hub port? The port would be more flexible,
> > offering the ability to turn the power off and on while the
> > system is running without affecting anything else. But the
> > port code is currently in flux, which could cause this new
> > addition to be delayed.
>
> I think associating ULPI PHY reset and SMSC power and reset with the
> hub port power state is good. Then, you could have the driver in a
> device with multiple onboard USB devices, and individually control
> whether they're eating power or not. In the asset case, you'd
> associate mux assets with ehci-omap.0.
>
> Yesterday I studied the hub port code and have a couple of patches,
> one normalizes the hub port device to have a stub driver.
>
> The other then puts hub port power state signalling into runtime_pm
> handlers in the hub port device. Until now, actually there's no
> code in hub.c to switch off a port.
Did you take a look at the most recent patches from Tianyu to add
support to power off a port if a device is suspended?
Start of the series:
http://marc.info/?l=linux-usb&m=135314427413307&w=2
Patch that adds power off on device suspend:
http://marc.info/?l=linux-usb&m=135314431913321&w=2
Tianyu also added some code to the xHCI host controller driver to call
into the ACPI methods to power off a port when the USB hub driver clears
the port power feature.
Sarah Sharp
^ permalink raw reply
* Re: [RFC PATCH 4/5] arm: omap2: support port power on lan95xx devices
From: Alan Stern @ 2012-12-04 17:10 UTC (permalink / raw)
To: Andy Green, Rafael J. Wysocki
Cc: Ming Lei, Linux-pm mailing list, linux-omap, USB list
In-Reply-To: <50BD7095.5090103@linaro.org>
[CC: list trimmed; the people who were on it should be subscribed to at
least one of the lists anyway.]
On Tue, 4 Dec 2012, Andy Green wrote:
> I think associating ULPI PHY reset and SMSC power and reset with the hub
> port power state is good. Then, you could have the driver in a device
> with multiple onboard USB devices, and individually control whether
> they're eating power or not. In the asset case, you'd associate mux
> assets with ehci-omap.0.
>
> Yesterday I studied the hub port code and have a couple of patches, one
> normalizes the hub port device to have a stub driver.
>
> The other then puts hub port power state signalling into runtime_pm
> handlers in the hub port device. Until now, actually there's no code in
> hub.c to switch off a port.
In fact that's not quite true. You simply weren't aware of the new
code; you can find a series of patches starting here:
http://marc.info/?l=linux-usb&m=135314427413307&w=2
The parts of interest to us begin in patch 7/10.
> Assuming that's not insane, what should the user interface to disable a
> port power look like, something in sysfs? Until now it doesn't seem to
> exist.
It will be implemented through PM QOS.
> > (On the other hand, since the LAN95xx is the only thing
> > connected to the root hub, it could be powered off and on by
> > unbinding the ehci-omap.0 device from its driver and rebinding
> > it.)
>
> We shouldn't get to tied up with Panda case, this will be there for all
> cases like PCs etc. It should work well if there are multiple ports
> with onboard assets.
Okay, I'm fine with tying this to the port.
> > 2. If we do choose the port, do we want to identify it by matching
> > against a device name string or by matching a sequence of port
> > numbers (in this case, a length-1 sequence)? The port numbers
> > are fixed by the board design, whereas the device name strings
> > might get changed in the future. On the other hand, the port
> > numbers apply only to USB whereas device names can be used by
> > any subsystem.
>
> USB device names contain the port information. The matching scheme I
> have currently just uses the right-hand side of the path information and
> nothing that is not defined by the USB subsystem. It uses a
> platform_device ancestor to restrict matches to descendants of the right
> host controller. So unlike try#1 the names are as stable as the
> subsystem code alone, however stable that is, it's not exposed to
> changes from anywhere else. As you mention it's then workable on any
> dynamically probed bus.
>
> > 3. Should the matching mechanism go into the device core or into
> > the USB port code? (This is related to the previous question.)
>
> Currently I am experimenting with having the asset pointer in struct
> device, but migrating the events into runtime_resume and
> runtime_suspend. If it works out that has advantages that assets follow
> not just the logical device existence but the pm state of the device
> closely.
>
> It also allows leveraging assets directly to the hub port runtime_pm
> state, so they follow enable state of the port without any additional code.
If we use a PM domain then there won't be any need to hook the runtime
PM events. The domain will automatically be notified about power
changes.
> > 4. Should this be implemented simply as a regulator (or a pair of
> > regulators)? Or should it be generalized to some sort of PM
> > domain thing? The generic_pm_domain structure defined in
> > include/linux/pm_domain.h seems like overkill, but maybe it's
> > the most appropriate thing to use.
>
> They should be regulators for that I think. But it's only part the
> problem since clocks and mux are also going to be commonly associated
> with device power state, and indeed are in Panda case.
>
> I realize restricting the scope is desirable to get something done, but
> I'm not sure supporting regulators only is enough of the job.
Then why not use a PM domain? It will allow us to do whatever we want
in the callbacks.
On Tue, 4 Dec 2012, Ming Lei wrote:
> Alos, the same ehci-omap driver and same LAN95xx chip is used in
> beagle board and panda board with different power control
> approach, does port driver can distinguish these two cases?
> Port device is a general device(not platform device), how does
> port driver get platform/board dependent info?
This is the part that Andy has been working on. The board-dependent
info will be registered by the board file, and it will take effect
either when the port is registered or when it is bound to a driver.
The details of this aren't clear yet. For instance, should the device
core try to match the port with the asset info, or should this be done
by the USB code when the port is created?
> Not only regulators involved, clock or other things might be involved too.
> Also the same power domain might be shared with more than one port,
> so it is better to introduce power domain to the problem. Looks
> generic_pm_domain is overkill, so I introduced power controller which
> only focuses on power on/off and being shared by multiple devices.
Even though it is overkill, it may be better than introducing a new
abstraction. After all, this is exactly the sort of thing that PM
domains were originally created to handle.
Rafael, do you have any advice on this? The generic_pm_domain
structure is fairly complicated, there's a lot of code in
drivers/base/power/domain.c (it's the biggest source file in its
directory), and I'm not aware of any documentation.
Alan Stern
^ permalink raw reply
* Re: [PATCH 6/6 v7] cpufreq, highbank: add support for highbank cpufreq
From: Shawn Guo @ 2012-12-04 16:21 UTC (permalink / raw)
To: Mark Langsdorf
Cc: linux-kernel, cpufreq, linux-pm, linux-arm-kernel, mturquette
In-Reply-To: <1354631642-30433-7-git-send-email-mark.langsdorf@calxeda.com>
On Tue, Dec 04, 2012 at 08:34:02AM -0600, Mark Langsdorf wrote:
> Highbank processors depend on the external ECME to perform voltage
> management based on a requested frequency. Communication between the
> A9 cores and the ECME happens over the pl320 IPC channel.
>
> Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
> Cc: shawn.guo@linaro.org
> Cc: mturquette@ti.com
> ---
> Changes from v6
> Removed devicetree bindings documentation.
> Restructured driver to use clk notifications.
> Core driver logic is now cpufreq-clk0.
Great. It saves some codes :)
> Changes from v5
> Changed ipc_transmit() to pl320_ipc_transmit().
> Changes from v4
> Removed erroneous changes to arch/arm/Kconfig.
> Removed unnecessary changes to drivers/cpufreq/Kconfig.arm
> Alphabetized additions to arch/arm/mach-highbank/Kconfig
> Changed ipc call and header to match new ipc location in
> drivers/mailbox.
> Changes from v3
> None.
> Changes from v2
> Changed transition latency binding in code to match documentation.
> Changes from v1
> Added highbank specific Kconfig changes.
>
> arch/arm/boot/dts/highbank.dts | 10 ++++
> arch/arm/mach-highbank/Kconfig | 2 +
> drivers/cpufreq/Kconfig.arm | 16 ++++++
> drivers/cpufreq/Makefile | 1 +
> drivers/cpufreq/highbank-cpufreq.c | 106 +++++++++++++++++++++++++++++++++++++
> 5 files changed, 135 insertions(+)
> create mode 100644 drivers/cpufreq/highbank-cpufreq.c
>
> diff --git a/arch/arm/boot/dts/highbank.dts b/arch/arm/boot/dts/highbank.dts
> index 0c6fc34..7c4c27d 100644
> --- a/arch/arm/boot/dts/highbank.dts
> +++ b/arch/arm/boot/dts/highbank.dts
> @@ -36,6 +36,16 @@
> next-level-cache = <&L2>;
> clocks = <&a9pll>;
> clock-names = "cpu";
> + operating-points = <
> + /* kHz ignored */
> + 1300000 1000000
> + 1200000 1000000
> + 1100000 1000000
> + 800000 1000000
> + 400000 1000000
> + 200000 1000000
> + >;
> + clock-latency = <100000>;
> };
>
> cpu@1 {
> diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
> index 2896881..b7862da 100644
> --- a/arch/arm/mach-highbank/Kconfig
> +++ b/arch/arm/mach-highbank/Kconfig
> @@ -1,5 +1,7 @@
> config ARCH_HIGHBANK
> bool "Calxeda ECX-1000 (Highbank)" if ARCH_MULTI_V7
> + select ARCH_HAS_CPUFREQ
> + select ARCH_HAS_OPP
> select ARCH_WANT_OPTIONAL_GPIOLIB
> select ARM_AMBA
> select ARM_GIC
> diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
> index 5961e64..7aaac9f 100644
> --- a/drivers/cpufreq/Kconfig.arm
> +++ b/drivers/cpufreq/Kconfig.arm
> @@ -76,3 +76,19 @@ config ARM_EXYNOS5250_CPUFREQ
> help
> This adds the CPUFreq driver for Samsung EXYNOS5250
> SoC.
> +
> +config ARM_HIGHBANK_CPUFREQ
> + tristate "Calxeda Highbank-based"
> + depends on ARCH_HIGHBANK
> + select CPU_FREQ_TABLE
> + select GENERIC_CPUFREQ_CPU0
> + select PM_OPP
> + select REGULATOR
> +
> + default m
> + help
> + This adds the CPUFreq driver for Calxeda Highbank SoC
> + based boards.
> +
> + If in doubt, say N.
> +
> diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
> index 1bc90e1..9e8f12a 100644
> --- a/drivers/cpufreq/Makefile
> +++ b/drivers/cpufreq/Makefile
> @@ -50,6 +50,7 @@ obj-$(CONFIG_ARM_EXYNOS4210_CPUFREQ) += exynos4210-cpufreq.o
> obj-$(CONFIG_ARM_EXYNOS4X12_CPUFREQ) += exynos4x12-cpufreq.o
> obj-$(CONFIG_ARM_EXYNOS5250_CPUFREQ) += exynos5250-cpufreq.o
> obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ) += omap-cpufreq.o
> +obj-$(CONFIG_ARM_HIGHBANK_CPUFREQ) += highbank-cpufreq.o
>
> ##################################################################################
> # PowerPC platform drivers
> diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
> new file mode 100644
> index 0000000..25ef437
> --- /dev/null
> +++ b/drivers/cpufreq/highbank-cpufreq.c
> @@ -0,0 +1,106 @@
> +/*
> + * Copyright (C) 2012 Calxeda, Inc.
> + *
> + * derived from cpufreq-cpu0 by Freescale Semiconductor
It's not any more, right?
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/clk.h>
> +#include <linux/cpu.h>
> +#include <linux/err.h>
> +#include <linux/of.h>
> +#include <linux/opp.h>
Do you need this header?
> +#include <linux/slab.h>
> +#include <linux/mailbox.h>
> +
> +#define HB_CPUFREQ_CHANGE_NOTE 0x80000001
> +
> +static struct device *cpu_dev;
> +static struct clk *cpu_clk;
These two now can be local variables hb_cpufreq_driver_init().
> +
> +static int hb_voltage_change(unsigned int freq)
> +{
> + int i;
> + u32 msg[7];
> +
> + msg[0] = HB_CPUFREQ_CHANGE_NOTE;
> + msg[1] = freq / 1000000;
> + for (i = 2; i < 7; i++)
> + msg[i] = 0;
> +
> + return pl320_ipc_transmit(msg);
> +}
> +
> +static int hb_cpufreq_clk_notify(struct notifier_block *nb,
> + unsigned long action, void *hclk)
> +{
> + struct clk_notifier_data *clk_data = hclk;
> + int i = 0;
> +
> + if (action == PRE_RATE_CHANGE) {
> + if (clk_data->new_rate > clk_data->old_rate)
> + while (hb_voltage_change(clk_data->new_rate))
> + if (i++ > 15)
> + return NOTIFY_STOP;
> + } else if (action == POST_RATE_CHANGE)
Add a {} pair for else block or remove {} for if?
> + if (clk_data->new_rate < clk_data->old_rate)
> + while (hb_voltage_change(clk_data->new_rate))
> + if (i++ > 15)
> + break;
> +
> + return NOTIFY_DONE;
> +}
> +
> +static struct notifier_block hb_cpufreq_clk_nb = {
> + .notifier_call = hb_cpufreq_clk_notify,
> +};
> +
> +static int __devinit hb_cpufreq_driver_init(void)
Isn't there a big series removing __devinit from the kernel tree
as CONFIG_HOTPLUG is going away?
Shawn
> +{
> + struct device_node *np;
> + int ret;
> +
> + np = of_find_node_by_path("/cpus/cpu@0");
> + if (!np) {
> + pr_err("failed to find highbank cpufreq node\n");
> + return -ENOENT;
> + }
> +
> + cpu_dev = get_cpu_device(0);
> + if (!cpu_dev) {
> + pr_err("failed to get highbank cpufreq device\n");
> + ret = -ENODEV;
> + goto out_put_node;
> + }
> +
> + cpu_dev->of_node = np;
> +
> + cpu_clk = clk_get(cpu_dev, NULL);
> + if (IS_ERR(cpu_clk)) {
> + ret = PTR_ERR(cpu_clk);
> + pr_err("failed to get cpu0 clock: %d\n", ret);
> + goto out_put_node;
> + }
> +
> + ret = clk_notifier_register(cpu_clk, &hb_cpufreq_clk_nb);
> + if (ret) {
> + pr_err("failed to register clk notifier: %d\n", ret);
> + goto out_put_node;
> + }
> +
> +out_put_node:
> + of_node_put(np);
> + return ret;
> +}
> +late_initcall(hb_cpufreq_driver_init);
> +
> +MODULE_AUTHOR("Mark Langsdorf <mark.langsdorf@calxeda.com>");
> +MODULE_DESCRIPTION("Calxeda Highbank cpufreq driver");
> +MODULE_LICENSE("GPL");
> --
> 1.7.11.7
>
^ permalink raw reply
* Re: [PATCH RFC] PM/Devfreq: Add Exynos5-bus devfreq driver for Exynos5250.
From: Abhilash Kesavan @ 2012-12-04 16:18 UTC (permalink / raw)
To: linux-pm
In-Reply-To: <000101cdd202$dc6e8310$954b8930$%choi@samsung.com>
Jonghwan Choi <jhbird.choi <at> samsung.com> writes:
>
>
> Hi Abhilash Kesavan.
Hi,
>
> I compiled in 3.7-rc8
> I got a compile error & warning.
>
> Compile error.
>
> CC drivers/devfreq/exynos5_ppmu.o
> drivers/devfreq/exynos5_ppmu.c:56:14: error: 'S5P_VA_PPMU_DDR_C' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:59:14: error: 'S5P_VA_PPMU_DDR_R1' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:62:13: error: 'S5P_VA_PPMU_DDR_L' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:65:13: error: 'S5P_VA_PPMU_RIGHT' undeclared
> here (not in a function)
> drivers/devfreq/exynos5_ppmu.c:68:14: error: 'S5P_VA_PPMU_CPU' undeclared
> here (not in a function)
> make[2]: *** [drivers/devfreq/exynos5_ppmu.o] Error 1
I have submitted only the driver changes. As mentioned in the commit message I
will post the arch side changes after the drivers gets a complete review.
However, if you want to test the driver I will post it tommorrow.
[...]
> drivers/devfreq/exynos5_bus.c: In function 'exynos5_busfreq_int_probe':
> drivers/devfreq/exynos5_bus.c:388:9: warning: passing argument 3 of
> 'devfreq_add_device' from incompatible pointer type [enabled by default]
> include/linux/devfreq.h:170:24: note: expected 'struct devfreq_governor
> const *' but argument is of type 'char *'
[...]
> Need change?
These changes are based on
git://git.kernel.org/pub/scm/linux/kernel/git/mzx/devfreq.git : for-rafael
devfreq_add_device prototype has recently changed and that might be the cause of
the compilation error. Can you please try on this tree.
> How to change the INT clock? I think you have to change
> arch/arm/mach-exynos/clock-exynos5250.c
> (If you want to control via clock framework.)
[...]
> exynos5_int_clk.ops = &exynos5_clk_int_ops;
This is part of the arch side patch that has not been posted.
Regards,
Abhilash
^ permalink raw reply
* Re: [RFC PATCH 01/10] CPU hotplug: Introduce "stable" cpu online mask, for atomic hotplug readers
From: Tejun Heo @ 2012-12-04 15:17 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: tglx, peterz, paulmck, rusty, mingo, akpm, namhyung,
vincent.guittot, sbw, amit.kucheria, rostedt, rjw, wangyun,
xiaoguangrong, nikunj, linux-pm, linux-kernel
In-Reply-To: <20121204085324.25919.53090.stgit@srivatsabhat.in.ibm.com>
Hello, Srivatsa.
On Tue, Dec 04, 2012 at 02:23:41PM +0530, Srivatsa S. Bhat wrote:
> extern const struct cpumask *const cpu_possible_mask;
> extern const struct cpumask *const cpu_online_mask;
> +extern const struct cpumask *const cpu_online_stable_mask;
> extern const struct cpumask *const cpu_present_mask;
> extern const struct cpumask *const cpu_active_mask;
This is a bit nasty. The distinction between cpu_online_mask and the
stable one is quite subtle and there's no mechanism to verify the
right one is in use. IIUC, the only time cpu_online_mask and
cpu_online_stable_mask can deviate is during the final stage CPU take
down, right? If so, why not just make cpu_online_mask the stable one
and the few cases where the actual online state matters deal with the
internal version? Anyone outside cpu hotplug proper should be happy
with the stable version anyway, no?
Thanks.
--
tejun
^ permalink raw reply
* [PATCH 5/6 v7] power: export opp cpufreq functions
From: Mark Langsdorf @ 2012-12-04 14:34 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel; +Cc: Mark Langsdorf
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
These functions are needed to make the cpufreq-core0 and highbank-cpufreq
drivers loadable as modules.
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Acked-by: Nishanth Menon <nm@ti.com>
---
Changes from v4, v5, v6
None.
Changes from v3
includes linux/export.h instead of module.h.
Changes from v2
None.
Changes from v1
Added Nishanth Menon's ack.
Clarified the purpose of the change in the commit message.
drivers/base/power/opp.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/base/power/opp.c b/drivers/base/power/opp.c
index d946864..4062ec3 100644
--- a/drivers/base/power/opp.c
+++ b/drivers/base/power/opp.c
@@ -23,6 +23,7 @@
#include <linux/rcupdate.h>
#include <linux/opp.h>
#include <linux/of.h>
+#include <linux/export.h>
/*
* Internal data structure organization with the OPP layer library is as
@@ -643,6 +644,7 @@ int opp_init_cpufreq_table(struct device *dev,
return 0;
}
+EXPORT_SYMBOL(opp_init_cpufreq_table);
/**
* opp_free_cpufreq_table() - free the cpufreq table
@@ -660,6 +662,7 @@ void opp_free_cpufreq_table(struct device *dev,
kfree(*table);
*table = NULL;
}
+EXPORT_SYMBOL(opp_free_cpufreq_table);
#endif /* CONFIG_CPU_FREQ */
/**
@@ -720,4 +723,5 @@ int of_init_opp_table(struct device *dev)
return 0;
}
+EXPORT_SYMBOL(of_init_opp_table);
#endif
--
1.7.11.7
^ permalink raw reply related
* [PATCH 4/6 v7] arm highbank: add support for pl320 IPC
From: Mark Langsdorf @ 2012-12-04 14:34 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
Cc: Omar Ramirez Luna, Arnd Bergmann, Rob Herring, Mark Langsdorf
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
From: Rob Herring <rob.herring@calxeda.com>
The pl320 IPC allows for interprocessor communication between the highbank A9
and the EnergyCore Management Engine. The pl320 implements a straightforward
mailbox protocol.
This patch depends on Omar Ramirez Luna's <omar.luna@linaro.org>
mailbox driver patch series.
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Cc: Omar Ramirez Luna <omar.luna@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
---
Changes from v6
None.
Changes from v5
Renamed ipc_transmit() to pl320_ipc_transmit().
Properly exported pl320_ipc_{un}register_notifier().
Changes from v4
Moved pl320-ipc.c from arch/arm/mach-highbank to drivers/mailbox.
Moved header information to include/linux/mailbox.h.
Added Kconfig options to reflect the new code location.
Change drivers/mailbox/Makefile to build the omap mailboxes only
when they are configured.
Removed ipc_call_fast and renamed ipc_call_slow ipc_transmit.
Changes from v3, v2
None.
Changes from v1
Removed erroneous changes for cpufreq Kconfig.
arch/arm/mach-highbank/Kconfig | 2 +
drivers/mailbox/Kconfig | 9 ++
drivers/mailbox/Makefile | 4 +
drivers/mailbox/pl320-ipc.c | 199 +++++++++++++++++++++++++++++++++++++++++
include/linux/mailbox.h | 19 +++-
5 files changed, 232 insertions(+), 1 deletion(-)
create mode 100644 drivers/mailbox/Makefile
create mode 100644 drivers/mailbox/pl320-ipc.c
diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
index 0e1d0a4..2896881 100644
--- a/arch/arm/mach-highbank/Kconfig
+++ b/arch/arm/mach-highbank/Kconfig
@@ -11,5 +11,7 @@ config ARCH_HIGHBANK
select GENERIC_CLOCKEVENTS
select HAVE_ARM_SCU
select HAVE_SMP
+ select MAILBOX
+ select PL320_MBOX
select SPARSE_IRQ
select USE_OF
diff --git a/drivers/mailbox/Kconfig b/drivers/mailbox/Kconfig
index be8cac0..e89fdb4 100644
--- a/drivers/mailbox/Kconfig
+++ b/drivers/mailbox/Kconfig
@@ -34,4 +34,13 @@ config OMAP_MBOX_KFIFO_SIZE
This can also be changed at runtime (via the mbox_kfifo_size
module parameter).
+config PL320_MBOX
+ bool "ARM PL320 Mailbox"
+ help
+ An implementation of the ARM PL320 Interprocessor Communication
+ Mailbox (IPCM), tailored for the Calxeda Highbank. It is used to
+ send short messages between Highbank's A9 cores and the EnergyCore
+ Management Engine, primarily for cpufreq. Say Y here if you want
+ to use the PL320 IPCM support.
+
endif
diff --git a/drivers/mailbox/Makefile b/drivers/mailbox/Makefile
new file mode 100644
index 0000000..c9f14c3
--- /dev/null
+++ b/drivers/mailbox/Makefile
@@ -0,0 +1,4 @@
+obj-$(CONFIG_OMAP1_MBOX) += mailbox.o mailbox-omap1.o
+obj-$(CONFIG_OMAP2PLUS_MBOX) += mailbox.o mailbox-omap2.o
+obj-$(CONFIG_PL320_MBOX) += pl320-ipc.o
+
diff --git a/drivers/mailbox/pl320-ipc.c b/drivers/mailbox/pl320-ipc.c
new file mode 100644
index 0000000..1a9d8e4
--- /dev/null
+++ b/drivers/mailbox/pl320-ipc.c
@@ -0,0 +1,199 @@
+/*
+ * Copyright 2012 Calxeda, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+#include <linux/types.h>
+#include <linux/err.h>
+#include <linux/delay.h>
+#include <linux/export.h>
+#include <linux/io.h>
+#include <linux/interrupt.h>
+#include <linux/completion.h>
+#include <linux/mutex.h>
+#include <linux/notifier.h>
+#include <linux/spinlock.h>
+#include <linux/device.h>
+#include <linux/amba/bus.h>
+
+#include <linux/mailbox.h>
+
+#define IPCMxSOURCE(m) ((m) * 0x40)
+#define IPCMxDSET(m) (((m) * 0x40) + 0x004)
+#define IPCMxDCLEAR(m) (((m) * 0x40) + 0x008)
+#define IPCMxDSTATUS(m) (((m) * 0x40) + 0x00C)
+#define IPCMxMODE(m) (((m) * 0x40) + 0x010)
+#define IPCMxMSET(m) (((m) * 0x40) + 0x014)
+#define IPCMxMCLEAR(m) (((m) * 0x40) + 0x018)
+#define IPCMxMSTATUS(m) (((m) * 0x40) + 0x01C)
+#define IPCMxSEND(m) (((m) * 0x40) + 0x020)
+#define IPCMxDR(m, dr) (((m) * 0x40) + ((dr) * 4) + 0x024)
+
+#define IPCMMIS(irq) (((irq) * 8) + 0x800)
+#define IPCMRIS(irq) (((irq) * 8) + 0x804)
+
+#define MBOX_MASK(n) (1 << (n))
+#define IPC_TX_MBOX 1
+#define IPC_RX_MBOX 2
+
+#define CHAN_MASK(n) (1 << (n))
+#define A9_SOURCE 1
+#define M3_SOURCE 0
+
+static void __iomem *ipc_base;
+static int ipc_irq;
+static DEFINE_MUTEX(ipc_m1_lock);
+static DECLARE_COMPLETION(ipc_completion);
+static ATOMIC_NOTIFIER_HEAD(ipc_notifier);
+
+static inline void set_destination(int source, int mbox)
+{
+ __raw_writel(CHAN_MASK(source), ipc_base + IPCMxDSET(mbox));
+ __raw_writel(CHAN_MASK(source), ipc_base + IPCMxMSET(mbox));
+}
+
+static inline void clear_destination(int source, int mbox)
+{
+ __raw_writel(CHAN_MASK(source), ipc_base + IPCMxDCLEAR(mbox));
+ __raw_writel(CHAN_MASK(source), ipc_base + IPCMxMCLEAR(mbox));
+}
+
+static void __ipc_send(int mbox, u32 *data)
+{
+ int i;
+ for (i = 0; i < 7; i++)
+ __raw_writel(data[i], ipc_base + IPCMxDR(mbox, i));
+ __raw_writel(0x1, ipc_base + IPCMxSEND(mbox));
+}
+
+static u32 __ipc_rcv(int mbox, u32 *data)
+{
+ int i;
+ for (i = 0; i < 7; i++)
+ data[i] = __raw_readl(ipc_base + IPCMxDR(mbox, i));
+ return data[1];
+}
+
+/* blocking implmentation from the A9 side, not usuable in interrupts! */
+int pl320_ipc_transmit(u32 *data)
+{
+ int ret;
+
+ mutex_lock(&ipc_m1_lock);
+
+ init_completion(&ipc_completion);
+ __ipc_send(IPC_TX_MBOX, data);
+ ret = wait_for_completion_timeout(&ipc_completion,
+ msecs_to_jiffies(1000));
+ if (ret == 0) {
+ ret = -ETIMEDOUT;
+ goto out;
+ }
+
+ ret = __ipc_rcv(IPC_TX_MBOX, data);
+out:
+ mutex_unlock(&ipc_m1_lock);
+ return ret;
+}
+EXPORT_SYMBOL(pl320_ipc_transmit);
+
+irqreturn_t ipc_handler(int irq, void *dev)
+{
+ u32 irq_stat;
+ u32 data[7];
+
+ irq_stat = __raw_readl(ipc_base + IPCMMIS(1));
+ if (irq_stat & MBOX_MASK(IPC_TX_MBOX)) {
+ __raw_writel(0, ipc_base + IPCMxSEND(IPC_TX_MBOX));
+ complete(&ipc_completion);
+ }
+ if (irq_stat & MBOX_MASK(IPC_RX_MBOX)) {
+ __ipc_rcv(IPC_RX_MBOX, data);
+ atomic_notifier_call_chain(&ipc_notifier, data[0], data + 1);
+ __raw_writel(2, ipc_base + IPCMxSEND(IPC_RX_MBOX));
+ }
+
+ return IRQ_HANDLED;
+}
+
+int pl320_ipc_register_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_register(&ipc_notifier, nb);
+}
+EXPORT_SYMBOL(pl320_ipc_register_notifier);
+
+int pl320_ipc_unregister_notifier(struct notifier_block *nb)
+{
+ return atomic_notifier_chain_unregister(&ipc_notifier, nb);
+}
+EXPORT_SYMBOL(pl320_ipc_unregister_notifier);
+
+static int __devinit pl320_probe(struct amba_device *adev,
+ const struct amba_id *id)
+{
+ int ret;
+
+ ipc_base = ioremap(adev->res.start, resource_size(&adev->res));
+ if (ipc_base == NULL)
+ return -ENOMEM;
+
+ __raw_writel(0, ipc_base + IPCMxSEND(IPC_TX_MBOX));
+
+ ipc_irq = adev->irq[0];
+ ret = request_irq(ipc_irq, ipc_handler, 0, dev_name(&adev->dev), NULL);
+ if (ret < 0)
+ goto err;
+
+ /* Init slow mailbox */
+ __raw_writel(CHAN_MASK(A9_SOURCE),
+ ipc_base + IPCMxSOURCE(IPC_TX_MBOX));
+ __raw_writel(CHAN_MASK(M3_SOURCE),
+ ipc_base + IPCMxDSET(IPC_TX_MBOX));
+ __raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+ ipc_base + IPCMxMSET(IPC_TX_MBOX));
+
+ /* Init receive mailbox */
+ __raw_writel(CHAN_MASK(M3_SOURCE),
+ ipc_base + IPCMxSOURCE(IPC_RX_MBOX));
+ __raw_writel(CHAN_MASK(A9_SOURCE),
+ ipc_base + IPCMxDSET(IPC_RX_MBOX));
+ __raw_writel(CHAN_MASK(M3_SOURCE) | CHAN_MASK(A9_SOURCE),
+ ipc_base + IPCMxMSET(IPC_RX_MBOX));
+
+ return 0;
+err:
+ iounmap(ipc_base);
+ return ret;
+}
+
+static struct amba_id pl320_ids[] = {
+ {
+ .id = 0x00041320,
+ .mask = 0x000fffff,
+ },
+ { 0, 0 },
+};
+
+static struct amba_driver pl320_driver = {
+ .drv = {
+ .name = "pl320",
+ },
+ .id_table = pl320_ids,
+ .probe = pl320_probe,
+};
+
+static int __init ipc_init(void)
+{
+ return amba_driver_register(&pl320_driver);
+}
+module_init(ipc_init);
diff --git a/include/linux/mailbox.h b/include/linux/mailbox.h
index e8e4131..e7829e5 100644
--- a/include/linux/mailbox.h
+++ b/include/linux/mailbox.h
@@ -1,4 +1,16 @@
-/* mailbox.h */
+/*
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program. If not, see <http://www.gnu.org/licenses/>.
+ */
typedef u32 mbox_msg_t;
struct omap_mbox;
@@ -20,3 +32,8 @@ void omap_mbox_save_ctx(struct omap_mbox *mbox);
void omap_mbox_restore_ctx(struct omap_mbox *mbox);
void omap_mbox_enable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq);
void omap_mbox_disable_irq(struct omap_mbox *mbox, omap_mbox_irq_t irq);
+
+int pl320_ipc_transmit(u32 *data);
+int pl320_ipc_register_notifier(struct notifier_block *nb);
+int pl320_ipc_unregister_notifier(struct notifier_block *nb);
+
--
1.7.11.7
^ permalink raw reply related
* [PATCH 3/6 v7] cpufreq: tolerate inexact values when collecting stats
From: Mark Langsdorf @ 2012-12-04 14:33 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel; +Cc: Mark Langsdorf
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
This patch is withdrawn due to a need for severe rework.
Changes from v4
Withdrawn.
Changes from v3, v2
None.
Changes from v1
Implemented a simple round-up algorithm instead of the over/under
method that could cause errors on Intel processors with boost mode.
^ permalink raw reply
* [PATCH 2/6 v7] clk, highbank: Prevent glitches in non-bypass reset mode
From: Mark Langsdorf @ 2012-12-04 14:33 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
Cc: Mark Langsdorf, Rob Herring
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
The highbank clock will glitch with the current code if the
clock rate is reset without relocking the PLL. Program the PLL
correctly to prevent glitches.
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Acked-by: Mike Turquette <mturquette@linaro.org>
---
Changes from v6
None.
Changes from v5
Added Mike Turquette's ack.
Changes from v4
None.
Changes from v3
Changelog text and patch name now correspond to the actual patch.
was clk, highbank: remove non-bypass reset mode.
Changes from v2
None.
Changes from v1:
Removed erroneous reformating.
drivers/clk/clk-highbank.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/clk/clk-highbank.c b/drivers/clk/clk-highbank.c
index 52fecad..3a0b723 100644
--- a/drivers/clk/clk-highbank.c
+++ b/drivers/clk/clk-highbank.c
@@ -182,8 +182,10 @@ static int clk_pll_set_rate(struct clk_hw *hwclk, unsigned long rate,
reg |= HB_PLL_EXT_ENA;
reg &= ~HB_PLL_EXT_BYPASS;
} else {
+ writel(reg | HB_PLL_EXT_BYPASS, hbclk->reg);
reg &= ~HB_PLL_DIVQ_MASK;
reg |= divq << HB_PLL_DIVQ_SHIFT;
+ writel(reg | HB_PLL_EXT_BYPASS, hbclk->reg);
}
writel(reg, hbclk->reg);
--
1.7.11.7
^ permalink raw reply related
* [PATCH 1/6 v7] arm: use devicetree to get smp_twd clock
From: Mark Langsdorf @ 2012-12-04 14:33 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
Cc: Mark Langsdorf, Rob Herring
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
From: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Rob Herring <rob.herring@calxeda.com>
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
---
Changes from v4, v5, v6
None.
Changes from v3
No longer setting *clk to NULL in twd_get_clock().
Changes from v2
Turned the check for the node pointer into an if-then-else statement.
Removed the second, redundant clk_get_rate.
Changes from v1
None.
arch/arm/kernel/smp_twd.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)
diff --git a/arch/arm/kernel/smp_twd.c b/arch/arm/kernel/smp_twd.c
index b22d700..af46b80 100644
--- a/arch/arm/kernel/smp_twd.c
+++ b/arch/arm/kernel/smp_twd.c
@@ -237,12 +237,15 @@ static irqreturn_t twd_handler(int irq, void *dev_id)
return IRQ_NONE;
}
-static struct clk *twd_get_clock(void)
+static struct clk *twd_get_clock(struct device_node *np)
{
struct clk *clk;
int err;
- clk = clk_get_sys("smp_twd", NULL);
+ if (np)
+ clk = of_clk_get(np, 0);
+ else
+ clk = clk_get_sys("smp_twd", NULL);
if (IS_ERR(clk)) {
pr_err("smp_twd: clock not found: %d\n", (int)PTR_ERR(clk));
return clk;
@@ -263,6 +266,7 @@ static struct clk *twd_get_clock(void)
return ERR_PTR(err);
}
+ twd_timer_rate = clk_get_rate(clk);
return clk;
}
@@ -273,12 +277,7 @@ static int __cpuinit twd_timer_setup(struct clock_event_device *clk)
{
struct clock_event_device **this_cpu_clk;
- if (!twd_clk)
- twd_clk = twd_get_clock();
-
- if (!IS_ERR_OR_NULL(twd_clk))
- twd_timer_rate = clk_get_rate(twd_clk);
- else
+ if (IS_ERR_OR_NULL(twd_clk))
twd_calibrate_rate();
__raw_writel(0, twd_base + TWD_TIMER_CONTROL);
@@ -349,6 +348,8 @@ int __init twd_local_timer_register(struct twd_local_timer *tlt)
if (!twd_base)
return -ENOMEM;
+ twd_clk = twd_get_clock(NULL);
+
return twd_local_timer_common_register();
}
@@ -383,6 +384,8 @@ void __init twd_local_timer_of_register(void)
goto out;
}
+ twd_clk = twd_get_clock(np);
+
err = twd_local_timer_common_register();
out:
--
1.7.11.7
^ permalink raw reply related
* [PATCH 0/6 v7] cpufreq: add support for Calxeda ECX-1000 (highbank)
From: Mark Langsdorf @ 2012-12-04 14:33 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
In-Reply-To: <1351631056-25938-1-git-send-email-mark.langsdorf@calxeda.com>
This patch series adds cpufreq support for the Calxeda
ECX-1000 (highbank) SoCs. The EnergyCore Management Engine (ECME) on
the ECX-1000 manages the voltage for the part and communications with
Linux through a pl320 mailbox. clk notifications are used to control
when to send messages to the ECME.
--Mark Langsdorf
Calxeda, Inc.
^ permalink raw reply
* [PATCH 6/6 v7] cpufreq, highbank: add support for highbank cpufreq
From: Mark Langsdorf @ 2012-12-04 14:34 UTC (permalink / raw)
To: linux-kernel, cpufreq, linux-pm, linux-arm-kernel
Cc: Mark Langsdorf, shawn.guo, mturquette
In-Reply-To: <1354631642-30433-1-git-send-email-mark.langsdorf@calxeda.com>
Highbank processors depend on the external ECME to perform voltage
management based on a requested frequency. Communication between the
A9 cores and the ECME happens over the pl320 IPC channel.
Signed-off-by: Mark Langsdorf <mark.langsdorf@calxeda.com>
Cc: shawn.guo@linaro.org
Cc: mturquette@ti.com
---
Changes from v6
Removed devicetree bindings documentation.
Restructured driver to use clk notifications.
Core driver logic is now cpufreq-clk0.
Changes from v5
Changed ipc_transmit() to pl320_ipc_transmit().
Changes from v4
Removed erroneous changes to arch/arm/Kconfig.
Removed unnecessary changes to drivers/cpufreq/Kconfig.arm
Alphabetized additions to arch/arm/mach-highbank/Kconfig
Changed ipc call and header to match new ipc location in
drivers/mailbox.
Changes from v3
None.
Changes from v2
Changed transition latency binding in code to match documentation.
Changes from v1
Added highbank specific Kconfig changes.
arch/arm/boot/dts/highbank.dts | 10 ++++
arch/arm/mach-highbank/Kconfig | 2 +
drivers/cpufreq/Kconfig.arm | 16 ++++++
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/highbank-cpufreq.c | 106 +++++++++++++++++++++++++++++++++++++
5 files changed, 135 insertions(+)
create mode 100644 drivers/cpufreq/highbank-cpufreq.c
diff --git a/arch/arm/boot/dts/highbank.dts b/arch/arm/boot/dts/highbank.dts
index 0c6fc34..7c4c27d 100644
--- a/arch/arm/boot/dts/highbank.dts
+++ b/arch/arm/boot/dts/highbank.dts
@@ -36,6 +36,16 @@
next-level-cache = <&L2>;
clocks = <&a9pll>;
clock-names = "cpu";
+ operating-points = <
+ /* kHz ignored */
+ 1300000 1000000
+ 1200000 1000000
+ 1100000 1000000
+ 800000 1000000
+ 400000 1000000
+ 200000 1000000
+ >;
+ clock-latency = <100000>;
};
cpu@1 {
diff --git a/arch/arm/mach-highbank/Kconfig b/arch/arm/mach-highbank/Kconfig
index 2896881..b7862da 100644
--- a/arch/arm/mach-highbank/Kconfig
+++ b/arch/arm/mach-highbank/Kconfig
@@ -1,5 +1,7 @@
config ARCH_HIGHBANK
bool "Calxeda ECX-1000 (Highbank)" if ARCH_MULTI_V7
+ select ARCH_HAS_CPUFREQ
+ select ARCH_HAS_OPP
select ARCH_WANT_OPTIONAL_GPIOLIB
select ARM_AMBA
select ARM_GIC
diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm
index 5961e64..7aaac9f 100644
--- a/drivers/cpufreq/Kconfig.arm
+++ b/drivers/cpufreq/Kconfig.arm
@@ -76,3 +76,19 @@ config ARM_EXYNOS5250_CPUFREQ
help
This adds the CPUFreq driver for Samsung EXYNOS5250
SoC.
+
+config ARM_HIGHBANK_CPUFREQ
+ tristate "Calxeda Highbank-based"
+ depends on ARCH_HIGHBANK
+ select CPU_FREQ_TABLE
+ select GENERIC_CPUFREQ_CPU0
+ select PM_OPP
+ select REGULATOR
+
+ default m
+ help
+ This adds the CPUFreq driver for Calxeda Highbank SoC
+ based boards.
+
+ If in doubt, say N.
+
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 1bc90e1..9e8f12a 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_ARM_EXYNOS4210_CPUFREQ) += exynos4210-cpufreq.o
obj-$(CONFIG_ARM_EXYNOS4X12_CPUFREQ) += exynos4x12-cpufreq.o
obj-$(CONFIG_ARM_EXYNOS5250_CPUFREQ) += exynos5250-cpufreq.o
obj-$(CONFIG_ARM_OMAP2PLUS_CPUFREQ) += omap-cpufreq.o
+obj-$(CONFIG_ARM_HIGHBANK_CPUFREQ) += highbank-cpufreq.o
##################################################################################
# PowerPC platform drivers
diff --git a/drivers/cpufreq/highbank-cpufreq.c b/drivers/cpufreq/highbank-cpufreq.c
new file mode 100644
index 0000000..25ef437
--- /dev/null
+++ b/drivers/cpufreq/highbank-cpufreq.c
@@ -0,0 +1,106 @@
+/*
+ * Copyright (C) 2012 Calxeda, Inc.
+ *
+ * derived from cpufreq-cpu0 by Freescale Semiconductor
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/clk.h>
+#include <linux/cpu.h>
+#include <linux/err.h>
+#include <linux/of.h>
+#include <linux/opp.h>
+#include <linux/slab.h>
+#include <linux/mailbox.h>
+
+#define HB_CPUFREQ_CHANGE_NOTE 0x80000001
+
+static struct device *cpu_dev;
+static struct clk *cpu_clk;
+
+static int hb_voltage_change(unsigned int freq)
+{
+ int i;
+ u32 msg[7];
+
+ msg[0] = HB_CPUFREQ_CHANGE_NOTE;
+ msg[1] = freq / 1000000;
+ for (i = 2; i < 7; i++)
+ msg[i] = 0;
+
+ return pl320_ipc_transmit(msg);
+}
+
+static int hb_cpufreq_clk_notify(struct notifier_block *nb,
+ unsigned long action, void *hclk)
+{
+ struct clk_notifier_data *clk_data = hclk;
+ int i = 0;
+
+ if (action == PRE_RATE_CHANGE) {
+ if (clk_data->new_rate > clk_data->old_rate)
+ while (hb_voltage_change(clk_data->new_rate))
+ if (i++ > 15)
+ return NOTIFY_STOP;
+ } else if (action == POST_RATE_CHANGE)
+ if (clk_data->new_rate < clk_data->old_rate)
+ while (hb_voltage_change(clk_data->new_rate))
+ if (i++ > 15)
+ break;
+
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block hb_cpufreq_clk_nb = {
+ .notifier_call = hb_cpufreq_clk_notify,
+};
+
+static int __devinit hb_cpufreq_driver_init(void)
+{
+ struct device_node *np;
+ int ret;
+
+ np = of_find_node_by_path("/cpus/cpu@0");
+ if (!np) {
+ pr_err("failed to find highbank cpufreq node\n");
+ return -ENOENT;
+ }
+
+ cpu_dev = get_cpu_device(0);
+ if (!cpu_dev) {
+ pr_err("failed to get highbank cpufreq device\n");
+ ret = -ENODEV;
+ goto out_put_node;
+ }
+
+ cpu_dev->of_node = np;
+
+ cpu_clk = clk_get(cpu_dev, NULL);
+ if (IS_ERR(cpu_clk)) {
+ ret = PTR_ERR(cpu_clk);
+ pr_err("failed to get cpu0 clock: %d\n", ret);
+ goto out_put_node;
+ }
+
+ ret = clk_notifier_register(cpu_clk, &hb_cpufreq_clk_nb);
+ if (ret) {
+ pr_err("failed to register clk notifier: %d\n", ret);
+ goto out_put_node;
+ }
+
+out_put_node:
+ of_node_put(np);
+ return ret;
+}
+late_initcall(hb_cpufreq_driver_init);
+
+MODULE_AUTHOR("Mark Langsdorf <mark.langsdorf@calxeda.com>");
+MODULE_DESCRIPTION("Calxeda Highbank cpufreq driver");
+MODULE_LICENSE("GPL");
--
1.7.11.7
^ permalink raw reply related
* Re: [PATCH v9 06/10] ata: zpodd: check zero power ready status
From: James Bottomley @ 2012-12-04 12:11 UTC (permalink / raw)
To: Tejun Heo
Cc: Aaron Lu, Rafael J. Wysocki, linux-pm, Jeff Garzik, Alan Stern,
Jeff Wu, Aaron Lu, linux-ide, linux-scsi, linux-acpi
In-Reply-To: <20121203162323.GB19802@htj.dyndns.org>
On Mon, 2012-12-03 at 08:23 -0800, Tejun Heo wrote:
> Hello, James.
>
> On Mon, Dec 03, 2012 at 08:25:43AM +0000, James Bottomley wrote:
> > > diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> > > index e65c62e..1756151 100644
> > > --- a/include/scsi/scsi_device.h
> > > +++ b/include/scsi/scsi_device.h
> > > @@ -160,6 +160,7 @@ struct scsi_device {
> > > unsigned can_power_off:1; /* Device supports runtime power off */
> > > unsigned wce_default_on:1; /* Cache is ON by default */
> > > unsigned no_dif:1; /* T10 PI (DIF) should be disabled */
> > > + unsigned event_driven:1; /* No need to poll the device */
> > >
> > > DECLARE_BITMAP(supported_events, SDEV_EVT_MAXBITS); /* supported events */
> > > struct list_head event_list; /* asserted events */
> >
> > Yes, but if we can get away with doing that, it should be in genhd
> > because it's completely generic.
> >
> > I was imagining we'd have to fake the reply to the test unit ready or
> > some other commands, which is why it would need to be in sr.c
> >
> > The check events code is Tejun's baby, if he's OK with it then just do
> > it in genhd.c
>
> The problem here is there's no easy to reach genhd from libata (or the
> other way around) without going through sr. I think we're gonna have
> to have something in sr one way or the other.
Can't we do that via an event? It's a bit clunky because we need the
callback in the layer that sees the sdev, which is libata-scsi, we just
need an analogue of ata_scsi_media_change_notify, but ignoring and
allowing polling is essentially event driven as well, so it should all
work. We'll need a listener in genhd, which might be trickier.
This may also work as the more generic route for stuff where we can't
connect the bottom to the top of the stack (which looks like a problem
we'll begin wrestling with a lot now).
James
^ permalink raw reply
* Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management
From: wujianguo @ 2012-12-04 10:51 UTC (permalink / raw)
To: Srivatsa S. Bhat
Cc: akpm, mgorman, mjg59, paulmck, dave, maxime.coquelin,
loic.pallardy, arjan, kmpark, kamezawa.hiroyu, lenb, rjw,
gargankita, amit.kachhap, svaidy, thomas.abraham,
santosh.shilimkar, linux-pm, linux-mm, linux-kernel
In-Reply-To: <20121106195026.6941.24662.stgit@srivatsabhat.in.ibm.com>
Hi Srivatsa,
I applied this patchset, and run genload(from LTP) test: numactl --membind=1 ./genload -m 100,
then got a "general protection fault", and system was going to reboot.
If I revert [RFC PATCH 7/8], and run this test again, genload will be killed due to OOM,
but the system is OK, no coredump.
ps: node1 has 8G memory.
[ 3647.020666] general protection fault: 0000 [#1] SMP
[ 3647.026232] Modules linked in: edd cpufreq_conservative cpufreq_userspace cpu
freq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_
intel ixgbe ipv6 i7core_edac igb iTCO_wdt i2c_i801 iTCO_vendor_support ioatdma e
dac_core tpm_tis joydev lpc_ich i2c_core microcode mfd_core rtc_cmos pcspkr sr_m
od tpm sg dca hid_generic mdio tpm_bios cdrom button ext3 jbd mbcache usbhid hid
uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hw
mon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_
piix libata megaraid_sas scsi_mod
[ 3647.084565] CPU 19
[ 3647.086709] Pid: 33708, comm: genload Not tainted 3.7.0-rc7-mem-region+ #11 Q
CI QSSC-S4R/QSSC-S4R
[ 3647.096799] RIP: 0010:[<ffffffff8110979c>] [<ffffffff8110979c>] add_to_freel
ist+0x8c/0x100
[ 3647.106125] RSP: 0000:ffff880a7f6c3e58 EFLAGS: 00010086
[ 3647.112042] RAX: dead000000200200 RBX: 0000000000000001 RCX: 0000000000000000
[ 3647.119990] RDX: ffffea001211a3a0 RSI: ffffea001211ffa0 RDI: 0000000000000001
[ 3647.127936] RBP: ffff880a7f6c3e58 R08: ffff88067ff6d240 R09: ffff88067ff6b180
[ 3647.135884] R10: 0000000000000002 R11: 0000000000000001 R12: 00000000000007fe
[ 3647.143831] R13: 0000000000000001 R14: 0000000000000001 R15: ffffea001211ff80
[ 3647.151778] FS: 00007f0b2a674700(0000) GS:ffff880a7f6c0000(0000) knlGS:00000
00000000000
[ 3647.160790] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3647.167188] CR2: 00007f0b1a000000 CR3: 0000000484723000 CR4: 00000000000007e0
[ 3647.175136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3647.183083] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3647.191030] Process genload (pid: 33708, threadinfo ffff8806852bc000, task ff
ff880688288000)
[ 3647.200428] Stack:
[ 3647.202667] ffff880a7f6c3f08 ffffffff8110e9c0 ffff88067ff66100 0000000000000
7fe
[ 3647.210954] ffff880a7f6d5bb0 0000000000000030 0000000000002030 ffff88067ff66
168
[ 3647.219244] 0000000000000002 ffff880a7f6d5b78 0000000e88288000 ffff88067ff66
100
[ 3647.227530] Call Trace:
[ 3647.230252] <IRQ>
[ 3647.232394] [<ffffffff8110e9c0>] free_pcppages_bulk+0x350/0x450
[ 3647.239297] [<ffffffff8110f0d0>] ? drain_pages+0xd0/0xd0
[ 3647.245313] [<ffffffff8110f0c3>] drain_pages+0xc3/0xd0
[ 3647.251135] [<ffffffff8110f0e6>] drain_local_pages+0x16/0x20
[ 3647.257540] [<ffffffff810a3bce>] generic_smp_call_function_interrupt+0xae/0x
260
[ 3647.265783] [<ffffffff810282c7>] smp_call_function_interrupt+0x27/0x40
[ 3647.273156] [<ffffffff8147f272>] call_function_interrupt+0x72/0x80
[ 3647.280136] <EOI>
[ 3647.282278] [<ffffffff81077936>] ? mutex_spin_on_owner+0x76/0xa0
[ 3647.289292] [<ffffffff81473116>] __mutex_lock_slowpath+0x66/0x180
[ 3647.296181] [<ffffffff8113afe7>] ? try_to_unmap_one+0x277/0x440
[ 3647.302872] [<ffffffff81472b93>] mutex_lock+0x23/0x40
[ 3647.308595] [<ffffffff8113b657>] rmap_walk+0x137/0x240
[ 3647.314417] [<ffffffff8115c230>] ? get_page+0x40/0x40
[ 3647.320133] [<ffffffff8115d036>] move_to_new_page+0xb6/0x110
[ 3647.326526] [<ffffffff8115d452>] __unmap_and_move+0x192/0x230
[ 3647.333023] [<ffffffff8115d612>] unmap_and_move+0x122/0x140
[ 3647.339328] [<ffffffff8115d6c9>] migrate_pages+0x99/0x150
[ 3647.345433] [<ffffffff81129f10>] ? isolate_freepages+0x220/0x220
[ 3647.352220] [<ffffffff8112ace2>] compact_zone+0x2f2/0x5d0
[ 3647.358332] [<ffffffff8112b4a0>] try_to_compact_pages+0x180/0x240
[ 3647.365218] [<ffffffff8110f1e7>] __alloc_pages_direct_compact+0x97/0x200
[ 3647.372780] [<ffffffff810a45a3>] ? on_each_cpu_mask+0x63/0xb0
[ 3647.379279] [<ffffffff8110f84f>] __alloc_pages_slowpath+0x4ff/0x780
[ 3647.386349] [<ffffffff8110fbf1>] __alloc_pages_nodemask+0x121/0x180
[ 3647.393430] [<ffffffff811500d6>] alloc_pages_vma+0xd6/0x170
[ 3647.399737] [<ffffffff81162198>] do_huge_pmd_anonymous_page+0x148/0x210
[ 3647.407203] [<ffffffff81132f6b>] handle_mm_fault+0x33b/0x340
[ 3647.413609] [<ffffffff814799d3>] __do_page_fault+0x2a3/0x4e0
[ 3647.420017] [<ffffffff8126316a>] ? trace_hardirqs_off_thunk+0x3a/0x6c
[ 3647.427290] [<ffffffff81479c1e>] do_page_fault+0xe/0x10
[ 3647.433208] [<ffffffff81475f68>] page_fault+0x28/0x30
[ 3647.438921] Code: 8d 78 01 48 89 f8 48 c1 e0 04 49 8d 04 00 48 8b 50 08 48 83
40 10 01 48 85 d2 74 1b 48 8b 42 08 48 89 72 08 48 89 16 48 89 46 08 <48> 89 30
c9 c3 0f 1f 80 00 00 00 00 4d 3b 00 74 4b 83 e9 01 79
[ 3647.460607] RIP [<ffffffff8110979c>] add_to_freelist+0x8c/0x100
[ 3647.467308] RSP <ffff880a7f6c3e58>
[ 0.000000] Linux version 3.7.0-rc7-mem-region+ (root@linux-intel) (gcc versi
on 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #11 SMP Tue Dec 4 15:23
:15 CST 2012
.
Thanks,
Jianguo Wu
On 2012-11-7 3:52, Srivatsa S. Bhat wrote:
> Hi,
>
> This is an alternative design for Memory Power Management, developed based on
> some of the suggestions[1] received during the review of the earlier patchset
> ("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists
> to keep them region-sorted, and is hence identified as the "Sorted-buddy" design.
>
> One of the key aspects of this design is that it avoids the zone-fragmentation
> problem that was present in the earlier design[3].
>
>
> Quick overview of Memory Power Management and Memory Regions:
> ------------------------------------------------------------
>
> Today memory subsystems are offer a wide range of capabilities for managing
> memory power consumption. As a quick example, if a block of memory is not
> referenced for a threshold amount of time, the memory controller can decide to
> put that chunk into a low-power content-preserving state. And the next
> reference to that memory chunk would bring it back to full power for read/write.
> With this capability in place, it becomes important for the OS to understand
> the boundaries of such power-manageable chunks of memory and to ensure that
> references are consolidated to a minimum number of such memory power management
> domains.
>
> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> the firmware can expose information regarding the boundaries of such memory
> power management domains to the OS in a standard way.
>
> How can Linux VM help memory power savings?
>
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space. Basically area of memory
> that is not being referenced, can reside in low power state.
>
> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
>
> Memory Regions:
> ---------------
>
> "Memory Regions" is a way of capturing the boundaries of power-managable
> chunks of memory, within the MM subsystem.
>
>
> Short description of the "Sorted-buddy" design:
> -----------------------------------------------
>
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions. (The freelists need not be fully
> address-sorted, they just need to be region-sorted. Patch 6 explains this
> in more detail).
>
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform page reclaim in the reverse order, as illustrated
> below.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation---> <---Direction of reclaim
>
>
> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and not the page-allocation path and hence the
> critical page allocation paths remain fast. Moreover, the heart of the page
> allocation algorithm itself remains largely unchanged, and the region-related
> data-structures are optimized to avoid unnecessary updates during the
> page-allocator's runtime.
>
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> hence we avoid its associated problems (like too many zones, extra page
> reclaim threads, question of choosing watermarks etc).
> [This is an advantage over the "Hierarchy" design]
>
> 2. Performance overhead is expected to be low: Since we retain the simplicity
> of the algorithm in the page allocation path, page allocation can
> potentially remain as fast as it would be without memory regions. The
> overhead is pushed to the page-freeing paths which are not that critical.
>
>
> Results:
> =======
>
> Test setup:
> -----------
> This patchset applies cleanly on top of 3.7-rc3.
>
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
>
> Functional testing:
> -------------------
>
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
>
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
>
> Present pages | Free pages at various allocations |
> | start | 512 MB | 1024 MB | 2048 MB |
> Region 0 16 | 0 | 0 | 0 | 0 |
> Region 1 131072 | 87219 | 8066 | 7892 | 7387 |
> Region 2 131072 | 131072 | 79036 | 0 | 0 |
> Region 3 131072 | 131072 | 131072 | 79061 | 0 |
> Region 4 131072 | 131072 | 131072 | 131072 | 0 |
> Region 5 131072 | 131072 | 131072 | 131072 | 79051 |
> Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 8 131056 | 105475 | 105472 | 105472 | 105472 |
>
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
>
> Performance impact:
> -------------------
>
> Kernbench results didn't show much of a difference between the performance
> of vanilla 3.7-rc3 and this patchset.
>
>
> Todos:
> =====
>
> 1. Memory-region aware page-reclamation:
> ----------------------------------------
>
> We would like to do page reclaim in the reverse order of page allocation
> within a zone, ie., in the order of decreasing region numbers.
> To achieve that, while scanning lru pages to reclaim, we could potentially
> look for pages belonging to higher regions (considering region boundaries)
> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> reclaim candidates.
>
> 2. Compile-time exclusion of Memory Power Management, and extending the
> support to also work with other features such as Mem cgroups, kexec etc.
>
> References:
> ----------
>
> [1]. Review comments suggesting modifying the buddy allocator to be aware of
> memory regions:
> http://article.gmane.org/gmane.linux.power-management.general/24862
> http://article.gmane.org/gmane.linux.power-management.general/25061
> http://article.gmane.org/gmane.linux.kernel.mm/64689
>
> [2]. Patch series that implemented the node-region-zone hierarchy design:
> http://lwn.net/Articles/445045/
> http://thread.gmane.org/gmane.linux.kernel.mm/63840
>
> Summary of the discussion on that patchset:
> http://article.gmane.org/gmane.linux.power-management.general/25061
>
> Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
> http://thread.gmane.org/gmane.linux.kernel.mm/89202
>
> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
> zones:
> http://article.gmane.org/gmane.linux.kernel.mm/63849
>
> [4]. Estimate of potential power savings on Samsung exynos board
> http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> [5]. ACPI 5.0 and MPST support
> http://www.acpi.info/spec.htm
> Section 5.2.21 Memory Power State Table (MPST)
>
> Srivatsa S. Bhat (8):
> mm: Introduce memory regions data-structure to capture region boundaries within node
> mm: Initialize node memory regions during boot
> mm: Introduce and initialize zone memory regions
> mm: Add helpers to retrieve node region and zone region for a given page
> mm: Add data-structures to describe memory regions within the zones' freelists
> mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
> mm: Add an optimized version of del_from_freelist to keep page allocation fast
> mm: Print memory region statistics to understand the buddy allocator behavior
>
>
> include/linux/mm.h | 38 +++++++
> include/linux/mmzone.h | 52 +++++++++
> mm/compaction.c | 8 +
> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
> mm/vmstat.c | 59 ++++++++++-
> 5 files changed, 390 insertions(+), 30 deletions(-)
>
>
> Thanks,
> Srivatsa S. Bhat
> IBM Linux Technology Center
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox